Statistical Tests

Intro

As a statistician people will often come to you in hopes of being told the nature of reality, what really happened. You cannot do that; you can only give insight as to what can be said is statistically likely to have happened.

2. Hypothesis

A statistical hypothesis is a supposition about the relationship between two or more data elements.

a) The Null Hypothesis

This is the hypothesis to be tested. It is the alternative theory to the theory you are researching. For example, if you are researching a model that predicts Y is related to X then the Null Hypotheses is there is no relationship between Y and X, the relationship is zero or null.

b) Type I Error

The Null Hypothesis is falsely rejected or rejecting the null hypothesis when it is true

c) Type II Error

The null hypothesis is falsely accepted or failure to reject the null hypothesis when it is false.

d) Type III Error

This is not official but often times refers to wrongly categorizing a treatment.

3. Common Tests

Z-Test

The Z-Test is used when testing if a sample mean is significantly different from its population s mean when the variance is known.

Use

Testing a sample s mean

Assumption

Requires normality
Known variance

Calculation

=sample mean

=population mean

: =population variance

N = sample size

Example Code(R)

Library(BSDA)
x <- rnorm(10)
z.test(x,sigma.x=1)
x <- rnorm(100)
z.test(x,sigma.x=1)

T-Test

The T-Test or Student’s-T is used to determine if the means of two randomly drawn samples are statistically different without making assumptions about the variance of the population.  It is also used to test the mean of a population against a specific value like in the case of a parameters in regression models.

Use

Compare means between samples
Compare mean of a sample to a specific value.

Assumption

Requires normality

Calculation

Var1= variance of sample1.

Var2= variance of sample2.

Example Code(R)

MyData <- data.frame(group = c(1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0)
MyData$x[1:10]   <-rnorm(10, mean = 0) 
MyData$x[11:20]  <- rnorm(10, mean = 5) 
t.test(x ~ group , alternative= two.sided , conf.level=.95, var.equal=FALSE, data=MyData)

Kruskal-Wallis Test

Tests if two or more random samples could have come from populations with the same median. Does not assume normal distribution.

Use
Distribution assumptions.

Assumption

Variance same in both samples.

Calculation

Example Code(R)

data(beaver1)
kruskal.test(temp ~ activ, data = beaver1)

F-Test  

The F-test is used to test the significance of the mean from two or more samples populations.

Use

Means test.

Assumption

Assume normal distribution

Calculation

Example Code(R)

oneway.test(temp ~ activ, data = beaver1)

Wilk-Sapiro Normality Test

The Wilk-Sapiro is used to determine if the population a sample came from  is normally distributed.

Use
Test distributional assumptions.

Assumption

Calculation

Example Code(R)

x   <- rnorm(1000, mean = 0, sd=1)
shapiro.test(x)
x   <- rgamma(1000, 10,1)
shapiro.test(x)

Kolmogorov-Smirnov test

The Kolmogorov-Smirnov, like the Wilk-Sapiro, is used to determine if the population a sample came from  is normally distributed.  Also known as the K-S test.

Use

Test distributional assumptions.

Assumption

Calculation

Example Code(R)

1) Samples from the same distribution?

x <- rnorm(20, mean = 0)
y <- rnorm(20, mean = 0)
ks.test(x,y)

#2) Sample from a specific distribution?
ks.test(x+2, pgamma , 3, 2)

Wilcoxon Rank Sum
The Wilcoxon Rank Sum determines if two related or repeated samples could have come from the same population. It is an alternative to the T-Test and does not have the same reliance on the assumption of normality that the t-test has.

Use
Compare means between samples independence.

Assumption

Samples come from a symmetric distribution

Calculation

Example Code(R)

x <- rnorm(100, mean = 0)
y <- rnorm(100, mean = 0)
wilcox.test(x, y)
x <- rnorm(100, mean = 0)
y <-  rnorm(100, mean = 1)
wilcox.test(x, y)

Mann-Whitney

The Mann-Whitney tests if two samples could have come from the same population. Like the Wilcoxon Rank Sum, it does not have the same reliance on the assumption of normality. Unlike the Wilcoxon Rank Sum the samples do not need to be related.

Use

Goodness of fit to a distributionCompare means between samples

Independence

Assumption

Samples come from a symmetric distribution

Calculation

Where R1 is sum of ranks in sample

Example Code(R)

x <- rnorm(800, mean = 0)
y < rnorm(1000, mean = 0)
wilcox.test(x, y)
x <- rnorm(800, mean = 0)
y <;- rnorm(1000, mean = 1)
wilcox.test(x, y)

Log-likelihood ratio (G-Test )

The Log-likelihood ratio, sometimes known as the G-Test, is commonly used to test the independence of two variables.  Like the chi-Squared, it is useful in determining whether an event had any effect on an outcome.

Use

Calculation

Example Code(R)

 
library(hierfstat)
e <- rlnorm(20, mean = 1)
prob1 <- rlnorm(20, mean = 3)
prob2 <- rlnorm(20, mean = 1)
x <- prob1 + .6* e
### Related by equation above
test1 <- g.stats.glob (data.frame( x , prob1))
### No relation
Test2 <- g.stats.glob (data.frame( x , prob2))
test1$g.stats
test2$g.stats

Pearson’s Chi-Squared Test

The Pearson’s chi-squared test is an approximation of the log-likelihood ratio test.  Both test the independence of two variables.

Use

Independence

Goodness of fit

Assumption

1. Sample is randomly drawn.2. Variable are independent

3. Variables are not percentages

Calculation

Chi-Squared = sum (  ( (Observed(i) –Expected(i) )^2) / Expected(i) ) 

Example Code(R)

e <- rlnorm(20, mean = 1)
prob1 <- rlnorm(20, mean = 3)
prob2 <- rlnorm(20, mean = 1)
x < prob1 + .6* e
### Related by equation above
chisq.test(x , p = prob1, rescal =TRUE)
### No relation
chisq.test(x , p = prob2 , rescal =TRUE)

Fisher s Exact Test

The Fisher Exact Test is used to test the whether significant association between two variables exists. It is used with contingency tables with a small sample size.  Note, with large samples use a Chi-Squared.

Use
Independence

Goodness of fit

Assumption
Calculation

Method1 Method2
Outcome1
a
b
Outcome2
c
d

P = (a +b)! (c+d)!(a+c)!(b+d)!/ (n!a!b!c!d!)

Example Code(R)

## Random Relationship
MyMatrix <- matrix(c( round(rlnorm(1, mean = 3)), round(rlnorm(1, mean = 3)), round(rlnorm(1, mean = 3)), round(rlnorm(1, mean = 3))), 2)
MyMatrix fisher.test(MyMatrix )

## Relation
MyMatrix <- matrix(c(100, 34, 12, 120), nr = 2,
dimnames = list(Result = c( T , F ),
Prediction = c( T , F )))
MyMatrix
fisher.test(MyMatrix )