Intro
As a statistician people will often come to you in hopes of being told the nature of reality, what really happened. You cannot do that; you can only give insight as to what can be said is statistically likely to have happened.
2. Hypothesis
A statistical hypothesis is a supposition about the relationship between two or more data elements.
a) The Null Hypothesis
This is the hypothesis to be tested. It is the alternative theory to the theory you are researching. For example, if you are researching a model that predicts Y is related to X then the Null Hypotheses is there is no relationship between Y and X, the relationship is zero or null.
b) Type I Error
The Null Hypothesis is falsely rejected or rejecting the null hypothesis when it is true
c) Type II Error
The null hypothesis is falsely accepted or failure to reject the null hypothesis when it is false.
d) Type III Error
This is not official but often times refers to wrongly categorizing a treatment.
3. Common Tests
Z-Test
The Z-Test is used when testing if a sample mean is significantly different from its population s mean when the variance is known.
Use
Testing a sample s mean
Assumption
Requires normality
Known variance
Calculation


=sample mean
=population mean
: =population variance
N = sample size
Example Code(R)
Library(BSDA)
x <- rnorm(10)
z.test(x,sigma.x=1)
x <- rnorm(100)
z.test(x,sigma.x=1)
T-Test
The T-Test or Student’s-T is used to determine if the means of two randomly drawn samples are statistically different without making assumptions about the variance of the population. It is also used to test the mean of a population against a specific value like in the case of a parameters in regression models.
Use
Compare means between samples
Compare mean of a sample to a specific value.
Assumption
Requires normality
Calculation

Var1= variance of sample1.
Var2= variance of sample2.
Example Code(R)
MyData <- data.frame(group = c(1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0)
MyData$x[1:10] <-rnorm(10, mean = 0)
MyData$x[11:20] <- rnorm(10, mean = 5)
t.test(x ~ group , alternative= two.sided , conf.level=.95, var.equal=FALSE, data=MyData)
Kruskal-Wallis Test
Tests if two or more random samples could have come from populations with the same median. Does not assume normal distribution.
Use
Distribution assumptions.
Assumption
Variance same in both samples.
Calculation

Example Code(R)
data(beaver1)
kruskal.test(temp ~ activ, data = beaver1)
F-Test
The F-test is used to test the significance of the mean from two or more samples populations.
Use
Means test.
Assumption
Assume normal distribution
Calculation



Example Code(R)
oneway.test(temp ~ activ, data = beaver1)
Wilk-Sapiro Normality Test
The Wilk-Sapiro is used to determine if the population a sample came from is normally distributed.
Use
Test distributional assumptions.
Assumption
Calculation
Example Code(R)
x <- rnorm(1000, mean = 0, sd=1)
shapiro.test(x)
x <- rgamma(1000, 10,1)
shapiro.test(x)
Kolmogorov-Smirnov test
The Kolmogorov-Smirnov, like the Wilk-Sapiro, is used to determine if the population a sample came from is normally distributed. Also known as the K-S test.
Use
Test distributional assumptions.
Assumption
Calculation
Example Code(R)
1) Samples from the same distribution?
x <- rnorm(20, mean = 0)
y <- rnorm(20, mean = 0)
ks.test(x,y)
#2) Sample from a specific distribution?
ks.test(x+2, pgamma , 3, 2)
Wilcoxon Rank Sum
The Wilcoxon Rank Sum determines if two related or repeated samples could have come from the same population. It is an alternative to the T-Test and does not have the same reliance on the assumption of normality that the t-test has.
Use
Compare means between samples independence.
Assumption
Samples come from a symmetric distribution
Calculation

Example Code(R)
x <- rnorm(100, mean = 0)
y <- rnorm(100, mean = 0)
wilcox.test(x, y)
x <- rnorm(100, mean = 0)
y <- rnorm(100, mean = 1)
wilcox.test(x, y)
Mann-Whitney
The Mann-Whitney tests if two samples could have come from the same population. Like the Wilcoxon Rank Sum, it does not have the same reliance on the assumption of normality. Unlike the Wilcoxon Rank Sum the samples do not need to be related.
Use
Goodness of fit to a distributionCompare means between samples
Independence
Assumption
Samples come from a symmetric distribution
Calculation

Where R1 is sum of ranks in sample
Example Code(R)
x <- rnorm(800, mean = 0)
y < rnorm(1000, mean = 0)
wilcox.test(x, y)
x <- rnorm(800, mean = 0)
y <;- rnorm(1000, mean = 1)
wilcox.test(x, y)
Log-likelihood ratio (G-Test )
The Log-likelihood ratio, sometimes known as the G-Test, is commonly used to test the independence of two variables. Like the chi-Squared, it is useful in determining whether an event had any effect on an outcome.
Use
Calculation

Example Code(R)
library(hierfstat)
e <- rlnorm(20, mean = 1)
prob1 <- rlnorm(20, mean = 3)
prob2 <- rlnorm(20, mean = 1)
x <- prob1 + .6* e
### Related by equation above
test1 <- g.stats.glob (data.frame( x , prob1))
### No relation
Test2 <- g.stats.glob (data.frame( x , prob2))
test1$g.stats
test2$g.stats
Pearson’s Chi-Squared Test
The Pearson’s chi-squared test is an approximation of the log-likelihood ratio test. Both test the independence of two variables.
Use
Independence
Goodness of fit
Assumption
1. Sample is randomly drawn.2. Variable are independent
3. Variables are not percentages
Calculation
Chi-Squared = sum ( ( (Observed(i) –Expected(i) )^2) / Expected(i) )
Example Code(R)
e <- rlnorm(20, mean = 1)
prob1 <- rlnorm(20, mean = 3)
prob2 <- rlnorm(20, mean = 1)
x < prob1 + .6* e
### Related by equation above
chisq.test(x , p = prob1, rescal =TRUE)
### No relation
chisq.test(x , p = prob2 , rescal =TRUE)
Fisher s Exact Test
The Fisher Exact Test is used to test the whether significant association between two variables exists. It is used with contingency tables with a small sample size. Note, with large samples use a Chi-Squared.
Use
Independence
Goodness of fit
Assumption
Calculation
|
Method1
|
Method2
|
Outcome1
|
a
|
b
|
Outcome2
|
c
|
d
|
P = (a +b)! (c+d)!(a+c)!(b+d)!/ (n!a!b!c!d!)
Example Code(R)
## Random Relationship
MyMatrix <- matrix(c( round(rlnorm(1, mean = 3)), round(rlnorm(1, mean = 3)), round(rlnorm(1, mean = 3)), round(rlnorm(1, mean = 3))), 2)
MyMatrix fisher.test(MyMatrix )
## Relation
MyMatrix <- matrix(c(100, 34, 12, 120), nr = 2,
dimnames = list(Result = c( T , F ),
Prediction = c( T , F )))
MyMatrix
fisher.test(MyMatrix )