Intro
As a statistician people will often come to you in hopes of being told the nature of reality, what really happened. You cannot do that; you can only give insight as to what can be said is statistically likely to have happened.
2. Hypothesis
A statistical hypothesis is a supposition about the relationship between two or more data elements.
a) The Null Hypothesis
This is the hypothesis to be tested. It is the alternative theory to the theory you are researching. For example, if you are researching a model that predicts Y is related to X then the Null Hypotheses is there is no relationship between Y and X, the relationship is zero or null.
b) Type I Error
The Null Hypothesis is falsely rejected or rejecting the null hypothesis when it is true
c) Type II Error
The null hypothesis is falsely accepted or failure to reject the null hypothesis when it is false.
d) Type III Error
This is not official but often times refers to wrongly categorizing a treatment.
3. Common Tests
ZTest
The ZTest is used when testing if a sample mean is significantly different from its population s mean when the variance is known.
Use
Testing a sample s mean
Assumption
Requires normality
Known variance
Calculation
=sample mean
=population mean
: =population variance
N = sample size
Example Code(R)
Library(BSDA)
x < rnorm(10)
z.test(x,sigma.x=1)
x < rnorm(100)
z.test(x,sigma.x=1)
TTest
The TTest or Student’sT is used to determine if the means of two randomly drawn samples are statistically different without making assumptions about the variance of the population. It is also used to test the mean of a population against a specific value like in the case of a parameters in regression models.
Use
Compare means between samples
Compare mean of a sample to a specific value.
Assumption
Requires normality
Calculation
Var1= variance of sample1.
Var2= variance of sample2.
Example Code(R)
MyData < data.frame(group = c(1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0)
MyData$x[1:10] <rnorm(10, mean = 0)
MyData$x[11:20] < rnorm(10, mean = 5)
t.test(x ~ group , alternative= two.sided , conf.level=.95, var.equal=FALSE, data=MyData)
KruskalWallis Test
Tests if two or more random samples could have come from populations with the same median. Does not assume normal distribution.
Use
Distribution assumptions.
Assumption
Variance same in both samples.
Calculation
Example Code(R)
data(beaver1)
kruskal.test(temp ~ activ, data = beaver1)
FTest
The Ftest is used to test the significance of the mean from two or more samples populations.
Use
Means test.
Assumption
Assume normal distribution
Calculation
Example Code(R)
oneway.test(temp ~ activ, data = beaver1)
WilkSapiro Normality Test
The WilkSapiro is used to determine if the population a sample came from is normally distributed.
Use
Test distributional assumptions.
Assumption
Calculation
Example Code(R)
x < rnorm(1000, mean = 0, sd=1)
shapiro.test(x)
x < rgamma(1000, 10,1)
shapiro.test(x)
KolmogorovSmirnov test
The KolmogorovSmirnov, like the WilkSapiro, is used to determine if the population a sample came from is normally distributed. Also known as the KS test.
Use
Test distributional assumptions.
Assumption
Calculation
Example Code(R)
1) Samples from the same distribution?
x < rnorm(20, mean = 0)
y < rnorm(20, mean = 0)
ks.test(x,y)
#2) Sample from a specific distribution?
ks.test(x+2, pgamma , 3, 2)
Wilcoxon Rank Sum
The Wilcoxon Rank Sum determines if two related or repeated samples could have come from the same population. It is an alternative to the TTest and does not have the same reliance on the assumption of normality that the ttest has.
Use
Compare means between samples independence.
Assumption
Samples come from a symmetric distribution
Calculation
Example Code(R)
x < rnorm(100, mean = 0)
y < rnorm(100, mean = 0)
wilcox.test(x, y)
x < rnorm(100, mean = 0)
y < rnorm(100, mean = 1)
wilcox.test(x, y)
MannWhitney
The MannWhitney tests if two samples could have come from the same population. Like the Wilcoxon Rank Sum, it does not have the same reliance on the assumption of normality. Unlike the Wilcoxon Rank Sum the samples do not need to be related.
Use
Goodness of fit to a distributionCompare means between samples
Independence
Assumption
Samples come from a symmetric distribution
Calculation
Where R1 is sum of ranks in sample
Example Code(R)
x < rnorm(800, mean = 0)
y < rnorm(1000, mean = 0)
wilcox.test(x, y)
x < rnorm(800, mean = 0)
y <; rnorm(1000, mean = 1)
wilcox.test(x, y)
Loglikelihood ratio (GTest )
The Loglikelihood ratio, sometimes known as the GTest, is commonly used to test the independence of two variables. Like the chiSquared, it is useful in determining whether an event had any effect on an outcome.
Use
Calculation
Example Code(R)
library(hierfstat)
e < rlnorm(20, mean = 1)
prob1 < rlnorm(20, mean = 3)
prob2 < rlnorm(20, mean = 1)
x < prob1 + .6* e
### Related by equation above
test1 < g.stats.glob (data.frame( x , prob1))
### No relation
Test2 < g.stats.glob (data.frame( x , prob2))
test1$g.stats
test2$g.stats
Pearson’s ChiSquared Test
The Pearson’s chisquared test is an approximation of the loglikelihood ratio test. Both test the independence of two variables.
Use
Independence
Goodness of fit
Assumption
1. Sample is randomly drawn.2. Variable are independent
3. Variables are not percentages
Calculation
ChiSquared = sum ( ( (Observed(i) –Expected(i) )^2) / Expected(i) )
Example Code(R)
e < rlnorm(20, mean = 1)
prob1 < rlnorm(20, mean = 3)
prob2 < rlnorm(20, mean = 1)
x < prob1 + .6* e
### Related by equation above
chisq.test(x , p = prob1, rescal =TRUE)
### No relation
chisq.test(x , p = prob2 , rescal =TRUE)
Fisher s Exact Test
The Fisher Exact Test is used to test the whether significant association between two variables exists. It is used with contingency tables with a small sample size. Note, with large samples use a ChiSquared.
Use
Independence
Goodness of fit
Assumption
Calculation

Method1

Method2

Outcome1

a

b

Outcome2

c

d

P = (a +b)! (c+d)!(a+c)!(b+d)!/ (n!a!b!c!d!)
Example Code(R)
## Random Relationship
MyMatrix < matrix(c( round(rlnorm(1, mean = 3)), round(rlnorm(1, mean = 3)), round(rlnorm(1, mean = 3)), round(rlnorm(1, mean = 3))), 2)
MyMatrix fisher.test(MyMatrix )
## Relation
MyMatrix < matrix(c(100, 34, 12, 120), nr = 2,
dimnames = list(Result = c( T , F ),
Prediction = c( T , F )))
MyMatrix
fisher.test(MyMatrix )