Theory Behind the Model

You must answer this question before you begin running any statistical technique, what is your theory about the true underline model? If you do not have a theory find one, any one. Read the literature or ask an expert. Without a theory you may find an answer but only by dumb luck will it be right.

Suppose you were given a task to estimate demand for your product. Without any theory you put in the price of the product and the quantity purchased and, low and behold, as you increase price the quantity purchased also increase! You run to the marketing department who use your advice and double the price. The next week you go out of business because no one bought your product. Why?  The theory of supply and demand tells us why. Quantity purchased rose when price rose because the demand was increasing for the product while the data was collected. Your model violated an assumptions of OLS, no simultaneous models.

For example:

You need to understand the theory of supply and demand to model the problem correctly. This is true of all problems. There is a growing trend to use software without any theoretical forethought. Do not follow this trend.

Distibution

Distributions are critical in understanding the properties of stochastic variables. A distribution of a number describes the likelihood the variable will taken on a particular value or, in the continuous case, interval.

Important Terms
a) ContinuousThe values of the distribution are express in continuous numbers (1.001, 1.002, .1003, …).
b) DiscreteThe values of the distribution are express in whole numbers (1,2,3…).
c) MeanMean or average is the central tendency of population. It is the sum of the values divided by the size of the population.
d) VarianceThe variance measures of how much the values in the sample differ from the mean on average.
e) Standard DeviationThe standard deviation is the square root of the variance
f) KurtosisThe measure of peakness of a distribution is Kurosis.
g) SkewnessSkewness measures the degree to which a distribution is not symmetric.

Common Distrubutions

Normal

The most familiar distribution, the Normal or Gaussian distribution is used in a variety of test and models.  A key assumption to many models is the normality of the error structure.

It is symmetric and continuous.

Log NormalThe lognormal distribution is related to the Normal distribution. The logarithm of the values from a lognormal distribution are distributed normally.

It is asymmetric and continuous.

BernoulliIf there are only two possible outcomes from an experiment then the number of success has a Bernoulli distribution.

It is asymmetric and discreet.

BinomialUsed in repeated trials with two outcomes, x success in n trails without replacement is pulled from a Binomial distribution.It is symmetric and discreet.
MultinomialWhen there are more than two outcomes use a multinomial instead of a binomial

 

Negative BinomialUsed to answer the question number of trials required for kth success to occur.It is asymmetric and discreet.
GeometricUsed to answer the question number of trials required for kth success to occur.It is asymmetric and discreet.
HypergeometricHypergeometric is a binomial with replacement. It is used to answers the question how many successes in n trials with replacement.
PoissonIt is a computationally simpler than the binomial and can be used to answer similar questions. The poisson distribution is used in queuing models for arrival rates.
Gamma

Gamma is often used when answer questions concerning waiting time between events drawn from a Poisson distribution. Answers question what is the probability of waiting less than x?

It is a asymmetric and continuous.

ErlangeIt is a special case of the Gamma distribution. The Erlange is also used in queuing model. It is more flexible than the exponential and is used when number of servers is finite.

It is a asymmetric and continuous.

ExponentialIt is a special case of the Gamma distribution.  It is often used to estimate waiting times in queues.It is a asymmetric and continuous.
Chi-SquaredOften used in inferential statistics the Chi-Squared distribution is a special case of the Gama distribution.It is a asymmetric and continuous.

Beta

Used in Bayesian inference, the Beta distribution can take on a variety of different shapes.

It is a asymmetric and continuous.

UniformWhen the probability for a given value is constant it has a uniform distribution.

It is a symmetric and continuous.

Logistic

1. Intro

Logistic regressions are used for binary and categorical dependent variables.  They model the likelihood the dependent variable will take on a particular value whereas a linear model predicts the value.  Binary dependent variables violate one of the key assumptions for OLS, variables are minimally interval level data, so in these circumstances OLS cannot be used.  Examples of a binary dependent variable are, whether someone votes yes or no or buys your product. The key to understanding Logistic regression is the odds ratio.

2. Odds Ratio

Odds are just another form probability.  If the probability of an event occurring is 10% then the odds is 1 out of 10.   The odds ratio allows comparison of the probability of an event across groups.  Example, whether a increase in the interest rate cause people to increase or decrease their saving:

Yes No
Increase a b
Decrease c c

OR = Odds Ratio = ad/bc.  In the two-group example it is the ratio of the products of the complimentary outcomes.  The outcomes a and d can be seen as complimenting one another. You increase your savings if the rates go up and decrease your saving if your rates go down.  How to read a odds ratio, if all values are the same then the OR = 1 therefore there is no relationship between the groups. Values different from one indicate relationships.

3. Link Functions

Links functions are used in (GLM) generalized linear models to define the relationship between the independent variables and the distribution of the dependent variable.

E(Y) = G^-1(Beta*X)

Where G is the link function.

Note, in the case of OLS, the link function is an identity matrix.

There are two main link functions that allow modeling of binary dependent variables, the Logit and the Probit.  The Probit link function assumes a normal distribution. The Logit model assumes of logistic regression and is computationally simpler that the Probit

There is debate over which is the preferred link function.  The Probit is better when, the tails of the distribution are a concern, the theoretical model predicts a normal distribution or when the event is a proportion, not a binary outcome.

4. Assumptions

Important assumptions for the Logit & Probit regression

Assumption Effect of violation
1 a) Logit: conditional probabilities are represented by a logistic function

b) Probit: conditional probabilities are represented by a normal function

Wrong model
2. No missing or extraneous variables

 

Inflated Error term
3. Observations are uncorrelated.

 

Inflated Error term
4. No measurement error of independent variables

 

Coeff can be biased
5. No outliers

 

Coeff can be biased.
6. No exact linear relationship between independent variables and more observations than independent variables

 

Coeff are biased.

Assumptions not required

1. The relationship need not be linear

2. Dependent variable does not need to be normally distributed

5. Example Code (R)


Results.Model1 <- glm(low  ~ age + lwt+ race+ smoke +ptl +ht +ui+ ftv  ,    data= birthwt,  family = “binomial”)

birthwt$Yhat<- predict(Results.Model1, birthwt)
birthwt$Yhat<- exp(birthwt$Yhat) /  (1+exp(birthwt$Yhat))

confusion <- function(a, b){
tbl <- table(a, b)
mis <- 1 - sum(diag(tbl))/sum(tbl)
list(table = tbl, misclass.prob = mis)
}

confusion((birthwt$Yhat > .50), birthwt$low  )

plotmeans(birthwt$Yhat ~ birthwt$low  , error.bars= conf.int , level=0.95)

library(ROCR );

Pred <-  prediction(birthwt$Yhat , birthwt$low  )
par(mfrow = c(2, 2))
##ROC Curve
perf <- performance(Pred, tpr , fpr )
plot(perf, avg = threshold , colorize = T, lwd = 3,   main = ROC Curve )

## precision/recall curve (x-axis: recall, y-axis: precision)
perf <- performance(Pred, prec , rec )
plot(perf, avg = threshold , colorize = T, lwd = 3,   main = precision/recall curve )

## sensitivity/specificity curve (x-axis: specificity,
## y-axis: sensitivity)
perf <- performance(Pred, sens , spec )
plot(perf, avg = threshold , colorize = T, lwd = 3,   main = sensitivity/specificity curve )

##Lift Chart
perf <- performance(pred, lift , acc )
plot(perf, avg = threshold , colorize = T, lwd = 3,    main = Lift Chart )

library(ineq);
lor <-Lc(birthwt$Yhat,  birthwt$low  ,plot = TRUE)

title = Lorenz Curve
plot(lor)

Statistical Tests

Intro

As a statistician people will often come to you in hopes of being told the nature of reality, what really happened. You cannot do that; you can only give insight as to what can be said is statistically likely to have happened.

2. Hypothesis

A statistical hypothesis is a supposition about the relationship between two or more data elements.

a) The Null Hypothesis

This is the hypothesis to be tested. It is the alternative theory to the theory you are researching. For example, if you are researching a model that predicts Y is related to X then the Null Hypotheses is there is no relationship between Y and X, the relationship is zero or null.

b) Type I Error

The Null Hypothesis is falsely rejected or rejecting the null hypothesis when it is true

c) Type II Error

The null hypothesis is falsely accepted or failure to reject the null hypothesis when it is false.

d) Type III Error

This is not official but often times refers to wrongly categorizing a treatment.

3. Common Tests

Z-Test

The Z-Test is used when testing if a sample mean is significantly different from its population s mean when the variance is known.

Use

Testing a sample s mean

Assumption

Requires normality
Known variance

Calculation

=sample mean

=population mean

: =population variance

N = sample size

Example Code(R)

Library(BSDA)
x <- rnorm(10)
z.test(x,sigma.x=1)
x <- rnorm(100)
z.test(x,sigma.x=1)

T-Test

The T-Test or Student’s-T is used to determine if the means of two randomly drawn samples are statistically different without making assumptions about the variance of the population.  It is also used to test the mean of a population against a specific value like in the case of a parameters in regression models.

Use

Compare means between samples
Compare mean of a sample to a specific value.

Assumption

Requires normality

Calculation

Var1= variance of sample1.

Var2= variance of sample2.

Example Code(R)

MyData <- data.frame(group = c(1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0)
MyData$x[1:10]   <-rnorm(10, mean = 0) 
MyData$x[11:20]  <- rnorm(10, mean = 5) 
t.test(x ~ group , alternative= two.sided , conf.level=.95, var.equal=FALSE, data=MyData)

Kruskal-Wallis Test

Tests if two or more random samples could have come from populations with the same median. Does not assume normal distribution.

Use
Distribution assumptions.

Assumption

Variance same in both samples.

Calculation

Example Code(R)

data(beaver1)
kruskal.test(temp ~ activ, data = beaver1)

F-Test  

The F-test is used to test the significance of the mean from two or more samples populations.

Use

Means test.

Assumption

Assume normal distribution

Calculation

Example Code(R)

oneway.test(temp ~ activ, data = beaver1)

Wilk-Sapiro Normality Test

The Wilk-Sapiro is used to determine if the population a sample came from  is normally distributed.

Use
Test distributional assumptions.

Assumption

Calculation

Example Code(R)

x   <- rnorm(1000, mean = 0, sd=1)
shapiro.test(x)
x   <- rgamma(1000, 10,1)
shapiro.test(x)

Kolmogorov-Smirnov test

The Kolmogorov-Smirnov, like the Wilk-Sapiro, is used to determine if the population a sample came from  is normally distributed.  Also known as the K-S test.

Use

Test distributional assumptions.

Assumption

Calculation

Example Code(R)

1) Samples from the same distribution?

x <- rnorm(20, mean = 0)
y <- rnorm(20, mean = 0)
ks.test(x,y)

#2) Sample from a specific distribution?
ks.test(x+2, pgamma , 3, 2)

Wilcoxon Rank Sum
The Wilcoxon Rank Sum determines if two related or repeated samples could have come from the same population. It is an alternative to the T-Test and does not have the same reliance on the assumption of normality that the t-test has.

Use
Compare means between samples independence.

Assumption

Samples come from a symmetric distribution

Calculation

Example Code(R)

x <- rnorm(100, mean = 0)
y <- rnorm(100, mean = 0)
wilcox.test(x, y)
x <- rnorm(100, mean = 0)
y <-  rnorm(100, mean = 1)
wilcox.test(x, y)

Mann-Whitney

The Mann-Whitney tests if two samples could have come from the same population. Like the Wilcoxon Rank Sum, it does not have the same reliance on the assumption of normality. Unlike the Wilcoxon Rank Sum the samples do not need to be related.

Use

Goodness of fit to a distributionCompare means between samples

Independence

Assumption

Samples come from a symmetric distribution

Calculation

Where R1 is sum of ranks in sample

Example Code(R)

x <- rnorm(800, mean = 0)
y < rnorm(1000, mean = 0)
wilcox.test(x, y)
x <- rnorm(800, mean = 0)
y <;- rnorm(1000, mean = 1)
wilcox.test(x, y)

Log-likelihood ratio (G-Test )

The Log-likelihood ratio, sometimes known as the G-Test, is commonly used to test the independence of two variables.  Like the chi-Squared, it is useful in determining whether an event had any effect on an outcome.

Use

Calculation

Example Code(R)

 
library(hierfstat)
e <- rlnorm(20, mean = 1)
prob1 <- rlnorm(20, mean = 3)
prob2 <- rlnorm(20, mean = 1)
x <- prob1 + .6* e
### Related by equation above
test1 <- g.stats.glob (data.frame( x , prob1))
### No relation
Test2 <- g.stats.glob (data.frame( x , prob2))
test1$g.stats
test2$g.stats

Pearson’s Chi-Squared Test

The Pearson’s chi-squared test is an approximation of the log-likelihood ratio test.  Both test the independence of two variables.

Use

Independence

Goodness of fit

Assumption

1. Sample is randomly drawn.2. Variable are independent

3. Variables are not percentages

Calculation

Chi-Squared = sum (  ( (Observed(i) –Expected(i) )^2) / Expected(i) ) 

Example Code(R)

e <- rlnorm(20, mean = 1)
prob1 <- rlnorm(20, mean = 3)
prob2 <- rlnorm(20, mean = 1)
x < prob1 + .6* e
### Related by equation above
chisq.test(x , p = prob1, rescal =TRUE)
### No relation
chisq.test(x , p = prob2 , rescal =TRUE)

Fisher s Exact Test

The Fisher Exact Test is used to test the whether significant association between two variables exists. It is used with contingency tables with a small sample size.  Note, with large samples use a Chi-Squared.

Use
Independence

Goodness of fit

Assumption
Calculation

Method1 Method2
Outcome1
a
b
Outcome2
c
d

P = (a +b)! (c+d)!(a+c)!(b+d)!/ (n!a!b!c!d!)

Example Code(R)

## Random Relationship
MyMatrix <- matrix(c( round(rlnorm(1, mean = 3)), round(rlnorm(1, mean = 3)), round(rlnorm(1, mean = 3)), round(rlnorm(1, mean = 3))), 2)
MyMatrix fisher.test(MyMatrix )

## Relation
MyMatrix <- matrix(c(100, 34, 12, 120), nr = 2,
dimnames = list(Result = c( T , F ),
Prediction = c( T , F )))
MyMatrix
fisher.test(MyMatrix )