Logistic

1. Intro

Logistic regressions are used for binary and categorical dependent variables.  They model the likelihood the dependent variable will take on a particular value whereas a linear model predicts the value.  Binary dependent variables violate one of the key assumptions for OLS, variables are minimally interval level data, so in these circumstances OLS cannot be used.  Examples of a binary dependent variable are, whether someone votes yes or no or buys your product. The key to understanding Logistic regression is the odds ratio.

2. Odds Ratio

Odds are just another form probability.  If the probability of an event occurring is 10% then the odds is 1 out of 10.   The odds ratio allows comparison of the probability of an event across groups.  Example, whether a increase in the interest rate cause people to increase or decrease their saving:

Yes No
Increase a b
Decrease c c

OR = Odds Ratio = ad/bc.  In the two-group example it is the ratio of the products of the complimentary outcomes.  The outcomes a and d can be seen as complimenting one another. You increase your savings if the rates go up and decrease your saving if your rates go down.  How to read a odds ratio, if all values are the same then the OR = 1 therefore there is no relationship between the groups. Values different from one indicate relationships.

3. Link Functions

Links functions are used in (GLM) generalized linear models to define the relationship between the independent variables and the distribution of the dependent variable.

E(Y) = G^-1(Beta*X)

Where G is the link function.

Note, in the case of OLS, the link function is an identity matrix.

There are two main link functions that allow modeling of binary dependent variables, the Logit and the Probit.  The Probit link function assumes a normal distribution. The Logit model assumes of logistic regression and is computationally simpler that the Probit

There is debate over which is the preferred link function.  The Probit is better when, the tails of the distribution are a concern, the theoretical model predicts a normal distribution or when the event is a proportion, not a binary outcome.

4. Assumptions

Important assumptions for the Logit & Probit regression

Assumption Effect of violation
1 a) Logit: conditional probabilities are represented by a logistic function

b) Probit: conditional probabilities are represented by a normal function

Wrong model
2. No missing or extraneous variables

 

Inflated Error term
3. Observations are uncorrelated.

 

Inflated Error term
4. No measurement error of independent variables

 

Coeff can be biased
5. No outliers

 

Coeff can be biased.
6. No exact linear relationship between independent variables and more observations than independent variables

 

Coeff are biased.

Assumptions not required

1. The relationship need not be linear

2. Dependent variable does not need to be normally distributed

5. Example Code (R)


Results.Model1 <- glm(low  ~ age + lwt+ race+ smoke +ptl +ht +ui+ ftv  ,    data= birthwt,  family = “binomial”)

birthwt$Yhat<- predict(Results.Model1, birthwt)
birthwt$Yhat<- exp(birthwt$Yhat) /  (1+exp(birthwt$Yhat))

confusion <- function(a, b){
tbl <- table(a, b)
mis <- 1 - sum(diag(tbl))/sum(tbl)
list(table = tbl, misclass.prob = mis)
}

confusion((birthwt$Yhat > .50), birthwt$low  )

plotmeans(birthwt$Yhat ~ birthwt$low  , error.bars= conf.int , level=0.95)

library(ROCR );

Pred <-  prediction(birthwt$Yhat , birthwt$low  )
par(mfrow = c(2, 2))
##ROC Curve
perf <- performance(Pred, tpr , fpr )
plot(perf, avg = threshold , colorize = T, lwd = 3,   main = ROC Curve )

## precision/recall curve (x-axis: recall, y-axis: precision)
perf <- performance(Pred, prec , rec )
plot(perf, avg = threshold , colorize = T, lwd = 3,   main = precision/recall curve )

## sensitivity/specificity curve (x-axis: specificity,
## y-axis: sensitivity)
perf <- performance(Pred, sens , spec )
plot(perf, avg = threshold , colorize = T, lwd = 3,   main = sensitivity/specificity curve )

##Lift Chart
perf <- performance(pred, lift , acc )
plot(perf, avg = threshold , colorize = T, lwd = 3,    main = Lift Chart )

library(ineq);
lor <-Lc(birthwt$Yhat,  birthwt$low  ,plot = TRUE)

title = Lorenz Curve
plot(lor)