1. Intro
Logistic regressions are used for binary and categorical dependent variables. They model the likelihood the dependent variable will take on a particular value whereas a linear model predicts the value. Binary dependent variables violate one of the key assumptions for OLS, variables are minimally interval level data, so in these circumstances OLS cannot be used. Examples of a binary dependent variable are, whether someone votes yes or no or buys your product. The key to understanding Logistic regression is the odds ratio.
2. Odds Ratio
Odds are just another form probability. If the probability of an event occurring is 10% then the odds is 1 out of 10. The odds ratio allows comparison of the probability of an event across groups. Example, whether a increase in the interest rate cause people to increase or decrease their saving:
Yes | No | |
Increase | a | b |
Decrease | c | c |
OR = Odds Ratio = ad/bc. In the two-group example it is the ratio of the products of the complimentary outcomes. The outcomes a and d can be seen as complimenting one another. You increase your savings if the rates go up and decrease your saving if your rates go down. How to read a odds ratio, if all values are the same then the OR = 1 therefore there is no relationship between the groups. Values different from one indicate relationships.
3. Link Functions
Links functions are used in (GLM) generalized linear models to define the relationship between the independent variables and the distribution of the dependent variable.
E(Y) = G^-1(Beta*X)
Where G is the link function.
Note, in the case of OLS, the link function is an identity matrix.
There are two main link functions that allow modeling of binary dependent variables, the Logit and the Probit. The Probit link function assumes a normal distribution. The Logit model assumes of logistic regression and is computationally simpler that the Probit
There is debate over which is the preferred link function. The Probit is better when, the tails of the distribution are a concern, the theoretical model predicts a normal distribution or when the event is a proportion, not a binary outcome.
4. Assumptions
Important assumptions for the Logit & Probit regression
|
|
Assumption | Effect of violation |
1 a) Logit: conditional probabilities are represented by a logistic function
b) Probit: conditional probabilities are represented by a normal function |
Wrong model |
2. No missing or extraneous variables
|
Inflated Error term |
3. Observations are uncorrelated.
|
Inflated Error term |
4. No measurement error of independent variables
|
Coeff can be biased |
5. No outliers
|
Coeff can be biased. |
6. No exact linear relationship between independent variables and more observations than independent variables
|
Coeff are biased. |
Assumptions not required
1. The relationship need not be linear
2. Dependent variable does not need to be normally distributed
5. Example Code (R)
Results.Model1 <- glm(low ~ age + lwt+ race+ smoke +ptl +ht +ui+ ftv , data= birthwt, family = “binomial”) birthwt$Yhat<- predict(Results.Model1, birthwt) birthwt$Yhat<- exp(birthwt$Yhat) / (1+exp(birthwt$Yhat)) confusion <- function(a, b){ tbl <- table(a, b) mis <- 1 - sum(diag(tbl))/sum(tbl) list(table = tbl, misclass.prob = mis) } confusion((birthwt$Yhat > .50), birthwt$low ) plotmeans(birthwt$Yhat ~ birthwt$low , error.bars= conf.int , level=0.95) library(ROCR ); Pred <- prediction(birthwt$Yhat , birthwt$low ) par(mfrow = c(2, 2)) ##ROC Curve perf <- performance(Pred, tpr , fpr ) plot(perf, avg = threshold , colorize = T, lwd = 3, main = ROC Curve ) ## precision/recall curve (x-axis: recall, y-axis: precision) perf <- performance(Pred, prec , rec ) plot(perf, avg = threshold , colorize = T, lwd = 3, main = precision/recall curve ) ## sensitivity/specificity curve (x-axis: specificity, ## y-axis: sensitivity) perf <- performance(Pred, sens , spec ) plot(perf, avg = threshold , colorize = T, lwd = 3, main = sensitivity/specificity curve ) ##Lift Chart perf <- performance(pred, lift , acc ) plot(perf, avg = threshold , colorize = T, lwd = 3, main = Lift Chart ) library(ineq); lor <-Lc(birthwt$Yhat, birthwt$low ,plot = TRUE) title = Lorenz Curve plot(lor)