Often times our data conspires against us violating essential assumptions of the standard linear model. One critical assumption is the errors term of a model must be uncorrelated. There can be many causes of such correlation, repeated data sources (as with customers over time) or missing variables or data which is skewed over time. If those variations are linear you can model them by adding linear random effects, (i.e. allow the intercept to vary by the source of the variation). One such solution is Generalized Linear Mixed Models (GLMM). It is called a mixed model because it has both random effects and fixed effects. Remember fixed effects are assumed have no measurement error and the same generalization (no group effects). Random effects can have measurement bias or group effects.
Example one: suppose the probability of a customer returning increases after each visit. If this increase in probability is linear, (e.g. first visit 10% more likely to return, second visit 20% likely to return, third visit 30% likely to return,…), then by adding a linear random effects that varies by customer you can correctly model this. Often times if you do not correct for this relationship between variables such as customer age (because the older you are the more likely you have returned more than once) may proxy for this leading to incorrect conclusion.
Example two: suppose you are collecting data at the county level. Each county may have different fixed effects. Imagine crime rates across counties. The base rate of crime will vary while the influence of say poverty rate will be the same. If you did not allow the intercept to vary the relation between poverty and crime may be obscured by the base rate variation across counties.
GLMM do not correct for non-linear variations across sampling units. If you believe customer retention is a non-linear relationship visit or the influence of poverty varies by county than GLMM alone cannot correct for this. In many circumstances however, such as influence of poverty on crime, the variation across counties is due to missing or latent variables such as the support programs available for individuals at or below the poverty line. Likewise the fixed effect can also be caused by omitted or latent variables. The base crime rate across counties is a function of general underline conditions that oftentimes cannot be measured. Such latent variables are often impossible to uncover so using GLMM is an acceptable solution to correct for this missing variable bias.
Further Reading
http://support.sas.com/rnd/app/papers/glimmix.pdf
http://www.wiley.com/legacy/wileychi/eosbs/pdfs/bsa251.pdf
http://www.stat.umu.se/forskning/reports/glmmML.pdf
http://web.maths.unsw.edu.au/~wand/kowpap.pdf
http://arxiv.org/PS_cache/math/pdf/0606/0606491v1.pdf
http://staff.pubhealth.ku.dk/~pd/mixed-jan.2006/glmm.pdf
http://www.stat.umn.edu/geyer/bernor/library/bernor/doc/examples.pdf
1.Intro
|
|
Understanding and using diagnostics defines a good statistician from a hack. It is not enough just to run the diagnostics; you must challenge your model as a critical eye. To build a model is simple, to assure it is stable and accurate is a work of art. There are a variety of tools and test that can aid you in evaluating your model. When doing diagnostics ever assume anything, always seek proof.
|
2. Tools for linear models
|
|
a. QQ plot
|
|
|
The QQ or quantile-quantile plot shows the residual errors for the first and last quantilths of the dependant variable plotted against a 45-degree line. This allows you to see how well the model fits at both the extreme. A model that fits poorly will appear to curl up on itself on the plot. Having a model that fits poorly at the extremes is not a good thing but oftentimes it is not a showstopper. By setting maximum allowable values for the model it can still be usefully in segmenting cases. To correct for poorly fitting tails look for new explanatory variables or double check to see if you missed any non-linarities that could be confusing the system.
|
|
b. Residual Plots
|
|
|
By observing the residual plots much can be uncovered about how the model is performing. A key thing to look for is any pattern in the plots. Since the residual should be random there should be no observable trend in the residual. Any observable pattern indicates trouble with the model.
|
|
c. R-Squared
|
|
|
R-Squared measures the proportion of the variation of the dependant variable explained (I am using that term very losely) by the model. R-Squared has poor standing amoung statitisica but can be useful if it is not the only measure of fittness of the model. It ranges from zero to one with one being a perfect fit. One is only possible if you include the dependant variable as an explaintory variable and therefore is an indication of error. With the data I typically look at a good model typically ranges from .1 to .3 however I have seen model in production working well with an R-Squared as low as .07
R^2 = 1- ((y-Xb)’(y-Xb) )/ sum(y-yBar)^2
|
|
d. MSE
|
|
|
MSE or Mean Squared Error is useful in choosing between multiple models. It is simply the average of the squared errors.
|
|
e. 1. Partial Regression
|
|
|
Partial regressions are an important tool in determine for the independent variables effect the model as well as themselves. It is the net effect of a independent variable correcting for other regressors.
|
|
e. 2. Partial Residual Plots
|
|
|
Partial residual plots are residuals plotted against each independent variable s value. This shows how the residuals of the model vary as the value of the independent variable changes. This will uncover situations such as a variable at high values causing too great a variation in the model leading to high residuals. In this case you would cap the independent variable. Want you want to see is a even cloud of data points with a zero-slope centered on zero.
|
|
f. T-stats on Coefficients
|
|
|
The T-statistics on the repressors test the null hypotheses that the coefficient is zero, that is has not effect on the model. If you cannot statistically justify a variables inclusion into the model it is preferred to remove it. Reasons for a variable failing a t-test can rage from it having no relation with the dependant variable, to non-lineariites influence the results or other independent variables clouding the true relationship. If there is firm theoretical reasons from including the variable investigate further.
|
|
g. Economic Significance of Coefficients
|
|
|
An independent variable may be statistically significant but have no explanatory power. By calculating the economic significance of a variable you can roughly measure its contribution to the overall value of the dependant variable. The Economic Significance of Coefficients is the coefficient times the standard deviation of the independent variables. There is no clear definition of whether a coefficient is economically significant instead a research has to look at the values and decide for herself whether a given coefficient has enough, well, oomph to be considered important. It is a powerful tool to rid yourself of those pesky statistically significant but unimportant variables.
|
|
h.Cooks
|
|
|
Cooks test is used to uncover outliners in the data. Using the Cooks value you can target outliers or removal. It should be remembered that not all outliers should be removed. Some are representative of important behavior in the system. In modeling weather in Florida hurricanes may look like a outlier in the data but they are a critical feature to model.
|
|
i. CHOW
|
|
|
The Chow test is used to test for structural or regime changes with in the data. In monetary and other financial models they are important test. If a structural change seldom occurs modeling a change using dummy variables can be a good choice but if structural changes occur often you may need to model the underline causes of those changes to have any chance of forecasting the process.
|
|
j. Durbin-Watson
|
|
|
Durbin-Watson (DW) is the standard test for serial correlation (autocorrelation). Remember, serial correlation violates BLUE and results in a bias model and you employ autoregression models to correct for it. When investigating time series data you always have to be conscious of the DW statistic.
|
|
k. Bag Plot
|
|
|
Bag plots uncover outliers in the data and can are useful with Cooks Test.
|
|
l. White
|
|
|
Thie White test is a standard test for heteroskedasticity. Heteroskedasticity causes correlation coefficients to be biased downward. This can lead to excluding relevant variables and biasing the coefficients downward.
|
3. Tools for Probablistic and Catorgical Models
|
|
a) Odds Ratios
|
|
|
The odds ratios for each independent variable indicate whether to keep that variable in the model. If the odds ratio is 1 that variable does not help the predictive power of the model while statistically significantly greater than or less than one indicates the variable has predictive power.
|
|
b) Receiver Operating Characteristic (ROC) Curve
|
|
|
The ROC curve is used to graphically show the trade off between Sensitivity (the true positive rate) and Specificity (the true negative rate). If the model has no predictive power all the point will lye n a 45 degree line. A greater area between the 45 degree line and the ROC curve indicates a more predictive model.
|
|
c) Lorenze Curve
|
|
|
The Lorenz curve is a rotated ROC curve. In other words, it is a plot of the cumulative percentage of cases selected by the model against the cumulative percentage of actual positive events. Like with the ROC curven the area between the curve and the 45 degree line is called the Gini coefficient and is used to measure the fit of a model. The higher the coefficient the better fitting the model.
|
|
d) Confusion Matrix
|
|
|
The confusion matrix shows the actual verse forecasted outcome for a binary or categorical process.
|
Predicted |
Yes |
No |
Actual |
Yes |
a |
b |
No |
c |
d |
a: The number of times the model predicted Yes and the outcome was Yes.
b: The number of times the model predicted No and the outcome was Yes
c: The number of times the model predicted Yes and the outcome was No
d: The number of times the model predicted Yes and the outcome was Yes
|
|
e) Profit curve
|
|
|
Uses if the model was used in production what its expected return would be, profit verse score.
|
1. Intro
Any rule or set of rules that reduce the uncertainty of an event is an heuristic rule(s) or expert system. Heuristic rules can be used in a stand alone system or coded as a variable into a model. There are cases where heuristic rules are the best choice when building a model.
When heuristic rules are the best choice:
1. The process is too non-linear for other forms of modeling.
2. There is good common sense knowledge of processes.
3. The market does not trust statistical models.
Do not discount heuristic rules when building models. They can be just as powerful in forecasting as any math based model. Also by using hybrid system with both modeling and heuristics rules you can improve your forecast dramatically. Heuristics rules excel when there are a plethora of complex or nonlinear relationships, exactly where statistical models can fall apart.
2. Building a Heuristic model
a. Experts
Talk to people! You may know statistics but that does not mean you know everything about the problem you are trying to solve. Find an expert in the field you are analyzing and listen. Here is an example, in one of my prior jobs a non-technical person researching prison populations noted that the incarceration rates always increase when a new prison is built, mainly due to the tremendous political pressure to fill new prisons. This simple heuristic rule ends up explaining a great deal of the variation in incarceration rates.
b. Data Analysis
Look at your data! The biggest mistake a modeler can do is ignore his or her data. Simple rules may emerge out of just looking at plots or frequencies that can have enormous predictive power. Once I noticed the system was shutting down periodically once a day. A simple frequency showed a relationship between output and whether it was noon. Obviously the system was shutting down for lunch. This is common sense, but we had not thought of it till we looked at the data. Knowing this great improved the efficiency of the model.
c. Data Mining
Data mining is ideal in discovery of heuristic rules. Trees, discussed in the data mining sections, are well designed to discover both complex and simple rules.
You must answer this question before you begin running any statistical technique, what is your theory about the true underline model? If you do not have a theory find one, any one. Read the literature or ask an expert. Without a theory you may find an answer but only by dumb luck will it be right.
Suppose you were given a task to estimate demand for your product. Without any theory you put in the price of the product and the quantity purchased and, low and behold, as you increase price the quantity purchased also increase! You run to the marketing department who use your advice and double the price. The next week you go out of business because no one bought your product. Why? The theory of supply and demand tells us why. Quantity purchased rose when price rose because the demand was increasing for the product while the data was collected. Your model violated an assumptions of OLS, no simultaneous models.
For example:
You need to understand the theory of supply and demand to model the problem correctly. This is true of all problems. There is a growing trend to use software without any theoretical forethought. Do not follow this trend.
1. Intro
Logistic regressions are used for binary and categorical dependent variables. They model the likelihood the dependent variable will take on a particular value whereas a linear model predicts the value. Binary dependent variables violate one of the key assumptions for OLS, variables are minimally interval level data, so in these circumstances OLS cannot be used. Examples of a binary dependent variable are, whether someone votes yes or no or buys your product. The key to understanding Logistic regression is the odds ratio.
2. Odds Ratio
Odds are just another form probability. If the probability of an event occurring is 10% then the odds is 1 out of 10. The odds ratio allows comparison of the probability of an event across groups. Example, whether a increase in the interest rate cause people to increase or decrease their saving:
|
Yes |
No |
Increase |
a |
b |
Decrease |
c |
c |
OR = Odds Ratio = ad/bc. In the two-group example it is the ratio of the products of the complimentary outcomes. The outcomes a and d can be seen as complimenting one another. You increase your savings if the rates go up and decrease your saving if your rates go down. How to read a odds ratio, if all values are the same then the OR = 1 therefore there is no relationship between the groups. Values different from one indicate relationships.
3. Link Functions
Links functions are used in (GLM) generalized linear models to define the relationship between the independent variables and the distribution of the dependent variable.
E(Y) = G^-1(Beta*X)
Where G is the link function.
Note, in the case of OLS, the link function is an identity matrix.
There are two main link functions that allow modeling of binary dependent variables, the Logit and the Probit. The Probit link function assumes a normal distribution. The Logit model assumes of logistic regression and is computationally simpler that the Probit
There is debate over which is the preferred link function. The Probit is better when, the tails of the distribution are a concern, the theoretical model predicts a normal distribution or when the event is a proportion, not a binary outcome.
4. Assumptions
Important assumptions for the Logit & Probit regression
|
Assumption |
Effect of violation |
1 a) Logit: conditional probabilities are represented by a logistic function
b) Probit: conditional probabilities are represented by a normal function |
Wrong model |
2. No missing or extraneous variables
|
Inflated Error term |
3. Observations are uncorrelated.
|
Inflated Error term |
4. No measurement error of independent variables
|
Coeff can be biased |
5. No outliers
|
Coeff can be biased. |
6. No exact linear relationship between independent variables and more observations than independent variables
|
Coeff are biased. |
Assumptions not required
1. The relationship need not be linear
2. Dependent variable does not need to be normally distributed |
5. Example Code (R)
Results.Model1 <- glm(low ~ age + lwt+ race+ smoke +ptl +ht +ui+ ftv , data= birthwt, family = “binomial”)
birthwt$Yhat<- predict(Results.Model1, birthwt)
birthwt$Yhat<- exp(birthwt$Yhat) / (1+exp(birthwt$Yhat))
confusion <- function(a, b){
tbl <- table(a, b)
mis <- 1 - sum(diag(tbl))/sum(tbl)
list(table = tbl, misclass.prob = mis)
}
confusion((birthwt$Yhat > .50), birthwt$low )
plotmeans(birthwt$Yhat ~ birthwt$low , error.bars= conf.int , level=0.95)
library(ROCR );
Pred <- prediction(birthwt$Yhat , birthwt$low )
par(mfrow = c(2, 2))
##ROC Curve
perf <- performance(Pred, tpr , fpr )
plot(perf, avg = threshold , colorize = T, lwd = 3, main = ROC Curve )
## precision/recall curve (x-axis: recall, y-axis: precision)
perf <- performance(Pred, prec , rec )
plot(perf, avg = threshold , colorize = T, lwd = 3, main = precision/recall curve )
## sensitivity/specificity curve (x-axis: specificity,
## y-axis: sensitivity)
perf <- performance(Pred, sens , spec )
plot(perf, avg = threshold , colorize = T, lwd = 3, main = sensitivity/specificity curve )
##Lift Chart
perf <- performance(pred, lift , acc )
plot(perf, avg = threshold , colorize = T, lwd = 3, main = Lift Chart )
library(ineq);
lor <-Lc(birthwt$Yhat, birthwt$low ,plot = TRUE)
title = Lorenz Curve
plot(lor)