Instrumental Variable

Here is an example problem; you want to determine how the competitive environment is affecting store performance. You proceed to estimate a model with store performance as a function of number of rival stores near by in hopes of seeing how rivals affect performance. But you have violated a key assumption of BLUE; at least one independent variable is contemporaneously correlated with dependent variable thus the error term. If a store is profitable it will attract rivals. When your independent variable is correlated with dependent variable you cannot get consistent estimate using regression analysis.

The result is a downward bias estimate on the affect of rivals on store performance. If a location is highly profitable more firms will enter the market increasing the number of rival store while not necessarily adversely affecting a store’s performance.This is a common occurrence. Other examples are income and education, supply and demand, store location and credit scores.

One solution is to use Instrumental Variables (IV). The key to IV is the use of a proxy variable(s) correlated independent variable but not contemporaneously to the error term. The IV can be a variable such as number of small streams to proxy for urban/city or a fitted value. Two stage least squares (2SLS) is a common IV technique that uses a fitted value from a second regression as the IV in the first regression. When building a 2SLS model you still need independent variables that are correlated with the variable you are trying to proxy for but not correlated with the primary model s error term to use in the second stage of the model.

In the store performance location example you could employ a 2SLS model to estimate the number of rivals based on variables such as development tax incentives, number of rivals before the stored opened and other such factors not directly correlated with the performance of store you are examining.

Instrumental variables can also be used to correct for omitted variables by choosing a proxy variable closely correlated with the missing variable.

Further Reading


Economics Glossary

Missouri state Working_Paper_Series


OpenBayes A free Bayesian Belief Network library written in Python.

Resource: Another website focused on data mining.

Spatial autocorrelation

Lets start off with clarifying two terms that are easy to get confused by, autocorrelation and autoregression. Time series autocorrelation is where the error terms are correlated across time. In other words, past errors in the model affects present outcome. This violates the assumption of homoscedasticity needed for BLUE resulting in biased standard errors. Biased standard errors are bad because you cannot say for certain whether an independent variable’s effect on the dependent variable is statically valid or not. An example of time series autocorrelations is when past forecasting errors affect the present value of the dependent variable. If a system exhibits memory this can also lead to autocorrelations. Autoregession is a means to correct for this bias by regressing a variable on lagged values of itself.

Spatial autocorrelation is another type of autocorrelation but instead of spanning time it spans space.  If a variable is correlated with itself through space it is said to be spatial autocorrelation.  This can be due to misspecification of the model, measurement bias or many other reasons.  Another term for spatial autocorrelation is spatially dependent errors. Moran I and Geary C tests are the most commonly used to detect spatial autocorrelation.  Another example is when an area affect near by regions. For example, imagine a high crime neighborhood.  The surrounding areas should also exhibit a higher than average crime rate due to spillover effects.  This spillover effect will degrade the further away you move from the crime epicenter.  The mechanics of the spatial autocorrelation in this example could be transportation routes or poor police coverage that extend the crime to outlying areas. An example could be large shopping centers on the number of stores. Stores could be concentrated at the shopping center and outline areas may be devoid of stores following Hotelling s Law.

Spatial Autoregression (SAR) models correct for spatial autocorrelation by adding surrounding territories dependent variables (referred to as spatial lagged values of the dependent variable) as regressors. If you were modeling crime across zip codes, you would include the crime rate of near-by zip codes for each zip code as an independent variable. This is similar to how Time series Autoregression models add lagged values of the dependent variables as regressors.

Further Reading

Hunter College

Canada Forestry Service

Cornell U

North Carolina State University

University of West Alabama

Autoregressive Exogenous Model

ARX are auto-regressive models with exogenous inputs. The term exogenous variables should not be confused independent variables. Exogenous variables are determined outside of the process you are modeling. An exogenous variable can be a shift in the oil supply effecting prices or a change in consumer preferences for foreign manufactured products effecting price. Simply put, exogenous variable are independent of the process you are trying to model. Why are ARX model different then standard models? As an example let’s look at a model trying to predict the industrial output. In this model you may want to include lagged output, (the industrial capacity is carried over from one period to the next) and lagged interest rates (the past cost of money influences current contacts). Both lagged output and lagged interest rates are endogenous to the system. What effects output also affect the price of money (interest rates). In this model an exogenous variable would be an oil crisis or natural disaster. These events happened regardless of the values of output or interest rates.

How are ARX models different?

In a vector auto-regressive models (VARX) the distinction becomes clear. In a vector auto regression model (VAR) all the variables are assumed to be correlated with one another. To identify the model you make an assumption about how the variables are contemporaneously correlated with one another. EG. Interest rates effect money immediately but only lagged money effects interest rates today. With VARX model use estimate a system of correlated variables and exogenous variables. VARX allows outside shocks to be taken into consideration.

There are many variations of ARX models.

Non-linear auto-regressive models (NARX)

Additive nonlinear autoregressive exogenous

Vector auto-regressive models (VARX)

Further Reading

AMC Portal



GLMM Models

Often times our data conspires against us violating essential assumptions of the standard linear model.  One critical assumption is the errors term of a model must be uncorrelated. There can be many causes of such correlation, repeated data sources (as with customers over time) or missing variables or data which is skewed over time.  If those variations are linear you can model them by adding linear random effects, (i.e. allow the intercept to vary by the source of the variation). One such solution is Generalized Linear Mixed Models (GLMM). It is called a mixed model because it has both random effects and fixed effects. Remember fixed effects are assumed have no measurement error and the same generalization (no group effects). Random effects can have measurement bias or group effects.

Example one: suppose the probability of a customer returning increases after each visit. If this increase in probability is linear, (e.g. first visit 10% more likely to return, second visit 20% likely to return, third visit 30% likely to return,…), then by adding a linear random effects that varies by customer you can correctly model this. Often times if you do not correct for this relationship between variables such as customer age (because the older you are the more likely you have returned more than once) may proxy for this leading to incorrect conclusion.

Example two:  suppose you are collecting data at the county level.  Each county may have different fixed effects.  Imagine crime rates across counties.  The base rate of crime will vary while the influence of say poverty rate will be the same.  If you did not allow the intercept to vary the relation between poverty and crime may be obscured by the base rate variation across counties.

GLMM do not correct for non-linear variations across sampling units.  If you believe customer retention is a non-linear relationship visit or the influence of poverty varies by county than GLMM alone cannot correct for this.  In many circumstances however, such as influence of poverty on crime, the variation across counties is due to missing or latent variables such as the support programs available for individuals at or below the poverty line.  Likewise the fixed effect can also be caused by omitted or latent variables.  The base crime rate across counties is a function of general underline conditions that oftentimes cannot be measured.  Such latent variables are often impossible to uncover so using GLMM is an acceptable solution to correct for this missing variable bias. Further Reading




  Understanding and using diagnostics defines a good statistician from a hack.  It is not enough just to run the diagnostics; you must challenge your model as a critical eye.  To build a model is simple, to assure it is stable and accurate is a work of art.  There are a variety of tools and test that can aid you in evaluating your model.  When doing diagnostics ever assume anything, always seek proof.


2. Tools for linear models



a. QQ plot



The QQ or quantile-quantile plot shows the residual errors for the first and last quantilths of the dependant variable plotted against a 45-degree line. This allows you to see how well the model fits at both the extreme. A model that fits poorly will appear to curl up on itself on the plot. Having a model that fits poorly at the extremes is not a good thing but oftentimes it is not a showstopper. By setting maximum allowable values for the model it can still be usefully in segmenting cases. To correct for poorly fitting tails look for new explanatory variables or double check to see if you missed any non-linarities that could be confusing the system.



b. Residual Plots


    By observing the residual plots much can be uncovered about how the model is performing.  A key thing to look for is any pattern in the plots. Since the residual should be random there should be no observable trend in the residual.  Any observable pattern indicates trouble with the model.



c. R-Squared


    R-Squared measures the proportion of the variation of the dependant variable explained (I am using that term very losely) by the model. R-Squared has poor standing amoung statitisica but can be useful if it is not the only measure of fittness of the model. It ranges from zero to one with one being a perfect fit. One is only possible if you include the dependant variable as an explaintory variable and therefore is an indication of error. With the data I typically look at a good model typically ranges from .1 to .3 however I have seen model in production working well with an R-Squared as low as .07

R^2 = 1- ((y-Xb)’(y-Xb) )/ sum(y-yBar)^2



d. MSE


    MSE or Mean Squared Error is useful in choosing between multiple models. It is simply the average of the squared errors.



e. 1. Partial Regression


    Partial regressions are an important tool in determine for the independent variables effect the model as well as themselves. It is the net effect of a independent variable correcting for other regressors.



e. 2. Partial Residual Plots


    Partial residual plots are residuals plotted against each independent variable s value. This shows how the residuals of the model vary as the value of the independent variable changes. This will uncover situations such as a variable at high values causing too great a variation in the model leading to high residuals. In this case you would cap the independent variable. Want you want to see is a even cloud of data points with a zero-slope centered on zero.



f. T-stats on Coefficients


    The T-statistics on the repressors test the null hypotheses that the coefficient is zero, that is has not effect on the model.  If you cannot statistically justify a variables inclusion into the model it is preferred to remove it.  Reasons for a variable failing a t-test can rage from it having no relation with the dependant variable, to non-lineariites influence the results or other independent variables clouding the true relationship. If there is firm theoretical reasons from including the variable investigate further. 



g. Economic Significance of Coefficients


    An independent variable may be statistically significant but have no explanatory power.  By calculating the economic significance of a variable you can roughly measure its contribution to the overall value of the dependant variable.  The Economic Significance of Coefficients is the coefficient times the standard deviation of the independent variables. There is no clear definition of whether a coefficient is economically significant instead a research has to look at the values and decide for herself whether a given coefficient has enough, well, oomph to be considered important. It is a powerful tool to rid yourself of those pesky statistically significant but unimportant variables.




    Cooks test is used to uncover outliners in the data. Using the Cooks value you can target outliers or removal. It should be remembered that not all outliers should be removed. Some are representative of important behavior in the system. In modeling weather in Florida hurricanes may look like a outlier in the data but they are a critical feature to model.


  i. CHOW


    The Chow test is used to test for structural or regime changes with in the data. In monetary and other financial models they are important test. If a structural change seldom occurs modeling a change using dummy variables can be a good choice but if structural changes occur often you may need to model the underline causes of those changes to have any chance of forecasting the process.


  j. Durbin-Watson


    Durbin-Watson (DW) is the standard test for serial correlation (autocorrelation). Remember, serial correlation violates BLUE and results in a bias model and you employ autoregression models to correct for it.  When investigating time series data you always have to be conscious of the DW statistic.


  k. Bag Plot


    Bag plots uncover outliers in the data and can are useful with Cooks Test.


  l. White


    Thie White test is a standard test for heteroskedasticity. Heteroskedasticity causes correlation coefficients to be biased downward.  This can lead to excluding relevant variables and biasing the coefficients downward.


3. Tools for Probablistic and Catorgical Models



a) Odds Ratios



The odds ratios for each independent variable indicate whether to keep that variable in the model. If the odds ratio is 1 that variable does not help the predictive power of the model while statistically significantly greater than or less than one indicates the variable has predictive power.



b) Receiver Operating Characteristic (ROC) Curve



The ROC curve is used to graphically show the trade off between Sensitivity (the true positive rate) and Specificity (the true negative rate). If the model has no predictive power all the point will lye n a 45 degree line.  A greater area between the 45 degree line and the ROC curve indicates a more predictive model.



c) Lorenze Curve


        The Lorenz curve is a rotated ROC curve. In other words, it is a plot of the cumulative percentage of cases selected by the model against the cumulative percentage of actual positive events.  Like with the ROC curven the area between the curve and the 45 degree line is called the Gini coefficient and is used to measure the fit of a model.  The higher the coefficient the better fitting the model.



d) Confusion Matrix



The confusion matrix shows the actual verse forecasted outcome for a binary or categorical process.


Yes No
Actual Yes a b
No c d

a: The number of times the model predicted Yes and the outcome was Yes.

b: The number of times the model predicted No and the outcome was Yes

c: The number of times the model predicted Yes and the outcome was No

d: The number of times the model predicted Yes and the outcome was Yes



e) Profit curve


    Uses if the model was used in production what its expected return would be, profit verse score.


Heuristic Rules

1. Intro

Any rule or set of rules that reduce the uncertainty of an event is an heuristic rule(s) or expert system.  Heuristic rules can be used in a stand alone system or coded as a variable into a model.  There are cases where heuristic rules are the best choice when building a model.

When heuristic rules are the best choice:

1. The process is too non-linear for other forms of modeling.

2. There is good common sense knowledge of processes.

3. The market does not trust statistical models.

Do not discount heuristic rules when building models.  They can be just as powerful in forecasting as any math based model. Also by using hybrid system with both modeling and heuristics rules you can improve your forecast dramatically. Heuristics rules excel when there are a plethora of complex or nonlinear relationships, exactly where statistical models can fall apart.

2. Building a Heuristic model

a. Experts

Talk to people!  You may know statistics but that does not mean you know everything about the problem you are trying to solve. Find an expert in the field you are analyzing and listen.  Here is an example, in one of my prior jobs a non-technical person researching prison populations noted that the incarceration rates always increase when a new prison is built, mainly due to the tremendous political pressure to fill new prisons.  This simple heuristic rule ends up explaining a great deal of the variation in incarceration rates.

b. Data Analysis

Look at your data!  The biggest mistake a modeler can do is ignore his or her data.  Simple rules may emerge out of just looking at plots or frequencies that can have enormous predictive power.  Once I noticed the system was shutting down periodically once a day. A simple frequency showed a relationship between output and whether it was noon.  Obviously the system was shutting down for lunch.  This is common sense, but we had not thought of it till we looked at the data.  Knowing this great improved the efficiency of the model.

c. Data Mining

Data mining is ideal in discovery of heuristic rules.  Trees, discussed in the data mining sections, are well designed to discover both complex and simple rules.

Theory Behind the Model

You must answer this question before you begin running any statistical technique, what is your theory about the true underline model? If you do not have a theory find one, any one. Read the literature or ask an expert. Without a theory you may find an answer but only by dumb luck will it be right.

Suppose you were given a task to estimate demand for your product. Without any theory you put in the price of the product and the quantity purchased and, low and behold, as you increase price the quantity purchased also increase! You run to the marketing department who use your advice and double the price. The next week you go out of business because no one bought your product. Why?  The theory of supply and demand tells us why. Quantity purchased rose when price rose because the demand was increasing for the product while the data was collected. Your model violated an assumptions of OLS, no simultaneous models.

For example:

You need to understand the theory of supply and demand to model the problem correctly. This is true of all problems. There is a growing trend to use software without any theoretical forethought. Do not follow this trend.


1. Intro

Logistic regressions are used for binary and categorical dependent variables.  They model the likelihood the dependent variable will take on a particular value whereas a linear model predicts the value.  Binary dependent variables violate one of the key assumptions for OLS, variables are minimally interval level data, so in these circumstances OLS cannot be used.  Examples of a binary dependent variable are, whether someone votes yes or no or buys your product. The key to understanding Logistic regression is the odds ratio.

2. Odds Ratio

Odds are just another form probability.  If the probability of an event occurring is 10% then the odds is 1 out of 10.   The odds ratio allows comparison of the probability of an event across groups.  Example, whether a increase in the interest rate cause people to increase or decrease their saving:

Yes No
Increase a b
Decrease c c

OR = Odds Ratio = ad/bc.  In the two-group example it is the ratio of the products of the complimentary outcomes.  The outcomes a and d can be seen as complimenting one another. You increase your savings if the rates go up and decrease your saving if your rates go down.  How to read a odds ratio, if all values are the same then the OR = 1 therefore there is no relationship between the groups. Values different from one indicate relationships.

3. Link Functions

Links functions are used in (GLM) generalized linear models to define the relationship between the independent variables and the distribution of the dependent variable.

E(Y) = G^-1(Beta*X)

Where G is the link function.

Note, in the case of OLS, the link function is an identity matrix.

There are two main link functions that allow modeling of binary dependent variables, the Logit and the Probit.  The Probit link function assumes a normal distribution. The Logit model assumes of logistic regression and is computationally simpler that the Probit

There is debate over which is the preferred link function.  The Probit is better when, the tails of the distribution are a concern, the theoretical model predicts a normal distribution or when the event is a proportion, not a binary outcome.

4. Assumptions

Important assumptions for the Logit & Probit regression

Assumption Effect of violation
1 a) Logit: conditional probabilities are represented by a logistic function

b) Probit: conditional probabilities are represented by a normal function

Wrong model
2. No missing or extraneous variables


Inflated Error term
3. Observations are uncorrelated.


Inflated Error term
4. No measurement error of independent variables


Coeff can be biased
5. No outliers


Coeff can be biased.
6. No exact linear relationship between independent variables and more observations than independent variables


Coeff are biased.

Assumptions not required

1. The relationship need not be linear

2. Dependent variable does not need to be normally distributed

5. Example Code (R)

Results.Model1 <- glm(low  ~ age + lwt+ race+ smoke +ptl +ht +ui+ ftv  ,    data= birthwt,  family = “binomial”)

birthwt$Yhat<- predict(Results.Model1, birthwt)
birthwt$Yhat<- exp(birthwt$Yhat) /  (1+exp(birthwt$Yhat))

confusion <- function(a, b){
tbl <- table(a, b)
mis <- 1 - sum(diag(tbl))/sum(tbl)
list(table = tbl, misclass.prob = mis)

confusion((birthwt$Yhat > .50), birthwt$low  )

plotmeans(birthwt$Yhat ~ birthwt$low  , error.bars= , level=0.95)

library(ROCR );

Pred <-  prediction(birthwt$Yhat , birthwt$low  )
par(mfrow = c(2, 2))
##ROC Curve
perf <- performance(Pred, tpr , fpr )
plot(perf, avg = threshold , colorize = T, lwd = 3,   main = ROC Curve )

## precision/recall curve (x-axis: recall, y-axis: precision)
perf <- performance(Pred, prec , rec )
plot(perf, avg = threshold , colorize = T, lwd = 3,   main = precision/recall curve )

## sensitivity/specificity curve (x-axis: specificity,
## y-axis: sensitivity)
perf <- performance(Pred, sens , spec )
plot(perf, avg = threshold , colorize = T, lwd = 3,   main = sensitivity/specificity curve )

##Lift Chart
perf <- performance(pred, lift , acc )
plot(perf, avg = threshold , colorize = T, lwd = 3,    main = Lift Chart )

lor <-Lc(birthwt$Yhat,  birthwt$low  ,plot = TRUE)

title = Lorenz Curve