The S-Language


The S programming language of statistical programming language was developed  Bell laboratories specifically for statistical modeling. There are two versions of  S.  One was developed by insightful under the name S-Plus.  The other is an open-source initiative called R.  S allows you to create objects and is very extendable and has power graphing capabilities.

Tips
Tip 1

Set Memory Size

memory.size(max = TRUE)
Tip 2

Today’s Date

Today <- format(Sys.Date(), %d %b %Y )
Tip 3

Set Working Directory

setwd( C:// )
Tip 4

Load In Data

ExampleData.path    <- file.path(getwd(), USDemographics.CSV ) 
ExampleData.FullSet  <- read.table( ExampleData.path, header=TRUE, sep= , , na.strings= NA , dec= . , strip.white=TRUE)
Tip 5

Split Data

ExampleData.Nrows <-  nrow(ExampleData.FullSet) ExampleData.NCol= ncol(ExampleData.FullSet) 
ExampleData.SampleSize <- ExampleData.Nrows /2
ExampleData.Sample <- sample(nrow(ExampleData.FullSet ),size = ExampleData.SampleSize ,
replace=FALSE, prob = NULL )
ExampleData.HoldBack  <- ExampleData.FullSet[ExampleData.Sample, c(5,1:ExampleData.NCol)]
ExampleData.Run   <- ExampleData.FullSet[-ExampleData.Sample, c(5,1:ExampleData.NCol)  ]
Tip 6

Create Function

Confusion <- function(a, b){
                  tbl <- table(a, b)
                  mis <- 1 - sum(diag(tbl))/sum(tbl)
                  list(table = tbl, misclass.prob = mis)
                   }
Tip 7

Recode Fields

ExampleData.FullSet$Savings 
ExampleData.FullSet$SavingsCat <- recode(ExampleData.FullSet$Savings, 
, -40000.00:-100.00 = HighNeg ; -100.00:-50.00  = MedNeg ; -50.00:10.00 = LowNeg ; 10.00:50.00 = Low ; 50.00:100.00 = Med ; 100.00:1000.00 = High ;;;  , as.factor.result=TRUE)
Tip 8

Summarize Data

Summary(ExampleData.FullSet)
Tip 9

Save output

save.image(file = c:/test.RData , version = NULL, ascii = FALSE,  compress = FALSE, safe = TRUE)
Tip 10

Subset

MyData.SubSample <- subset(MyData.Full, MyField ==0)
Tip 11

Remove Object From Memory

remove(list = c(‘MyObject’));
Tip  12

Create a Dataframe

TmpOuput <- data.frame ( Fields = c( Field1 , ‘Field2 , ‘Field3’),  Values   = c( 1 , 2 ,  2  ) )
Tip 13

Cut

data(swiss)
x <- swiss$Education  
swiss$Educated= cut(x, breaks=c(0, 11, 999), labels=c( 0 , 1 ))
Tip 14

Create Directories

dir.create( c:/MyProjects )

Statistical/AI Techniques

1 Intro

There is a forest of traditional statistical techniques and new artificial intelligence algorithms for forecasting. Choosing the right one can be difficult.

2. Choosing variables

With the Information Age, forecasters got a mixed blessing.  Now we have more data than was dreamed by the most optimistic forecaster just fifteen years ago.  We typically work with datasets consisting of thousands of data elements and millions of records but what to do with all this… stuff? Most of the data elements logically have no relation whatsoever with the features we are studying. And worst, many of the variables are hopelessly correlated with one another and by the law of large numbers, many erroneous relationships will emerge from this plethora of information.

a) Research

Again, many problems can be solved by communication or reading the research. If it is important someone has done it before.

b) Systematic algorithms

The method I favor is writing systematic algorithms to cycle through all data elements available, analyze their relationship with the target feature using a measure like MSE then cherry-pick the best element for further analysis.

b.1) Stepwise Regression

Stepwise regressions reduce the number of variables in a model by removing variables one at a time and calculating the marginal gains from including the variable.

b.2) Lorenz, ROI, and ROC curves.

Cycle through each potential independent variable and generate curves showing the relationship with the dependent variable.

b.3) Correlation

A simple two-way correlation between the potential independent and dependent variables is another technique for finding potential independent variables.

c) Data mining

There are several data mining techniques I will discuss in the data mining section that are geared to uncover linear and non-linear relationships in the data.

d) Principle Components\Factor Analysis

As mentioned in the previous section this technique can aid in reducing the number of variables whose relationships need to be estimated with the hope of not losing too much information. Again, this is an estimation technique and should be treated as such.

3) Forecasting Techniques

Below is a cursory overview of common forecasting techniques. A more detailed overview is provided in the statistics and data mining sections. All the example code is in R except where noted.

a) Ordinary Least Squares Regression

This is the classic forecasting technique taught in schools.

1) Building a model

a) There are many packages that provide ordinary least squares estimates.

b) Variable selection is important.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Continuous or Categorical Variables

) Dialogistic Tools

In the statistical section, I go over in detail the evaluation of OLS models. But here are some tools for uncovering majors issues within building an OLS model:

a) QQ Plot

b) Residuals plots

c) Correlations

d) Partial Regressions

e) MSE

f) R-Squared

2) Caveats

Require strong assumptions as to the nature of the data and relationships to maintain BLUE.  BLUE is discussed in detail in the statistical section.

4) Example Code(R)

data(trees)
Results.Model1

b) Logistic Regressions

Logistic regressions have a very different theoretical foundation from ordinary least squares models.  You are trying to estimate a probability, so the dependent data variable only has values of 1 or 0. This violates the assumption required for BLUE.

1) Building the model

a) Many software packages have Logistic regression models included.

b) Variable selection is important.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  Variables

2) Dialogistic Tools

a. Confusion Matrix

b. Lift Charts

c. ROC chart

d. Lorenz Curve

e. Cost Sensitivity/ROI

3) Caveats

Non-linearities can obscure relationships between variables.

4) Example Code(R)

library(MASS)
Results.Model1

c) Vector Autoregression Models (VAR)

Vector Auto Regression models require a firm theoretical foundation.  They are designed for estimating the relationship between a matrix of codependent, autocorrelated variables.  To identify the structure you must make strong assumptions on the structure of the errors, namely how the errors are temporarily related.

1) Building a model

a) There are few packages that provide Vector Auto Regression models.

b) Variable selection is critical.

c) Outputs a simple equation.

d) They correct for auto-correlation and simultaneous equations.

e) Can model time series data.

f) Continuous Variables

2) Dialogistic Tools

Same as for an OLS model.

3) Caveats

By changing the order of the variables in the model you change completely your theory of what true relationship is.  For example, if you order money first you believe the money supply drives output.  If you order output first you believe output drives money.  These are two contradictory models.  Due to this strong reliance on a detailed understanding of the true relationship between variables and all the assumptions required for an OLS model as well they have fallen out of favor in many forecasting circles.

4) Example Code(R)

# This example uses Bayesian c VAR model with flat priors.
Library(MSBVAR)
data(longley)

# Flat priors models
szbvar (longley, p=1 , z=NULL, lambda0=1, lambda1=1, lambda3=1, lambda4=1, lambda5=0, mu5=0, mu6=0, nu=0, qm=4, prior=2,  posterior.fit=F)

d) MARS

Multivariate Adaptive Regression Splines are designed to better deal with nonlinear relationships.  They can be seen as a blending of CHART and OLS.

1) Building a model

a) There are few packages that have MARS models.

b) Variable selection is similar to OLS but you do not need to worry as much with nonlinearities.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Continuous or Categorical  Variables
2) Dialogistic Tools

Similar to an OLS model.

3) Caveats
The output can be difficult to read with a complex model but are understandable.  They are prone to overfitting.

4) Example Code (R)

data(glass)
library(mda)
Results.Model1

e) Artificial Neural Networks (ANN)

Artificial Neural Networks(ANN) are an attempt to simulate how the mind works.

1) Building a model

a) There are many good neural network packages.

b) Variable selection is similar to OLS but many non-linearities can be assumed to be handled by the ANN.

c) The output is not understandable in the manner before mentioned models are.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

3) Dialogistic Tools

I have not found ANN to be black boxes as they are often criticized as being. You can use the same tools as with an OLS or logistic regression.  To find out the influence of each variable you can cycle through each variable, remove it then re-run the model.  The effect of the variable can be measured via MSE.

3) Caveats
Overfitting

4) Example Code (R)

data(swiss)
library(nnet)
results.Model

f) Support Vector Machines (SVM )

Support Vector Machines are closer to classical statistical methods but hold the promise of uncovering nonlinear relationships.

1) Building a model

a) There are few good SVM packages both commercial and open source.

b) Variable selection is similar OLS but many non-linearities can be assumed to be handled by the ANN.

c) The output is not understandable in the manner before mentioned models are.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools

I have also found SVM not to be black boxes. You can use the same tools as OLS and logistic regression to diagnose like with the ANN.

3) Caveats

Overfitting

4) Example Code (R)

data(swiss)
library(kernlab)

## train a support vector machine

results.KVSM1

g) Regression Trees

Regression trees briefly became popular as a forecasting technique around the turn of the century.  It was hoped that they could better model nonlinearities but proved to be prone to overfitting.

1) Building a model

a) There are several good Tree packages both commercial and open source.

b) Automatic variable selection.

c) The output is easy to understand.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools
You can use the same tools as OLS and logistic regression to diagnose.

3) Caveats

Overfitting

4) Example Code (R)

data(kyphosis)
library(rpart)
library(maptree)

DefaultSettings

h) Bagging, Boosting and Voting

Bagging is a way to help unstable models become more stable by combining many models together.

i) Boosted Trees and Random Forests

Boosted Trees apply the boosting methodology applied to trees. You run many, in some case hundreds, of small regression trees then combine all the models to using a voting methodology to stabilize the results.  The resulting model is very complex, but much more stable than any individual tree model would be.

1) Building a model

a) There are several good tree packages both commercial and open source.

b) Automatic variable selection.

c) The output is easy to understand but very large.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools

You can use the same tools as OLS and logistic regression to diagnose.

3) Caveats

Overfitting

4) Example Code (R)

data(swiss)
library (randomForest)
set.seed(131)
Results.Model1 

Ordinary Least Squared

Ordinary least squares estimators are one of the most commonly used statistical techniques for forecasting and causal inference.  OLS is an attempt to estimate a linear process.  Below is a classic example of a linear model:

Suppose you want to model the relationship between a mother’s height and her child’s height.  You can assume this relationship can be estimated as linear (we do not see a 5 foot mother having a 25 foot child).  Also, the child’s height should not influence the mother’s so direction of causation is one-way. This can be represented as the following:

Y= Alpha + Beta *X + error

Where

Y is the child’s height at 18

X is the mother’s height

Beta is the influence of the mother’s height on the child’s height

Alpha is the intercept or base height for the people in the study

Error includes missing variables, measurement error, and inherent randomness to the relationship.

This model would use cross-sectional data, the data consists of individuals, and time is not a relevant factor.  Ordinary least squares can also model time series data like the influence of interest rates on GNP.

OLS is a powerful tool, however, it has many restrictive assumptions.  Violation of just one of the assumptions can render a model invalid. In the above example, the exclusion of a relevant variable such as poverty level (which may influence the health of the child) and a time indicator to show an advance in medical technology may invalidate or weaken the results.

BLUE

The ordinary least squares estimator is BLUE ( Best linear unbiased estimator) as long as certain assumptions hold true. The Guass-Markov theorem defines the conditions where a least squared estimator is BLUE.  When a linear model violates the assumption required for BLUE it is no longer true that the model is the most accurate in reflecting the true underline relationship of the model.

BLUE means: The model is unbiased, efficient (have the minimum variance), and consistent (the estimate improves as the sample size increases).

Some of the results of violations of BLUE:<

1. Including irrelevant independent variables or excluding relevant independent variables.

2. Poor or bias fit

3. Unclear or misleading casual relationships

Assumptions for BLUE

Assumption Examples of Violations Effect
1. Linearity, the relationship between the dependant and independent variables is linear in nature. Non-linear, wrong regressors, non-linearity, changing parameters Coefficients are biased.
2. The error term’s mean is zero and from a normal distribution. Omitted variables, biased intercept Intercept is Biased
3. Homoscedasticity, the error terms have a uniform variance. Heteroskedasticity. Standard errors are biased.
4. No serial correlation. Observations are uncorrelated. Errors in variables, autoregression, simultaneous equations. Standard errors are biased.
5. Variables are minimally interval level data No dummy variables for dependant variables. Coefficients are biased
6. No exact linear relationship between independent variables and more observations than independent variables. No perfect multicollinearity. Standard errors are inflated

Other important assumptions

 

No Outliers This is an issue when examining data with rare events. The question can become is this a rare event that must be modeled or an atypical case? Can lead to a biased estimate.
No Measurement errors There will always exist some level of measurement error and it can lead to the coefficient to be biased.

3. Testing if you are BLUE

Testing for BLUE

Assumption Test Possible Solution
1. Linearity, the relationship between the dependant and independent variables is linear in nature. For non-linearity: Plot residuals. Transform variables
2. The error term’s mean is zero and from a normal distribution. Plot residuals Use GLM
3. Homoscedasticity, the error terms have a uniform variance. Breusch Pagan  and White Test for Heteroskedasicity Use GLM
4. No serial correlation. Observations are uncorrelated. Durban Watson: Auto-correlations. Include lagged repressors
5. Variables are minimally interval level data Understand how the dependent variables are constructed. Use logistic regression
6. No exact linear relationship between independent variables and more observations than independent variables. Run correlations for independent variables. Exclude co-linear variables or use two-stage least squares for strongly correlated data.
No Outliers Cooks test for outliners, plot residuals Remove outliers

Instrumental Variable

Here is an example problem; you want to determine how the competitive environment is affecting store performance. You proceed to estimate a model with store performance as a function of number of rival stores near by in hopes of seeing how rivals affect performance. But you have violated a key assumption of BLUE; at least one independent variable is contemporaneously correlated with dependent variable thus the error term. If a store is profitable it will attract rivals. When your independent variable is correlated with dependent variable you cannot get consistent estimate using regression analysis.

The result is a downward bias estimate on the affect of rivals on store performance. If a location is highly profitable more firms will enter the market increasing the number of rival store while not necessarily adversely affecting a store’s performance.This is a common occurrence. Other examples are income and education, supply and demand, store location and credit scores.

One solution is to use Instrumental Variables (IV). The key to IV is the use of a proxy variable(s) correlated independent variable but not contemporaneously to the error term. The IV can be a variable such as number of small streams to proxy for urban/city or a fitted value. Two stage least squares (2SLS) is a common IV technique that uses a fitted value from a second regression as the IV in the first regression. When building a 2SLS model you still need independent variables that are correlated with the variable you are trying to proxy for but not correlated with the primary model s error term to use in the second stage of the model.

In the store performance location example you could employ a 2SLS model to estimate the number of rivals based on variables such as development tax incentives, number of rivals before the stored opened and other such factors not directly correlated with the performance of store you are examining.

Instrumental variables can also be used to correct for omitted variables by choosing a proxy variable closely correlated with the missing variable.

Further Reading

Wikipedia

Economics Glossary

Missouri state Working_Paper_Series

Software:

OpenBayes A free Bayesian Belief Network library written in Python.

Resource:

The-Data-Mine.com Another website focused on data mining.

Polytomous (Multinomial) and Ordinal Logistic Models

If your dependent variable is continuous or near continuous you use a regression technique; if the dependent variable is binary you can use a logistic regression. Often times; however, what we are trying to model is neither continuous nor binary. Multi-level dependent variables are common occurrences in the real world. Examples of Multi-level dependent variables are:

1: ‘yes’, ‘no’ or ‘does not respond’

2: ‘high’, ‘medium’ or ‘low’ levels of agreement.

These can sometimes be modeled using binary models. You can collapse two categories together, say no and does not respond resulting in a binary choice model. This however can obscure relationships especially is the groups are incorrectly formed or there are several potential outcomes that do not group together logically (i.e. does medium go with low or high?).

There are two main types of models for handling mutually-exclusive, multi-level dependent variables, one for ordinal outcomes and the other is for nominal outcomes. Nominal outcomes are where the ordering of the outcomes does not matter like in example 1. Ordinal outcomes are when the order matters like in example 2. For nominal outcomes you can use Polytomous (Multinomial) models and for ordinal outcomes you can use Ordinal or Polytomous models.

Polytomous models are similar to standard logistic models but instead of one log odds ratio to estimate you have multiple, one for each response. Ordinal Models are stricter in there assumption than Polytomous models namely it assumes the response variables have a natural ordering. Like the Polytomous models you estimate a log odds for each response.

Both of the models have the same assumption of the standard logistic model plus a few more. One new assumption for both models is that the outcomes are mutually exclusive. In the two examples above it is clear the categories are mutually exclusive. But, consider this example: who do you like: Candidate A, B, C? Someone could easily like more than one candidate. In these cases when this assumption is too restrictive you may use data mining approaches such as SVM or ANN.

Another assumption is independence of irrelevant alternatives (IIA). Since the log odds for each response is pitted against one another it is assumed people are behaving rationally. An example of a violation of IIA is if someone prefers candidate A to B, B the C and C to A. Given the preferences A to B and B to C a rational person would prefer A to C. If IIA is violated it can make interpreting the log odds impossible. For a detailed discussion go to:

http://en.wikipedia.org/wiki/Independence_of_irrelevant_alternatives.
www.stat.psu.edu/~jglenn/stat504/08_multilog/10_multilog_logits.htm

www2.chass.ncsu.edu/garson/PA765/logistic.htm

www.mrc-bsu.cam.ac.uk/bugs/documentation/exampVol2/node21.html

en.wikipedia.org/wiki/Logistic_regression

nlp.stanford.edu/IR-book/html/htmledition/node189.html

www.ats.ucla.edu/STAT/mult_pkg/perspective/v18n2p25.htm

Spatial autocorrelation

Lets start off with clarifying two terms that are easy to get confused by, autocorrelation and autoregression. Time series autocorrelation is where the error terms are correlated across time. In other words, past errors in the model affects present outcome. This violates the assumption of homoscedasticity needed for BLUE resulting in biased standard errors. Biased standard errors are bad because you cannot say for certain whether an independent variable’s effect on the dependent variable is statically valid or not. An example of time series autocorrelations is when past forecasting errors affect the present value of the dependent variable. If a system exhibits memory this can also lead to autocorrelations. Autoregession is a means to correct for this bias by regressing a variable on lagged values of itself.

Spatial autocorrelation is another type of autocorrelation but instead of spanning time it spans space.  If a variable is correlated with itself through space it is said to be spatial autocorrelation.  This can be due to misspecification of the model, measurement bias or many other reasons.  Another term for spatial autocorrelation is spatially dependent errors. Moran I and Geary C tests are the most commonly used to detect spatial autocorrelation.  Another example is when an area affect near by regions. For example, imagine a high crime neighborhood.  The surrounding areas should also exhibit a higher than average crime rate due to spillover effects.  This spillover effect will degrade the further away you move from the crime epicenter.  The mechanics of the spatial autocorrelation in this example could be transportation routes or poor police coverage that extend the crime to outlying areas. An example could be large shopping centers on the number of stores. Stores could be concentrated at the shopping center and outline areas may be devoid of stores following Hotelling s Law.

Spatial Autoregression (SAR) models correct for spatial autocorrelation by adding surrounding territories dependent variables (referred to as spatial lagged values of the dependent variable) as regressors. If you were modeling crime across zip codes, you would include the crime rate of near-by zip codes for each zip code as an independent variable. This is similar to how Time series Autoregression models add lagged values of the dependent variables as regressors.

Further Reading

Hunter College

Canada Forestry Service

Cornell U

North Carolina State University

University of West Alabama

Spatial-Statistics.com

Autoregressive Exogenous Model

ARX are auto-regressive models with exogenous inputs. The term exogenous variables should not be confused independent variables. Exogenous variables are determined outside of the process you are modeling. An exogenous variable can be a shift in the oil supply effecting prices or a change in consumer preferences for foreign manufactured products effecting price. Simply put, exogenous variable are independent of the process you are trying to model. Why are ARX model different then standard models? As an example let’s look at a model trying to predict the industrial output. In this model you may want to include lagged output, (the industrial capacity is carried over from one period to the next) and lagged interest rates (the past cost of money influences current contacts). Both lagged output and lagged interest rates are endogenous to the system. What effects output also affect the price of money (interest rates). In this model an exogenous variable would be an oil crisis or natural disaster. These events happened regardless of the values of output or interest rates.

How are ARX models different?

In a vector auto-regressive models (VARX) the distinction becomes clear. In a vector auto regression model (VAR) all the variables are assumed to be correlated with one another. To identify the model you make an assumption about how the variables are contemporaneously correlated with one another. EG. Interest rates effect money immediately but only lagged money effects interest rates today. With VARX model use estimate a system of correlated variables and exogenous variables. VARX allows outside shocks to be taken into consideration.

There are many variations of ARX models.

Non-linear auto-regressive models (NARX)

Additive nonlinear autoregressive exogenous

Vector auto-regressive models (VARX)

Further Reading

AMC Portal

Wikipedia

SAS

GLMM Models

Often times our data conspires against us violating essential assumptions of the standard linear model.  One critical assumption is the errors term of a model must be uncorrelated. There can be many causes of such correlation, repeated data sources (as with customers over time) or missing variables or data which is skewed over time.  If those variations are linear you can model them by adding linear random effects, (i.e. allow the intercept to vary by the source of the variation). One such solution is Generalized Linear Mixed Models (GLMM). It is called a mixed model because it has both random effects and fixed effects. Remember fixed effects are assumed have no measurement error and the same generalization (no group effects). Random effects can have measurement bias or group effects.

Example one: suppose the probability of a customer returning increases after each visit. If this increase in probability is linear, (e.g. first visit 10% more likely to return, second visit 20% likely to return, third visit 30% likely to return,…), then by adding a linear random effects that varies by customer you can correctly model this. Often times if you do not correct for this relationship between variables such as customer age (because the older you are the more likely you have returned more than once) may proxy for this leading to incorrect conclusion.

Example two:  suppose you are collecting data at the county level.  Each county may have different fixed effects.  Imagine crime rates across counties.  The base rate of crime will vary while the influence of say poverty rate will be the same.  If you did not allow the intercept to vary the relation between poverty and crime may be obscured by the base rate variation across counties.

GLMM do not correct for non-linear variations across sampling units.  If you believe customer retention is a non-linear relationship visit or the influence of poverty varies by county than GLMM alone cannot correct for this.  In many circumstances however, such as influence of poverty on crime, the variation across counties is due to missing or latent variables such as the support programs available for individuals at or below the poverty line.  Likewise the fixed effect can also be caused by omitted or latent variables.  The base crime rate across counties is a function of general underline conditions that oftentimes cannot be measured.  Such latent variables are often impossible to uncover so using GLMM is an acceptable solution to correct for this missing variable bias. Further Reading

http://support.sas.com/rnd/app/papers/glimmix.pdf
http://www.wiley.com/legacy/wileychi/eosbs/pdfs/bsa251.pdf
http://www.stat.umu.se/forskning/reports/glmmML.pdf
http://web.maths.unsw.edu.au/~wand/kowpap.pdf
http://arxiv.org/PS_cache/math/pdf/0606/0606491v1.pdf
http://staff.pubhealth.ku.dk/~pd/mixed-jan.2006/glmm.pdf
http://www.stat.umn.edu/geyer/bernor/library/bernor/doc/examples.pdf

Diagnostics


1.Intro

 

  Understanding and using diagnostics defines a good statistician from a hack.  It is not enough just to run the diagnostics; you must challenge your model as a critical eye.  To build a model is simple, to assure it is stable and accurate is a work of art.  There are a variety of tools and test that can aid you in evaluating your model.  When doing diagnostics ever assume anything, always seek proof.

 


2. Tools for linear models

 

   

a. QQ plot

 

   

The QQ or quantile-quantile plot shows the residual errors for the first and last quantilths of the dependant variable plotted against a 45-degree line. This allows you to see how well the model fits at both the extreme. A model that fits poorly will appear to curl up on itself on the plot. Having a model that fits poorly at the extremes is not a good thing but oftentimes it is not a showstopper. By setting maximum allowable values for the model it can still be usefully in segmenting cases. To correct for poorly fitting tails look for new explanatory variables or double check to see if you missed any non-linarities that could be confusing the system.

 

 

b. Residual Plots

 

    By observing the residual plots much can be uncovered about how the model is performing.  A key thing to look for is any pattern in the plots. Since the residual should be random there should be no observable trend in the residual.  Any observable pattern indicates trouble with the model.

 

 

c. R-Squared

 

    R-Squared measures the proportion of the variation of the dependant variable explained (I am using that term very losely) by the model. R-Squared has poor standing amoung statitisica but can be useful if it is not the only measure of fittness of the model. It ranges from zero to one with one being a perfect fit. One is only possible if you include the dependant variable as an explaintory variable and therefore is an indication of error. With the data I typically look at a good model typically ranges from .1 to .3 however I have seen model in production working well with an R-Squared as low as .07

R^2 = 1- ((y-Xb)’(y-Xb) )/ sum(y-yBar)^2

 

 

d. MSE

 

    MSE or Mean Squared Error is useful in choosing between multiple models. It is simply the average of the squared errors.

 

 

e. 1. Partial Regression

 

    Partial regressions are an important tool in determine for the independent variables effect the model as well as themselves. It is the net effect of a independent variable correcting for other regressors.

 

 

e. 2. Partial Residual Plots

 

    Partial residual plots are residuals plotted against each independent variable s value. This shows how the residuals of the model vary as the value of the independent variable changes. This will uncover situations such as a variable at high values causing too great a variation in the model leading to high residuals. In this case you would cap the independent variable. Want you want to see is a even cloud of data points with a zero-slope centered on zero.

 

 

f. T-stats on Coefficients

 

    The T-statistics on the repressors test the null hypotheses that the coefficient is zero, that is has not effect on the model.  If you cannot statistically justify a variables inclusion into the model it is preferred to remove it.  Reasons for a variable failing a t-test can rage from it having no relation with the dependant variable, to non-lineariites influence the results or other independent variables clouding the true relationship. If there is firm theoretical reasons from including the variable investigate further. 

 

 

g. Economic Significance of Coefficients

 

    An independent variable may be statistically significant but have no explanatory power.  By calculating the economic significance of a variable you can roughly measure its contribution to the overall value of the dependant variable.  The Economic Significance of Coefficients is the coefficient times the standard deviation of the independent variables. There is no clear definition of whether a coefficient is economically significant instead a research has to look at the values and decide for herself whether a given coefficient has enough, well, oomph to be considered important. It is a powerful tool to rid yourself of those pesky statistically significant but unimportant variables.

 

  h.Cooks

 

    Cooks test is used to uncover outliners in the data. Using the Cooks value you can target outliers or removal. It should be remembered that not all outliers should be removed. Some are representative of important behavior in the system. In modeling weather in Florida hurricanes may look like a outlier in the data but they are a critical feature to model.

 

  i. CHOW

 

    The Chow test is used to test for structural or regime changes with in the data. In monetary and other financial models they are important test. If a structural change seldom occurs modeling a change using dummy variables can be a good choice but if structural changes occur often you may need to model the underline causes of those changes to have any chance of forecasting the process.

 

  j. Durbin-Watson

 

    Durbin-Watson (DW) is the standard test for serial correlation (autocorrelation). Remember, serial correlation violates BLUE and results in a bias model and you employ autoregression models to correct for it.  When investigating time series data you always have to be conscious of the DW statistic.

 

  k. Bag Plot

 

    Bag plots uncover outliers in the data and can are useful with Cooks Test.

 

  l. White

 

    Thie White test is a standard test for heteroskedasticity. Heteroskedasticity causes correlation coefficients to be biased downward.  This can lead to excluding relevant variables and biasing the coefficients downward.

 


3. Tools for Probablistic and Catorgical Models

 

 

a) Odds Ratios

 

   

The odds ratios for each independent variable indicate whether to keep that variable in the model. If the odds ratio is 1 that variable does not help the predictive power of the model while statistically significantly greater than or less than one indicates the variable has predictive power.

 

 

b) Receiver Operating Characteristic (ROC) Curve

 

   

The ROC curve is used to graphically show the trade off between Sensitivity (the true positive rate) and Specificity (the true negative rate). If the model has no predictive power all the point will lye n a 45 degree line.  A greater area between the 45 degree line and the ROC curve indicates a more predictive model.

 

 

c) Lorenze Curve

 

        The Lorenz curve is a rotated ROC curve. In other words, it is a plot of the cumulative percentage of cases selected by the model against the cumulative percentage of actual positive events.  Like with the ROC curven the area between the curve and the 45 degree line is called the Gini coefficient and is used to measure the fit of a model.  The higher the coefficient the better fitting the model.

 

 

d) Confusion Matrix

 

   

The confusion matrix shows the actual verse forecasted outcome for a binary or categorical process.

 

  Predicted
Yes No
Actual Yes a b
No c d

a: The number of times the model predicted Yes and the outcome was Yes.

b: The number of times the model predicted No and the outcome was Yes

c: The number of times the model predicted Yes and the outcome was No

d: The number of times the model predicted Yes and the outcome was Yes

 

 

e) Profit curve

 

    Uses if the model was used in production what its expected return would be, profit verse score.

 

Descriptive Analysis

1. Frequencies\Charts\Plots

Simple frequencies and plots can tell you quickly if a relationship exists between two or more variables. However, reliance solely on graphs as a diagnostic or research tool, as with any technique, potentially blind you to discovering the true underline relationship.

Example code (R)

library(tcltk)
data(longley)

hist(longley$Unemployed, breaks= Sturges , col= darkgray )
boxplot(longley$Unemployed, ylab= Unemployed)
scatter3d(longley$Unemployed, longley$GNP, longley$Year, fit= linear , bg= white , grid=TRUE)

2. Correlations

Correlations measure the influence one variables has on another.  The values range from 1 to 0 with 1 indicating perfect correlation.  Remember correlations do not show causality only if the variations in two or more variables are related.  Also, non-linearities and interactions can obscure the relationship.

Example Code (R)

data(swiss)
results.Corr

3. ANOVA

Analyses of Variance is a powerful tools to show correlation between two or more variables. While it may not lead directly to a forecast model it can help a research gain knowledge of the relationship between the data elements. It is also useful in seeing how variables in a system are related to one another.

Example Code (R)

data(Seatbelts)
anova(lm(DriversKilled ~ PetrolPrice, data= Seatbelts))

4. Cluster Analysis

Cluster analysis is often times confused with principle components(factor analysis).  Both are powerful unsupervised data reduction tools. While a principle components is concerned with grouping columns in a dataset together cluster analysis is concerned with grouping rows together. It can be a powerful tool in building rules and dummy variables.  For example, if a strong group merges from cluster analysis for young males it would be prudent to test this subgroup either by splitting the data or adding a dummy variables based on it.

Example code (R)

data(swiss)
library(DCluster)
library(cluster)
hc

5. OLAP Cubes\Pivot Charts

Online Analytic Processing (OLAP) is a power data mining tool.  It allows uses to run ad hoc queries on a database quickly with little understanding of data access languages such as SQL.  The end results are frequencies.  OLAP requires an intelligent machine (preferably a statistician) to wield it and will not uncover relationship by itself.

Most OLAP tools come with a graphic interface (GUI). OLAP can be thought of more as a substitute for SAS or SQL. It allows users to program complex queries using drag and drop interface that is intuitive to use.