Model Validation


1. Intro

 

 

When managing a project model validation should be fifty percent of development time. The last thing you want is for model validation to be a meaningless rubber stamp. The first run of all models should be expected to fail the validation stage. To save time I recommend having several models finished before going into the last stage and to include the person in charge of validation in all discussions of the model.

Carl Sagan in The Demon-Haunted World said about long-term forecasts:

Each field of science has its own complement of pseudo-science. Physicists have perpetual motion machines, an army of amateur relativity disprovers and perhaps cold fusion. Chemists still have alchemy. Economists have long-range forecasting.

In the validation stage we strive to make Carl Sagan s statement false.

 


2. Issues

 

 

a) Over fitting

Over-fitting (Bias-Variance Tradeoff)  is when you fit the model too closely to the the sample data.  Why is fitting your model closely to the data bad? 

1) Remember you are estimating a stochastic process.  We assume there is some unexplained error. You do not want to explain random error or model measurement error that only exists in the sample you are observing. 

2) The more complex the model the more likely you will wrong.  One common mistake is to put multiple non-linear transformation of a variable to fit a complex relationship.  Every added variable is an added assumption to the true underline model.  Every added assumption is another assumption that could be wrong even if it worked in sample.

In forecasting you want a general model not one specific to the test data.  Think of tables with an uneven surfaces where the large unevenness is systemic and the minor ones particular to a table. If you tried to model the surface with clay the best fitting general model would be to apply minor pressure when pressing the clay to a tabletop.  The worst model would be to apply great pressure to the clay there by creating one perfect model for a particular table that fits no other table.  Remember the goal.

 

    

b) Sub Sample Stability

Does the model produce stable results for a majority of the  sub segments of the population?  Example, how well does it predict for over 65?

 

 

c) Predictive Power

Questions to answer:

1. Does it treat subgroups fairly? 

2. Does it exhibit adverse selection for any subgroup?

3. Does it provide sufficient separation? 

4. How stable is its performance across samples?

5. And, not to be forgotten, is it profitable?   


3. Methods

 

 

 

 

 

a) Out of Sample Forecasting 

The easiest way to test performance of a forecasting model is to take data you have not seen and see how well the model performs. This is the best and simplest way to test model s robustness. It is often called the holdout sample.

 

 

 

 

 

b) Cross-Validation

When you do not have enough data to have a test, validation and holdout sample cross-validation is an alternative. There are many types of cross-validation methods.  A simple version would be as follows:

 

1. Partition your sample into multiple sub groups.

2. Train your model using all but one partition.

3. Validate your model on the remaining partition.

4. Repeat till all partitions have been used as a validation set.

 

I have not had great luck with cross-validation. When I have used it in production, the models prove to be less stable. One thing I have noticed is papers using cross validation favor over fitting techniques. Since, typically, you use data that has been sampled at one time most of the within sample variance should be low. If the sample was pulled at different points in time from different sources cross-validation should be more effective.

 


4. Model Dialogistic Tools

 

 

Below is a brief overview of some common dialogistic tools.  In the statistics sections these tools are reviewed more closely.

 

Dialogistic

Purpose

Type of dependent variable

QQ plot

Stability of predictions at tail

Linear

Residual plots

Fit, non-linearities, structural shifts

Linear

R-Squared

Fit

Linear

MSE

Fit 

All

Partial Regression

Parameter validity, structural shifts

All

T-Statistics

Parameter validity

Linear

White

Heteroskadsticity

Linear

Durbin-Watson

Auto-correlation

Linear

Economic Significance 

Parameter significance

All

Receiver Operating Characteristic Curve (ROC)

Fit

Binary,Categorical

Lorenze Curve

Fit,Lift 

Binary,Categorical

Profit curve

Model s relevance

All

Confusion Matrix

Fit

Binary,Categorical

Odds Ratios

Parameter relevance 

Binary,Categorical

Trees

Non-linear relationships

All

Influence of Variables on MSE

Like a partial regression for learning methods

All

Out of sample forecasting

Stability, fit

All