Five Model Evaluation Criteria

There are many different criteria to use to evaluate a statistical or data-mining model.  So many in facts it can be a bit confusing and at times seem like a sporting event where proponents of one criterion are constantly trying to prove it is the best.  There is no such thing as a best criterion.  Different criteria tell you different things about how a model behaves.  In a given situation one criterion may be better than others but that will change as situations change.  My recommendation, as with many other tools, is to use multiple methods and understand the strength and weaknesses of each method with the problem you are currently faced with.   Many of criteria are slight variation of another and most have residual sum of squares (RSS) in them in one manner or another.  The differences may be subtle but can lead to very different conclusions about the fit of a model.  This month examine non-visual measures.  Next month will look at visual tools.

 

MSE Criterion

The simplest of measure the mean squared error is the average of the square of actual verse predicted values.   A lower MSE means a better fitting model.  It does not provide you with a absolute measure like R-Squared meaning MSE is used to compare models with the same independent variable not as a measure of overall fit.  Remember you are summing the squared residuals because the sum of residuals should be zero, i.e. no bias in the model.

R-Squared

R-Squared is probably the first method you were taught in school.  It is the most derided of all the measures of goodness-of-fit mainly because, in the past, people have a tendency to over state its importance.  R- Squared could not be the only test you do but it should not be ignored.  It is criticized in text books because it is over used not because it is invalid. R-square values range from -1 to 1.  -1 indicates a perfect a negative correlation while 1 indicates a perfect positive correlation.  In social sciences, r-squared values for a good model range from .05 to .2.  In physical sciences, a good r-squared is much higher between .7 and .9. 

R-Squared = 1-MSE/((1/n)*TSS)

MSE = mean squared errors = RSS/N
TSS = Total Sum of Squares is sum of dependant variable’s deviation from the mean
n = observations.

In layman’s terms it can be thought of as a normalized MSE.

The standard R-Squared does not take into consideration the number of parameters used in the model.  This leads to one flaw, namely you can increase the R-Squared simply by adding random variables to your model.  The adjusted R-square corrects for this.

 

Adj. R-Squared = 1-(RSS/ (n-p-1) /(TSS/(n-1))

MSE = mean squared errors = RSS/N
TSS = Total Sum of Squares is sum of dependant variable’s deviation from the mean
n = observations
p= number of parameters

 

Akaike s Information Criterion (AIC) 

In 1972 Akaike introduced the Akaike s Information Criterion (AIC). AIC was an attempt to improve upon previous measure by penalizing the number of free parameters more greatly than adjusted R-Squared.  Its goal is to build the best possible model with the least number of parameters. This should reduce the likelihood of over fitting compared with R-Squared.

 

AIC = -2*log(L(theta hat) + 2*p
Where
        L is maximum likelihood function
        P is the number of parameters
Or
AIC = n*(ln((2*pi*RSS)/n) +1) + 2*p

Where
      RSS = is the residual sum of squares
      P is the number of parameters
      n is the number of observations

 

Schwarz s Informational Criterion (BIC or SIC)

The SIC penalizes additional variables more heavily than AIC otherwise it behaves the same.  

It is always better to penalize addition variables, right?  That sounds good; however, this is not always a superior measure than R-Squared.  Example, if you are building a model with potentially noisy data reducing the number of parameters may make the model more unstable out of sample.  By reducing the number of parameters (independent variables) each individual variable’s contribution to the model will increase.  If those variables have stability issues out of sample the model may be more likely to “explode”.  By having a greater number of parameters you reduce the chance any one variables anomaly will yield wild results, the other variables can compensate. In essence you are spreading the risk of a random measure error across multiple variables. This example aside the AIC is a sound measure of model performance.

 

SIC = -2*log(L(theta hat) + p*ln(n)
Where
      L is maximum likelihood function

Or

SIC = n*(ln((2*pi*RSS)/n) +1)+  p*ln(n)
Where
         RSS = is the residual sum of squares
 

 

Information based complexity Inverse-Fisher Information Matrix Criterion  (ICOMP(IFIM))

Bozdogan came up with ICOMP(IFIM))  as an alternative to the AIC based approaches.    
ICOMP(IFIM))  balances model fit against model complexity as measure by the inverse-Fisher information matrix.  This is superior to AIC based approaches because it defines complexity based on covariance matrix of the independent variables as apposed to just the count of independent variables.  Example, suppose you have one model with five independent variables that are not correlated with one another verse a model with four highly correlated parameters. Now suppose both have the same MSE.  The first model intuitively should be superior to the second however with AIC based approaches the second model would look superior.  ICOMP(IFIM))  should compensate for this. 

 

 

Further Reading

www.psy.vanderbilt.edu

http://web.utk.edu/~bozdogan/infomod.html

www.ntua.gr/ISBA2000/new/Fin_sessions_1.html

en.wikipedia.org/wiki/Coefficient_of_determination

etd.fcla.edu/UF/UFE1000122/erdogmus_d.pdf

faculty.psy.ohio-state.edu/myung/personal/info_matrix.pdf