1. Intro

A simulation is an attempt to mimic the real world using analytical methodology. They are ideal to forecast systems where the true relationship is too difficult to estimate or the system is easily modeled. Simulations are not necessarily alternatives to heuristic rules and statistical techniques but an alternative method for forecasting using those techniques.  To build a simulation model you may have to rely on either statistics and/or heuristics to build the core logic. Alternatively, you could use theoretical models as the core of your simulation. Simulations are powerful predictive tools as well as useful for running what-if scenarios. Example simulation models:

a) Supply and Demand

b) Queuing models, prisons, phone,…

c) Factory floor

2. Monte Carlo Simulations

Monte Carlo simulations are stochastic models. They simulate the real world by assuming it is a random or stochastic process.

a. Random Number Generators

Most of the time we do not have truly random numbers but pseudo-random numbers typically generated from the date and time the number was created. These random numbers are drawn from a uniform distribution.

One way to generate a random number drawn from a particular distribution is to calculate the probability density function (PDF) for the random variable then using a random number from a uniform distribution as the probability of that random number. In cases where we cannot easily use the PDF often times a simple algorithm will work.

b. Markov Chains

A Markov chain is a sequence of random numbers that are independent of one another. A classic example of a Markov chain is a random walk. A random walk, or drunkard s walk, is when, if walking each successive step is in a random direction. Studying Markov chains is not an excuse to hang out in bars more often; a real drunk has an intended direction but impaired capacity for executing their intension.

c. Example Model (Queuing)

1) Intro

Queuing models have one or more servers that process either people or items. If a server cannot instantaneously process people and if more then one person arrives a line forms behind the server.

2) Open Jackson Network Queuing

1. Arrival follows a Poisson process

2. Service time is independent and exponentially distributed

3. Probability of complete one node and going onto another is independent

4. Is open so can re-enter the system

5. Assume an infinite number of servers.

Note: Use an Erlang instead of Poisson distribution when you have a finite number of servers.

Statistical/AI Techniques

1 Intro

There is a forest of traditional statistical techniques and new artificial intelligence algorithms for forecasting. Choosing the right one can be difficult.

2. Choosing variables

With the Information Age, forecasters got a mixed blessing.  Now we have more data than was dreamed by the most optimistic forecaster just fifteen years ago.  We typically work with datasets consisting of thousands of data elements and millions of records but what to do with all this… stuff? Most of the data elements logically have no relation whatsoever with the features we are studying. And worst, many of the variables are hopelessly correlated with one another and by the law of large numbers, many erroneous relationships will emerge from this plethora of information.

a) Research

Again, many problems can be solved by communication or reading the research. If it is important someone has done it before.

b) Systematic algorithms

The method I favor is writing systematic algorithms to cycle through all data elements available, analyze their relationship with the target feature using a measure like MSE then cherry-pick the best element for further analysis.

b.1) Stepwise Regression

Stepwise regressions reduce the number of variables in a model by removing variables one at a time and calculating the marginal gains from including the variable.

b.2) Lorenz, ROI, and ROC curves.

Cycle through each potential independent variable and generate curves showing the relationship with the dependent variable.

b.3) Correlation

A simple two-way correlation between the potential independent and dependent variables is another technique for finding potential independent variables.

c) Data mining

There are several data mining techniques I will discuss in the data mining section that are geared to uncover linear and non-linear relationships in the data.

d) Principle Components\Factor Analysis

As mentioned in the previous section this technique can aid in reducing the number of variables whose relationships need to be estimated with the hope of not losing too much information. Again, this is an estimation technique and should be treated as such.

3) Forecasting Techniques

Below is a cursory overview of common forecasting techniques. A more detailed overview is provided in the statistics and data mining sections. All the example code is in R except where noted.

a) Ordinary Least Squares Regression

This is the classic forecasting technique taught in schools.

1) Building a model

a) There are many packages that provide ordinary least squares estimates.

b) Variable selection is important.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Continuous or Categorical Variables

) Dialogistic Tools

In the statistical section, I go over in detail the evaluation of OLS models. But here are some tools for uncovering majors issues within building an OLS model:

a) QQ Plot

b) Residuals plots

c) Correlations

d) Partial Regressions

e) MSE

f) R-Squared

2) Caveats

Require strong assumptions as to the nature of the data and relationships to maintain BLUE.  BLUE is discussed in detail in the statistical section.

4) Example Code(R)


b) Logistic Regressions

Logistic regressions have a very different theoretical foundation from ordinary least squares models.  You are trying to estimate a probability, so the dependent data variable only has values of 1 or 0. This violates the assumption required for BLUE.

1) Building the model

a) Many software packages have Logistic regression models included.

b) Variable selection is important.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  Variables

2) Dialogistic Tools

a. Confusion Matrix

b. Lift Charts

c. ROC chart

d. Lorenz Curve

e. Cost Sensitivity/ROI

3) Caveats

Non-linearities can obscure relationships between variables.

4) Example Code(R)


c) Vector Autoregression Models (VAR)

Vector Auto Regression models require a firm theoretical foundation.  They are designed for estimating the relationship between a matrix of codependent, autocorrelated variables.  To identify the structure you must make strong assumptions on the structure of the errors, namely how the errors are temporarily related.

1) Building a model

a) There are few packages that provide Vector Auto Regression models.

b) Variable selection is critical.

c) Outputs a simple equation.

d) They correct for auto-correlation and simultaneous equations.

e) Can model time series data.

f) Continuous Variables

2) Dialogistic Tools

Same as for an OLS model.

3) Caveats

By changing the order of the variables in the model you change completely your theory of what true relationship is.  For example, if you order money first you believe the money supply drives output.  If you order output first you believe output drives money.  These are two contradictory models.  Due to this strong reliance on a detailed understanding of the true relationship between variables and all the assumptions required for an OLS model as well they have fallen out of favor in many forecasting circles.

4) Example Code(R)

# This example uses Bayesian c VAR model with flat priors.

# Flat priors models
szbvar (longley, p=1 , z=NULL, lambda0=1, lambda1=1, lambda3=1, lambda4=1, lambda5=0, mu5=0, mu6=0, nu=0, qm=4, prior=2,


Multivariate Adaptive Regression Splines are designed to better deal with nonlinear relationships.  They can be seen as a blending of CHART and OLS.

1) Building a model

a) There are few packages that have MARS models.

b) Variable selection is similar to OLS but you do not need to worry as much with nonlinearities.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Continuous or Categorical  Variables
2) Dialogistic Tools

Similar to an OLS model.

3) Caveats
The output can be difficult to read with a complex model but are understandable.  They are prone to overfitting.

4) Example Code (R)


e) Artificial Neural Networks (ANN)

Artificial Neural Networks(ANN) are an attempt to simulate how the mind works.

1) Building a model

a) There are many good neural network packages.

b) Variable selection is similar to OLS but many non-linearities can be assumed to be handled by the ANN.

c) The output is not understandable in the manner before mentioned models are.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

3) Dialogistic Tools

I have not found ANN to be black boxes as they are often criticized as being. You can use the same tools as with an OLS or logistic regression.  To find out the influence of each variable you can cycle through each variable, remove it then re-run the model.  The effect of the variable can be measured via MSE.

3) Caveats

4) Example Code (R)


f) Support Vector Machines (SVM )

Support Vector Machines are closer to classical statistical methods but hold the promise of uncovering nonlinear relationships.

1) Building a model

a) There are few good SVM packages both commercial and open source.

b) Variable selection is similar OLS but many non-linearities can be assumed to be handled by the ANN.

c) The output is not understandable in the manner before mentioned models are.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools

I have also found SVM not to be black boxes. You can use the same tools as OLS and logistic regression to diagnose like with the ANN.

3) Caveats


4) Example Code (R)


## train a support vector machine


g) Regression Trees

Regression trees briefly became popular as a forecasting technique around the turn of the century.  It was hoped that they could better model nonlinearities but proved to be prone to overfitting.

1) Building a model

a) There are several good Tree packages both commercial and open source.

b) Automatic variable selection.

c) The output is easy to understand.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools
You can use the same tools as OLS and logistic regression to diagnose.

3) Caveats


4) Example Code (R)



h) Bagging, Boosting and Voting

Bagging is a way to help unstable models become more stable by combining many models together.

i) Boosted Trees and Random Forests

Boosted Trees apply the boosting methodology applied to trees. You run many, in some case hundreds, of small regression trees then combine all the models to using a voting methodology to stabilize the results.  The resulting model is very complex, but much more stable than any individual tree model would be.

1) Building a model

a) There are several good tree packages both commercial and open source.

b) Automatic variable selection.

c) The output is easy to understand but very large.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools

You can use the same tools as OLS and logistic regression to diagnose.

3) Caveats


4) Example Code (R)

library (randomForest)

Simulation Example

This month we are going to take a stab at programming a simple queuing model in JavaScript.  First, why JavaScript and not R or SAS?  JavaScript is a good language to use for programming examples because nearly anyone with a computer and browser can play around with it. Another good reason is that while JavaScript is not JAVA it is very similar to JAVA and to other languages such as C++ and C#.  Learning JavaScript will help you in coding in those languages as well.

The model:

We are going to code a very simple queuing model. For this simulation we are going to assume a single server (where the people are processed), that people arrive following a Poisson distribution and the service time is a random number with an exponential distribution.

See the code running here: example simple queuing model.

Code was written by Ted Harris, Dec 2007 for
Questions and comments to:
<script type= text/javascript >

First, we need to decide which distribution to use fro the arrival and service rates. Assuming a Poisson distribution for arrival times leads to an arrival rate generated from an exponential function. For service time we can also assume an exponential function.
Generating random number from an exponential functions is easy. We just need to solve: f(x) = -exp^alpha*x for x giving us: x = (-1/alpha) *log(f(x) ). Generate f(x) from a uniform distribution and you are done.

function exponential(intAlpha)
var intR = Math.random();
var intX = (-1/intAlpha) * Math.log(intR);
return Math.round(intX);

/* Here is the core to the code that runs the simuation.*/
function runsim ()

/*First we declare dynamic arrays to hold our cases and results. */
var aryCases = new Array();
var aryResults = new Array();
var intTotCases= 100; /* The total number of cases you want to simulate.. */
var fltArrival = .5; /* This variables sets the average arrival rate. */
var fltWait = .5; /* This sets the average wait time, */
/* Next we populate the arrival and wait times for each case. */

for (i = 0; i < intTotCases; i++)
aryCases[aryCases.length] = new Array(i, exponential(fltArrival) ,exponential(fltWait) );

Since we are assuming one server and one queue this is a relatively simple set up. That is dependant on the first case.
Now we can set the arrival and departure time for the first case. This on is simple:
Arrival = Exponential random number
Departure = Arrival + Exponential random number.

aryResults[aryResults.length] = new Array(1, aryCases[0][1] , aryCases[0][1] + aryCases[0][2] ,aryCases[0][1] ,aryCases[0][2] );

Now we set the arrival and departure times for the rest of the cases.
Arrival = Arrival time of previous queue member + Exponential random number
Departure = maxiumn value between Arrival + Exponential random number and exit time of previous queue memeber + Exponential random number .

for (i = 1; i < aryCases.length; i++)
aryResults[aryResults.length] = new Array(i, aryResults[i-1][1]+ aryCases[i][1] , Math.max( (aryResults[i-1][1]+ aryCases[i][1]+ aryCases[i][2]) , (aryResults[i-1][2]+ aryCases[i][2]) ) );

/* Now write the results to the webpage.

/* This function write the results. */
function writeresults( aryResults )

/*First, find the target object in the document (webpage) to over-write. */
var objParent= document.getElementById( results );
/* Create a new version of that object to be replaced.*/
var objNewParent = document.createElement( span );
/* Set the object ID to match that of the object we are replacing.*/ =;
/*Replace the object with our new object which has our new results writen to it. */
/*Now, clear out any child objects (content) associated with the old parent . */
while (objParent.firstChild)

/* Write the header to the output.*/
objNewParent.innerHTML += <p> Arrives , Departs </p> ;
/*Now, loop through the results table and write the contents to the new span we ahev created, */
for (i = 0; i < aryResults.length; i++)
objNewParent.innerHTML += <p> + aryResults[i][1] + , + aryResults[i][2] + </p> ;


<!– Create a button with an onclick event to run the code.–>
<button onclick = runsim(); > Run simulation </button>
<!– Add a break to make the code look a little better–>

<!– Create a span to write the output to. It is important to set the id for the span so JAVAScript can find it latter. –>
<span id = results >
</span >

Now you can play around with parameter and copy paste the output to Excel to do further research. In some version of Excel the columns will not be properly placed.  To correct for this copy paste the output to a text document then change the file extension to csv.  Excel should now open the file correctly. One fun thing to do is slowly increase the time to be served while keeping the arrival rate constant and watch how quickly the average wait time increases.

Further Reading:

Five Model Evaluation Criteria

There are many different criteria to use to evaluate a statistical or data-mining model.  So many in facts it can be a bit confusing and at times seem like a sporting event where proponents of one criterion are constantly trying to prove it is the best.  There is no such thing as a best criterion.  Different criteria tell you different things about how a model behaves.  In a given situation one criterion may be better than others but that will change as situations change.  My recommendation, as with many other tools, is to use multiple methods and understand the strength and weaknesses of each method with the problem you are currently faced with.   Many of criteria are slight variation of another and most have residual sum of squares (RSS) in them in one manner or another.  The differences may be subtle but can lead to very different conclusions about the fit of a model.  This month examine non-visual measures.  Next month will look at visual tools.


MSE Criterion

The simplest of measure the mean squared error is the average of the square of actual verse predicted values.   A lower MSE means a better fitting model.  It does not provide you with a absolute measure like R-Squared meaning MSE is used to compare models with the same independent variable not as a measure of overall fit.  Remember you are summing the squared residuals because the sum of residuals should be zero, i.e. no bias in the model.


R-Squared is probably the first method you were taught in school.  It is the most derided of all the measures of goodness-of-fit mainly because, in the past, people have a tendency to over state its importance.  R- Squared could not be the only test you do but it should not be ignored.  It is criticized in text books because it is over used not because it is invalid. R-square values range from -1 to 1.  -1 indicates a perfect a negative correlation while 1 indicates a perfect positive correlation.  In social sciences, r-squared values for a good model range from .05 to .2.  In physical sciences, a good r-squared is much higher between .7 and .9. 

R-Squared = 1-MSE/((1/n)*TSS)

MSE = mean squared errors = RSS/N
TSS = Total Sum of Squares is sum of dependant variable’s deviation from the mean
n = observations.

In layman’s terms it can be thought of as a normalized MSE.

The standard R-Squared does not take into consideration the number of parameters used in the model.  This leads to one flaw, namely you can increase the R-Squared simply by adding random variables to your model.  The adjusted R-square corrects for this.


Adj. R-Squared = 1-(RSS/ (n-p-1) /(TSS/(n-1))

MSE = mean squared errors = RSS/N
TSS = Total Sum of Squares is sum of dependant variable’s deviation from the mean
n = observations
p= number of parameters


Akaike s Information Criterion (AIC) 

In 1972 Akaike introduced the Akaike s Information Criterion (AIC). AIC was an attempt to improve upon previous measure by penalizing the number of free parameters more greatly than adjusted R-Squared.  Its goal is to build the best possible model with the least number of parameters. This should reduce the likelihood of over fitting compared with R-Squared.


AIC = -2*log(L(theta hat) + 2*p
        L is maximum likelihood function
        P is the number of parameters
AIC = n*(ln((2*pi*RSS)/n) +1) + 2*p

      RSS = is the residual sum of squares
      P is the number of parameters
      n is the number of observations


Schwarz s Informational Criterion (BIC or SIC)

The SIC penalizes additional variables more heavily than AIC otherwise it behaves the same.  

It is always better to penalize addition variables, right?  That sounds good; however, this is not always a superior measure than R-Squared.  Example, if you are building a model with potentially noisy data reducing the number of parameters may make the model more unstable out of sample.  By reducing the number of parameters (independent variables) each individual variable’s contribution to the model will increase.  If those variables have stability issues out of sample the model may be more likely to “explode”.  By having a greater number of parameters you reduce the chance any one variables anomaly will yield wild results, the other variables can compensate. In essence you are spreading the risk of a random measure error across multiple variables. This example aside the AIC is a sound measure of model performance.


SIC = -2*log(L(theta hat) + p*ln(n)
      L is maximum likelihood function


SIC = n*(ln((2*pi*RSS)/n) +1)+  p*ln(n)
         RSS = is the residual sum of squares


Information based complexity Inverse-Fisher Information Matrix Criterion  (ICOMP(IFIM))

Bozdogan came up with ICOMP(IFIM))  as an alternative to the AIC based approaches.    
ICOMP(IFIM))  balances model fit against model complexity as measure by the inverse-Fisher information matrix.  This is superior to AIC based approaches because it defines complexity based on covariance matrix of the independent variables as apposed to just the count of independent variables.  Example, suppose you have one model with five independent variables that are not correlated with one another verse a model with four highly correlated parameters. Now suppose both have the same MSE.  The first model intuitively should be superior to the second however with AIC based approaches the second model would look superior.  ICOMP(IFIM))  should compensate for this. 



Further Reading

Model Validation

1. Intro



When managing a project model validation should be fifty percent of development time. The last thing you want is for model validation to be a meaningless rubber stamp. The first run of all models should be expected to fail the validation stage. To save time I recommend having several models finished before going into the last stage and to include the person in charge of validation in all discussions of the model.

Carl Sagan in The Demon-Haunted World said about long-term forecasts:

Each field of science has its own complement of pseudo-science. Physicists have perpetual motion machines, an army of amateur relativity disprovers and perhaps cold fusion. Chemists still have alchemy. Economists have long-range forecasting.

In the validation stage we strive to make Carl Sagan s statement false.


2. Issues



a) Over fitting

Over-fitting (Bias-Variance Tradeoff)  is when you fit the model too closely to the the sample data.  Why is fitting your model closely to the data bad? 

1) Remember you are estimating a stochastic process.  We assume there is some unexplained error. You do not want to explain random error or model measurement error that only exists in the sample you are observing. 

2) The more complex the model the more likely you will wrong.  One common mistake is to put multiple non-linear transformation of a variable to fit a complex relationship.  Every added variable is an added assumption to the true underline model.  Every added assumption is another assumption that could be wrong even if it worked in sample.

In forecasting you want a general model not one specific to the test data.  Think of tables with an uneven surfaces where the large unevenness is systemic and the minor ones particular to a table. If you tried to model the surface with clay the best fitting general model would be to apply minor pressure when pressing the clay to a tabletop.  The worst model would be to apply great pressure to the clay there by creating one perfect model for a particular table that fits no other table.  Remember the goal.



b) Sub Sample Stability

Does the model produce stable results for a majority of the  sub segments of the population?  Example, how well does it predict for over 65?



c) Predictive Power

Questions to answer:

1. Does it treat subgroups fairly? 

2. Does it exhibit adverse selection for any subgroup?

3. Does it provide sufficient separation? 

4. How stable is its performance across samples?

5. And, not to be forgotten, is it profitable?   

3. Methods






a) Out of Sample Forecasting 

The easiest way to test performance of a forecasting model is to take data you have not seen and see how well the model performs. This is the best and simplest way to test model s robustness. It is often called the holdout sample.






b) Cross-Validation

When you do not have enough data to have a test, validation and holdout sample cross-validation is an alternative. There are many types of cross-validation methods.  A simple version would be as follows:


1. Partition your sample into multiple sub groups.

2. Train your model using all but one partition.

3. Validate your model on the remaining partition.

4. Repeat till all partitions have been used as a validation set.


I have not had great luck with cross-validation. When I have used it in production, the models prove to be less stable. One thing I have noticed is papers using cross validation favor over fitting techniques. Since, typically, you use data that has been sampled at one time most of the within sample variance should be low. If the sample was pulled at different points in time from different sources cross-validation should be more effective.


4. Model Dialogistic Tools



Below is a brief overview of some common dialogistic tools.  In the statistics sections these tools are reviewed more closely.




Type of dependent variable

QQ plot

Stability of predictions at tail


Residual plots

Fit, non-linearities, structural shifts








Partial Regression

Parameter validity, structural shifts



Parameter validity








Economic Significance 

Parameter significance


Receiver Operating Characteristic Curve (ROC)



Lorenze Curve



Profit curve

Model s relevance


Confusion Matrix



Odds Ratios

Parameter relevance 



Non-linear relationships


Influence of Variables on MSE

Like a partial regression for learning methods


Out of sample forecasting

Stability, fit


Heuristic Rules

1. Intro

Any rule or set of rules that reduce the uncertainty of an event is an heuristic rule(s) or expert system.  Heuristic rules can be used in a stand alone system or coded as a variable into a model.  There are cases where heuristic rules are the best choice when building a model.

When heuristic rules are the best choice:

1. The process is too non-linear for other forms of modeling.

2. There is good common sense knowledge of processes.

3. The market does not trust statistical models.

Do not discount heuristic rules when building models.  They can be just as powerful in forecasting as any math based model. Also by using hybrid system with both modeling and heuristics rules you can improve your forecast dramatically. Heuristics rules excel when there are a plethora of complex or nonlinear relationships, exactly where statistical models can fall apart.

2. Building a Heuristic model

a. Experts

Talk to people!  You may know statistics but that does not mean you know everything about the problem you are trying to solve. Find an expert in the field you are analyzing and listen.  Here is an example, in one of my prior jobs a non-technical person researching prison populations noted that the incarceration rates always increase when a new prison is built, mainly due to the tremendous political pressure to fill new prisons.  This simple heuristic rule ends up explaining a great deal of the variation in incarceration rates.

b. Data Analysis

Look at your data!  The biggest mistake a modeler can do is ignore his or her data.  Simple rules may emerge out of just looking at plots or frequencies that can have enormous predictive power.  Once I noticed the system was shutting down periodically once a day. A simple frequency showed a relationship between output and whether it was noon.  Obviously the system was shutting down for lunch.  This is common sense, but we had not thought of it till we looked at the data.  Knowing this great improved the efficiency of the model.

c. Data Mining

Data mining is ideal in discovery of heuristic rules.  Trees, discussed in the data mining sections, are well designed to discover both complex and simple rules.