Choosing a Model

1. What Technique to use?

There are many tools for a statistician from artificial intelligence to heuristic rules to traditional techniques.  Often we use the same technique repeatedly without seriously considering other techniques. There is a cost to this in terms of the quality work you are providing and your professional growth.  As computers and programming become more powerful AI is a field that cannot be ignored.  But, new techniques have a cost.  You need to build in house knowledge, figure out how to make them operational and often times they are higher maintenance than traditional techniques. A good statistician needs to do a cost/benefit analysis to match the right technique(s) to the problem at hand. But, do not become a Johnny one-note.

2. Questions to Consider

First question, what is it you are trying to model?

  • Linear: Is the dependent variable continuous, like price?
  • Probability: Are you trying to model the probability of a event occurring?
  • Classification: Are you grouping output into set baskets like low and high risk of default?
  • Rules: Do you need a set of rules like routing an order through a factory or a claim for further investigation.

Next question, what are the requirements for the project?

  •  What is the client requirements for the project?
  •  What have you learned from your data analysis?
  •  What is the timeline?  Do you have enough time to learn a new technology?
  •  Do you have in house knowledge to maintain the product?
  •  How will it be deployed?
  •  Are there potential competitors who will be researching more advanced techniques?
  •  How stable does the end product need to be?  Yearly updates, monthly updates or build and forget?
  •  What is the software expense required for each technique?
  •  Is there a marketing biased towards one technique? If you are in a highly competitive market ignoring non-traditional techniques is the best way of getting pushed out of that market.
  •  How will you validate the model?
  •  Is non-linear relationships an issue with your data?

3. Brief overview of techniques

Below is a grid of various technique with their strengths and weaknesses.

Method Strengths Weakness
Linear Regression Stable, Simple Tends towards mean. Requires strong assumption about underlining relationship.
Logistic Stable, Simple Weak separations. Requires strong assumption about underlining relationship.
Vector Auto Regressions Stable Tends towards mean. Requires very strong assumption about underlining relationship.
GLM Stable, Simple Tends towards mean Requires strong assumption about underlining relationship.
MARS Can model complex relationships. If not properly built can be unstable, more complex.
Boosted Trees (TreeNet, Random Forests) Can model complex relationships. If not properly built can be unstable, more complex
ANN Can model complex relationships. If not properly built can be unstable, more complex
SVM Can model complex relationships. f not properly built can be unstable, more complex
GA Mostly used to find variables or parameters for model Unstable
Bagging Stable Tends towards mean, complex
Heuristic Rules Stable, can be simple The complexity grows substantially as you try to increase discrimination.  Limited in power.

Further discussion of these techniques will be in the next few sections.

4. Conclusion

Choosing the perfect technique for the process you are trying to model is an impossible task. I recommend not even trying, instead, develop two or three model simultaneously then have a horse race to pick the best one. You can even combine the models using bagging techniques to get the best features from each model. No one model will out perform all others in all categories. My personal experience is, ANN and Boosted Trees (TreeNet) perform better than traditional techniques.