1. What Technique to use?
There are many tools for a statistician from artificial intelligence to heuristic rules to traditional techniques. Often we use the same technique repeatedly without seriously considering other techniques. There is a cost to this in terms of the quality work you are providing and your professional growth. As computers and programming become more powerful AI is a field that cannot be ignored. But, new techniques have a cost. You need to build in house knowledge, figure out how to make them operational and often times they are higher maintenance than traditional techniques. A good statistician needs to do a cost/benefit analysis to match the right technique(s) to the problem at hand. But, do not become a Johnny one-note.
2. Questions to Consider
First question, what is it you are trying to model?
- Linear: Is the dependent variable continuous, like price?
- Probability: Are you trying to model the probability of a event occurring?
- Classification: Are you grouping output into set baskets like low and high risk of default?
- Rules: Do you need a set of rules like routing an order through a factory or a claim for further investigation.
Next question, what are the requirements for the project?
- What is the client requirements for the project?
- What have you learned from your data analysis?
- What is the timeline? Do you have enough time to learn a new technology?
- Do you have in house knowledge to maintain the product?
- How will it be deployed?
- Are there potential competitors who will be researching more advanced techniques?
- How stable does the end product need to be? Yearly updates, monthly updates or build and forget?
- What is the software expense required for each technique?
- Is there a marketing biased towards one technique? If you are in a highly competitive market ignoring non-traditional techniques is the best way of getting pushed out of that market.
- How will you validate the model?
- Is non-linear relationships an issue with your data?
3. Brief overview of techniques
Below is a grid of various technique with their strengths and weaknesses.
Method | Strengths | Weakness |
Linear Regression | Stable, Simple | Tends towards mean. Requires strong assumption about underlining relationship. |
Logistic | Stable, Simple | Weak separations. Requires strong assumption about underlining relationship. |
Vector Auto Regressions | Stable | Tends towards mean. Requires very strong assumption about underlining relationship. |
GLM | Stable, Simple | Tends towards mean Requires strong assumption about underlining relationship. |
MARS | Can model complex relationships. | If not properly built can be unstable, more complex. |
Boosted Trees (TreeNet, Random Forests) | Can model complex relationships. | If not properly built can be unstable, more complex |
ANN | Can model complex relationships. | If not properly built can be unstable, more complex |
SVM | Can model complex relationships. | f not properly built can be unstable, more complex |
GA | Mostly used to find variables or parameters for model | Unstable |
Bagging | Stable | Tends towards mean, complex |
Heuristic Rules | Stable, can be simple | The complexity grows substantially as you try to increase discrimination. Limited in power. |
Further discussion of these techniques will be in the next few sections.
4. Conclusion
Choosing the perfect technique for the process you are trying to model is an impossible task. I recommend not even trying, instead, develop two or three model simultaneously then have a horse race to pick the best one. You can even combine the models using bagging techniques to get the best features from each model. No one model will out perform all others in all categories. My personal experience is, ANN and Boosted Trees (TreeNet) perform better than traditional techniques.