Bagging and Boosting

Even Data Mining has a democracy of sorts; even if the preferred method of voting does not guarantee one agent one vote.  Some agents are more equal than others.  Two such methods are Boosting and bagging.  They are ensemble methods designed to aggregate models in order to come up with more robust and accurate predictions or classifications.  They are referred to as ensemble methods because at the end of the process a set of classifiers will behave as one.

The two methods are very different in their approach.  In bagging each vote is equal while in boosting the voting weight is determined by the strength of the predictor.   They also differ on how the data is employed.  Bagging uses sub-samples while boosting does not.  Boosting is considered more accurate however it can also lead to over fitting in some datasets.  The trade off between the two methods is in accuracy (Boosting) verse stability (Bagging).  This makes intuitive sense, if you can correct assess the predictor’s strength giving it a heavier weight will yield superior results however, if the voting weights themselves are not stable then the model with deteriorate more quickly over time.  In general, the idea is, a weakness due to over fitting in one model will be overwhelmed by the other models while important relationships which is overwhelm in the full sample can contribute in the final model.  Below is a quick overview of bagging and booting methods.

Bagging:  First create n sub-samples from your training dataset.  Next build a separate model on each sub-sample.  Finally, to make a prediction run all n models and average the results.  Typically bagging employees the same data mining technique, such as trees or artificial neural networks, for each model, however, more traditional techniques such as logistic regressions or even a combination of techniques can be used.  Using a variety of techniques may allow for very complex relationships to be modeled and defend your model against an unseen weakness one technique may have for you particular problem.

Boosting:  Instead of dividing the data set into sub sample boosting repeatedly uses the same sample for all the models in an iterative process.  The process is iterative because the error rate of the first model is fed into the second model and so forth.  The error rate is used to focus the next model on the harder to classify sub-groups while spending less time on groups already correctly classified by a previous model.   In essence, the data is re-weighted each iteration based on the mis-classification rate from the previous iterations.  Boosting requires the same data mining technique is used for each model.  The most widely used boosting technique is AdaBoost M1.

The most used data mining technique that employs boosting is Boosted Trees (also known as TreeeNet from Salford systems).

Further Reading

www.nada.kth.se/kurser/kth/2D1431/2004/ML_lecture7.pdf

http://www.icaen.uiowa.edu/~comp/Public/Bagging.pdf

http://gnomo.fe.up.pt/~nnig/papers/boo_bag.pdf

www.d.umn.edu/~rmaclin/publications/opitz-jair99.pdf

preprints.stat.ucla.edu/366/ensemble.pdf