Trees

1. Intro

Trees are the most common data ming tool used today. They are powerful at discovering non-linear relationships hidden within data.  For example, if you are trying to uncover the effect of age on savings using traditional techniques you will have to code dummy variables to account for non-linearities the effect of age on saving behavior. Trees will quickly and automatically uncover facts like younger and older people behave differently than middle-aged people in regards to their savings rate.

2. Overview

The above example is shown in the chart below.

This information space can be partitioned using a tree algorithm.  The result is shown below.

To make the relationships clearer you can represent the above chart in a tree diagram as the image below shows.

 

Binary Trees :

Each node has two branches.

         1          

                 / \         

                /   \        

              2     3                             

Multiway Trees:

Each node has two or more branches.  Any Multiway tree can be represented as a Binary tree although the output is more complex.

                1          

                 /  |  \         

                /   |   \         

              2   3   4       

 

 

3 Common Tree Algorithms

a) THAID

THAID is a binary tree designed to infer relationships with a nominal dependent response variables. Uses statistical significance tests to prune the tree.

 

b) CHAID

CHAID  is a multiway tree used to infer relationships with a nominal and categorical response variables. Uses statistical significance tests to prune the tree.  It is of the same family(AID) as THAID trees.

 

c) CART

CART is a binary tree that supports continuous, nominal, and ordinal response variables.  It is geared for forecasting and uses cross-validation and prune to control the size of the tree.

 

d) Boosted Trees/Random Forests

These methods are discussed in the forecasting section.