Sympathy for the learner: Abuse #1

Abuse #1 Throwing data at the learner

As data mining becomes more popular in our risky times invariably the profession is becoming sloppy. I see this in research papers, interactions with consultants, and vendor presentations. It is not technical knowledge that I see lacking but sympathy for the learner. Many in the data mining field, for lack of a better word, abuse their learners. For those of you who are not data miners let me give you a brief overview of what I mean by a learner. Suppose you have a collection of data and a problem (or concept) that you hope can be better understood via that data. The learner is whatever method or tool you use to learn (estimate) the concept that you are trying to describe. The learner can be a linear regression, neural network, boosted tree, or even a human.

One way we abuse our learners is the growing tendency to throw data at learner with little consideration for the data’s presentation in hopes that amidst the cloud of information the concept will magically become clear. Remember a boosted tree knows nothing more than what is in the data. A boosted tree was not provided an education or even given the ability to read a book. Most learners have no common sense of knowledge and even forget what it learned in the previous model. Because of this any common sense knowledge about how the data works can provide a tremendous amount of information to the learner sometimes even exceeding the initial information content of the data alone.

Example: Say you are trying to model the optimal coverage for an automobile insurance policy. In the data, you have the number of drivers and vehicles. Common sense tells you it is important if there is a disparity between drivers and vehicles. An extra vehicle can go unused and an extra driver can’t drive. How can a learner ‘see’ this pattern? If it is a tree it creates numerous splits, (if 1 driver and 2 vehicles do this, if 2 drivers on a vehicle do this, …). Essentially the learner is forced to construct a proxy for the fact about whether there are more cars than vehicles. There are several problems with this, there is no guarantee the proxy will be correctly created, it makes the model needlessly complex, and it crowds out other patterns from being included in the tree. A better solution is to introduce a flag indicating more cars than drivers. Although this is a mere one-bit field behind is the complex reasoning as to why the disparity between drivers and vehicles matters and therefore it contains far more information than one bit. A simple one-bit field like this can make or break a model.

The presentation of the data to the learner is just as important as the data itself. What can be obvious, (more cars than drivers, international verse domestic transactions), can be pivotal in uncovering complex concepts. As a data miner put yourself in the leaner’s shoes and you will find yourself giving more sympathy to the learner.

Trees

1. Intro

Trees are the most common data ming tool used today. They are powerful at discovering non-linear relationships hidden within data.  For example, if you are trying to uncover the effect of age on savings using traditional techniques you will have to code dummy variables to account for non-linearities the effect of age on saving behavior. Trees will quickly and automatically uncover facts like younger and older people behave differently than middle-aged people in regards to their savings rate.

2. Overview

The above example is shown in the chart below.

This information space can be partitioned using a tree algorithm.  The result is shown below.

To make the relationships clearer you can represent the above chart in a tree diagram as the image below shows.

 

Binary Trees :

Each node has two branches.

         1          

                 / \         

                /   \        

              2     3                             

Multiway Trees:

Each node has two or more branches.  Any Multiway tree can be represented as a Binary tree although the output is more complex.

                1          

                 /  |  \         

                /   |   \         

              2   3   4       

 

 

3 Common Tree Algorithms

a) THAID

THAID is a binary tree designed to infer relationships with a nominal dependent response variables. Uses statistical significance tests to prune the tree.

 

b) CHAID

CHAID  is a multiway tree used to infer relationships with a nominal and categorical response variables. Uses statistical significance tests to prune the tree.  It is of the same family(AID) as THAID trees.

 

c) CART

CART is a binary tree that supports continuous, nominal, and ordinal response variables.  It is geared for forecasting and uses cross-validation and prune to control the size of the tree.

 

d) Boosted Trees/Random Forests

These methods are discussed in the forecasting section.

BootStapping

Bootstrapping is one of the most useful statistical techniques. It is also one that is often misunderstood, over used and avoided. In general, bootstrap is a process that uses simulation to make inferences about the probability distribution of a sample population. Operationally, you take repeated samples with replacement from a given dataset to build a new dataset. This process is repeated multiple times till the number is sufficient for statistically valid conclusions to be made. It was first introduced by Effron (1979). Jackknife a similar but less popular re-sampling technique pre-dates Bootstrapping. It does not use re-sampling whereas bootstrapping does.

The power of bootstrapping is its ability to make statistical inference about a population using only small, potentially biased sub-sample. And that is why the term bootstrap comes from the phrase, “to pull oneself up by one’s bootstrap’. It is seemingly impossible task. And it does this all of this without the restrictive assumption of normality. Bootstrapping is also valid with small sample sizes (as small as twenty).

There are two main types of bootstrapping, parametric and non-parametric. Non-parametric bootstrapping does not assume a distribution for the population but instead defines the distribution from the data. Parametric bootstrapping assumes the population follows a known and parameterized distributions such as a log-normal.

Below are example uses of bootstrapping:

*If faced with a biased sample or an infrequent event you can employ bootstrapping to resample cases. You can employ this when estimating a logistic regression with a rare event.

 

*By re-sampling residuals using bootstrap technique you make inferences about the asymptotic properties the confidence intervals (CI) and other goodness of fit statistics. This is useful when the sample size is small or the assumption of normality is too restrictive.

*To make inferences about a population you can bootstrap the sampling distribution. This offers a powerful alternative to using standard methods when the assumption of normality is too restrictive.

*Bootstrapping can also be useful to detect outliers.

Further Reading

www.uvm.edu/~dhowell/StatPages/Resampling/Bootstrapping.html

www.uvm.edu/~dhowell/StatPages/Resampling/Resampling.html

wikipedia.org/wiki/Bootstrapping

Sas.com/kb/24/982.html

Multiple Adaptive Regression Splines

Multiple Adaptive Regression Splines is a data mining technique developed by Jerry Friedman (who also co-developed CART).   Like other data mining techniques MARS is designed to model complex non-linear relationship both via interaction and nonlinear transformations.  It also corrects for issues such as irrelevant repressors and missing values.  One touted benefit of MARS over other data mining techniques is the model’s readability.  Since MARS is developed from standard techniques (spline function and regression models) the results are more familiar.

How MARS works by fitting multiple splines to independent variables. A spline in this contexts in a function that allows discontinuities in the relationship (also known as non-linarities).  For example, imagine demand for cars in a four-person household.  You could imagine the household’s demand for an additional car if the household already has four cars is more price sensitive (elastic) than if the household has less than three cars.  In this situation the relationship between price and household demand has a discontinuity when the household already has four cars.  This point is called a knot and a spline function can model this.   MARS finds the knots and interactions by using a brute force search procedure to minimize a loss of fit criterion.  In this manner missing values and non-predictive variables are also addressed.  The key to MARS is its efficient algorithm for searching this very large space (all possible interaction and knots).

This type of complex relationship can also be modeled by Artificial Neural Networks, Support Vector Machines and Regression Trees.  Where MARS shines is its understandability.  In the end you get a model with knots, interaction and coefficient. Many find this more digestible than weights on hidden neurons in hidden layers. MARS has a commercial version available from Salford Systems and an open source version in R.

Bagging and Boosting

Even Data Mining has a democracy of sorts; even if the preferred method of voting does not guarantee one agent one vote.  Some agents are more equal than others.  Two such methods are Boosting and bagging.  They are ensemble methods designed to aggregate models in order to come up with more robust and accurate predictions or classifications.  They are referred to as ensemble methods because at the end of the process a set of classifiers will behave as one.

The two methods are very different in their approach.  In bagging each vote is equal while in boosting the voting weight is determined by the strength of the predictor.   They also differ on how the data is employed.  Bagging uses sub-samples while boosting does not.  Boosting is considered more accurate however it can also lead to over fitting in some datasets.  The trade off between the two methods is in accuracy (Boosting) verse stability (Bagging).  This makes intuitive sense, if you can correct assess the predictor’s strength giving it a heavier weight will yield superior results however, if the voting weights themselves are not stable then the model with deteriorate more quickly over time.  In general, the idea is, a weakness due to over fitting in one model will be overwhelmed by the other models while important relationships which is overwhelm in the full sample can contribute in the final model.  Below is a quick overview of bagging and booting methods.

Bagging:  First create n sub-samples from your training dataset.  Next build a separate model on each sub-sample.  Finally, to make a prediction run all n models and average the results.  Typically bagging employees the same data mining technique, such as trees or artificial neural networks, for each model, however, more traditional techniques such as logistic regressions or even a combination of techniques can be used.  Using a variety of techniques may allow for very complex relationships to be modeled and defend your model against an unseen weakness one technique may have for you particular problem.

Boosting:  Instead of dividing the data set into sub sample boosting repeatedly uses the same sample for all the models in an iterative process.  The process is iterative because the error rate of the first model is fed into the second model and so forth.  The error rate is used to focus the next model on the harder to classify sub-groups while spending less time on groups already correctly classified by a previous model.   In essence, the data is re-weighted each iteration based on the mis-classification rate from the previous iterations.  Boosting requires the same data mining technique is used for each model.  The most widely used boosting technique is AdaBoost M1.

The most used data mining technique that employs boosting is Boosted Trees (also known as TreeeNet from Salford systems).

Further Reading

www.nada.kth.se/kurser/kth/2D1431/2004/ML_lecture7.pdf

http://www.icaen.uiowa.edu/~comp/Public/Bagging.pdf

http://gnomo.fe.up.pt/~nnig/papers/boo_bag.pdf

www.d.umn.edu/~rmaclin/publications/opitz-jair99.pdf

preprints.stat.ucla.edu/366/ensemble.pdf

Artificial Intelligence

Inference Theory of Learning (ITL) models are concerned with; how knowledge is interpreted; the validity of knowledge obtained; the use of prior knowledge; what knowledge can be derived at a given point given prior knowledge; how learning goals and their structure influence the learning process. This theory assumes learning is a process where agents are guided by a set goal and use past experience and knowledge to help reach that set of goals. Some alternative theories are, Computation Learning Theory (aka Statistical Learning Theory which is a simpler theory not concerned with multi-strategic learning goals) and Multi-strategy Task-adaptive learning (these models focus on the cognitive powers of the agents). ITL algorithms are at the core of many data mining techniques. However, you must remember we do not know for certain this is how humans learn so be careful when drawing parallels to the model’s behavior and human behavior. For data miners, the goal is not to model humans but to come up with power tools for analyzing patterns in data and the last thing you wish to enter when presenting results is a debate about how humans think. That is the third rail in any data mining discussion. While a data miner does not need to be an expert in theories of learning it is very useful to be at least aware of them when reading papers and evaluating new techniques.

2. Learning Goal

The learning goals are the key to any ITL process.  Without a set of goals (which may include only one goal) agents (or leaner) would just be complying information with no selection or inference of that information.  How the agent uses the knowledge space, which is the totality of beliefs, to benefit itself is defined in the set of learning goals. Five questions can define a learning goal.

a) What part of prior knowledge relevant?

b) What knowledge to acquire and of what form?

c) How to evaluate the knowledge?

d) When to stop learning?

3. Learning Methodology

New knowledge can enter in three forms:

a) Derived knowledge, (deductive transformation); knowledge generated by deduction from past knowledge inputs.

b) Intrinsic, (intrinsically new) knowledge; knowledge from an external source, such as a teacher or from analysis of past knowledge.

c) Pragmatically new knowledge, (managed knowledge); if the knowledge can not be obtained in space and time defined by knowledge space.

4. ITL algorithms

Examples of ITL algorithms.

Neural Networks

Genetic Algorithms

Some Monte Carol techniques with a feedback loop.

Support Vector Machines (Statistical Learning Theory)

MARS

Genetic Algorithms

1.Intro

Genetic Algorithms (GA) have been around for a while but have not seen widespread adoption in data mining circles as a primary modeling technique.  They are widely used in academic research simulate evolutionary environments and human behavior.  In Economics, many papers have shown GA are effective in simulating how learning effects the economy.  In these papers GA are used to model how people learn and adapt to changes in an economic system.

In data mining GA are mainly used coupled with other techniques such as ANN or as a voting mechanism.  Here, they can be quite powerful in choosing the best parameters or model.  As a primary modeling technique, such as predicting income, they have been less successful.  If you can remember genetics from your biology class and GA will seem familiar to you.  Genetic algorithms were developed following the Inference Theory of Learning (ITL).

Key features of a GA

1. Heterogeneous belief of agents.

2. Information requirement for agents is minimal.

3. Natural model for experimentation by agents with alternative rules.

4. Heterogeneity allows parallel processing of information.

3. Overview

The basic principles of heredity are at the core to GA. In GA the genes represents knowledge or a belief and the evolutionary process is to pick the best belief(s). The fitness of the belief is determined by its forecast error.  A select percentage of beliefs are eliminated based on their fitness. Key to GA is that fact knowledge good or bad is quickly shared through out the knowledge space. The agent’s beliefs are typically represented by bit strings.Below is an example of a simple belief bit string, X = (X1,X2, … ,Xn).

Position 4 indicates 1000

Position 3 indicates 100

Position 2 indicates 10

Position 1 indicates 1

So,  the bit string  0010  indicates a belief of 10.

2. Learning Methodology

a) Cross Over

Cross Over is when two agents share beliefs (genes); it can be viewed as simulated mating.  First, two newborn agents are randomly chosen to mate.  With some probability agents will perform crossover of their beliefs.  Crossover is achieved by dividing each agent’s beliefs at a random point and then swapping pieces.  This creates two new bit strings that represents new agents with new beliefs.

X = (X1,X2, … ,Xn) ->   X = (X1, … , Xi , Yi+1, … Yn)

Y= (Y1,Y2, … ,Yn)    -> Y = (Y1, … , Yi ,  Xi+1, …Xn)

Example

Before Mating:

Agent 1’s  belief: 0010  = 10

Agent 2’s belief: 0001  = 1

Mating:

Exchange segments 2 and 3.

After Mating:

Agent 3’s belief: 0000  = 0

Agent 4’s belief: 0011  = 1

b)  Mutations

To achieve an even more random new belief you can take the crossed over belief and mutate it.  Mutation involves randomly changing a bit string value to a new value, in this case from zero to one or one to zero.

e.g.  X’ = (X1, … , Xi-1 , Z ,  Xi+1 , … ,Xn) where Z is a mutation.

Agent 3’s belief: 0000  = 0 -> 1000  = 1000

c) Inversion

Another approach to create new information is to invert existing strings. Inversion is taking a randomly chosen segment of an agent’s bit string and inverting it.  This method, as with mutation, does not require the newborns to interact with one another while still allowing new information to be generated from the newborn’s belief.

e.g.  X’ = (X1 , …. Xi , Xj, Xj-1, … , Xi-1, Xj+1, …, Xn)

3. Selection

After the crossover, mutation and inversion processes are complete each newborn has a new belief. Once the new information has entered the system each newborn has two beliefs, that they received from the smarter parent and one they have created themselves.  The newborn then applies their learning goal so, the belief that would have yielded the highest utility the period before becomes the newborn’s forecast rule. Now one cycle of the learning process is complete.  This cycle is repeated until the stop criteria (such as a minimum forecast error) is achieved or the maximum number of cycles is reached.

Relational Datamining

1.Intro

Much of the data and processes we are trying to model are relational in nature.  The data tables often relate in a one to many fashion.  An example of a one to many relationship is one person can own multiple books. This is troublesome for most statistical and data mining techniques.  They require a flat file where one row contains all the information required to processes that row.  In relational data this is not true. Relational data mining holds the promise of improved pattern discovery in relational data.

2. East-West Train

Ryszard Michalski in 1980 helped bring the issue of relational data mining into the attention of data miners with his East-West Train challenge.  In this challenge he gave ten trains each pulling a diverse set of cars.   The challenge was to design an algorithm that would predict which train is traveling East by the type of car(s) it is pulling. This is a relational problem because each train is pulling many cars each of which has many attributes.  Using traditional methodology of pattern discovery this problem would take an enormous amount of computational time as every permutation of the data would have to have a corresponding flat file.  From the challenge many new techniques were created.

Figure 1. Michalski s Original Ten Trains

Solution

One rule was simple, if the train is pulling a small, enclosed car it is traveling east.

3. ILP

Inductive Logic Programming is one way of answering Michalski s challenge.  ILP combines logic with machine learning algorithms in the hopes of greatly reducing the number of searches required to intelligently explore the information space.

a) FOIL

First order inductive leaner (FOIL) , a popular ILP.

b) LINUS

c)  Progol

Comparable in performance to FOIL.

Artificial Neural Networks

1. Intro
Artificial Neural Networks (ANN) are an attempt to simulate how the mind works. It was developed using the connectionist approach to computer learning.   They gained popularity among forecasters due to their ability to model non-linearities and interactions across variables.  ANN are used for a variety purposes from forecasting the stock market to pattern recognition to compression algorithms.  Critics of ANN often decry them as a black box.  In reality, the working of a ANN model are viewable but their foundation is not in statistics, but artificial intelligence which has a more limited audience.  ANNs have a tendency to over fit so use of a hold out sample and intelligent forecasting practices is required.

2. Overview
ANNs simulate how neurons operate in the brain by using a network of artificial neurons organized in layers with one input layer and one output layer. Artificial neurons (nodes) are simple processing elements that typical have one response value and one to many input values. The neuron is trained to minimize the predicted error by modifying how it responds to its data element; that data element may be the response of other neuron(s) from a higher level.  The simplest ANN has two layers, an input and an output layer.  The two layers are connected via weights, which are adjusted to minimize forecast error.  A model with one input layer and no hidden layer is very similar to a simple regression model.  The hidden layers are all layer between the input and output layers.  This name is misleading, they are not hidden and their weights can be viewed.

border=
The above example is of a Multi-Layer Perceptron with one hidden layer.

3. Classification
There are many ways to classify neural networks and no consensus among researchers as to the best method. Below, I briefly cover three type of classification for ANN.  These classifications do overlap.

a) Learning features

1) Supervised: these ANN fit a model to predict output from inputs.  This is analogous to a regression model where the research choose a dependent variable. The output of the model is a fitted or predicted value.

2) Unsupervised: ANN has no desired output from the model.  They can be viewed as a form of data reduction like cluster analysis that finds a pattern among variables across observations.

b) Layer Structure

1) Single-layer Perceptron

Single-layer networks are an early attempt at ANN with no hidden layer.  The inputs are fed to the output using a series of weights.

Input
O
O

\
/
Output
O

2) Multi-layer Perceptron

A Multi-layer Perceptron consists of at least three layers, input, hidden output. This allows the system to model interactions across variables as well as nonlinearities.

Input
O
O
|\
/|
Hidden Layer
O
O
\
/
Output
O

c) Network Structures

1) Feed Forward 

In feed forward models the data flows directly from the input to the output.

2) Feed Back

Feedback models allow the output of the model to influence itself, feedback into the system.  It is a powerful means of dealing with issue such a serial correlation.

3) Kononen Self-Organizing Network

This is an unsupervised learning algorithm.

4. Common ANN Training Functions

a) Error Backward Propagation

The most widely used training algorithm is backward propagation (BP).  Backward propagation works by repeatedly looping through a learning cycle (when the neuron weights are recalculated) and readjusting the neuron weights and importance.  In each iteration you calculate the scaling factor (adjustment to the weights to better match the desired output) and assign an error to each neuron (the error is also called the blame and is used to adjust the neuron’s importance).  It is called backward propagation because information obtained by looking at the output node, the final mode, is applied upward through the structure.

Common activation functions:

livesrc= livesrc=
Sigmun function Step Function
livesrc= livesrc=
Sigmoid function Tan-hyperbolic Function

b) Radial Basis Function

Radial Basis Function (RBF) is used primarily for image recognition. It is similar to BP but with more restrict assumptions on learning resulting in faster computation.  Can only have one hidden layer.

c) Probabilistic

Probabilistic neural networks (PNN) are also similar to BP but do only one pass through the data and therefore have a much faster computation rate. It is also mainly used for images but also in cases where rapid response is necessary such as with robotics.

d) Recurrent Neural Network

Recurrent Neural Networks (RNN) are a type of feedback ANN. They are useful where serial correlation exists or the data is noisy.

e) Self-Organizing Feature Maps

SOFM are a type of Kononen Self-Organizing Network ANN.

5. Building a model

a. Preparing the data
For many neural network packages it is required to do the following:

1. Input variables must be bound between [0,1] if using a signmoid function.
2. Binary inputs are transform to [.1,.9]. This is because 0 and 1 are at the extreme of the choice functions and convergence may not be possible.

b. Settings
1. Number of hidden layers

The hidden layers are between the input and output layers.  Typically for modeling linear processes one hidden layer is sufficient.  Image recognition typically requires many hidden layers.  Too many hidden layers can result in over fitting.  In practice, unless the model warrants over fitting, like data compression, stick with one hidden layer.

2. Number of neurons for each hidden layer

One critical choice is the number of hidden layers and the number of neurons in each layer. A triangle formation is a good place to start. With the triangle formation the first hidden layer has the same number of neuron as they are input nodes; the next hidden layer has half as many neurons and so forth.

6. Example Code (R)

data(swiss)
library(nnet)

results.Model  

Support Vector Machines

1. Intro

Support Vector Machines (SVM), like artificial neural networks come from the computational theory of learning.  They have gained popularity recently as an alternative to ANN partly because statisticians better understand the theory behind them and because the computational time can be much shorter than with a ANN. They do not model interactions are effectively as ANN.

2. Overview

Support Vector Machines work by using non-linear transformation of the data elements to minimize the errors. The input space is transformed to the feature space using a kernel method. Kernel functions are the hearts to integral transformations which are used to map functions to new domains that are easier to solve. They are called Support Vector Machines because support vectors are used separate the transformed variables. Below are two illustrations of how a SVM’s transformation of the input space resulting better fitting models:



By using non-linear transformation of the input space the support vectors can find the optimal solution where before there were no or multiple solutions. Like with the ANN, supervised SVM is used to forecast and an unsupervised SVM is used for classification.

3. SVM Types

There are four main types of SVM.  The central differences between each is the training function used and whether it is a classification or regression model.  As with any technique, it is advisable to use more than one type to see how the fit changes.

a) Regression models

1) Epsilon-SVR

With epsilon-SVMs the cost is zero when the predicted is close to the actual value.  This is to correct for noise.

2) NU-SVR
A non-linear loss function is used with NVM. This can lead to over fitting and long computation time.

b) Classification model

1) C-SVM

The complexity (the penalty factor) is constant in C-SVM.

2) NU-SVR
A non-linear loss function is used with NVM.

c) Distribution estimationO

One-class SVM are used to estimate distributions.

4. Kernel Methods

The choice of kernel method can have dramatic effects on the outcome of the model. There are a variety of different kernel methods below are the most commonly used for SVM.

a) Linear

Linear kernel methods only do a linear transformation of the input space.

b) Radial Basis Function RBF

The most commonly used kernel type; the RBF provides a non-linear transformation of the input space.

c) Polynomial

The polynomial kernel also provides a non-linear transformation of the input space.  Its values can go to infinity or zero in some circumstances.

d) Sigmoid

The sigmoid kernel is a special case of the RBF kernel and provides a non-linear transformation of the input space.  Because it is a special case it is less flexible than the RBF type.