Sympathy for the learner: Abuse #1

Abuse #1 Throwing data at the learner

As data mining becomes more popular in our risky times invariably the profession is becoming sloppy. I see this in research papers, interactions with consultants, and vendor presentations. It is not technical knowledge that I see lacking but sympathy for the learner. Many in the data mining field, for lack of a better word, abuse their learners. For those of you who are not data miners let me give you a brief overview of what I mean by a learner. Suppose you have a collection of data and a problem (or concept) that you hope can be better understood via that data. The learner is whatever method or tool you use to learn (estimate) the concept that you are trying to describe. The learner can be a linear regression, neural network, boosted tree, or even a human.

One way we abuse our learners is the growing tendency to throw data at learner with little consideration for the data’s presentation in hopes that amidst the cloud of information the concept will magically become clear. Remember a boosted tree knows nothing more than what is in the data. A boosted tree was not provided an education or even given the ability to read a book. Most learners have no common sense of knowledge and even forget what it learned in the previous model. Because of this any common sense knowledge about how the data works can provide a tremendous amount of information to the learner sometimes even exceeding the initial information content of the data alone.

Example: Say you are trying to model the optimal coverage for an automobile insurance policy. In the data, you have the number of drivers and vehicles. Common sense tells you it is important if there is a disparity between drivers and vehicles. An extra vehicle can go unused and an extra driver can’t drive. How can a learner ‘see’ this pattern? If it is a tree it creates numerous splits, (if 1 driver and 2 vehicles do this, if 2 drivers on a vehicle do this, …). Essentially the learner is forced to construct a proxy for the fact about whether there are more cars than vehicles. There are several problems with this, there is no guarantee the proxy will be correctly created, it makes the model needlessly complex, and it crowds out other patterns from being included in the tree. A better solution is to introduce a flag indicating more cars than drivers. Although this is a mere one-bit field behind is the complex reasoning as to why the disparity between drivers and vehicles matters and therefore it contains far more information than one bit. A simple one-bit field like this can make or break a model.

The presentation of the data to the learner is just as important as the data itself. What can be obvious, (more cars than drivers, international verse domestic transactions), can be pivotal in uncovering complex concepts. As a data miner put yourself in the leaner’s shoes and you will find yourself giving more sympathy to the learner.

Statistical/AI Techniques

1 Intro

There is a forest of traditional statistical techniques and new artificial intelligence algorithms for forecasting. Choosing the right one can be difficult.

2. Choosing variables

With the Information Age, forecasters got a mixed blessing.  Now we have more data than was dreamed by the most optimistic forecaster just fifteen years ago.  We typically work with datasets consisting of thousands of data elements and millions of records but what to do with all this… stuff? Most of the data elements logically have no relation whatsoever with the features we are studying. And worst, many of the variables are hopelessly correlated with one another and by the law of large numbers, many erroneous relationships will emerge from this plethora of information.

a) Research

Again, many problems can be solved by communication or reading the research. If it is important someone has done it before.

b) Systematic algorithms

The method I favor is writing systematic algorithms to cycle through all data elements available, analyze their relationship with the target feature using a measure like MSE then cherry-pick the best element for further analysis.

b.1) Stepwise Regression

Stepwise regressions reduce the number of variables in a model by removing variables one at a time and calculating the marginal gains from including the variable.

b.2) Lorenz, ROI, and ROC curves.

Cycle through each potential independent variable and generate curves showing the relationship with the dependent variable.

b.3) Correlation

A simple two-way correlation between the potential independent and dependent variables is another technique for finding potential independent variables.

c) Data mining

There are several data mining techniques I will discuss in the data mining section that are geared to uncover linear and non-linear relationships in the data.

d) Principle Components\Factor Analysis

As mentioned in the previous section this technique can aid in reducing the number of variables whose relationships need to be estimated with the hope of not losing too much information. Again, this is an estimation technique and should be treated as such.

3) Forecasting Techniques

Below is a cursory overview of common forecasting techniques. A more detailed overview is provided in the statistics and data mining sections. All the example code is in R except where noted.

a) Ordinary Least Squares Regression

This is the classic forecasting technique taught in schools.

1) Building a model

a) There are many packages that provide ordinary least squares estimates.

b) Variable selection is important.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Continuous or Categorical Variables

) Dialogistic Tools

In the statistical section, I go over in detail the evaluation of OLS models. But here are some tools for uncovering majors issues within building an OLS model:

a) QQ Plot

b) Residuals plots

c) Correlations

d) Partial Regressions

e) MSE

f) R-Squared

2) Caveats

Require strong assumptions as to the nature of the data and relationships to maintain BLUE.  BLUE is discussed in detail in the statistical section.

4) Example Code(R)

data(trees)
Results.Model1

b) Logistic Regressions

Logistic regressions have a very different theoretical foundation from ordinary least squares models.  You are trying to estimate a probability, so the dependent data variable only has values of 1 or 0. This violates the assumption required for BLUE.

1) Building the model

a) Many software packages have Logistic regression models included.

b) Variable selection is important.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  Variables

2) Dialogistic Tools

a. Confusion Matrix

b. Lift Charts

c. ROC chart

d. Lorenz Curve

e. Cost Sensitivity/ROI

3) Caveats

Non-linearities can obscure relationships between variables.

4) Example Code(R)

library(MASS)
Results.Model1

c) Vector Autoregression Models (VAR)

Vector Auto Regression models require a firm theoretical foundation.  They are designed for estimating the relationship between a matrix of codependent, autocorrelated variables.  To identify the structure you must make strong assumptions on the structure of the errors, namely how the errors are temporarily related.

1) Building a model

a) There are few packages that provide Vector Auto Regression models.

b) Variable selection is critical.

c) Outputs a simple equation.

d) They correct for auto-correlation and simultaneous equations.

e) Can model time series data.

f) Continuous Variables

2) Dialogistic Tools

Same as for an OLS model.

3) Caveats

By changing the order of the variables in the model you change completely your theory of what true relationship is.  For example, if you order money first you believe the money supply drives output.  If you order output first you believe output drives money.  These are two contradictory models.  Due to this strong reliance on a detailed understanding of the true relationship between variables and all the assumptions required for an OLS model as well they have fallen out of favor in many forecasting circles.

4) Example Code(R)

# This example uses Bayesian c VAR model with flat priors.
Library(MSBVAR)
data(longley)

# Flat priors models
szbvar (longley, p=1 , z=NULL, lambda0=1, lambda1=1, lambda3=1, lambda4=1, lambda5=0, mu5=0, mu6=0, nu=0, qm=4, prior=2,  posterior.fit=F)

d) MARS

Multivariate Adaptive Regression Splines are designed to better deal with nonlinear relationships.  They can be seen as a blending of CHART and OLS.

1) Building a model

a) There are few packages that have MARS models.

b) Variable selection is similar to OLS but you do not need to worry as much with nonlinearities.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Continuous or Categorical  Variables
2) Dialogistic Tools

Similar to an OLS model.

3) Caveats
The output can be difficult to read with a complex model but are understandable.  They are prone to overfitting.

4) Example Code (R)

data(glass)
library(mda)
Results.Model1

e) Artificial Neural Networks (ANN)

Artificial Neural Networks(ANN) are an attempt to simulate how the mind works.

1) Building a model

a) There are many good neural network packages.

b) Variable selection is similar to OLS but many non-linearities can be assumed to be handled by the ANN.

c) The output is not understandable in the manner before mentioned models are.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

3) Dialogistic Tools

I have not found ANN to be black boxes as they are often criticized as being. You can use the same tools as with an OLS or logistic regression.  To find out the influence of each variable you can cycle through each variable, remove it then re-run the model.  The effect of the variable can be measured via MSE.

3) Caveats
Overfitting

4) Example Code (R)

data(swiss)
library(nnet)
results.Model

f) Support Vector Machines (SVM )

Support Vector Machines are closer to classical statistical methods but hold the promise of uncovering nonlinear relationships.

1) Building a model

a) There are few good SVM packages both commercial and open source.

b) Variable selection is similar OLS but many non-linearities can be assumed to be handled by the ANN.

c) The output is not understandable in the manner before mentioned models are.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools

I have also found SVM not to be black boxes. You can use the same tools as OLS and logistic regression to diagnose like with the ANN.

3) Caveats

Overfitting

4) Example Code (R)

data(swiss)
library(kernlab)

## train a support vector machine

results.KVSM1

g) Regression Trees

Regression trees briefly became popular as a forecasting technique around the turn of the century.  It was hoped that they could better model nonlinearities but proved to be prone to overfitting.

1) Building a model

a) There are several good Tree packages both commercial and open source.

b) Automatic variable selection.

c) The output is easy to understand.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools
You can use the same tools as OLS and logistic regression to diagnose.

3) Caveats

Overfitting

4) Example Code (R)

data(kyphosis)
library(rpart)
library(maptree)

DefaultSettings

h) Bagging, Boosting and Voting

Bagging is a way to help unstable models become more stable by combining many models together.

i) Boosted Trees and Random Forests

Boosted Trees apply the boosting methodology applied to trees. You run many, in some case hundreds, of small regression trees then combine all the models to using a voting methodology to stabilize the results.  The resulting model is very complex, but much more stable than any individual tree model would be.

1) Building a model

a) There are several good tree packages both commercial and open source.

b) Automatic variable selection.

c) The output is easy to understand but very large.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools

You can use the same tools as OLS and logistic regression to diagnose.

3) Caveats

Overfitting

4) Example Code (R)

data(swiss)
library (randomForest)
set.seed(131)
Results.Model1 

Relational Datamining

1.Intro

Much of the data and processes we are trying to model are relational in nature.  The data tables often relate in a one to many fashion.  An example of a one to many relationship is one person can own multiple books. This is troublesome for most statistical and data mining techniques.  They require a flat file where one row contains all the information required to processes that row.  In relational data this is not true. Relational data mining holds the promise of improved pattern discovery in relational data.

2. East-West Train

Ryszard Michalski in 1980 helped bring the issue of relational data mining into the attention of data miners with his East-West Train challenge.  In this challenge he gave ten trains each pulling a diverse set of cars.   The challenge was to design an algorithm that would predict which train is traveling East by the type of car(s) it is pulling. This is a relational problem because each train is pulling many cars each of which has many attributes.  Using traditional methodology of pattern discovery this problem would take an enormous amount of computational time as every permutation of the data would have to have a corresponding flat file.  From the challenge many new techniques were created.

Figure 1. Michalski s Original Ten Trains

Solution

One rule was simple, if the train is pulling a small, enclosed car it is traveling east.

3. ILP

Inductive Logic Programming is one way of answering Michalski s challenge.  ILP combines logic with machine learning algorithms in the hopes of greatly reducing the number of searches required to intelligently explore the information space.

a) FOIL

First order inductive leaner (FOIL) , a popular ILP.

b) LINUS

c)  Progol

Comparable in performance to FOIL.