Data Manipulation

1.Sample Size

Before you go out hacking through the data looking for gems of information think, how are you going to test validity?  At the beginning you should split your data into at least a training and validation dataset.  I prefer splitting the data into three randomly assigned groups.  The first group is the training set.  This is the data you build your model on.  The next dataset is the validation set.  It is used to test your model’s performance out of sample.  The final dataset is the holdout sample.  It is the final test to see if your model works with unseen data and should be used sparingly.  Typically, I go back and forth between training and validation several times a model and two or three times between the training the hold out set.  I will discuss this topic further in the validation section.

2. Data transformations
With thousands of data elements why do we feel the need to create more?  Many relationships are hidden by nonlinearities and obscure interactions.  To build a successful model you need to get your hands dirty with transforming the data to uncover these relationships.  Nonlinear relations are any that cannot be modeled as the weighted sum of independent variables. This is a broad definition encompassing models where the output is the product of independent variables to structural breaks in the relationship to just about anything else.  Nonlinearites can be the showstopper when trying to build model and knowing how to get around them is essential.  Data transformation is an important tool to make the show go on.

a)Nonlinear Transformations

The classic example of a nonlinear relationship in economics is the Cobb-Douglas  production function.

Y = f(K,L) = aL^bK^c

L:Labour

K:Capital

A: Technology Constant

where

a+b=1 constant returns to scale

a+b< 1 diminishing returns to scale

a+b> 1 increasing returns to scale.

You will not be able to estimate this model using linear techniques.  However by taking the log of the equation yields:  Log(Y) = a + b*log(L) + c*Log(K)

This you can estimate using linear modeling. As seen, using nonlinear transformation can make an unsolvable problem solvable; however, non-linear transformation should only be used when there is theoretical reasons. Misuse of transformation will lead to curve fitting, over fitting of the model, and your model will perform poorly out of sample.

Another example of a nonlinear relationship is average travel time given distance from city center. When you are very close to a city center the average travel time is longer than if you are further out.  This could be because traffic increases exponentially as you approach city center therefore reducing your speed. AvgTime = a*Distance^b

By transforming distance you can make this relation estimable using linear techniques.

AvgTime = a + b*log(Distance)

 

Example Nonlinear Transformations:

1. Square

2. Square root

3. Log

4. Exp

b) Dummy Variables

Dummy variables are important to model structural breaks in the system, variables with noncontinuous relationships and other nonlinearites. It is more common for variables to have a noncontinuous relationship with one another than continuous one.  A continuous relationship is like speed to distance. The faster you travel the further you will travel for any given amount of time (assuming a straight highway, no police,….).  Lets use the example above for average travel time to distance from city center  Above I assumed the model: AvgTime = a*Distance^b.  But, that is not the only possibility.  It may be all the delay is caused by one bridge.  The model then would be

AvgTime = a + b*Distance   where Distance < bridge from city center

AvgTime = c + b*Distance   where Distance >= bridge from city center

This is best modeled by putting in a dummy variable to pull in the information on whether the journey begins before or after the bridge. Dummy variables can also incorporate accumulative effects, such as income vs educations.

Example :

1 0 0 0 0 High School

1 1 0 0 0 Junior College

1 1 1 0 0 College

1 1 1 1 0  Graduate School

(No High School is the Intercept).

c) Fuzzy Variables

Fuzzy variables promise to better incorporate real world notions of numbers and nonlinear relationships.  There are many variables that are thought of, well, in fuzzy terms, like temperature, age and weight.  We seldom think of temperature as a continuous number but in terms of hot or mild or cold. This gets more problematic because our definitions overlap.  The temperature can by hot and mild in our mind.  And if we define these variables in our head in a fuzzy manner, we react to these variables in a fuzzy manner.  Fuzzy logic holds the promises to better model these relationships.

Membership Functions: These define which set(s) a particular value might belong to.  For example, 85 degrees may be both hot and warm. Membership functions are developed through surveys, experimentation and logic.

Fuzzy variables avoids key problems that plagues dummy variables, namely the sharp cut off between being included and excluded and lumping all groups into the same category. For example, say you wanted to model the savings against education and you want to correct for age differences. The effect of age on employment is non-linear, younger and older people have lower employment rates than the ages in between. To capture this, you may want to include a dummy variable to indicate nearing retirement age which is 0 if under 65 and 1 if greater than or equal to 65. But why 65? Why not 64.5 or 66.5? The cut off at 65 is arbitrary and weakens the relationship between employment and retirement age. To capture this complex relationship you can define a membership function that allows a 64 year old to belong to both the retirement group and the non-retirement group.

Below is an example SAS Code generating fuzzy membership functions for temperature.


data Fuzzy;

do TempLoop =0 to 100;

Temp = TempLoop;

if Temp < 40 then Cold = 1 ;

if Temp >= 40 then Cold =-(0.05)*Temp + 3 ;

if Temp >= 60 then Cold = 0 ;

if Temp < 30 then Cool = 0;

if Temp > 30 then Cool = (0.04)*Temp -1.2 ;

if Temp >= 55 then Cool =-(0.04)*Temp +3.2 ;

if Temp > 80 then Cool =0;

if Temp < 60 then Warm =0;

if Temp >= 60 then Warm = (0.06667)*Temp - 4 ;

if Temp >= 75 then Warm =-(0.06667)*Temp + 6 ;

if Temp >= 90 then Warm =0;

if Temp < 80 then Hot = 0;

if Temp >= 80 then Hot = (0.05)*Temp - 4 ;

if Temp >= 100 then Hot = 1 ;

output;

end;

run;


title Temperature ;

proc gplot data=Fuzzy;
     plot Cold*Temp Cool*Temp Warm*Temp Hot*Temp / overlay frame legend = legend;
run;

quit;

d) Splitting the Data

Splitting a data set is useful in uncovering hidden relationships. Lets use the example above for average travel time to distance from city center. It may be all the delay is caused by one bridge but the effective of the bridge effects the maximum speed you can travel before reaching it.  The model then would be

AvgTime = a + b*Distance   where Distance < bridge from city center

AvgTime = a + c*Distance   where Distance >= bridge from city center

This could be modeled by splitting the dataset.

3. Data Reductions Techniques

a) Principle components\ Factor Analysis

Principle components is a powerful data reduction tool. It estimates the structure between variables producing factors that represent those structures.  By looking at the factors you can deduce relationships by seeing how different variables relate to one another.  The factors are by definition orthogonal and some would argue they can be used as independent variables in a regression. One common mistake is to forget principle components is an estimation technique and needs to be treated as such.

Example Code (R)


data(swiss)

Results.FA <-factanal(~Fertility+Agriculture+Examination+Education+Catholic

+ Infant.Mortality, factors=2, rotation= varimax , scores= none , data= swiss)

 

summary(Results.FA)

Results.FA

b) Other data reduction techniques

There are a number of unsupervised AI techniques that will be discussed in the AI data mining section.