JSON and JAQL

JSON (Javascript Object Notation) is growing in popularity as a data format. I am finding I am using it routinely when interfacing with sites such as FreeBase.com and at work when processing datasets using Hadoop.

One great strength of the JSON format is the rich set of tools available to work with the data. One example is JAQL, a JSON query language similar to SQL that works well with Hadoop. A great overview is found here <a href = http://code.google.com/p/jaql/wiki/JaqlOverview > here </a> he strength of JAQL is it allows users simple, extendable code to manipulate data that is in a non-proprietary, readable and commonly used file format.

I have added a JAQL example in the data section and to the data Management Quick Comparison.

Sympathy for the learner: Abuse #1

Abuse #1 Throwing data at the learner

As data mining becomes more popular in our risky times invariably the profession is becoming sloppy. I see this in research papers, interactions with consultants, and vendor presentations. It is not technical knowledge that I see lacking but sympathy for the learner. Many in the data mining field, for lack of a better word, abuse their learners. For those of you who are not data miners let me give you a brief overview of what I mean by a learner. Suppose you have a collection of data and a problem (or concept) that you hope can be better understood via that data. The learner is whatever method or tool you use to learn (estimate) the concept that you are trying to describe. The learner can be a linear regression, neural network, boosted tree, or even a human.

One way we abuse our learners is the growing tendency to throw data at learner with little consideration for the data’s presentation in hopes that amidst the cloud of information the concept will magically become clear. Remember a boosted tree knows nothing more than what is in the data. A boosted tree was not provided an education or even given the ability to read a book. Most learners have no common sense of knowledge and even forget what it learned in the previous model. Because of this any common sense knowledge about how the data works can provide a tremendous amount of information to the learner sometimes even exceeding the initial information content of the data alone.

Example: Say you are trying to model the optimal coverage for an automobile insurance policy. In the data, you have the number of drivers and vehicles. Common sense tells you it is important if there is a disparity between drivers and vehicles. An extra vehicle can go unused and an extra driver can’t drive. How can a learner ‘see’ this pattern? If it is a tree it creates numerous splits, (if 1 driver and 2 vehicles do this, if 2 drivers on a vehicle do this, …). Essentially the learner is forced to construct a proxy for the fact about whether there are more cars than vehicles. There are several problems with this, there is no guarantee the proxy will be correctly created, it makes the model needlessly complex, and it crowds out other patterns from being included in the tree. A better solution is to introduce a flag indicating more cars than drivers. Although this is a mere one-bit field behind is the complex reasoning as to why the disparity between drivers and vehicles matters and therefore it contains far more information than one bit. A simple one-bit field like this can make or break a model.

The presentation of the data to the learner is just as important as the data itself. What can be obvious, (more cars than drivers, international verse domestic transactions), can be pivotal in uncovering complex concepts. As a data miner put yourself in the leaner’s shoes and you will find yourself giving more sympathy to the learner.

Depersonalization

Depersonalization of data is a growing issue for modelers as privacy concerns about consumer’s data increases.  It is often necessary to de-associate personal identifiers from datasets or take other precautions to assure the anonymity of individuals studied.  This is difficult because many fields we use in modeling, gender, date of birth, and zip code can be used to identify individuals.  A study by Latanya Sweeney showed gender, date of birth, and zip code can uniquely identify 85% of the US population.  To meet privacy concerns removing driver license number, Social Security number and full name is often not enough.

 

Here is an example, you are given two datasets one has a demographic profile of an individual and results from a medical study and the other dataset has a full name, address, and date of birth.  The concern is you do not want someone to uniquely identify individuals across these datasets. As mentioned before, if both datasets contain gender, date of birth, and home zip code you can identify individuals with an 85% accuracy. Here there has been no depersonalization.  If age had replaced the date of birth in the study dataset one to one identification across datasets would not have been easily achievable.

Concept: K-anonymization

K-anonymization enables you to talk of degrees to which one dataset is related to another dataset. It is not the only measure of depersonalization and has some issues, namely it is NP-Hard but is an important concept to understand. If a record in one dataset can be matched to k records in another dataset that dataset is said to be (k-1). For example, if you can uniquely match each record in two datasets (one to one matching) K-anonymization is zero.  If, however, many records can match a given record K-anonymization is greater than zero. A large value for k indicates a greater degree of personalization of the study dataset.  When calculating the value you use a full information dataset and a study dataset that requires depersonalization.

 

Further Reading

L. Sweeney , Uniqueness of Simple Demographics in the U.S (2002) Carnegie Melon University, Laboratory for International Data Privacy http://privacy.cs.cmu.edu/courses/pad1/lectures/identifiability.pdf

http://reports-archive.adm.cs.cmu.edu/anon/isri2006/CMU-ISRI-06-105.pdf

http://lorrie.cranor.org/courses/fa04/malin_slides.pdf

Simulation

1. Intro

A simulation is an attempt to mimic the real world using analytical methodology. They are ideal to forecast systems where the true relationship is too difficult to estimate or the system is easily modeled. Simulations are not necessarily alternatives to heuristic rules and statistical techniques but an alternative method for forecasting using those techniques.  To build a simulation model you may have to rely on either statistics and/or heuristics to build the core logic. Alternatively, you could use theoretical models as the core of your simulation. Simulations are powerful predictive tools as well as useful for running what-if scenarios. Example simulation models:

a) Supply and Demand

b) Queuing models, prisons, phone,…

c) Factory floor

2. Monte Carlo Simulations

Monte Carlo simulations are stochastic models. They simulate the real world by assuming it is a random or stochastic process.

a. Random Number Generators

Most of the time we do not have truly random numbers but pseudo-random numbers typically generated from the date and time the number was created. These random numbers are drawn from a uniform distribution.

One way to generate a random number drawn from a particular distribution is to calculate the probability density function (PDF) for the random variable then using a random number from a uniform distribution as the probability of that random number. In cases where we cannot easily use the PDF often times a simple algorithm will work.

b. Markov Chains

A Markov chain is a sequence of random numbers that are independent of one another. A classic example of a Markov chain is a random walk. A random walk, or drunkard s walk, is when, if walking each successive step is in a random direction. Studying Markov chains is not an excuse to hang out in bars more often; a real drunk has an intended direction but impaired capacity for executing their intension.

c. Example Model (Queuing)

1) Intro

Queuing models have one or more servers that process either people or items. If a server cannot instantaneously process people and if more then one person arrives a line forms behind the server.

2) Open Jackson Network Queuing

1. Arrival follows a Poisson process

2. Service time is independent and exponentially distributed

3. Probability of complete one node and going onto another is independent

4. Is open so can re-enter the system

5. Assume an infinite number of servers.

Note: Use an Erlang instead of Poisson distribution when you have a finite number of servers.

The S-Language


The S programming language of statistical programming language was developed  Bell laboratories specifically for statistical modeling. There are two versions of  S.  One was developed by insightful under the name S-Plus.  The other is an open-source initiative called R.  S allows you to create objects and is very extendable and has power graphing capabilities.

Tips
Tip 1

Set Memory Size

memory.size(max = TRUE)
Tip 2

Today’s Date

Today <- format(Sys.Date(), %d %b %Y )
Tip 3

Set Working Directory

setwd( C:// )
Tip 4

Load In Data

ExampleData.path    <- file.path(getwd(), USDemographics.CSV ) 
ExampleData.FullSet  <- read.table( ExampleData.path, header=TRUE, sep= , , na.strings= NA , dec= . , strip.white=TRUE)
Tip 5

Split Data

ExampleData.Nrows <-  nrow(ExampleData.FullSet) ExampleData.NCol= ncol(ExampleData.FullSet) 
ExampleData.SampleSize <- ExampleData.Nrows /2
ExampleData.Sample <- sample(nrow(ExampleData.FullSet ),size = ExampleData.SampleSize ,
replace=FALSE, prob = NULL )
ExampleData.HoldBack  <- ExampleData.FullSet[ExampleData.Sample, c(5,1:ExampleData.NCol)]
ExampleData.Run   <- ExampleData.FullSet[-ExampleData.Sample, c(5,1:ExampleData.NCol)  ]
Tip 6

Create Function

Confusion <- function(a, b){
                  tbl <- table(a, b)
                  mis <- 1 - sum(diag(tbl))/sum(tbl)
                  list(table = tbl, misclass.prob = mis)
                   }
Tip 7

Recode Fields

ExampleData.FullSet$Savings 
ExampleData.FullSet$SavingsCat <- recode(ExampleData.FullSet$Savings, 
, -40000.00:-100.00 = HighNeg ; -100.00:-50.00  = MedNeg ; -50.00:10.00 = LowNeg ; 10.00:50.00 = Low ; 50.00:100.00 = Med ; 100.00:1000.00 = High ;;;  , as.factor.result=TRUE)
Tip 8

Summarize Data

Summary(ExampleData.FullSet)
Tip 9

Save output

save.image(file = c:/test.RData , version = NULL, ascii = FALSE,  compress = FALSE, safe = TRUE)
Tip 10

Subset

MyData.SubSample <- subset(MyData.Full, MyField ==0)
Tip 11

Remove Object From Memory

remove(list = c(‘MyObject’));
Tip  12

Create a Dataframe

TmpOuput <- data.frame ( Fields = c( Field1 , ‘Field2 , ‘Field3’),  Values   = c( 1 , 2 ,  2  ) )
Tip 13

Cut

data(swiss)
x <- swiss$Education  
swiss$Educated= cut(x, breaks=c(0, 11, 999), labels=c( 0 , 1 ))
Tip 14

Create Directories

dir.create( c:/MyProjects )

Statistical/AI Techniques

1 Intro

There is a forest of traditional statistical techniques and new artificial intelligence algorithms for forecasting. Choosing the right one can be difficult.

2. Choosing variables

With the Information Age, forecasters got a mixed blessing.  Now we have more data than was dreamed by the most optimistic forecaster just fifteen years ago.  We typically work with datasets consisting of thousands of data elements and millions of records but what to do with all this… stuff? Most of the data elements logically have no relation whatsoever with the features we are studying. And worst, many of the variables are hopelessly correlated with one another and by the law of large numbers, many erroneous relationships will emerge from this plethora of information.

a) Research

Again, many problems can be solved by communication or reading the research. If it is important someone has done it before.

b) Systematic algorithms

The method I favor is writing systematic algorithms to cycle through all data elements available, analyze their relationship with the target feature using a measure like MSE then cherry-pick the best element for further analysis.

b.1) Stepwise Regression

Stepwise regressions reduce the number of variables in a model by removing variables one at a time and calculating the marginal gains from including the variable.

b.2) Lorenz, ROI, and ROC curves.

Cycle through each potential independent variable and generate curves showing the relationship with the dependent variable.

b.3) Correlation

A simple two-way correlation between the potential independent and dependent variables is another technique for finding potential independent variables.

c) Data mining

There are several data mining techniques I will discuss in the data mining section that are geared to uncover linear and non-linear relationships in the data.

d) Principle Components\Factor Analysis

As mentioned in the previous section this technique can aid in reducing the number of variables whose relationships need to be estimated with the hope of not losing too much information. Again, this is an estimation technique and should be treated as such.

3) Forecasting Techniques

Below is a cursory overview of common forecasting techniques. A more detailed overview is provided in the statistics and data mining sections. All the example code is in R except where noted.

a) Ordinary Least Squares Regression

This is the classic forecasting technique taught in schools.

1) Building a model

a) There are many packages that provide ordinary least squares estimates.

b) Variable selection is important.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Continuous or Categorical Variables

) Dialogistic Tools

In the statistical section, I go over in detail the evaluation of OLS models. But here are some tools for uncovering majors issues within building an OLS model:

a) QQ Plot

b) Residuals plots

c) Correlations

d) Partial Regressions

e) MSE

f) R-Squared

2) Caveats

Require strong assumptions as to the nature of the data and relationships to maintain BLUE.  BLUE is discussed in detail in the statistical section.

4) Example Code(R)

data(trees)
Results.Model1

b) Logistic Regressions

Logistic regressions have a very different theoretical foundation from ordinary least squares models.  You are trying to estimate a probability, so the dependent data variable only has values of 1 or 0. This violates the assumption required for BLUE.

1) Building the model

a) Many software packages have Logistic regression models included.

b) Variable selection is important.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  Variables

2) Dialogistic Tools

a. Confusion Matrix

b. Lift Charts

c. ROC chart

d. Lorenz Curve

e. Cost Sensitivity/ROI

3) Caveats

Non-linearities can obscure relationships between variables.

4) Example Code(R)

library(MASS)
Results.Model1

c) Vector Autoregression Models (VAR)

Vector Auto Regression models require a firm theoretical foundation.  They are designed for estimating the relationship between a matrix of codependent, autocorrelated variables.  To identify the structure you must make strong assumptions on the structure of the errors, namely how the errors are temporarily related.

1) Building a model

a) There are few packages that provide Vector Auto Regression models.

b) Variable selection is critical.

c) Outputs a simple equation.

d) They correct for auto-correlation and simultaneous equations.

e) Can model time series data.

f) Continuous Variables

2) Dialogistic Tools

Same as for an OLS model.

3) Caveats

By changing the order of the variables in the model you change completely your theory of what true relationship is.  For example, if you order money first you believe the money supply drives output.  If you order output first you believe output drives money.  These are two contradictory models.  Due to this strong reliance on a detailed understanding of the true relationship between variables and all the assumptions required for an OLS model as well they have fallen out of favor in many forecasting circles.

4) Example Code(R)

# This example uses Bayesian c VAR model with flat priors.
Library(MSBVAR)
data(longley)

# Flat priors models
szbvar (longley, p=1 , z=NULL, lambda0=1, lambda1=1, lambda3=1, lambda4=1, lambda5=0, mu5=0, mu6=0, nu=0, qm=4, prior=2,  posterior.fit=F)

d) MARS

Multivariate Adaptive Regression Splines are designed to better deal with nonlinear relationships.  They can be seen as a blending of CHART and OLS.

1) Building a model

a) There are few packages that have MARS models.

b) Variable selection is similar to OLS but you do not need to worry as much with nonlinearities.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Continuous or Categorical  Variables
2) Dialogistic Tools

Similar to an OLS model.

3) Caveats
The output can be difficult to read with a complex model but are understandable.  They are prone to overfitting.

4) Example Code (R)

data(glass)
library(mda)
Results.Model1

e) Artificial Neural Networks (ANN)

Artificial Neural Networks(ANN) are an attempt to simulate how the mind works.

1) Building a model

a) There are many good neural network packages.

b) Variable selection is similar to OLS but many non-linearities can be assumed to be handled by the ANN.

c) The output is not understandable in the manner before mentioned models are.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

3) Dialogistic Tools

I have not found ANN to be black boxes as they are often criticized as being. You can use the same tools as with an OLS or logistic regression.  To find out the influence of each variable you can cycle through each variable, remove it then re-run the model.  The effect of the variable can be measured via MSE.

3) Caveats
Overfitting

4) Example Code (R)

data(swiss)
library(nnet)
results.Model

f) Support Vector Machines (SVM )

Support Vector Machines are closer to classical statistical methods but hold the promise of uncovering nonlinear relationships.

1) Building a model

a) There are few good SVM packages both commercial and open source.

b) Variable selection is similar OLS but many non-linearities can be assumed to be handled by the ANN.

c) The output is not understandable in the manner before mentioned models are.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools

I have also found SVM not to be black boxes. You can use the same tools as OLS and logistic regression to diagnose like with the ANN.

3) Caveats

Overfitting

4) Example Code (R)

data(swiss)
library(kernlab)

## train a support vector machine

results.KVSM1

g) Regression Trees

Regression trees briefly became popular as a forecasting technique around the turn of the century.  It was hoped that they could better model nonlinearities but proved to be prone to overfitting.

1) Building a model

a) There are several good Tree packages both commercial and open source.

b) Automatic variable selection.

c) The output is easy to understand.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools
You can use the same tools as OLS and logistic regression to diagnose.

3) Caveats

Overfitting

4) Example Code (R)

data(kyphosis)
library(rpart)
library(maptree)

DefaultSettings

h) Bagging, Boosting and Voting

Bagging is a way to help unstable models become more stable by combining many models together.

i) Boosted Trees and Random Forests

Boosted Trees apply the boosting methodology applied to trees. You run many, in some case hundreds, of small regression trees then combine all the models to using a voting methodology to stabilize the results.  The resulting model is very complex, but much more stable than any individual tree model would be.

1) Building a model

a) There are several good tree packages both commercial and open source.

b) Automatic variable selection.

c) The output is easy to understand but very large.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools

You can use the same tools as OLS and logistic regression to diagnose.

3) Caveats

Overfitting

4) Example Code (R)

data(swiss)
library (randomForest)
set.seed(131)
Results.Model1 

Python Data

Python scripting language with a heavy emphasis on code reuse and simplicity. It is a popular language with a large and active user group.  Like Perl, it has a rich library set and is popular with projects like MapReduce with Hadoop and is rapidly becoming the default language in areas like computational intelligence.  One interesting feature of the Python language is, indentation acts as block delimiters.  So, when copying code do not change the indentation of the code will not operate as intended.  It is also a powerful language meaning it can do a lot with very little code.  Below is an example of a database containing two tables, customers, and purchase order.  The Customer table has customerid (the unique identifier for the customer), name (The customer’s name), and age (the customer’s age).  The PurchaseOrder table has POID (the unique id for the purchase order), customerid (which refers back to the customer who made the purchase), and Purchase (what the purchase was).

Example:
Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
PurchaseOrder
POID CustomerID Purchase
1 3 Fiction
2 1 Biography
3 1 Fiction
4 2 Biography
5 3 Fiction
6 4 Fiction
 
     
SELECT
Create the file select.py with the following code:

import sys
import fileinput
# Loop through file and print lines
for line in fileinput.input(sys.argv[1]):
    print (line)

Then run the code:
python select.py customer.txt

CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
ORDER BY
Create the file orderby.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

listLines =[]
i =0

# Load file into an array

for line in fileinput.input(sys.argv[1]):
listLines.append(line)

# Create a custom sort based on second field

def getCustId(listLines):
return listLines.split( , , 2)[-1]

# Sort the array

listLines.sort(key=getCustId)

# Print arrys lines

for object in listLines:
    print (object)

Then run the code:
python orderby.py customer.txt

CustomerID Name Age
5 Susan 18
1 Joe 23
3 Lin 34
2 Mika 45
4 Sara 56
WHERE
Create the file select_by_id.py with the following code:
import sys

import fileinput
import string

# Loop through file

for line in fileinput.input(sys.argv[1]):

# Split line using a comma

tokens = string.split( line, , )

# If ID matches passed ID then print

if tokens[0] == sys.argv[2]:
print line,

Then run the code:
python select_by_id.py customer.txt 1

CustomerID Name Age
5 Susan 18
INNER JOIN
Create the file innerjoin.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

listB =[]
i =0

# Load second file into an array to loop through

for lineB in fileinput.input(sys.argv[2]):
listB.append(lineB)

# Loop through first file

for lineA in fileinput.input(sys.argv[1]):

# Split line using a comma

tokensA = string.split( lineA, , )

# Loop through array

for object in listB:

# Split line using a comma

tokensB = string.split(object, , )

#If there is a match print

if tokensA[0] == tokensB[1]:

# Remove newline character with strip method

print lineA.strip() + , + object,

Then run the code:
python innerjoin.py customer.txt orders.txt

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
LEFT OUTER JOIN
Create the file leftouterjoin.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

listB =[]
iFound =0

# Load second file into an array to loop through

for lineB in fileinput.input(sys.argv[2]):
listB.append(lineB)

# Loop through first file

for lineA in fileinput.input(sys.argv[1]):

# Split line using a comma

tokensA = string.split( lineA, , )
iFound =0

# Loop through array

for object in listB:

# Split line using a comma

tokensB = string.split(object, , )

#If there is a match print

if tokensA[0] == tokensB[1]:

# Remove newline character with strip method

print lineA.strip() + , + object,
iFound=1

#If there is no match print

if iFound ==0:
print lineA.strip() + , + , + , + , ,

Then run the code:
python leftouterjoin.py customer.txt orders.txt

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
5 Susan 18 NULL NULL
GROUP BY
Create the file groupby.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

iCnt =1
iLoop =0

# Load and loop through file

for lineA in fileinput.input(sys.argv[1]):

# Split line using a comma

tokensA = string.split( lineA, , )

# Adjust with header and first line

if iLoop <2:
priorTokens = tokensA
iCnt=0

if tokensA[0] == priorTokens[0]:
iCnt =iCnt+1
else:
print priorTokens[1] + , + priorTokens[2].strip() + , +str(iCnt)
iCnt=1 iLoop = iLoop +1
priorTokens = tokensA

# print last line

print priorTokens[1] + , + priorTokens[2].strip() + ,

Then run the code:
python groupby.py customer.txt

Name Age Orders
Joe 23 1
Mika 45 2
Lin 34 2
Sara 56 1
UPDATE
Create the file update with the following code:

import sys
import fileinput
import string

# Loop through file

for line in fileinput.input(sys.argv[1]):

# Split line using a comma

tokens = string.split( line, , )

# If ID is not passed ID then print else replace age with passed parameter

if tokens[0] != sys.argv[2]:
print line,
else:
print tokens[0]+ , + tokens[1] + , + sys.argv[3]

Then run the code:
python groupby.py customer.txt 1 23

Customer
CustomerID Name Age
1 Joe 26
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
INSERT
Create the file insert.py with the following code:

import sys
import fileinput

# Loop through file and print lines

for line in fileinput.input(sys.argv[1]):
print line,

# Add new line from passed arguments

print sys.argv[2] + , + sys.argv[3] + , + sys.argv[4],

Then run the code:
python insert.py customer.txt 6 Joe 34

Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
6 Terry 50
DELETE
Create the file delete.py with the following code:

import sys
import fileinput
import string

# Loop through file

for line in fileinput.input(sys.argv[1]):

# Split line using a comma

tokens = string.split( line, , )

# If ID is not passed ID then print

if tokens[0] != sys.argv[2]:
print line,

Then run the code:
python delete.py customer.txt 1

Customer
CustomerID Name Age
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

Ordinary Least Squared

Ordinary least squares estimators are one of the most commonly used statistical techniques for forecasting and causal inference.  OLS is an attempt to estimate a linear process.  Below is a classic example of a linear model:

Suppose you want to model the relationship between a mother’s height and her child’s height.  You can assume this relationship can be estimated as linear (we do not see a 5 foot mother having a 25 foot child).  Also, the child’s height should not influence the mother’s so direction of causation is one-way. This can be represented as the following:

Y= Alpha + Beta *X + error

Where

Y is the child’s height at 18

X is the mother’s height

Beta is the influence of the mother’s height on the child’s height

Alpha is the intercept or base height for the people in the study

Error includes missing variables, measurement error, and inherent randomness to the relationship.

This model would use cross-sectional data, the data consists of individuals, and time is not a relevant factor.  Ordinary least squares can also model time series data like the influence of interest rates on GNP.

OLS is a powerful tool, however, it has many restrictive assumptions.  Violation of just one of the assumptions can render a model invalid. In the above example, the exclusion of a relevant variable such as poverty level (which may influence the health of the child) and a time indicator to show an advance in medical technology may invalidate or weaken the results.

BLUE

The ordinary least squares estimator is BLUE ( Best linear unbiased estimator) as long as certain assumptions hold true. The Guass-Markov theorem defines the conditions where a least squared estimator is BLUE.  When a linear model violates the assumption required for BLUE it is no longer true that the model is the most accurate in reflecting the true underline relationship of the model.

BLUE means: The model is unbiased, efficient (have the minimum variance), and consistent (the estimate improves as the sample size increases).

Some of the results of violations of BLUE:<

1. Including irrelevant independent variables or excluding relevant independent variables.

2. Poor or bias fit

3. Unclear or misleading casual relationships

Assumptions for BLUE

Assumption Examples of Violations Effect
1. Linearity, the relationship between the dependant and independent variables is linear in nature. Non-linear, wrong regressors, non-linearity, changing parameters Coefficients are biased.
2. The error term’s mean is zero and from a normal distribution. Omitted variables, biased intercept Intercept is Biased
3. Homoscedasticity, the error terms have a uniform variance. Heteroskedasticity. Standard errors are biased.
4. No serial correlation. Observations are uncorrelated. Errors in variables, autoregression, simultaneous equations. Standard errors are biased.
5. Variables are minimally interval level data No dummy variables for dependant variables. Coefficients are biased
6. No exact linear relationship between independent variables and more observations than independent variables. No perfect multicollinearity. Standard errors are inflated

Other important assumptions

 

No Outliers This is an issue when examining data with rare events. The question can become is this a rare event that must be modeled or an atypical case? Can lead to a biased estimate.
No Measurement errors There will always exist some level of measurement error and it can lead to the coefficient to be biased.

3. Testing if you are BLUE

Testing for BLUE

Assumption Test Possible Solution
1. Linearity, the relationship between the dependant and independent variables is linear in nature. For non-linearity: Plot residuals. Transform variables
2. The error term’s mean is zero and from a normal distribution. Plot residuals Use GLM
3. Homoscedasticity, the error terms have a uniform variance. Breusch Pagan  and White Test for Heteroskedasicity Use GLM
4. No serial correlation. Observations are uncorrelated. Durban Watson: Auto-correlations. Include lagged repressors
5. Variables are minimally interval level data Understand how the dependent variables are constructed. Use logistic regression
6. No exact linear relationship between independent variables and more observations than independent variables. Run correlations for independent variables. Exclude co-linear variables or use two-stage least squares for strongly correlated data.
No Outliers Cooks test for outliners, plot residuals Remove outliers

Trees

1. Intro

Trees are the most common data ming tool used today. They are powerful at discovering non-linear relationships hidden within data.  For example, if you are trying to uncover the effect of age on savings using traditional techniques you will have to code dummy variables to account for non-linearities the effect of age on saving behavior. Trees will quickly and automatically uncover facts like younger and older people behave differently than middle-aged people in regards to their savings rate.

2. Overview

The above example is shown in the chart below.

This information space can be partitioned using a tree algorithm.  The result is shown below.

To make the relationships clearer you can represent the above chart in a tree diagram as the image below shows.

 

Binary Trees :

Each node has two branches.

         1          

                 / \         

                /   \        

              2     3                             

Multiway Trees:

Each node has two or more branches.  Any Multiway tree can be represented as a Binary tree although the output is more complex.

                1          

                 /  |  \         

                /   |   \         

              2   3   4       

 

 

3 Common Tree Algorithms

a) THAID

THAID is a binary tree designed to infer relationships with a nominal dependent response variables. Uses statistical significance tests to prune the tree.

 

b) CHAID

CHAID  is a multiway tree used to infer relationships with a nominal and categorical response variables. Uses statistical significance tests to prune the tree.  It is of the same family(AID) as THAID trees.

 

c) CART

CART is a binary tree that supports continuous, nominal, and ordinal response variables.  It is geared for forecasting and uses cross-validation and prune to control the size of the tree.

 

d) Boosted Trees/Random Forests

These methods are discussed in the forecasting section.

Data Manipulation

1.Sample Size

Before you go out hacking through the data looking for gems of information think, how are you going to test validity?  At the beginning you should split your data into at least a training and validation dataset.  I prefer splitting the data into three randomly assigned groups.  The first group is the training set.  This is the data you build your model on.  The next dataset is the validation set.  It is used to test your model’s performance out of sample.  The final dataset is the holdout sample.  It is the final test to see if your model works with unseen data and should be used sparingly.  Typically, I go back and forth between training and validation several times a model and two or three times between the training the hold out set.  I will discuss this topic further in the validation section.

2. Data transformations
With thousands of data elements why do we feel the need to create more?  Many relationships are hidden by nonlinearities and obscure interactions.  To build a successful model you need to get your hands dirty with transforming the data to uncover these relationships.  Nonlinear relations are any that cannot be modeled as the weighted sum of independent variables. This is a broad definition encompassing models where the output is the product of independent variables to structural breaks in the relationship to just about anything else.  Nonlinearites can be the showstopper when trying to build model and knowing how to get around them is essential.  Data transformation is an important tool to make the show go on.

a)Nonlinear Transformations

The classic example of a nonlinear relationship in economics is the Cobb-Douglas  production function.

Y = f(K,L) = aL^bK^c

L:Labour

K:Capital

A: Technology Constant

where

a+b=1 constant returns to scale

a+b< 1 diminishing returns to scale

a+b> 1 increasing returns to scale.

You will not be able to estimate this model using linear techniques.  However by taking the log of the equation yields:  Log(Y) = a + b*log(L) + c*Log(K)

This you can estimate using linear modeling. As seen, using nonlinear transformation can make an unsolvable problem solvable; however, non-linear transformation should only be used when there is theoretical reasons. Misuse of transformation will lead to curve fitting, over fitting of the model, and your model will perform poorly out of sample.

Another example of a nonlinear relationship is average travel time given distance from city center. When you are very close to a city center the average travel time is longer than if you are further out.  This could be because traffic increases exponentially as you approach city center therefore reducing your speed. AvgTime = a*Distance^b

By transforming distance you can make this relation estimable using linear techniques.

AvgTime = a + b*log(Distance)

 

Example Nonlinear Transformations:

1. Square

2. Square root

3. Log

4. Exp

b) Dummy Variables

Dummy variables are important to model structural breaks in the system, variables with noncontinuous relationships and other nonlinearites. It is more common for variables to have a noncontinuous relationship with one another than continuous one.  A continuous relationship is like speed to distance. The faster you travel the further you will travel for any given amount of time (assuming a straight highway, no police,….).  Lets use the example above for average travel time to distance from city center  Above I assumed the model: AvgTime = a*Distance^b.  But, that is not the only possibility.  It may be all the delay is caused by one bridge.  The model then would be

AvgTime = a + b*Distance   where Distance < bridge from city center

AvgTime = c + b*Distance   where Distance >= bridge from city center

This is best modeled by putting in a dummy variable to pull in the information on whether the journey begins before or after the bridge. Dummy variables can also incorporate accumulative effects, such as income vs educations.

Example :

1 0 0 0 0 High School

1 1 0 0 0 Junior College

1 1 1 0 0 College

1 1 1 1 0  Graduate School

(No High School is the Intercept).

c) Fuzzy Variables

Fuzzy variables promise to better incorporate real world notions of numbers and nonlinear relationships.  There are many variables that are thought of, well, in fuzzy terms, like temperature, age and weight.  We seldom think of temperature as a continuous number but in terms of hot or mild or cold. This gets more problematic because our definitions overlap.  The temperature can by hot and mild in our mind.  And if we define these variables in our head in a fuzzy manner, we react to these variables in a fuzzy manner.  Fuzzy logic holds the promises to better model these relationships.

Membership Functions: These define which set(s) a particular value might belong to.  For example, 85 degrees may be both hot and warm. Membership functions are developed through surveys, experimentation and logic.

Fuzzy variables avoids key problems that plagues dummy variables, namely the sharp cut off between being included and excluded and lumping all groups into the same category. For example, say you wanted to model the savings against education and you want to correct for age differences. The effect of age on employment is non-linear, younger and older people have lower employment rates than the ages in between. To capture this, you may want to include a dummy variable to indicate nearing retirement age which is 0 if under 65 and 1 if greater than or equal to 65. But why 65? Why not 64.5 or 66.5? The cut off at 65 is arbitrary and weakens the relationship between employment and retirement age. To capture this complex relationship you can define a membership function that allows a 64 year old to belong to both the retirement group and the non-retirement group.

Below is an example SAS Code generating fuzzy membership functions for temperature.


data Fuzzy;

do TempLoop =0 to 100;

Temp = TempLoop;

if Temp < 40 then Cold = 1 ;

if Temp >= 40 then Cold =-(0.05)*Temp + 3 ;

if Temp >= 60 then Cold = 0 ;

if Temp < 30 then Cool = 0;

if Temp > 30 then Cool = (0.04)*Temp -1.2 ;

if Temp >= 55 then Cool =-(0.04)*Temp +3.2 ;

if Temp > 80 then Cool =0;

if Temp < 60 then Warm =0;

if Temp >= 60 then Warm = (0.06667)*Temp - 4 ;

if Temp >= 75 then Warm =-(0.06667)*Temp + 6 ;

if Temp >= 90 then Warm =0;

if Temp < 80 then Hot = 0;

if Temp >= 80 then Hot = (0.05)*Temp - 4 ;

if Temp >= 100 then Hot = 1 ;

output;

end;

run;


title Temperature ;

proc gplot data=Fuzzy;
     plot Cold*Temp Cool*Temp Warm*Temp Hot*Temp / overlay frame legend = legend;
run;

quit;

d) Splitting the Data

Splitting a data set is useful in uncovering hidden relationships. Lets use the example above for average travel time to distance from city center. It may be all the delay is caused by one bridge but the effective of the bridge effects the maximum speed you can travel before reaching it.  The model then would be

AvgTime = a + b*Distance   where Distance < bridge from city center

AvgTime = a + c*Distance   where Distance >= bridge from city center

This could be modeled by splitting the dataset.

3. Data Reductions Techniques

a) Principle components\ Factor Analysis

Principle components is a powerful data reduction tool. It estimates the structure between variables producing factors that represent those structures.  By looking at the factors you can deduce relationships by seeing how different variables relate to one another.  The factors are by definition orthogonal and some would argue they can be used as independent variables in a regression. One common mistake is to forget principle components is an estimation technique and needs to be treated as such.

Example Code (R)


data(swiss)

Results.FA <-factanal(~Fertility+Agriculture+Examination+Education+Catholic

+ Infant.Mortality, factors=2, rotation= varimax , scores= none , data= swiss)

 

summary(Results.FA)

Results.FA

b) Other data reduction techniques

There are a number of unsupervised AI techniques that will be discussed in the AI data mining section.