Depersonalization

Depersonalization of data is a growing issue for modelers as privacy concerns about consumer’s data increases.  It is often necessary to de-associate personal identifiers from datasets or take other precautions to assure the anonymity of individuals studied.  This is difficult because many fields we use in modeling, gender, date of birth, and zip code can be used to identify individuals.  A study by Latanya Sweeney showed gender, date of birth, and zip code can uniquely identify 85% of the US population.  To meet privacy concerns removing driver license number, Social Security number and full name is often not enough.

 

Here is an example, you are given two datasets one has a demographic profile of an individual and results from a medical study and the other dataset has a full name, address, and date of birth.  The concern is you do not want someone to uniquely identify individuals across these datasets. As mentioned before, if both datasets contain gender, date of birth, and home zip code you can identify individuals with an 85% accuracy. Here there has been no depersonalization.  If age had replaced the date of birth in the study dataset one to one identification across datasets would not have been easily achievable.

Concept: K-anonymization

K-anonymization enables you to talk of degrees to which one dataset is related to another dataset. It is not the only measure of depersonalization and has some issues, namely it is NP-Hard but is an important concept to understand. If a record in one dataset can be matched to k records in another dataset that dataset is said to be (k-1). For example, if you can uniquely match each record in two datasets (one to one matching) K-anonymization is zero.  If, however, many records can match a given record K-anonymization is greater than zero. A large value for k indicates a greater degree of personalization of the study dataset.  When calculating the value you use a full information dataset and a study dataset that requires depersonalization.

 

Further Reading

L. Sweeney , Uniqueness of Simple Demographics in the U.S (2002) Carnegie Melon University, Laboratory for International Data Privacy http://privacy.cs.cmu.edu/courses/pad1/lectures/identifiability.pdf

http://reports-archive.adm.cs.cmu.edu/anon/isri2006/CMU-ISRI-06-105.pdf

http://lorrie.cranor.org/courses/fa04/malin_slides.pdf

Simulation

1. Intro

A simulation is an attempt to mimic the real world using analytical methodology. They are ideal to forecast systems where the true relationship is too difficult to estimate or the system is easily modeled. Simulations are not necessarily alternatives to heuristic rules and statistical techniques but an alternative method for forecasting using those techniques.  To build a simulation model you may have to rely on either statistics and/or heuristics to build the core logic. Alternatively, you could use theoretical models as the core of your simulation. Simulations are powerful predictive tools as well as useful for running what-if scenarios. Example simulation models:

a) Supply and Demand

b) Queuing models, prisons, phone,…

c) Factory floor

2. Monte Carlo Simulations

Monte Carlo simulations are stochastic models. They simulate the real world by assuming it is a random or stochastic process.

a. Random Number Generators

Most of the time we do not have truly random numbers but pseudo-random numbers typically generated from the date and time the number was created. These random numbers are drawn from a uniform distribution.

One way to generate a random number drawn from a particular distribution is to calculate the probability density function (PDF) for the random variable then using a random number from a uniform distribution as the probability of that random number. In cases where we cannot easily use the PDF often times a simple algorithm will work.

b. Markov Chains

A Markov chain is a sequence of random numbers that are independent of one another. A classic example of a Markov chain is a random walk. A random walk, or drunkard s walk, is when, if walking each successive step is in a random direction. Studying Markov chains is not an excuse to hang out in bars more often; a real drunk has an intended direction but impaired capacity for executing their intension.

c. Example Model (Queuing)

1) Intro

Queuing models have one or more servers that process either people or items. If a server cannot instantaneously process people and if more then one person arrives a line forms behind the server.

2) Open Jackson Network Queuing

1. Arrival follows a Poisson process

2. Service time is independent and exponentially distributed

3. Probability of complete one node and going onto another is independent

4. Is open so can re-enter the system

5. Assume an infinite number of servers.

Note: Use an Erlang instead of Poisson distribution when you have a finite number of servers.

Statistical/AI Techniques

1 Intro

There is a forest of traditional statistical techniques and new artificial intelligence algorithms for forecasting. Choosing the right one can be difficult.

2. Choosing variables

With the Information Age, forecasters got a mixed blessing.  Now we have more data than was dreamed by the most optimistic forecaster just fifteen years ago.  We typically work with datasets consisting of thousands of data elements and millions of records but what to do with all this… stuff? Most of the data elements logically have no relation whatsoever with the features we are studying. And worst, many of the variables are hopelessly correlated with one another and by the law of large numbers, many erroneous relationships will emerge from this plethora of information.

a) Research

Again, many problems can be solved by communication or reading the research. If it is important someone has done it before.

b) Systematic algorithms

The method I favor is writing systematic algorithms to cycle through all data elements available, analyze their relationship with the target feature using a measure like MSE then cherry-pick the best element for further analysis.

b.1) Stepwise Regression

Stepwise regressions reduce the number of variables in a model by removing variables one at a time and calculating the marginal gains from including the variable.

b.2) Lorenz, ROI, and ROC curves.

Cycle through each potential independent variable and generate curves showing the relationship with the dependent variable.

b.3) Correlation

A simple two-way correlation between the potential independent and dependent variables is another technique for finding potential independent variables.

c) Data mining

There are several data mining techniques I will discuss in the data mining section that are geared to uncover linear and non-linear relationships in the data.

d) Principle Components\Factor Analysis

As mentioned in the previous section this technique can aid in reducing the number of variables whose relationships need to be estimated with the hope of not losing too much information. Again, this is an estimation technique and should be treated as such.

3) Forecasting Techniques

Below is a cursory overview of common forecasting techniques. A more detailed overview is provided in the statistics and data mining sections. All the example code is in R except where noted.

a) Ordinary Least Squares Regression

This is the classic forecasting technique taught in schools.

1) Building a model

a) There are many packages that provide ordinary least squares estimates.

b) Variable selection is important.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Continuous or Categorical Variables

) Dialogistic Tools

In the statistical section, I go over in detail the evaluation of OLS models. But here are some tools for uncovering majors issues within building an OLS model:

a) QQ Plot

b) Residuals plots

c) Correlations

d) Partial Regressions

e) MSE

f) R-Squared

2) Caveats

Require strong assumptions as to the nature of the data and relationships to maintain BLUE.  BLUE is discussed in detail in the statistical section.

4) Example Code(R)

data(trees)
Results.Model1

b) Logistic Regressions

Logistic regressions have a very different theoretical foundation from ordinary least squares models.  You are trying to estimate a probability, so the dependent data variable only has values of 1 or 0. This violates the assumption required for BLUE.

1) Building the model

a) Many software packages have Logistic regression models included.

b) Variable selection is important.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  Variables

2) Dialogistic Tools

a. Confusion Matrix

b. Lift Charts

c. ROC chart

d. Lorenz Curve

e. Cost Sensitivity/ROI

3) Caveats

Non-linearities can obscure relationships between variables.

4) Example Code(R)

library(MASS)
Results.Model1

c) Vector Autoregression Models (VAR)

Vector Auto Regression models require a firm theoretical foundation.  They are designed for estimating the relationship between a matrix of codependent, autocorrelated variables.  To identify the structure you must make strong assumptions on the structure of the errors, namely how the errors are temporarily related.

1) Building a model

a) There are few packages that provide Vector Auto Regression models.

b) Variable selection is critical.

c) Outputs a simple equation.

d) They correct for auto-correlation and simultaneous equations.

e) Can model time series data.

f) Continuous Variables

2) Dialogistic Tools

Same as for an OLS model.

3) Caveats

By changing the order of the variables in the model you change completely your theory of what true relationship is.  For example, if you order money first you believe the money supply drives output.  If you order output first you believe output drives money.  These are two contradictory models.  Due to this strong reliance on a detailed understanding of the true relationship between variables and all the assumptions required for an OLS model as well they have fallen out of favor in many forecasting circles.

4) Example Code(R)

# This example uses Bayesian c VAR model with flat priors.
Library(MSBVAR)
data(longley)

# Flat priors models
szbvar (longley, p=1 , z=NULL, lambda0=1, lambda1=1, lambda3=1, lambda4=1, lambda5=0, mu5=0, mu6=0, nu=0, qm=4, prior=2,  posterior.fit=F)

d) MARS

Multivariate Adaptive Regression Splines are designed to better deal with nonlinear relationships.  They can be seen as a blending of CHART and OLS.

1) Building a model

a) There are few packages that have MARS models.

b) Variable selection is similar to OLS but you do not need to worry as much with nonlinearities.

c) Outputs a simple equation.

d) Can model time series and cross-sectional data.

e) Continuous or Categorical  Variables
2) Dialogistic Tools

Similar to an OLS model.

3) Caveats
The output can be difficult to read with a complex model but are understandable.  They are prone to overfitting.

4) Example Code (R)

data(glass)
library(mda)
Results.Model1

e) Artificial Neural Networks (ANN)

Artificial Neural Networks(ANN) are an attempt to simulate how the mind works.

1) Building a model

a) There are many good neural network packages.

b) Variable selection is similar to OLS but many non-linearities can be assumed to be handled by the ANN.

c) The output is not understandable in the manner before mentioned models are.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

3) Dialogistic Tools

I have not found ANN to be black boxes as they are often criticized as being. You can use the same tools as with an OLS or logistic regression.  To find out the influence of each variable you can cycle through each variable, remove it then re-run the model.  The effect of the variable can be measured via MSE.

3) Caveats
Overfitting

4) Example Code (R)

data(swiss)
library(nnet)
results.Model

f) Support Vector Machines (SVM )

Support Vector Machines are closer to classical statistical methods but hold the promise of uncovering nonlinear relationships.

1) Building a model

a) There are few good SVM packages both commercial and open source.

b) Variable selection is similar OLS but many non-linearities can be assumed to be handled by the ANN.

c) The output is not understandable in the manner before mentioned models are.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools

I have also found SVM not to be black boxes. You can use the same tools as OLS and logistic regression to diagnose like with the ANN.

3) Caveats

Overfitting

4) Example Code (R)

data(swiss)
library(kernlab)

## train a support vector machine

results.KVSM1

g) Regression Trees

Regression trees briefly became popular as a forecasting technique around the turn of the century.  It was hoped that they could better model nonlinearities but proved to be prone to overfitting.

1) Building a model

a) There are several good Tree packages both commercial and open source.

b) Automatic variable selection.

c) The output is easy to understand.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools
You can use the same tools as OLS and logistic regression to diagnose.

3) Caveats

Overfitting

4) Example Code (R)

data(kyphosis)
library(rpart)
library(maptree)

DefaultSettings

h) Bagging, Boosting and Voting

Bagging is a way to help unstable models become more stable by combining many models together.

i) Boosted Trees and Random Forests

Boosted Trees apply the boosting methodology applied to trees. You run many, in some case hundreds, of small regression trees then combine all the models to using a voting methodology to stabilize the results.  The resulting model is very complex, but much more stable than any individual tree model would be.

1) Building a model

a) There are several good tree packages both commercial and open source.

b) Automatic variable selection.

c) The output is easy to understand but very large.

d) Can model time series and cross-sectional data.

e) Probabilities or Categorical  or Continuous Variables

2) Dialogistic Tools

You can use the same tools as OLS and logistic regression to diagnose.

3) Caveats

Overfitting

4) Example Code (R)

data(swiss)
library (randomForest)
set.seed(131)
Results.Model1 

Python Data

Python scripting language with a heavy emphasis on code reuse and simplicity. It is a popular language with a large and active user group.  Like Perl, it has a rich library set and is popular with projects like MapReduce with Hadoop and is rapidly becoming the default language in areas like computational intelligence.  One interesting feature of the Python language is, indentation acts as block delimiters.  So, when copying code do not change the indentation of the code will not operate as intended.  It is also a powerful language meaning it can do a lot with very little code.  Below is an example of a database containing two tables, customers, and purchase order.  The Customer table has customerid (the unique identifier for the customer), name (The customer’s name), and age (the customer’s age).  The PurchaseOrder table has POID (the unique id for the purchase order), customerid (which refers back to the customer who made the purchase), and Purchase (what the purchase was).

Example:
Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
PurchaseOrder
POID CustomerID Purchase
1 3 Fiction
2 1 Biography
3 1 Fiction
4 2 Biography
5 3 Fiction
6 4 Fiction
 
     
SELECT
Create the file select.py with the following code:

import sys
import fileinput
# Loop through file and print lines
for line in fileinput.input(sys.argv[1]):
    print (line)

Then run the code:
python select.py customer.txt

CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
ORDER BY
Create the file orderby.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

listLines =[]
i =0

# Load file into an array

for line in fileinput.input(sys.argv[1]):
listLines.append(line)

# Create a custom sort based on second field

def getCustId(listLines):
return listLines.split( , , 2)[-1]

# Sort the array

listLines.sort(key=getCustId)

# Print arrys lines

for object in listLines:
    print (object)

Then run the code:
python orderby.py customer.txt

CustomerID Name Age
5 Susan 18
1 Joe 23
3 Lin 34
2 Mika 45
4 Sara 56
WHERE
Create the file select_by_id.py with the following code:
import sys

import fileinput
import string

# Loop through file

for line in fileinput.input(sys.argv[1]):

# Split line using a comma

tokens = string.split( line, , )

# If ID matches passed ID then print

if tokens[0] == sys.argv[2]:
print line,

Then run the code:
python select_by_id.py customer.txt 1

CustomerID Name Age
5 Susan 18
INNER JOIN
Create the file innerjoin.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

listB =[]
i =0

# Load second file into an array to loop through

for lineB in fileinput.input(sys.argv[2]):
listB.append(lineB)

# Loop through first file

for lineA in fileinput.input(sys.argv[1]):

# Split line using a comma

tokensA = string.split( lineA, , )

# Loop through array

for object in listB:

# Split line using a comma

tokensB = string.split(object, , )

#If there is a match print

if tokensA[0] == tokensB[1]:

# Remove newline character with strip method

print lineA.strip() + , + object,

Then run the code:
python innerjoin.py customer.txt orders.txt

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
LEFT OUTER JOIN
Create the file leftouterjoin.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

listB =[]
iFound =0

# Load second file into an array to loop through

for lineB in fileinput.input(sys.argv[2]):
listB.append(lineB)

# Loop through first file

for lineA in fileinput.input(sys.argv[1]):

# Split line using a comma

tokensA = string.split( lineA, , )
iFound =0

# Loop through array

for object in listB:

# Split line using a comma

tokensB = string.split(object, , )

#If there is a match print

if tokensA[0] == tokensB[1]:

# Remove newline character with strip method

print lineA.strip() + , + object,
iFound=1

#If there is no match print

if iFound ==0:
print lineA.strip() + , + , + , + , ,

Then run the code:
python leftouterjoin.py customer.txt orders.txt

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
5 Susan 18 NULL NULL
GROUP BY
Create the file groupby.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

iCnt =1
iLoop =0

# Load and loop through file

for lineA in fileinput.input(sys.argv[1]):

# Split line using a comma

tokensA = string.split( lineA, , )

# Adjust with header and first line

if iLoop <2:
priorTokens = tokensA
iCnt=0

if tokensA[0] == priorTokens[0]:
iCnt =iCnt+1
else:
print priorTokens[1] + , + priorTokens[2].strip() + , +str(iCnt)
iCnt=1 iLoop = iLoop +1
priorTokens = tokensA

# print last line

print priorTokens[1] + , + priorTokens[2].strip() + ,

Then run the code:
python groupby.py customer.txt

Name Age Orders
Joe 23 1
Mika 45 2
Lin 34 2
Sara 56 1
UPDATE
Create the file update with the following code:

import sys
import fileinput
import string

# Loop through file

for line in fileinput.input(sys.argv[1]):

# Split line using a comma

tokens = string.split( line, , )

# If ID is not passed ID then print else replace age with passed parameter

if tokens[0] != sys.argv[2]:
print line,
else:
print tokens[0]+ , + tokens[1] + , + sys.argv[3]

Then run the code:
python groupby.py customer.txt 1 23

Customer
CustomerID Name Age
1 Joe 26
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
INSERT
Create the file insert.py with the following code:

import sys
import fileinput

# Loop through file and print lines

for line in fileinput.input(sys.argv[1]):
print line,

# Add new line from passed arguments

print sys.argv[2] + , + sys.argv[3] + , + sys.argv[4],

Then run the code:
python insert.py customer.txt 6 Joe 34

Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
6 Terry 50
DELETE
Create the file delete.py with the following code:

import sys
import fileinput
import string

# Loop through file

for line in fileinput.input(sys.argv[1]):

# Split line using a comma

tokens = string.split( line, , )

# If ID is not passed ID then print

if tokens[0] != sys.argv[2]:
print line,

Then run the code:
python delete.py customer.txt 1

Customer
CustomerID Name Age
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

Ordinary Least Squared

Ordinary least squares estimators are one of the most commonly used statistical techniques for forecasting and causal inference.  OLS is an attempt to estimate a linear process.  Below is a classic example of a linear model:

Suppose you want to model the relationship between a mother’s height and her child’s height.  You can assume this relationship can be estimated as linear (we do not see a 5 foot mother having a 25 foot child).  Also, the child’s height should not influence the mother’s so direction of causation is one-way. This can be represented as the following:

Y= Alpha + Beta *X + error

Where

Y is the child’s height at 18

X is the mother’s height

Beta is the influence of the mother’s height on the child’s height

Alpha is the intercept or base height for the people in the study

Error includes missing variables, measurement error, and inherent randomness to the relationship.

This model would use cross-sectional data, the data consists of individuals, and time is not a relevant factor.  Ordinary least squares can also model time series data like the influence of interest rates on GNP.

OLS is a powerful tool, however, it has many restrictive assumptions.  Violation of just one of the assumptions can render a model invalid. In the above example, the exclusion of a relevant variable such as poverty level (which may influence the health of the child) and a time indicator to show an advance in medical technology may invalidate or weaken the results.

BLUE

The ordinary least squares estimator is BLUE ( Best linear unbiased estimator) as long as certain assumptions hold true. The Guass-Markov theorem defines the conditions where a least squared estimator is BLUE.  When a linear model violates the assumption required for BLUE it is no longer true that the model is the most accurate in reflecting the true underline relationship of the model.

BLUE means: The model is unbiased, efficient (have the minimum variance), and consistent (the estimate improves as the sample size increases).

Some of the results of violations of BLUE:<

1. Including irrelevant independent variables or excluding relevant independent variables.

2. Poor or bias fit

3. Unclear or misleading casual relationships

Assumptions for BLUE

Assumption Examples of Violations Effect
1. Linearity, the relationship between the dependant and independent variables is linear in nature. Non-linear, wrong regressors, non-linearity, changing parameters Coefficients are biased.
2. The error term’s mean is zero and from a normal distribution. Omitted variables, biased intercept Intercept is Biased
3. Homoscedasticity, the error terms have a uniform variance. Heteroskedasticity. Standard errors are biased.
4. No serial correlation. Observations are uncorrelated. Errors in variables, autoregression, simultaneous equations. Standard errors are biased.
5. Variables are minimally interval level data No dummy variables for dependant variables. Coefficients are biased
6. No exact linear relationship between independent variables and more observations than independent variables. No perfect multicollinearity. Standard errors are inflated

Other important assumptions

 

No Outliers This is an issue when examining data with rare events. The question can become is this a rare event that must be modeled or an atypical case? Can lead to a biased estimate.
No Measurement errors There will always exist some level of measurement error and it can lead to the coefficient to be biased.

3. Testing if you are BLUE

Testing for BLUE

Assumption Test Possible Solution
1. Linearity, the relationship between the dependant and independent variables is linear in nature. For non-linearity: Plot residuals. Transform variables
2. The error term’s mean is zero and from a normal distribution. Plot residuals Use GLM
3. Homoscedasticity, the error terms have a uniform variance. Breusch Pagan  and White Test for Heteroskedasicity Use GLM
4. No serial correlation. Observations are uncorrelated. Durban Watson: Auto-correlations. Include lagged repressors
5. Variables are minimally interval level data Understand how the dependent variables are constructed. Use logistic regression
6. No exact linear relationship between independent variables and more observations than independent variables. Run correlations for independent variables. Exclude co-linear variables or use two-stage least squares for strongly correlated data.
No Outliers Cooks test for outliners, plot residuals Remove outliers

Trees

1. Intro

Trees are the most common data ming tool used today. They are powerful at discovering non-linear relationships hidden within data.  For example, if you are trying to uncover the effect of age on savings using traditional techniques you will have to code dummy variables to account for non-linearities the effect of age on saving behavior. Trees will quickly and automatically uncover facts like younger and older people behave differently than middle-aged people in regards to their savings rate.

2. Overview

The above example is shown in the chart below.

This information space can be partitioned using a tree algorithm.  The result is shown below.

To make the relationships clearer you can represent the above chart in a tree diagram as the image below shows.

 

Binary Trees :

Each node has two branches.

         1          

                 / \         

                /   \        

              2     3                             

Multiway Trees:

Each node has two or more branches.  Any Multiway tree can be represented as a Binary tree although the output is more complex.

                1          

                 /  |  \         

                /   |   \         

              2   3   4       

 

 

3 Common Tree Algorithms

a) THAID

THAID is a binary tree designed to infer relationships with a nominal dependent response variables. Uses statistical significance tests to prune the tree.

 

b) CHAID

CHAID  is a multiway tree used to infer relationships with a nominal and categorical response variables. Uses statistical significance tests to prune the tree.  It is of the same family(AID) as THAID trees.

 

c) CART

CART is a binary tree that supports continuous, nominal, and ordinal response variables.  It is geared for forecasting and uses cross-validation and prune to control the size of the tree.

 

d) Boosted Trees/Random Forests

These methods are discussed in the forecasting section.

Data Manipulation

1.Sample Size

Before you go out hacking through the data looking for gems of information think, how are you going to test validity?  At the beginning you should split your data into at least a training and validation dataset.  I prefer splitting the data into three randomly assigned groups.  The first group is the training set.  This is the data you build your model on.  The next dataset is the validation set.  It is used to test your model’s performance out of sample.  The final dataset is the holdout sample.  It is the final test to see if your model works with unseen data and should be used sparingly.  Typically, I go back and forth between training and validation several times a model and two or three times between the training the hold out set.  I will discuss this topic further in the validation section.

2. Data transformations
With thousands of data elements why do we feel the need to create more?  Many relationships are hidden by nonlinearities and obscure interactions.  To build a successful model you need to get your hands dirty with transforming the data to uncover these relationships.  Nonlinear relations are any that cannot be modeled as the weighted sum of independent variables. This is a broad definition encompassing models where the output is the product of independent variables to structural breaks in the relationship to just about anything else.  Nonlinearites can be the showstopper when trying to build model and knowing how to get around them is essential.  Data transformation is an important tool to make the show go on.

a)Nonlinear Transformations

The classic example of a nonlinear relationship in economics is the Cobb-Douglas  production function.

Y = f(K,L) = aL^bK^c

L:Labour

K:Capital

A: Technology Constant

where

a+b=1 constant returns to scale

a+b< 1 diminishing returns to scale

a+b> 1 increasing returns to scale.

You will not be able to estimate this model using linear techniques.  However by taking the log of the equation yields:  Log(Y) = a + b*log(L) + c*Log(K)

This you can estimate using linear modeling. As seen, using nonlinear transformation can make an unsolvable problem solvable; however, non-linear transformation should only be used when there is theoretical reasons. Misuse of transformation will lead to curve fitting, over fitting of the model, and your model will perform poorly out of sample.

Another example of a nonlinear relationship is average travel time given distance from city center. When you are very close to a city center the average travel time is longer than if you are further out.  This could be because traffic increases exponentially as you approach city center therefore reducing your speed. AvgTime = a*Distance^b

By transforming distance you can make this relation estimable using linear techniques.

AvgTime = a + b*log(Distance)

 

Example Nonlinear Transformations:

1. Square

2. Square root

3. Log

4. Exp

b) Dummy Variables

Dummy variables are important to model structural breaks in the system, variables with noncontinuous relationships and other nonlinearites. It is more common for variables to have a noncontinuous relationship with one another than continuous one.  A continuous relationship is like speed to distance. The faster you travel the further you will travel for any given amount of time (assuming a straight highway, no police,….).  Lets use the example above for average travel time to distance from city center  Above I assumed the model: AvgTime = a*Distance^b.  But, that is not the only possibility.  It may be all the delay is caused by one bridge.  The model then would be

AvgTime = a + b*Distance   where Distance < bridge from city center

AvgTime = c + b*Distance   where Distance >= bridge from city center

This is best modeled by putting in a dummy variable to pull in the information on whether the journey begins before or after the bridge. Dummy variables can also incorporate accumulative effects, such as income vs educations.

Example :

1 0 0 0 0 High School

1 1 0 0 0 Junior College

1 1 1 0 0 College

1 1 1 1 0  Graduate School

(No High School is the Intercept).

c) Fuzzy Variables

Fuzzy variables promise to better incorporate real world notions of numbers and nonlinear relationships.  There are many variables that are thought of, well, in fuzzy terms, like temperature, age and weight.  We seldom think of temperature as a continuous number but in terms of hot or mild or cold. This gets more problematic because our definitions overlap.  The temperature can by hot and mild in our mind.  And if we define these variables in our head in a fuzzy manner, we react to these variables in a fuzzy manner.  Fuzzy logic holds the promises to better model these relationships.

Membership Functions: These define which set(s) a particular value might belong to.  For example, 85 degrees may be both hot and warm. Membership functions are developed through surveys, experimentation and logic.

Fuzzy variables avoids key problems that plagues dummy variables, namely the sharp cut off between being included and excluded and lumping all groups into the same category. For example, say you wanted to model the savings against education and you want to correct for age differences. The effect of age on employment is non-linear, younger and older people have lower employment rates than the ages in between. To capture this, you may want to include a dummy variable to indicate nearing retirement age which is 0 if under 65 and 1 if greater than or equal to 65. But why 65? Why not 64.5 or 66.5? The cut off at 65 is arbitrary and weakens the relationship between employment and retirement age. To capture this complex relationship you can define a membership function that allows a 64 year old to belong to both the retirement group and the non-retirement group.

Below is an example SAS Code generating fuzzy membership functions for temperature.


data Fuzzy;

do TempLoop =0 to 100;

Temp = TempLoop;

if Temp < 40 then Cold = 1 ;

if Temp >= 40 then Cold =-(0.05)*Temp + 3 ;

if Temp >= 60 then Cold = 0 ;

if Temp < 30 then Cool = 0;

if Temp > 30 then Cool = (0.04)*Temp -1.2 ;

if Temp >= 55 then Cool =-(0.04)*Temp +3.2 ;

if Temp > 80 then Cool =0;

if Temp < 60 then Warm =0;

if Temp >= 60 then Warm = (0.06667)*Temp - 4 ;

if Temp >= 75 then Warm =-(0.06667)*Temp + 6 ;

if Temp >= 90 then Warm =0;

if Temp < 80 then Hot = 0;

if Temp >= 80 then Hot = (0.05)*Temp - 4 ;

if Temp >= 100 then Hot = 1 ;

output;

end;

run;


title Temperature ;

proc gplot data=Fuzzy;
     plot Cold*Temp Cool*Temp Warm*Temp Hot*Temp / overlay frame legend = legend;
run;

quit;

d) Splitting the Data

Splitting a data set is useful in uncovering hidden relationships. Lets use the example above for average travel time to distance from city center. It may be all the delay is caused by one bridge but the effective of the bridge effects the maximum speed you can travel before reaching it.  The model then would be

AvgTime = a + b*Distance   where Distance < bridge from city center

AvgTime = a + c*Distance   where Distance >= bridge from city center

This could be modeled by splitting the dataset.

3. Data Reductions Techniques

a) Principle components\ Factor Analysis

Principle components is a powerful data reduction tool. It estimates the structure between variables producing factors that represent those structures.  By looking at the factors you can deduce relationships by seeing how different variables relate to one another.  The factors are by definition orthogonal and some would argue they can be used as independent variables in a regression. One common mistake is to forget principle components is an estimation technique and needs to be treated as such.

Example Code (R)


data(swiss)

Results.FA <-factanal(~Fertility+Agriculture+Examination+Education+Catholic

+ Infant.Mortality, factors=2, rotation= varimax , scores= none , data= swiss)

 

summary(Results.FA)

Results.FA

b) Other data reduction techniques

There are a number of unsupervised AI techniques that will be discussed in the AI data mining section.

BootStapping

Bootstrapping is one of the most useful statistical techniques. It is also one that is often misunderstood, over used and avoided. In general, bootstrap is a process that uses simulation to make inferences about the probability distribution of a sample population. Operationally, you take repeated samples with replacement from a given dataset to build a new dataset. This process is repeated multiple times till the number is sufficient for statistically valid conclusions to be made. It was first introduced by Effron (1979). Jackknife a similar but less popular re-sampling technique pre-dates Bootstrapping. It does not use re-sampling whereas bootstrapping does.

The power of bootstrapping is its ability to make statistical inference about a population using only small, potentially biased sub-sample. And that is why the term bootstrap comes from the phrase, “to pull oneself up by one’s bootstrap’. It is seemingly impossible task. And it does this all of this without the restrictive assumption of normality. Bootstrapping is also valid with small sample sizes (as small as twenty).

There are two main types of bootstrapping, parametric and non-parametric. Non-parametric bootstrapping does not assume a distribution for the population but instead defines the distribution from the data. Parametric bootstrapping assumes the population follows a known and parameterized distributions such as a log-normal.

Below are example uses of bootstrapping:

*If faced with a biased sample or an infrequent event you can employ bootstrapping to resample cases. You can employ this when estimating a logistic regression with a rare event.

 

*By re-sampling residuals using bootstrap technique you make inferences about the asymptotic properties the confidence intervals (CI) and other goodness of fit statistics. This is useful when the sample size is small or the assumption of normality is too restrictive.

*To make inferences about a population you can bootstrap the sampling distribution. This offers a powerful alternative to using standard methods when the assumption of normality is too restrictive.

*Bootstrapping can also be useful to detect outliers.

Further Reading

www.uvm.edu/~dhowell/StatPages/Resampling/Bootstrapping.html

www.uvm.edu/~dhowell/StatPages/Resampling/Resampling.html

wikipedia.org/wiki/Bootstrapping

Sas.com/kb/24/982.html

Instrumental Variable

Here is an example problem; you want to determine how the competitive environment is affecting store performance. You proceed to estimate a model with store performance as a function of number of rival stores near by in hopes of seeing how rivals affect performance. But you have violated a key assumption of BLUE; at least one independent variable is contemporaneously correlated with dependent variable thus the error term. If a store is profitable it will attract rivals. When your independent variable is correlated with dependent variable you cannot get consistent estimate using regression analysis.

The result is a downward bias estimate on the affect of rivals on store performance. If a location is highly profitable more firms will enter the market increasing the number of rival store while not necessarily adversely affecting a store’s performance.This is a common occurrence. Other examples are income and education, supply and demand, store location and credit scores.

One solution is to use Instrumental Variables (IV). The key to IV is the use of a proxy variable(s) correlated independent variable but not contemporaneously to the error term. The IV can be a variable such as number of small streams to proxy for urban/city or a fitted value. Two stage least squares (2SLS) is a common IV technique that uses a fitted value from a second regression as the IV in the first regression. When building a 2SLS model you still need independent variables that are correlated with the variable you are trying to proxy for but not correlated with the primary model s error term to use in the second stage of the model.

In the store performance location example you could employ a 2SLS model to estimate the number of rivals based on variables such as development tax incentives, number of rivals before the stored opened and other such factors not directly correlated with the performance of store you are examining.

Instrumental variables can also be used to correct for omitted variables by choosing a proxy variable closely correlated with the missing variable.

Further Reading

Wikipedia

Economics Glossary

Missouri state Working_Paper_Series

Software:

OpenBayes A free Bayesian Belief Network library written in Python.

Resource:

The-Data-Mine.com Another website focused on data mining.

Polytomous (Multinomial) and Ordinal Logistic Models

If your dependent variable is continuous or near continuous you use a regression technique; if the dependent variable is binary you can use a logistic regression. Often times; however, what we are trying to model is neither continuous nor binary. Multi-level dependent variables are common occurrences in the real world. Examples of Multi-level dependent variables are:

1: ‘yes’, ‘no’ or ‘does not respond’

2: ‘high’, ‘medium’ or ‘low’ levels of agreement.

These can sometimes be modeled using binary models. You can collapse two categories together, say no and does not respond resulting in a binary choice model. This however can obscure relationships especially is the groups are incorrectly formed or there are several potential outcomes that do not group together logically (i.e. does medium go with low or high?).

There are two main types of models for handling mutually-exclusive, multi-level dependent variables, one for ordinal outcomes and the other is for nominal outcomes. Nominal outcomes are where the ordering of the outcomes does not matter like in example 1. Ordinal outcomes are when the order matters like in example 2. For nominal outcomes you can use Polytomous (Multinomial) models and for ordinal outcomes you can use Ordinal or Polytomous models.

Polytomous models are similar to standard logistic models but instead of one log odds ratio to estimate you have multiple, one for each response. Ordinal Models are stricter in there assumption than Polytomous models namely it assumes the response variables have a natural ordering. Like the Polytomous models you estimate a log odds for each response.

Both of the models have the same assumption of the standard logistic model plus a few more. One new assumption for both models is that the outcomes are mutually exclusive. In the two examples above it is clear the categories are mutually exclusive. But, consider this example: who do you like: Candidate A, B, C? Someone could easily like more than one candidate. In these cases when this assumption is too restrictive you may use data mining approaches such as SVM or ANN.

Another assumption is independence of irrelevant alternatives (IIA). Since the log odds for each response is pitted against one another it is assumed people are behaving rationally. An example of a violation of IIA is if someone prefers candidate A to B, B the C and C to A. Given the preferences A to B and B to C a rational person would prefer A to C. If IIA is violated it can make interpreting the log odds impossible. For a detailed discussion go to:

http://en.wikipedia.org/wiki/Independence_of_irrelevant_alternatives.
www.stat.psu.edu/~jglenn/stat504/08_multilog/10_multilog_logits.htm

www2.chass.ncsu.edu/garson/PA765/logistic.htm

www.mrc-bsu.cam.ac.uk/bugs/documentation/exampVol2/node21.html

en.wikipedia.org/wiki/Logistic_regression

nlp.stanford.edu/IR-book/html/htmledition/node189.html

www.ats.ucla.edu/STAT/mult_pkg/perspective/v18n2p25.htm