R/Splus

The S language’s power is not its data management capability nor is data management the intent of the S language.  However, often times when evaluating the output of a model you may need to perform basic data management with R/SPlus and you will find the S language acceptable in this role.  The commands will seem more similar to Unix/Linux than SQL. However, the S language has many of the benefits of Unix/Linux (a concise language for data management) while being more data centric (allowing meta data for dataframes which includes column names).
Example:
Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
PurchaseOrder
POID CustomerID Purchase
1 3 Fiction
2 1 Biography
3 1 Fiction
4 2 Biography
5 3 Fiction
6 4 Fiction
 
     
 

SELECT

customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

ORDER BY

customer[order(customer[,2]),]
CustomerID Name Age
5 Susan 18
1 Joe 23
3 Lin 34
2 Mika 45
4 Sara 56

WHERE

subset(customer, custid == 5 )
CustomerID Name Age
5 Susan 18

INNER JOIN

merge (purchaseorder, customer, by.x = custid , by.y = custid , all = FALSE   )
CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction

LEFT OUTER JOIN

merge (purchaseorder, customer, by.x = custid , by.y = custid , all = TRUE   )
CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
5 Susan 18 NULL NULL

GROUP BY

Cust_sum <- merge (purchaseorder, customer, by.x = custid , by.y = custid , all = FALSE   ) xtabs( ~ fname, cust_sum)
Joe Lin Mika Sara
2 2 1 1

UPDATE

customer[1,]$age <-23
 
CustomerID Name Age
1 Joe 26
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

INSERT

newcust <- data.frame(custid = 6, fname = Terry , age =50) rbind(newcust,customer)
 
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
6 Terry 50

DELETE

subset(customer, custid != 1 )
 
CustomerID Name Age
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

 

Example R Function

You can extend R/S-Plus usability by writing functions. This is similar to macros in the SAS Language. These functions can be anything from a new statistical algorithm to file operation to data manipulation. Below I give an example of a custom R macro. This function takes output from an rpart tree and converts it to SAS code suitable to use in a data a step. This is useful when coding nodes into a model. This dirty little secret: I developed the code by looking at the default print method for the rpart package and adapting it to generate SAS code. This code can also be modified to generate SQL code as well. When attempting to write new code I suggest first looking at published package that do something similar then try to adapt them to your own use.

The S language (which both R and SPlus use) is similar to C. There are many good editor for S. this code was written using TinnR.

printSAS.rpart <- function(x, minlength=0, spaces=2, cp,
digits=getOption( digits ), …) {

tree.depth <- getFromNamespace( tree.depth , rpart )

if(!inherits(x, rpart )) stop( Not legitimate rpart object )
if (!is.null(x$frame$splits)) x <- rpconvert(x) #help for old objects
if (!missing(cp)) x <- prune.rpart(x, cp=cp)
frame <- x$frame

ylevel <- attr(x, ylevels )
node <- as.numeric(row.names(frame))
depth <- tree.depth(node)
indent <- paste(rep( , spaces * 32), collapse = )

#32 is the maximal depth
if(length(node) > 1) {
indent <- substring(indent, 1, spaces * seq(depth))
# indent <- TT
indent <- paste(c( , indent[depth]), format(node), ) , sep = )
}
else indent <- paste(format(node), ) , sep = )

tfun <- (x$functions)$print
if (!is.null(tfun)) {
if (is.null(frame$yval2))
yval <- tfun(frame$yval, ylevel, digits)
else yval <- tfun(frame$yval2, ylevel, digits)
}
else yval <- format(signif(frame$yval, digits = digits))

 

z <- labels(x, digits=digits, minlength=minlength, …)

term <- rep( , length(depth))
final <- rep( , length(depth))
temp1 <- rep( , length(depth))
tempnode <- rep(10000, length(depth))
term[frame$var == <leaf> ] <- Terminal

for(i in 1:length(depth)) #print(1:i)
{
if(term[i] != Terminal )
{
final[i] <-
}

if(term[i] == Terminal )
{
for(j in 1:length(depth))
{
if (node[i – j] == 1) break
if(term[i – j] != Terminal )
{
if (node[i – j] != tempnode[i] -1)
{
if (node[i – j ] < tempnode[i])
{
temp1[i] <- paste ( z[i – j], And , temp1[i] )
tempnode[i] = node[i – j]
}
}
}

} # end for

final[i] <- paste ( If , temp1[i] , z[i], then NodeVal = , yval[i] , ; )
}# end if *
} # end Main loop
cat(final, sep = \n ) ## Print results
}

Model Validation


1. Intro

 

 

When managing a project model validation should be fifty percent of development time. The last thing you want is for model validation to be a meaningless rubber stamp. The first run of all models should be expected to fail the validation stage. To save time I recommend having several models finished before going into the last stage and to include the person in charge of validation in all discussions of the model.

Carl Sagan in The Demon-Haunted World said about long-term forecasts:

Each field of science has its own complement of pseudo-science. Physicists have perpetual motion machines, an army of amateur relativity disprovers and perhaps cold fusion. Chemists still have alchemy. Economists have long-range forecasting.

In the validation stage we strive to make Carl Sagan s statement false.

 


2. Issues

 

 

a) Over fitting

Over-fitting (Bias-Variance Tradeoff)  is when you fit the model too closely to the the sample data.  Why is fitting your model closely to the data bad? 

1) Remember you are estimating a stochastic process.  We assume there is some unexplained error. You do not want to explain random error or model measurement error that only exists in the sample you are observing. 

2) The more complex the model the more likely you will wrong.  One common mistake is to put multiple non-linear transformation of a variable to fit a complex relationship.  Every added variable is an added assumption to the true underline model.  Every added assumption is another assumption that could be wrong even if it worked in sample.

In forecasting you want a general model not one specific to the test data.  Think of tables with an uneven surfaces where the large unevenness is systemic and the minor ones particular to a table. If you tried to model the surface with clay the best fitting general model would be to apply minor pressure when pressing the clay to a tabletop.  The worst model would be to apply great pressure to the clay there by creating one perfect model for a particular table that fits no other table.  Remember the goal.

 

    

b) Sub Sample Stability

Does the model produce stable results for a majority of the  sub segments of the population?  Example, how well does it predict for over 65?

 

 

c) Predictive Power

Questions to answer:

1. Does it treat subgroups fairly? 

2. Does it exhibit adverse selection for any subgroup?

3. Does it provide sufficient separation? 

4. How stable is its performance across samples?

5. And, not to be forgotten, is it profitable?   


3. Methods

 

 

 

 

 

a) Out of Sample Forecasting 

The easiest way to test performance of a forecasting model is to take data you have not seen and see how well the model performs. This is the best and simplest way to test model s robustness. It is often called the holdout sample.

 

 

 

 

 

b) Cross-Validation

When you do not have enough data to have a test, validation and holdout sample cross-validation is an alternative. There are many types of cross-validation methods.  A simple version would be as follows:

 

1. Partition your sample into multiple sub groups.

2. Train your model using all but one partition.

3. Validate your model on the remaining partition.

4. Repeat till all partitions have been used as a validation set.

 

I have not had great luck with cross-validation. When I have used it in production, the models prove to be less stable. One thing I have noticed is papers using cross validation favor over fitting techniques. Since, typically, you use data that has been sampled at one time most of the within sample variance should be low. If the sample was pulled at different points in time from different sources cross-validation should be more effective.

 


4. Model Dialogistic Tools

 

 

Below is a brief overview of some common dialogistic tools.  In the statistics sections these tools are reviewed more closely.

 

Dialogistic

Purpose

Type of dependent variable

QQ plot

Stability of predictions at tail

Linear

Residual plots

Fit, non-linearities, structural shifts

Linear

R-Squared

Fit

Linear

MSE

Fit 

All

Partial Regression

Parameter validity, structural shifts

All

T-Statistics

Parameter validity

Linear

White

Heteroskadsticity

Linear

Durbin-Watson

Auto-correlation

Linear

Economic Significance 

Parameter significance

All

Receiver Operating Characteristic Curve (ROC)

Fit

Binary,Categorical

Lorenze Curve

Fit,Lift 

Binary,Categorical

Profit curve

Model s relevance

All

Confusion Matrix

Fit

Binary,Categorical

Odds Ratios

Parameter relevance 

Binary,Categorical

Trees

Non-linear relationships

All

Influence of Variables on MSE

Like a partial regression for learning methods

All

Out of sample forecasting

Stability, fit

All

SQL


The most common data access method is structure query language (SQL). This language is designed to allow easy access to relational data. If you are familiar when set theory or Venn diagrams SQL will seem very intuitive. SQL allows you to quickly and simply combine relational data. Below is an example of a database containing two tables, customers and purchase order.  The Customer table has customerid (the unique identifier for the customer), name (The customer’s name) and age (the customer’s age).  The PurchaseOrder table has POID (the unique id for the purchase order), customerid (which refers back to the customer who made the purchase) and Purchase (what the purchase was).

Example:
Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
PurchaseOrder
POID CustomerID Purchase
1 3 Fiction
2 1 Biography
3 1 Fiction
4 2 Biography
5 3 Fiction
6 4 Fiction
 
     
SELECT
SELECT * FROM MyDataBase.Customer

CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
ORDER BY
SELECT * FROM MyDataBase.Customer ORDER BY by age
CustomerID Name Age
5 Susan 18
1 Joe 23
3 Lin 34
2 Mika 45
4 Sara 56
WHERE
SELECT * FROM MyDataBase.Customer WHERE age = 18
CustomerID Name Age
5 Susan 18
INNER JOIN
SELECT * FROM MyDataBase.Customer INNER JOIN MyDataBase.PurchaseOrder ON Customer.CustomerID = PurchaseOrder.CustomerID
CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
LEFT OUTER JOIN
SELECT * FROM MyDataBase.Customer LEFT OUTER JOIN MyDataBase.PurchaseOrder ON Customer.CustomerID = PurchaseOrder.CustomerID
CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
5 Susan 18 NULL NULL
GROUP BY
SELECT Name, count(*) as Orders FROM Customer INNER JOIN PurchaseOrder ON Customer.CustomerID = PurchaseOrder.CustomerID GROUP BY Name
Name Age Orders
Joe 23 1
Mika 45 2
Lin 34 2
Sara 56 1
UPDATE
UPDATE Customer SET Age = 26 WHERE CustomerID = 1
Customer
CustomerID Name Age
1 Joe 26
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
INSERT
INSERT INTO Customer Values (6, Terry , 50)
Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
6 Terry 50
DELETE
DELETE FROM Customer WHERE CustomerID = 1
Customer
CustomerID Name Age
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

Diagnostics


1.Intro

 

  Understanding and using diagnostics defines a good statistician from a hack.  It is not enough just to run the diagnostics; you must challenge your model as a critical eye.  To build a model is simple, to assure it is stable and accurate is a work of art.  There are a variety of tools and test that can aid you in evaluating your model.  When doing diagnostics ever assume anything, always seek proof.

 


2. Tools for linear models

 

   

a. QQ plot

 

   

The QQ or quantile-quantile plot shows the residual errors for the first and last quantilths of the dependant variable plotted against a 45-degree line. This allows you to see how well the model fits at both the extreme. A model that fits poorly will appear to curl up on itself on the plot. Having a model that fits poorly at the extremes is not a good thing but oftentimes it is not a showstopper. By setting maximum allowable values for the model it can still be usefully in segmenting cases. To correct for poorly fitting tails look for new explanatory variables or double check to see if you missed any non-linarities that could be confusing the system.

 

 

b. Residual Plots

 

    By observing the residual plots much can be uncovered about how the model is performing.  A key thing to look for is any pattern in the plots. Since the residual should be random there should be no observable trend in the residual.  Any observable pattern indicates trouble with the model.

 

 

c. R-Squared

 

    R-Squared measures the proportion of the variation of the dependant variable explained (I am using that term very losely) by the model. R-Squared has poor standing amoung statitisica but can be useful if it is not the only measure of fittness of the model. It ranges from zero to one with one being a perfect fit. One is only possible if you include the dependant variable as an explaintory variable and therefore is an indication of error. With the data I typically look at a good model typically ranges from .1 to .3 however I have seen model in production working well with an R-Squared as low as .07

R^2 = 1- ((y-Xb)’(y-Xb) )/ sum(y-yBar)^2

 

 

d. MSE

 

    MSE or Mean Squared Error is useful in choosing between multiple models. It is simply the average of the squared errors.

 

 

e. 1. Partial Regression

 

    Partial regressions are an important tool in determine for the independent variables effect the model as well as themselves. It is the net effect of a independent variable correcting for other regressors.

 

 

e. 2. Partial Residual Plots

 

    Partial residual plots are residuals plotted against each independent variable s value. This shows how the residuals of the model vary as the value of the independent variable changes. This will uncover situations such as a variable at high values causing too great a variation in the model leading to high residuals. In this case you would cap the independent variable. Want you want to see is a even cloud of data points with a zero-slope centered on zero.

 

 

f. T-stats on Coefficients

 

    The T-statistics on the repressors test the null hypotheses that the coefficient is zero, that is has not effect on the model.  If you cannot statistically justify a variables inclusion into the model it is preferred to remove it.  Reasons for a variable failing a t-test can rage from it having no relation with the dependant variable, to non-lineariites influence the results or other independent variables clouding the true relationship. If there is firm theoretical reasons from including the variable investigate further. 

 

 

g. Economic Significance of Coefficients

 

    An independent variable may be statistically significant but have no explanatory power.  By calculating the economic significance of a variable you can roughly measure its contribution to the overall value of the dependant variable.  The Economic Significance of Coefficients is the coefficient times the standard deviation of the independent variables. There is no clear definition of whether a coefficient is economically significant instead a research has to look at the values and decide for herself whether a given coefficient has enough, well, oomph to be considered important. It is a powerful tool to rid yourself of those pesky statistically significant but unimportant variables.

 

  h.Cooks

 

    Cooks test is used to uncover outliners in the data. Using the Cooks value you can target outliers or removal. It should be remembered that not all outliers should be removed. Some are representative of important behavior in the system. In modeling weather in Florida hurricanes may look like a outlier in the data but they are a critical feature to model.

 

  i. CHOW

 

    The Chow test is used to test for structural or regime changes with in the data. In monetary and other financial models they are important test. If a structural change seldom occurs modeling a change using dummy variables can be a good choice but if structural changes occur often you may need to model the underline causes of those changes to have any chance of forecasting the process.

 

  j. Durbin-Watson

 

    Durbin-Watson (DW) is the standard test for serial correlation (autocorrelation). Remember, serial correlation violates BLUE and results in a bias model and you employ autoregression models to correct for it.  When investigating time series data you always have to be conscious of the DW statistic.

 

  k. Bag Plot

 

    Bag plots uncover outliers in the data and can are useful with Cooks Test.

 

  l. White

 

    Thie White test is a standard test for heteroskedasticity. Heteroskedasticity causes correlation coefficients to be biased downward.  This can lead to excluding relevant variables and biasing the coefficients downward.

 


3. Tools for Probablistic and Catorgical Models

 

 

a) Odds Ratios

 

   

The odds ratios for each independent variable indicate whether to keep that variable in the model. If the odds ratio is 1 that variable does not help the predictive power of the model while statistically significantly greater than or less than one indicates the variable has predictive power.

 

 

b) Receiver Operating Characteristic (ROC) Curve

 

   

The ROC curve is used to graphically show the trade off between Sensitivity (the true positive rate) and Specificity (the true negative rate). If the model has no predictive power all the point will lye n a 45 degree line.  A greater area between the 45 degree line and the ROC curve indicates a more predictive model.

 

 

c) Lorenze Curve

 

        The Lorenz curve is a rotated ROC curve. In other words, it is a plot of the cumulative percentage of cases selected by the model against the cumulative percentage of actual positive events.  Like with the ROC curven the area between the curve and the 45 degree line is called the Gini coefficient and is used to measure the fit of a model.  The higher the coefficient the better fitting the model.

 

 

d) Confusion Matrix

 

   

The confusion matrix shows the actual verse forecasted outcome for a binary or categorical process.

 

  Predicted
Yes No
Actual Yes a b
No c d

a: The number of times the model predicted Yes and the outcome was Yes.

b: The number of times the model predicted No and the outcome was Yes

c: The number of times the model predicted Yes and the outcome was No

d: The number of times the model predicted Yes and the outcome was Yes

 

 

e) Profit curve

 

    Uses if the model was used in production what its expected return would be, profit verse score.