Data Integrity

1. Intro

Before you can run a model you need to know the data. I know of forecasters who avoid, at all cost, examination of the data. Sometimes this is due to laziness or lack of ability or the mistaken belief such a task is beneath them. Collecting, researching, cleaning and manipulating data is the backbone of any model. Eighty percent of a model is complete when a properly built flat file is ready for the gauntlet of forecasting algorithms to consume. This should not be the least rewarding part of model building. You should learn to enjoy it; and there is a lot to enjoy, the biggest finds are made here.  If you ignore data analysis, you may end up building a model explaining why performance of a product was poor in some states only to learn in a meeting that the data was incomplete in those states.

2. Accessing The Data

There are many tools for accessing and manipulating data. This is not a good thing. It can become a serious issue because everyone has his or her favorite method, language, and file format. The only certainty is, nothing will work well with one another. A tremendous amount of time and effort can be spent getting all the various data combined to form the flat file that will be used in the modeling. A flat file is what is required for most statistical procedures. Each row in a flat file represents one case and all the rows have the same columns.  There are some AI methods that can use relational tables which I go over a few in the Data Mining sections. At the beginning of a project it is important to catalogue what file formats, languages and databases are being used and who are the experts in those areas.

a. SQL

The most common data access method is structured query language (SQL). This language is designed to allow easy access to relational data. If you are familiar with set theory or Venn diagrams SQL will seem very intuitive. In the programming section I go over in details how to build queries. SQL is typically associated with a particular Relational Database Server.

Common Relational Database Servers

1. Oracle

2. MS SQL Server

3. DB2

5. MySQL

b. SAS

Statistical Analysis System (SAS) has been around longer than SQL. It is a powerful language that can even use SQL.  One of its strongest features is the ability to access data in a variety of formats.  SAS excels at what SQL excels at, manipulating data.  There is virtually no imaginable end form for a flat file that SAS cannot produce. Again, further discussion of SAS is in the programming section.

c. SPSS

SPSS predates SAS.  It has limited ability to manipulate and combine data but can import a variety of different sources.

d. Excel

Excel is used by many forecasters.  I find it difficult to merge relational data with Excel but it is powerful for manipulating flat files via combining and weighting columns.

3. Communication

The most important step in build a good data is communication. You need to talk to the domain experts of the data. There are many nonlinarities and measurement errors in the data lurking to spoil any well-planned model.  There could be regime changes, changes in the definitions of a variable and numerous other irregularities that is impossible to uncover without talking to someone.  This may be the hardest and scariest part of the job but actually talking to a human is an essential first step in building a good model.

4. Examine the Data

Inspection of the data can save a lot of time an effort of the model builder. You do not want to build a model explaining, completely erroneously, that small towns are a high risk when no data was collected there.

Methods:

a. Frequencies, mean, mode min, max, var

b. Compare with previous data is available

c. Plots and more plots

 Common Data Analysis

What to look for Method  Example SQL/SAS
 Assure all data was collected. Determine sample size. Sample Size
SQL

 Select count(*) from my MyTable

Look for any extreme values, outliers, missing values, variations, truncation and censoring within data.

 

Mean
SQL

Select avg(myField) from MyTable

Mode
SQL

Select mode(myField) from MyTable

Variance
SQL

Select Var(myField) from MyTable

Max
SQL

Select max(myField) from MyTable

Min
SQL

Select min(myField) from MyTable

Count Nulls
SQL

Select count(*) from my MyTable where my field is null

Does the data make sense?
Do you have the correct field definitions?
Are there known values in the category fields?
Frequencies

SAS

Proc Contents Data = MyTable;

Run;

Proc Freq Data = MyTable;

Tables MyField;

Run;

Does the data look right? Plots

SAS

Proc plot Data = MyTable;

Plot MyField;

Run;

*in Oracle only, for MS SQL Server you must write specialized code to get the mode.

5. Creating a flat file.

I go over creating a flat file in the programming section under SQL.

6. Security

Do not under-estimate the importance of security.  Security is a growing issue with personal data.  There have been a number of breaches and consequences of those lapses in security are years of aggravation and lost income for those affected.  As the information age marches onward and disk space increases a statistician s laptop could contain a considerable amount of private information. Always follow the best practices of your firm but try and go beyond them.  When using data asked yourself whether you need any personal identifiers.  Personal identifiers will seldom be an independent variable and unless you are doing data matching latter, never pull it into your dataset.  Examples of personal identifiers: Social Security number, financial account number, driver’s licenses, first and last name with address, first and last name with date of birth, vehicle plate numbers.  There are times when patterns in number can indicate fraud but these can be coded separately then latter brought back into your working set.  Treat other people’s data as if were your own because be assured there is some statistician out there that does have yours.