Data Management Quick Comparison 2

The following article provides a comparison between BASH and JAQL for basic data management and manipulation. 

 

Selecting- All Fields

BASH

more mydata.dat

JAQL

read(hdfs( Customers.dat )); 

-One Field

BASH

cut -c 13-22 mydata.dat 

JAQL

$Customers = read( Customers.dat );
$Customers -> transform { name: $.name };

– Subset

BASH

less mydata.dat |awk {if (substr($0,13,10) == 2786586641) print $0}

JAQL

$Customers = read( Customers.dat );
$Customers -> filter $.age == 18 ->  transform { name: $.name };

Sorting

BASH

sort -n -t +2 mydata.dat

JAQL

$Customers = read( Customers.dat );
$Customers -> sort by [$.age] -> transform { name: $.name };

Join

BASH

join -1 2 -2 1 mydata_2.dat mydata_lkup.dat | less
or (if no unmatched values and both files are sorted)
paste mydata_2.dat mydata_lkup.dat

JAQL

$CustomersPurchases =join $Customers, $PurchaseOrder where $Customers.CustomerId
== $PurchaseOrder.CustomerId into {$Customers.name, $PurchaseOrder.*} ;   

Sample

BASH

more mydata.dat| head -10

JAQL

$Customers = read( Customers.dat );
$Customers -> top(2); 

Aggregate Analysis

BASH

awk BEGIN { FS=OFS=SUBSEP= }{arr[$2,$3]++ }END {for (i in arr) print i,arr[i]} mydata_lkup.dat

JAQL

$CustomersPurchases -> group by $Customers_group = $.CustomerId into
{$Customers_group, total: count($[*].POID)};

Unique

BASH

less mydata_lkup.dat|cut -c 12|uniq

JAQL

$CustomersPurchases ->group by $Customers_group = $.CustomerId into
{$Customers_group };

JSON

JSON (JavaScript Object Notation) is a data format that is non-proprietary, human-readable, easy to parse, simple to generate, and widely adopted. You can read more about JSON at the project website here http://www.json.org/ Some of you might be saying, wait there is another data format with the same properties, namely XML, why bother with JSON? For me there are two reasons for data JSON is superior to XML. First, it is designed for data so handles things such as arrays much more eloquently than XML. There are versions of XML better suited for Data than the generic flavor but on the whole, XML is trying to provide a flexible structure to describe complex metadata not be a simple format for data. Second, there is only one type of JSON whereas XML can come in endless variety.  It is XML’s flexibility that is both its asset and liability. By being flexible XML can be used in a variety of ways but because of this confusion often arises about exactly how to consume and generate data. One of the things XML was supposed to solve. When transferring data I do not like ambiguity, I want strict conventions.

A great article about XML verse JSON with good comments is found here, http://ajaxian.com/archives/json-vs-xml-the-debate.

But I use JSON not because it is better for data (there is always something better) but mainly because a lot of data I want to consume uses the JSON format and you can load it directly into R: http://cran.r-project.org/web/packages/rjson/index.html

Below is an example of a JSON formatted dataset:

[{
"CustomerId" : 1,
"age" : 23,
"name" : "Joe"
},{
"CustomerId" : 2,
"age" : 45,
"name" : "Mika"
},{
"CustomerId" : 3,
"age" : 34,
"name" : "Lin"
}
}]

You can see that it is human readable. {} enclose rows and a row/column is expressed as “Colname” : value.

Depersonalization

Depersonalization of data is a growing issue for modelers as privacy concerns about consumer’s data increases.  It is often necessary to de-associate personal identifiers from datasets or take other precautions to assure the anonymity of individuals studied.  This is difficult because many fields we use in modeling, gender, date of birth, and zip code can be used to identify individuals.  A study by Latanya Sweeney showed gender, date of birth, and zip code can uniquely identify 85% of the US population.  To meet privacy concerns removing driver license number, Social Security number and full name is often not enough.

 

Here is an example, you are given two datasets one has a demographic profile of an individual and results from a medical study and the other dataset has a full name, address, and date of birth.  The concern is you do not want someone to uniquely identify individuals across these datasets. As mentioned before, if both datasets contain gender, date of birth, and home zip code you can identify individuals with an 85% accuracy. Here there has been no depersonalization.  If age had replaced the date of birth in the study dataset one to one identification across datasets would not have been easily achievable.

Concept: K-anonymization

K-anonymization enables you to talk of degrees to which one dataset is related to another dataset. It is not the only measure of depersonalization and has some issues, namely it is NP-Hard but is an important concept to understand. If a record in one dataset can be matched to k records in another dataset that dataset is said to be (k-1). For example, if you can uniquely match each record in two datasets (one to one matching) K-anonymization is zero.  If, however, many records can match a given record K-anonymization is greater than zero. A large value for k indicates a greater degree of personalization of the study dataset.  When calculating the value you use a full information dataset and a study dataset that requires depersonalization.

 

Further Reading

L. Sweeney , Uniqueness of Simple Demographics in the U.S (2002) Carnegie Melon University, Laboratory for International Data Privacy http://privacy.cs.cmu.edu/courses/pad1/lectures/identifiability.pdf

http://reports-archive.adm.cs.cmu.edu/anon/isri2006/CMU-ISRI-06-105.pdf

http://lorrie.cranor.org/courses/fa04/malin_slides.pdf

Python Data

Python scripting language with a heavy emphasis on code reuse and simplicity. It is a popular language with a large and active user group.  Like Perl, it has a rich library set and is popular with projects like MapReduce with Hadoop and is rapidly becoming the default language in areas like computational intelligence.  One interesting feature of the Python language is, indentation acts as block delimiters.  So, when copying code do not change the indentation of the code will not operate as intended.  It is also a powerful language meaning it can do a lot with very little code.  Below is an example of a database containing two tables, customers, and purchase order.  The Customer table has customerid (the unique identifier for the customer), name (The customer’s name), and age (the customer’s age).  The PurchaseOrder table has POID (the unique id for the purchase order), customerid (which refers back to the customer who made the purchase), and Purchase (what the purchase was).

Example:
Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
PurchaseOrder
POID CustomerID Purchase
1 3 Fiction
2 1 Biography
3 1 Fiction
4 2 Biography
5 3 Fiction
6 4 Fiction
 
     
SELECT
Create the file select.py with the following code:

import sys
import fileinput
# Loop through file and print lines
for line in fileinput.input(sys.argv[1]):
    print (line)

Then run the code:
python select.py customer.txt

CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
ORDER BY
Create the file orderby.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

listLines =[]
i =0

# Load file into an array

for line in fileinput.input(sys.argv[1]):
listLines.append(line)

# Create a custom sort based on second field

def getCustId(listLines):
return listLines.split( , , 2)[-1]

# Sort the array

listLines.sort(key=getCustId)

# Print arrys lines

for object in listLines:
    print (object)

Then run the code:
python orderby.py customer.txt

CustomerID Name Age
5 Susan 18
1 Joe 23
3 Lin 34
2 Mika 45
4 Sara 56
WHERE
Create the file select_by_id.py with the following code:
import sys

import fileinput
import string

# Loop through file

for line in fileinput.input(sys.argv[1]):

# Split line using a comma

tokens = string.split( line, , )

# If ID matches passed ID then print

if tokens[0] == sys.argv[2]:
print line,

Then run the code:
python select_by_id.py customer.txt 1

CustomerID Name Age
5 Susan 18
INNER JOIN
Create the file innerjoin.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

listB =[]
i =0

# Load second file into an array to loop through

for lineB in fileinput.input(sys.argv[2]):
listB.append(lineB)

# Loop through first file

for lineA in fileinput.input(sys.argv[1]):

# Split line using a comma

tokensA = string.split( lineA, , )

# Loop through array

for object in listB:

# Split line using a comma

tokensB = string.split(object, , )

#If there is a match print

if tokensA[0] == tokensB[1]:

# Remove newline character with strip method

print lineA.strip() + , + object,

Then run the code:
python innerjoin.py customer.txt orders.txt

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
LEFT OUTER JOIN
Create the file leftouterjoin.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

listB =[]
iFound =0

# Load second file into an array to loop through

for lineB in fileinput.input(sys.argv[2]):
listB.append(lineB)

# Loop through first file

for lineA in fileinput.input(sys.argv[1]):

# Split line using a comma

tokensA = string.split( lineA, , )
iFound =0

# Loop through array

for object in listB:

# Split line using a comma

tokensB = string.split(object, , )

#If there is a match print

if tokensA[0] == tokensB[1]:

# Remove newline character with strip method

print lineA.strip() + , + object,
iFound=1

#If there is no match print

if iFound ==0:
print lineA.strip() + , + , + , + , ,

Then run the code:
python leftouterjoin.py customer.txt orders.txt

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
5 Susan 18 NULL NULL
GROUP BY
Create the file groupby.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

iCnt =1
iLoop =0

# Load and loop through file

for lineA in fileinput.input(sys.argv[1]):

# Split line using a comma

tokensA = string.split( lineA, , )

# Adjust with header and first line

if iLoop <2:
priorTokens = tokensA
iCnt=0

if tokensA[0] == priorTokens[0]:
iCnt =iCnt+1
else:
print priorTokens[1] + , + priorTokens[2].strip() + , +str(iCnt)
iCnt=1 iLoop = iLoop +1
priorTokens = tokensA

# print last line

print priorTokens[1] + , + priorTokens[2].strip() + ,

Then run the code:
python groupby.py customer.txt

Name Age Orders
Joe 23 1
Mika 45 2
Lin 34 2
Sara 56 1
UPDATE
Create the file update with the following code:

import sys
import fileinput
import string

# Loop through file

for line in fileinput.input(sys.argv[1]):

# Split line using a comma

tokens = string.split( line, , )

# If ID is not passed ID then print else replace age with passed parameter

if tokens[0] != sys.argv[2]:
print line,
else:
print tokens[0]+ , + tokens[1] + , + sys.argv[3]

Then run the code:
python groupby.py customer.txt 1 23

Customer
CustomerID Name Age
1 Joe 26
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
INSERT
Create the file insert.py with the following code:

import sys
import fileinput

# Loop through file and print lines

for line in fileinput.input(sys.argv[1]):
print line,

# Add new line from passed arguments

print sys.argv[2] + , + sys.argv[3] + , + sys.argv[4],

Then run the code:
python insert.py customer.txt 6 Joe 34

Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
6 Terry 50
DELETE
Create the file delete.py with the following code:

import sys
import fileinput
import string

# Loop through file

for line in fileinput.input(sys.argv[1]):

# Split line using a comma

tokens = string.split( line, , )

# If ID is not passed ID then print

if tokens[0] != sys.argv[2]:
print line,

Then run the code:
python delete.py customer.txt 1

Customer
CustomerID Name Age
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

Unix/Linux Data Management

Data management using Unix/Linux is easy but it does have a few quirks. First, typical the header (the first row which contains the field names) is not contained in the data file requiring a scheme or layout of the data files. That is because most of the file operation does not recognize the first line as different from the rest of the file.  Also, complex multi table queries can only be achieved in multiple steps unlike SQL and SAS.  But the speed and efficiency of code make Unix/Linux a strong data management tool.

 
Example:
Customer.dat
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
PurchaseOrder.dat
1 3 Fiction
2 1 Biography
3 1 Fiction
4 2 Biography
5 3 Fiction
6 4 Fiction
 
     

  This data would be stored without column names.

SELECT

less customer.dat
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

ORDER BY

less customer.dat |sort -k 1.10-,1.12
5 Susan 18
1 Joe 23
3 Lin 34
2 Mika 45
4 Sara 56

WHERE

less customer.dat |awk {if (substr($0,3,5) == Susan )  print $0}
5 Susan 18

INNER JOIN

sort -k 2 purchaseorder.dat > srt_purchaseorder.dat
join -1 1 -2 2 customer.dat srt_purchaseorder.dat
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction

LEFT OUTER JOIN

sort -k 2 purchaseorder.dat > srt_purchaseorder.dat
join -a1 -1 1 -2 2 customer.dat srt_purchaseorder.dat
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
5 Susan 18 NULL NULL

GROUP BY

join  -1 1 -2 2 customer.dat srt_purchaseorder.dat > sum.dat
awk BEGIN { FS=OFS=SUBSEP= }{arr[$2,$3]++ }END {for (i in arr) print i,arr[i]}   sum.dat
Joe 23 1
Mika 45 2
Lin 34 2
Sara 56 1

UPDATE

less customer.dat |awk {if (substr($0,3,5) == Susan )  print substr($0,0,9) 59  ; else print $0 }
1 Joe 26
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

INSERT

cat customer.dat new_cust.dat
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
6 Terry 50

DELETE

less customer.dat |awk {if (substr($0,1,1) != 1 )  print $0
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

 

Perl Data


  Perl is an established language with lots of code examples provided on the Web.   It is a powerful language (meaning you can do a lot with very little code).   Perl has a rich library set, especially in regards to text processing.  The language can take some getting use to because of its heavy use of control character.  For example: ‘.’ means concatenate strings, and the leading character (@, $, & and %) indicates the data type being accessed (eg. $foobar is a scalar while @foobar is an array). Below is an example of a database containing two tables, customers and purchase order.  The Customer table has customerid (the unique identifier for the customer), name (The customer’s name) and age (the customer’s age).  The PurchaseOrder table has POID (the unique id for the purchase order), customerid (which refers back to the customer who made the purchase) and Purchase (what the purchase was).

Example:
Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
PurchaseOrder
POID CustomerID Purchase
1 3 Fiction
2 1 Biography
3 1 Fiction
4 2 Biography
5 3 Fiction
6 4 Fiction
 
     
SELECT
Create the file select.pl with the following code:

#!/usr/bin/perl -w

my @file_lines;

# Loop through data and print lines

while ( my $line = <> ) {

# Print Line

print $line . \n ;

# Populate next line

push @file_lines, $line; }

Then run the code:
perl select.pl customer.txt

CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
ORDER BY
Create the file orderby.pl with the following code:

#!/usr/bin/perl -w

# Open the file

open(DATA, $ARGV[0]); my $line;
my $iCnt;

# Initialize variables

$iCnt =0;
$iLoop =0;

# Loop through data and populate a array for later use.

while($line = )
{

# Copy data to an arry skipping header (first line with field names)

if ($iCnt >0)
{
$lines[$iCnt-1] = $line;
}

$iCnt ++;
}

# Close file object

close DATA;

# Sort using on custom sort based on second field

@sorted = sort{@PANa = split( \, , $a); @PANb = split( \, , $b); return $PANa[2] cmp $PANb[2] } (@lines);

# Loop through data and print lines

while($iLoop < $iCnt-1)
{
print $sorted[$iLoop];
$iLoop ++;
}

Then run the code:
perl orderby.pl customer.txt

CustomerID Name Age
5 Susan 18
1 Joe 23
3 Lin 34
2 Mika 45
4 Sara 56
WHERE
Create the file select_by_id.pl with the following code:

#!/usr/bin/perl -w my @file_lines;

# Loop through passed file

while ( my $line = <> ) {

# Split line using a comma

@PAN =split( \, , $line);
$out = $PAN[0];

# If ID is passed ID then print

if ($out eq 1 )
{
print $line . \n ;
}

# Populate next line

push @file_lines, $line; }

Then run the code:
perl select_by_id.pl customer.txt 1

CustomerID Name Age
5 Susan 18
INNER JOIN
Create the file innerjoin.pl with the following code:

#!/usr/bin/perl -w

# Open both files

open(DATA1, $ARGV[0]);
open(DATA2, $ARGV[1]); my $lineCust;
my $lineOrder;
my $iCnt;
my $iLoop;

# Initialize variables

$iCnt =0;

# Loop through data and populate a array for later use.

while($lineOrder = )
{
$orders[$iCnt] = $lineOrder;
$iCnt ++;
}

# Close file object

close DATA2;
while($lineCust = )
{

# Split line using a comma

@PAN1 =split( \, , $lineCust);
$out1 = $PAN1[0];

# Remove newline character

chop($PAN1[2]);
chop($PAN1[2]);

$iLoop =0;

while($iLoop < $iCnt)
{

# Split line using a comma

@PAN2 =split( \, , $orders[$iLoop]);
$out2 = $PAN2[1];

#If there is a match print

if ($out1 eq $out2)
{
print $PAN1[0] . , . $PAN1[1] . , . $PAN1[2] . , . $orders[$iLoop];
}

$iLoop ++;
}
}

# Close file object

close DATA1;

Then run the code:
perl innerjoin.py customer.txt orders.txt

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
LEFT OUTER JOIN
Create the file leftouterjoin with the following code:

#!/usr/bin/perl -w

# Open both files

open(DATA1, $ARGV[0]);
open(DATA2, $ARGV[1]); my $lineCust;
my $lineOrder;
my $iCnt;
my $iLoop;
my $iFound;

# Initialize variables

$iCnt =0;

# Loop through data and populate a array for later use.

while($lineOrder = )
{
$orders[$iCnt] = $lineOrder;
$iCnt ++;
}

# Close file object

close DATA2; while($lineCust = )
{

# Split line using a comma

@PAN1 =split( \, , $lineCust);
$out1 = $PAN1[0];

# Remove newline character

chop($PAN1[2]);
chop($PAN1[2]);

$iLoop =0;
$iFound =0;

while($iLoop < $iCnt)
{

# Split line using a comma

@PAN2 =split( \, , $orders[$iLoop]);
$out2 = $PAN2[1];

#If there is a match print

if ($out1 eq $out2)
{
print $PAN1[0] . , . $PAN1[1] . , . $PAN1[2] . , . $orders[$iLoop];
$iFound =1;
}

$iLoop ++;
}

#If there was no match print

if ($iFound ==0)
{
print $PAN1[0] . , . $PAN1[1] . , . $PAN1[2] . , . , . , . ,\n ;
}
}

# Close file object

close DATA1;

Then run the code:
perlleftouterjoin.pl customer.txt orders.txt

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
5 Susan 18 NULL NULL
GROUP BY
Create the file groupby.pl with the following code:

#!/usr/bin/perl -w

#Open the file

open(DATA1, $ARGV[0]); my $line;
my $iCnt;
my $iLoop;

# Initialize variables

$iCnt =1;
$iloop =0;

# Loop through data

while($line = )
{

# Split line using a comma

@PAN1 =split( \, , $line);
$id = $PAN1[0];
$name = $PAN1[1];
$age = $PAN1[2];

# Remove newline character

chop($age);
chop($age);

if ($iloop <2 )
{
$prior_id = $id ;
$iCnt = 0;
}

if ($prior_id eq $id)
{
$iCnt ++;
}
else
{
print $prior_name . , . $prior_age . , . $iCnt . \n ;
$iCnt =1;
}

$iloop ++;
$prior_id = $id ;
$prior_name = $name ;
$prior_age = $age ; }

print $prior_name . , . $prior_age . , . $iCnt . \n ;

# Close file object

close DATA;

Then run the code:
perl groupby.pl customer.txt

Name Age Orders
Joe 23 1
Mika 45 2
Lin 34 2
Sara 56 1
UPDATE
Create the file update.pl with the following code:

#!/usr/bin/perl -w

# Open the file

open(DATA, $ARGV[0]);

# Loop through data

while()
{

# Split line using a comma

@PAN =split( \, , $_);
$out = $PAN[0];

# Pint if ID is not equal to passed ID otherwise update.

if ($out eq $ARGV[1])
{
print $PAN[0]. , . $PAN[1] . , . $ARGV[2] . \n ;
}
else
{
print $_ ;
}

# Close file object

close DATA;
}

  Then run the code:
perl groupby.pl customer.txt 1 23

Customer
CustomerID Name Age
1 Joe 26
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
INSERT
Create the file insert.pl with the following code:

#!/usr/bin/perl -w

#Open the file

open(DATA, $ARGV[0]);

# Loop through data and print lines

while()
{
print $_ ;
}

# Add new line from passed arguments

print $ARGV[1]. , . $ARGV[2] . , . $ARGV[3] . \n ;

# Close file object

close DATA;

Then run the code:
perl insert.pl customer.txt 6 Joe 34

Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
6 Terry 50
DELETE
Create the file delete.pl with the following code:

#!/usr/bin/perl -w

#Open the file

open(DATA, $ARGV[0]);

# Loop through data

while()
{

# Split line using a comma

@PAN =split( \, , $_);
$out = $PAN[0];

# If ID is not passed ID then print

if ($out ne $ARGV[1])
{
print $_ ;
} }

# Close file object

close DATA;

Then run the code:
perl delete.pl customer.txt 1

Customer
CustomerID Name Age
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

Data Management Quick Comparison 1

SAS

Selecting -All Fields

SQL

Select * from mydata;

SAS

Proc print data = mydata;

– One Field

SQL

Select col1 from mydata
Proc print data = mydata (keep = col1);

– Subset

SQL

Select * from mydata where col2 = 2786586641

SAS

Proc print data = mydata (where = (col2 =2786586641);

Sorting

SQL

select * from mydata sort by col2

SAS

Proc sort data = mydata; by col2;

Join

SQL

Select * from mydata_2 as mydt inner join mydata_lkup lkup on lkup.col1 = mydt.col2

SAS

Proc sort data = mydata; by col2;
Proc sort data = mydata_lkup; 
    by col1; 

data tmp;
     merge mydata (in = _mydt rename col2 _lnk) 
        mydata_lkup( in = _lkup rename col1 = _lnk);
             by _lnk;
run; 

Sample

SQL

Select top 10 from mydata

SAS

Proc print data = mydata (obs =10);

Aggregate Analysis

SQL

Select col2, count(*) from mydata_lkup group by col2

SAS

proc freq data = mydata_lkup;

tables col2;
run;

______or_________________

Proc sort data = mydata_lkup;
By col2;

Data agg;
    set mydata;
       by col2;
         retain count;
     If first.col2 eq 1 then do;
         Count = 0;
    end;

Count = count +1;

If last.col2;
     keep col2 count;
quit;
run;

Unique

SQL

Select distinct col2 from mydata_lkup;

SAS

Proc sort data = mydata_lkup nodupkey;
By col2;

Data Integrity

1. Intro

Before you can run a model you need to know the data. I know of forecasters who avoid, at all cost, examination of the data. Sometimes this is due to laziness or lack of ability or the mistaken belief such a task is beneath them. Collecting, researching, cleaning and manipulating data is the backbone of any model. Eighty percent of a model is complete when a properly built flat file is ready for the gauntlet of forecasting algorithms to consume. This should not be the least rewarding part of model building. You should learn to enjoy it; and there is a lot to enjoy, the biggest finds are made here.  If you ignore data analysis, you may end up building a model explaining why performance of a product was poor in some states only to learn in a meeting that the data was incomplete in those states.

2. Accessing The Data

There are many tools for accessing and manipulating data. This is not a good thing. It can become a serious issue because everyone has his or her favorite method, language, and file format. The only certainty is, nothing will work well with one another. A tremendous amount of time and effort can be spent getting all the various data combined to form the flat file that will be used in the modeling. A flat file is what is required for most statistical procedures. Each row in a flat file represents one case and all the rows have the same columns.  There are some AI methods that can use relational tables which I go over a few in the Data Mining sections. At the beginning of a project it is important to catalogue what file formats, languages and databases are being used and who are the experts in those areas.

a. SQL

The most common data access method is structured query language (SQL). This language is designed to allow easy access to relational data. If you are familiar with set theory or Venn diagrams SQL will seem very intuitive. In the programming section I go over in details how to build queries. SQL is typically associated with a particular Relational Database Server.

Common Relational Database Servers

1. Oracle

2. MS SQL Server

3. DB2

5. MySQL

b. SAS

Statistical Analysis System (SAS) has been around longer than SQL. It is a powerful language that can even use SQL.  One of its strongest features is the ability to access data in a variety of formats.  SAS excels at what SQL excels at, manipulating data.  There is virtually no imaginable end form for a flat file that SAS cannot produce. Again, further discussion of SAS is in the programming section.

c. SPSS

SPSS predates SAS.  It has limited ability to manipulate and combine data but can import a variety of different sources.

d. Excel

Excel is used by many forecasters.  I find it difficult to merge relational data with Excel but it is powerful for manipulating flat files via combining and weighting columns.

3. Communication

The most important step in build a good data is communication. You need to talk to the domain experts of the data. There are many nonlinarities and measurement errors in the data lurking to spoil any well-planned model.  There could be regime changes, changes in the definitions of a variable and numerous other irregularities that is impossible to uncover without talking to someone.  This may be the hardest and scariest part of the job but actually talking to a human is an essential first step in building a good model.

4. Examine the Data

Inspection of the data can save a lot of time an effort of the model builder. You do not want to build a model explaining, completely erroneously, that small towns are a high risk when no data was collected there.

Methods:

a. Frequencies, mean, mode min, max, var

b. Compare with previous data is available

c. Plots and more plots

 Common Data Analysis

What to look for Method  Example SQL/SAS
 Assure all data was collected. Determine sample size. Sample Size
SQL

 Select count(*) from my MyTable

Look for any extreme values, outliers, missing values, variations, truncation and censoring within data.

 

Mean
SQL

Select avg(myField) from MyTable

Mode
SQL

Select mode(myField) from MyTable

Variance
SQL

Select Var(myField) from MyTable

Max
SQL

Select max(myField) from MyTable

Min
SQL

Select min(myField) from MyTable

Count Nulls
SQL

Select count(*) from my MyTable where my field is null

Does the data make sense?
Do you have the correct field definitions?
Are there known values in the category fields?
Frequencies

SAS

Proc Contents Data = MyTable;

Run;

Proc Freq Data = MyTable;

Tables MyField;

Run;

Does the data look right? Plots

SAS

Proc plot Data = MyTable;

Plot MyField;

Run;

*in Oracle only, for MS SQL Server you must write specialized code to get the mode.

5. Creating a flat file.

I go over creating a flat file in the programming section under SQL.

6. Security

Do not under-estimate the importance of security.  Security is a growing issue with personal data.  There have been a number of breaches and consequences of those lapses in security are years of aggravation and lost income for those affected.  As the information age marches onward and disk space increases a statistician s laptop could contain a considerable amount of private information. Always follow the best practices of your firm but try and go beyond them.  When using data asked yourself whether you need any personal identifiers.  Personal identifiers will seldom be an independent variable and unless you are doing data matching latter, never pull it into your dataset.  Examples of personal identifiers: Social Security number, financial account number, driver’s licenses, first and last name with address, first and last name with date of birth, vehicle plate numbers.  There are times when patterns in number can indicate fraud but these can be coded separately then latter brought back into your working set.  Treat other people’s data as if were your own because be assured there is some statistician out there that does have yours.