Sorting in Hadoop with Itertools

Previously, I gave a simple example of a python map/reduce program. In this example I will show you how to sort the data within a key using itertools, a powerful python library focusing on looping algorithms. This example is an extension of Michael Noll’s detailed Hadoop example. The data in this example will be comma-delimited.

AccountId, SequenceId,Amount,Product
0123,5, 4.60,Product5
0123,3, 6.34,Product3
0321,3, 2.35,Product3

Notice the data is sorted by the AccountId however for a given account the data is not sorted by SequenceId. One important thing to remember is Hadoop does not preserve the order of the data when passing it to the nodes which will run the reduce step. The map step will group the data by the specified key but even if the data is sorted on the server correctly, it is not guaranteed that order will be preserved in the reduce. If your analysis requires the data to be sorted not just grouped by key this must be handled in the reduce step. Now let suppose you need to know the ordered sequence of products purchased for a given account. For clarity I numbered the product in accordance with its sequenceId. While this is easy to achieve with the python base library thankfully there is a library which has a number of function that makes this quite elegant to code, itertools. First let’s create a simple map step,

#!/usr/bin/env python 
import sys 
for line in sys.stdin: 
    # Split line by deliminator
    aline = line.split(',') 
    # Choose the first part as map key
    key= aline[0] 
    # Output map key followed by a tab and the rest of the data
    print '%s\t%s' % (key, line.strip()) 

This will define the AccountId as the key. Remember even if you included the SequenceId in the key it will not help you. The Hadoop server will then just send each record randomly across all the nodes. If you need to do an analysis by account a given node will not see all the account’s transactions.

Now, here is the reduce step,

#!/usr/bin/env python
import sys
from itertools import groupby
from operator  import itemgetter 

First, we define how to parse the input data stream. Notice how simple it is to handle to different delimitations, the tab to define the key for the Hadoop server and the comma to separate out the data fields.

def read_mapper_output(file ,sortBy ,  sepA,  sepB) :
    for line in file: 
        key1,line1 = line.rstrip().split(sepA, 1)
        aline      = line.split(sepB)
        # Create a second key to use on our data structure
        key2       = aline[sortBy]
        yield key1 , key2,  line1 

Yield defines this function as an iterator. And a structure was created with account number, sequence ID, and the entire data line.

def main(separator='\t'  ): 
    # Define the iterator
    data   = read_mapper_output(sys.stdin, 1, '\t'  , ',' ) 
    # Loop through the iterator grouped by the first key.
    for grpKey, grpDt in groupby(data, itemgetter(0)): 
        #set the product list blank
        #Sort the grouped data by the second key.
        for key1, key2, line in sorted(grpDt , key=itemgetter(1)):  
             aline =line.split(',')
             # Populate you product list
             products = products + ',' +   aline[3]     
             print grpKey + products 

Finally, we call out the main function main as the default.

if __name__ == "__main__":

Run the following to test the code:

cat | python | python

It will output:


Even complex problems have an elegant and fast solution with Python.

Data Management Quick Comparison 2

The following article provides a comparison between BASH and JAQL for basic data management and manipulation. 


Selecting- All Fields


more mydata.dat


read(hdfs( Customers.dat )); 

-One Field


cut -c 13-22 mydata.dat 


$Customers = read( Customers.dat );
$Customers -> transform { name: $.name };

– Subset


less mydata.dat |awk {if (substr($0,13,10) == 2786586641) print $0}


$Customers = read( Customers.dat );
$Customers -> filter $.age == 18 ->  transform { name: $.name };



sort -n -t +2 mydata.dat


$Customers = read( Customers.dat );
$Customers -> sort by [$.age] -> transform { name: $.name };



join -1 2 -2 1 mydata_2.dat mydata_lkup.dat | less
or (if no unmatched values and both files are sorted)
paste mydata_2.dat mydata_lkup.dat


$CustomersPurchases =join $Customers, $PurchaseOrder where $Customers.CustomerId
== $PurchaseOrder.CustomerId into {$, $PurchaseOrder.*} ;   



more mydata.dat| head -10


$Customers = read( Customers.dat );
$Customers -> top(2); 

Aggregate Analysis


awk BEGIN { FS=OFS=SUBSEP= }{arr[$2,$3]++ }END {for (i in arr) print i,arr[i]} mydata_lkup.dat


$CustomersPurchases -> group by $Customers_group = $.CustomerId into
{$Customers_group, total: count($[*].POID)};



less mydata_lkup.dat|cut -c 12|uniq


$CustomersPurchases ->group by $Customers_group = $.CustomerId into
{$Customers_group };


Depersonalization of data is a growing issue for modelers as privacy concerns about consumer’s data increases.  It is often necessary to de-associate personal identifiers from datasets or take other precautions to assure the anonymity of individuals studied.  This is difficult because many fields we use in modeling, gender, date of birth, and zip code can be used to identify individuals.  A study by Latanya Sweeney showed gender, date of birth, and zip code can uniquely identify 85% of the US population.  To meet privacy concerns removing driver license number, Social Security number and full name is often not enough.


Here is an example, you are given two datasets one has a demographic profile of an individual and results from a medical study and the other dataset has a full name, address, and date of birth.  The concern is you do not want someone to uniquely identify individuals across these datasets. As mentioned before, if both datasets contain gender, date of birth, and home zip code you can identify individuals with an 85% accuracy. Here there has been no depersonalization.  If age had replaced the date of birth in the study dataset one to one identification across datasets would not have been easily achievable.

Concept: K-anonymization

K-anonymization enables you to talk of degrees to which one dataset is related to another dataset. It is not the only measure of depersonalization and has some issues, namely it is NP-Hard but is an important concept to understand. If a record in one dataset can be matched to k records in another dataset that dataset is said to be (k-1). For example, if you can uniquely match each record in two datasets (one to one matching) K-anonymization is zero.  If, however, many records can match a given record K-anonymization is greater than zero. A large value for k indicates a greater degree of personalization of the study dataset.  When calculating the value you use a full information dataset and a study dataset that requires depersonalization.


Further Reading

L. Sweeney , Uniqueness of Simple Demographics in the U.S (2002) Carnegie Melon University, Laboratory for International Data Privacy

Data Manipulation

1.Sample Size

Before you go out hacking through the data looking for gems of information think, how are you going to test validity?  At the beginning you should split your data into at least a training and validation dataset.  I prefer splitting the data into three randomly assigned groups.  The first group is the training set.  This is the data you build your model on.  The next dataset is the validation set.  It is used to test your model’s performance out of sample.  The final dataset is the holdout sample.  It is the final test to see if your model works with unseen data and should be used sparingly.  Typically, I go back and forth between training and validation several times a model and two or three times between the training the hold out set.  I will discuss this topic further in the validation section.

2. Data transformations
With thousands of data elements why do we feel the need to create more?  Many relationships are hidden by nonlinearities and obscure interactions.  To build a successful model you need to get your hands dirty with transforming the data to uncover these relationships.  Nonlinear relations are any that cannot be modeled as the weighted sum of independent variables. This is a broad definition encompassing models where the output is the product of independent variables to structural breaks in the relationship to just about anything else.  Nonlinearites can be the showstopper when trying to build model and knowing how to get around them is essential.  Data transformation is an important tool to make the show go on.

a)Nonlinear Transformations

The classic example of a nonlinear relationship in economics is the Cobb-Douglas  production function.

Y = f(K,L) = aL^bK^c



A: Technology Constant


a+b=1 constant returns to scale

a+b< 1 diminishing returns to scale

a+b> 1 increasing returns to scale.

You will not be able to estimate this model using linear techniques.  However by taking the log of the equation yields:  Log(Y) = a + b*log(L) + c*Log(K)

This you can estimate using linear modeling. As seen, using nonlinear transformation can make an unsolvable problem solvable; however, non-linear transformation should only be used when there is theoretical reasons. Misuse of transformation will lead to curve fitting, over fitting of the model, and your model will perform poorly out of sample.

Another example of a nonlinear relationship is average travel time given distance from city center. When you are very close to a city center the average travel time is longer than if you are further out.  This could be because traffic increases exponentially as you approach city center therefore reducing your speed. AvgTime = a*Distance^b

By transforming distance you can make this relation estimable using linear techniques.

AvgTime = a + b*log(Distance)


Example Nonlinear Transformations:

1. Square

2. Square root

3. Log

4. Exp

b) Dummy Variables

Dummy variables are important to model structural breaks in the system, variables with noncontinuous relationships and other nonlinearites. It is more common for variables to have a noncontinuous relationship with one another than continuous one.  A continuous relationship is like speed to distance. The faster you travel the further you will travel for any given amount of time (assuming a straight highway, no police,….).  Lets use the example above for average travel time to distance from city center  Above I assumed the model: AvgTime = a*Distance^b.  But, that is not the only possibility.  It may be all the delay is caused by one bridge.  The model then would be

AvgTime = a + b*Distance   where Distance < bridge from city center

AvgTime = c + b*Distance   where Distance >= bridge from city center

This is best modeled by putting in a dummy variable to pull in the information on whether the journey begins before or after the bridge. Dummy variables can also incorporate accumulative effects, such as income vs educations.

Example :

1 0 0 0 0 High School

1 1 0 0 0 Junior College

1 1 1 0 0 College

1 1 1 1 0  Graduate School

(No High School is the Intercept).

c) Fuzzy Variables

Fuzzy variables promise to better incorporate real world notions of numbers and nonlinear relationships.  There are many variables that are thought of, well, in fuzzy terms, like temperature, age and weight.  We seldom think of temperature as a continuous number but in terms of hot or mild or cold. This gets more problematic because our definitions overlap.  The temperature can by hot and mild in our mind.  And if we define these variables in our head in a fuzzy manner, we react to these variables in a fuzzy manner.  Fuzzy logic holds the promises to better model these relationships.

Membership Functions: These define which set(s) a particular value might belong to.  For example, 85 degrees may be both hot and warm. Membership functions are developed through surveys, experimentation and logic.

Fuzzy variables avoids key problems that plagues dummy variables, namely the sharp cut off between being included and excluded and lumping all groups into the same category. For example, say you wanted to model the savings against education and you want to correct for age differences. The effect of age on employment is non-linear, younger and older people have lower employment rates than the ages in between. To capture this, you may want to include a dummy variable to indicate nearing retirement age which is 0 if under 65 and 1 if greater than or equal to 65. But why 65? Why not 64.5 or 66.5? The cut off at 65 is arbitrary and weakens the relationship between employment and retirement age. To capture this complex relationship you can define a membership function that allows a 64 year old to belong to both the retirement group and the non-retirement group.

Below is an example SAS Code generating fuzzy membership functions for temperature.

data Fuzzy;

do TempLoop =0 to 100;

Temp = TempLoop;

if Temp < 40 then Cold = 1 ;

if Temp >= 40 then Cold =-(0.05)*Temp + 3 ;

if Temp >= 60 then Cold = 0 ;

if Temp < 30 then Cool = 0;

if Temp > 30 then Cool = (0.04)*Temp -1.2 ;

if Temp >= 55 then Cool =-(0.04)*Temp +3.2 ;

if Temp > 80 then Cool =0;

if Temp < 60 then Warm =0;

if Temp >= 60 then Warm = (0.06667)*Temp - 4 ;

if Temp >= 75 then Warm =-(0.06667)*Temp + 6 ;

if Temp >= 90 then Warm =0;

if Temp < 80 then Hot = 0;

if Temp >= 80 then Hot = (0.05)*Temp - 4 ;

if Temp >= 100 then Hot = 1 ;




title Temperature ;

proc gplot data=Fuzzy;
     plot Cold*Temp Cool*Temp Warm*Temp Hot*Temp / overlay frame legend = legend;


d) Splitting the Data

Splitting a data set is useful in uncovering hidden relationships. Lets use the example above for average travel time to distance from city center. It may be all the delay is caused by one bridge but the effective of the bridge effects the maximum speed you can travel before reaching it.  The model then would be

AvgTime = a + b*Distance   where Distance < bridge from city center

AvgTime = a + c*Distance   where Distance >= bridge from city center

This could be modeled by splitting the dataset.

3. Data Reductions Techniques

a) Principle components\ Factor Analysis

Principle components is a powerful data reduction tool. It estimates the structure between variables producing factors that represent those structures.  By looking at the factors you can deduce relationships by seeing how different variables relate to one another.  The factors are by definition orthogonal and some would argue they can be used as independent variables in a regression. One common mistake is to forget principle components is an estimation technique and needs to be treated as such.

Example Code (R)


Results.FA <-factanal(~Fertility+Agriculture+Examination+Education+Catholic

+ Infant.Mortality, factors=2, rotation= varimax , scores= none , data= swiss)




b) Other data reduction techniques

There are a number of unsupervised AI techniques that will be discussed in the AI data mining section.

Data Management Quick Comparison 1


Selecting -All Fields


Select * from mydata;


Proc print data = mydata;

– One Field


Select col1 from mydata
Proc print data = mydata (keep = col1);

– Subset


Select * from mydata where col2 = 2786586641


Proc print data = mydata (where = (col2 =2786586641);



select * from mydata sort by col2


Proc sort data = mydata; by col2;



Select * from mydata_2 as mydt inner join mydata_lkup lkup on lkup.col1 = mydt.col2


Proc sort data = mydata; by col2;
Proc sort data = mydata_lkup; 
    by col1; 

data tmp;
     merge mydata (in = _mydt rename col2 _lnk) 
        mydata_lkup( in = _lkup rename col1 = _lnk);
             by _lnk;



Select top 10 from mydata


Proc print data = mydata (obs =10);

Aggregate Analysis


Select col2, count(*) from mydata_lkup group by col2


proc freq data = mydata_lkup;

tables col2;


Proc sort data = mydata_lkup;
By col2;

Data agg;
    set mydata;
       by col2;
         retain count;
     If first.col2 eq 1 then do;
         Count = 0;

Count = count +1;

If last.col2;
     keep col2 count;



Select distinct col2 from mydata_lkup;


Proc sort data = mydata_lkup nodupkey;
By col2;