Python -Profile and Why to Avoid Loops.

If you have coded in R or SPlus you should be familiar with avoiding loops and capitalizing on the tappy function whenever you can. These rules of thumb will serve you well when coding in Python. Many python objects have built-in functions that can take iterators of pass parameters. A python also has functions such as map, reduce and filter that behavior similar to R’s tappy. Below are two quick examples of how to use the profile module and performance hit from using loops in Python.

 import profile
mylist = []
for i in range(10000):
    myList1.append(str(i))
 
def test_1( ):
    s_out1 = ""
    for val in myList1:
        s_out1 = s_out1 + val + ','
 
def test_2( ):
    s_out2= ",".join(myList1 )
 
profile.run('test_1()')
profile.run('test_2()')

The join function allows you to quickly concatenate strings within an iterator object without the hassle of looping through it. Below is an example of transforming all the elements within an iterator object, first by looping through the object then by using the map function.

myList2=[]

for i in range(26):
    myList2.append( chr(i+97))
 
myList2 = myList2*1000
 
def test_3( ):
    s_out1 = ""
    for val in myList2:
        s_out1 = val.upper( )
 
def test_4( ):
    s_out2= map(str.upper , myList2 )
 
profile.run('test_3()')
profile.run('test_4()') 

Although Python is quick to pick up there is, like with any language, good code and bad code.

Sorting in Hadoop with Itertools

Previously, I gave a simple example of a python map/reduce program. In this example I will show you how to sort the data within a key using itertools, a powerful python library focusing on looping algorithms. This example is an extension of Michael Noll’s detailed Hadoop example. The data in this example will be comma-delimited.

AccountId, SequenceId,Amount,Product
0123,9,12.50,Product9
0123,5, 4.60,Product5
0123,7,54.00,Product7
0123,2,34.75,Product2
0123,3, 6.34,Product3
0123,8,14.50,Product8
0123,1,52.56,Product1
0123,4,78.45,Product4
0123,6,89.50,Product6
0321,1,2.12,Product1
0321,8,90.50,Product8
0321,3, 2.35,Product3
0321,4,56.25,Product4
0321,9,71.00,Product9
0321,2,24.75,Product2
0321,7,34.34,Product7
0321,6,34.23,Product6
0321,5,37.03,Product5       

Notice the data is sorted by the AccountId however for a given account the data is not sorted by SequenceId. One important thing to remember is Hadoop does not preserve the order of the data when passing it to the nodes which will run the reduce step. The map step will group the data by the specified key but even if the data is sorted on the server correctly, it is not guaranteed that order will be preserved in the reduce. If your analysis requires the data to be sorted not just grouped by key this must be handled in the reduce step. Now let suppose you need to know the ordered sequence of products purchased for a given account. For clarity I numbered the product in accordance with its sequenceId. While this is easy to achieve with the python base library thankfully there is a library which has a number of function that makes this quite elegant to code, itertools. First let’s create a simple map step, mapper.py.

#!/usr/bin/env python 
import sys 
for line in sys.stdin: 
    # Split line by deliminator
    aline = line.split(',') 
    # Choose the first part as map key
    key= aline[0] 
    # Output map key followed by a tab and the rest of the data
    print '%s\t%s' % (key, line.strip()) 

This will define the AccountId as the key. Remember even if you included the SequenceId in the key it will not help you. The Hadoop server will then just send each record randomly across all the nodes. If you need to do an analysis by account a given node will not see all the account’s transactions.

Now, here is the reduce step, reducer.py:

#!/usr/bin/env python
import sys
from itertools import groupby
from operator  import itemgetter 

First, we define how to parse the input data stream. Notice how simple it is to handle to different delimitations, the tab to define the key for the Hadoop server and the comma to separate out the data fields.

def read_mapper_output(file ,sortBy ,  sepA,  sepB) :
    for line in file: 
        key1,line1 = line.rstrip().split(sepA, 1)
        aline      = line.split(sepB)
        # Create a second key to use on our data structure
        key2       = aline[sortBy]
        yield key1 , key2,  line1 

Yield defines this function as an iterator. And a structure was created with account number, sequence ID, and the entire data line.

def main(separator='\t'  ): 
    # Define the iterator
    data   = read_mapper_output(sys.stdin, 1, '\t'  , ',' ) 
    # Loop through the iterator grouped by the first key.
    for grpKey, grpDt in groupby(data, itemgetter(0)): 
        #set the product list blank
        products=''
        #Sort the grouped data by the second key.
        for key1, key2, line in sorted(grpDt , key=itemgetter(1)):  
             aline =line.split(',')
             # Populate you product list
             products = products + ',' +   aline[3]     
             print grpKey + products 

Finally, we call out the main function main as the default.

 
if __name__ == "__main__":
    main() 

Run the following to test the code:

cat test.data | python mapper.py | python reducer.py

It will output:

0123,Product1,Product2,Product3,Product4,Product5,Product6,Product7,Product8,Product9
0321,Product1,Product2,Product3,Product4,Product5,Product6,Product7,Product8,Product9

Even complex problems have an elegant and fast solution with Python.

Simple Hadoop Streaming Example

Hadoop is a software framework that distributes workloads across many machines. With a moderate-sized data center, it can process huge amounts of data in a brief period of time -thinks weeks equals hours. Because it is open-source, easy to use, and works, it is rapidly becoming a default tool in many analytic shops working with large datasets. In this post, I will give a very simple example of how to use Hadoop using Python. For more on Hadoop see here or type Hadoop overview in your favorite search engine.

For this example, let’s use the US Census County Business Patterns dataset.
Location: http://www2.census.gov/econ2007/CBP_CSV
Download: zbp07detail.zip then unzip the data.

This file provides a number of business establishments (EST) by NAICS and Zip code for the US. We are interested in the total number of establishments (EST) by zip code (zip).

The first step is to create the mapper. The map step creates the key the Hadoop system uses to keep parts of the data together for processing. The map step is expected to output the key followed by a tab (‘/t’) then the rest of the data required for processing. For this example, we will use Python. Create the file, map.py, and add the code below. The key which tells the Hadoop system how to group the data is defined as the value before the first tab (‘/t’).

#!/usr/bin/env python
import sys
for lineIn in sys.stdin:
    zip = lineIn[0:5]
#  Note: Key is defined here
print zip + '\t' + lineIn 

Now we create the reducer. The reducer step performs the analysis or data manipulation work. You only need the reducer step if the analysis requires a group by key which the map step is providing. In this example, it is a zip code. Create the file, reducer.py, and add the code below. In this step, you need to make sure you split the piped stream into the key and data line.

#!/usr/bin/env python
import sys
counts={}
for lineIn in sys.stdin:
    # Note: We separate the key here
    key, line = lineIn.rsplit('\t',1)
    aLine = line.split(',')
    counts[key] = counts.get(key,0) + int(aLine[3])
    for k , v in counts.items():
       print k + ',' +str(v)

To test the code just run the map and reduce steps alone. If it is a large file make sure to use a sub-sample.

cat zbp07detail.txt | python map.py -> zbp07detail.map.txt
cat zbp07detail.map.txt | python reduce.py -> Results.test.txt 

Now just run following commands at the shell prompt:

hadoop fs -put /zbp07detail.txt /user/mydir/ 
hadoop jar HadoopHome/contrib/streaming/hadoop-XXXX-streaming.jar \
-input /user/mydir/zbp07detail.txt \
-output /user/mydir/out \ -mapper ./map.py \ -reducer ./reducer.py \ -file ./map.py \ -file ./reducer.py
hadoop fs -cat /user/mydir/out > Results.txt

That is all there is to it. The file Results.txt will have the sums by zip.

Data Management Quick Comparison 3

The following article provides a comparison between R/S-Plus and Python for basic data management and manipulation. 

Load Data

R/S-Plus

mydata <- read.csv("c:/customer.csv") 
mydata_lkup <- read.csv("c:/purchaseorder.csv") 

Python

 
import csv 
custDS=[] 
for row in csv.reader(open('c:/customer.csv'), delimiter=',', quotechar='"'): 
    custDS.append(row) 
poDS=[] 
for row in csv.reader(open('c:/purchaseorder.csv'), delimiter=',', quotechar='"'): 
    poDS.append(row)

Selecting- All Fields

R/S-Plus

 mydata 

Python

print(custDS)

One Field

R/S-Plus

  
 mydata$col1 

Python

  
for x in custDS:
    print x[1] 

Subset

R/S-Plus

 
subset(mydata, col2 == 18 )

Python

[x for x in custDS if x[2]=='18']

Sorting

R/S-Plus

 
mydata [order(mydata [,1]),] 

Python

 
sorted(custDS, key=lambda customer: customer[2])

Join

R/S-Plus

 
merge (mydata_2, mydata_lkup, by.x = col1 , by.y = col2 , all = TRUE ) 

Python

 
poDt={}
for row in csv.reader(open('c:/purchaseorder.csv'), delimiter=',', quotechar='"'):
    poDt[row[0]] = row[1:4]dsOut=[]
for x in custDS:
    if x[0] in poDt:
        x.extend(poDt[x[0]] )
        print(x)

 

Sample

R/S-Plus

head(mydata , 10)

Python

for i in [0,1,2]:
    print(poDS[i])

Aggregate Analysis

R/S-Plus

xtabs( ~ col2, mydata_lkup) 

Python

poCounts = {}
for row in poDS_sorted: 
    poCounts[row[1]] = poCounts.get(row[1],0) + 1
print(poCounts)

Unique

R/S-Plus

unique(mydata_lkup$col2)

Python

uniqpoList=[] [uniqpoList.append(i[1]) for i in poDS if not uniqpoList.count(i[1])] uniqpoList

Python Data

Python scripting language with a heavy emphasis on code reuse and simplicity. It is a popular language with a large and active user group.  Like Perl, it has a rich library set and is popular with projects like MapReduce with Hadoop and is rapidly becoming the default language in areas like computational intelligence.  One interesting feature of the Python language is, indentation acts as block delimiters.  So, when copying code do not change the indentation of the code will not operate as intended.  It is also a powerful language meaning it can do a lot with very little code.  Below is an example of a database containing two tables, customers, and purchase order.  The Customer table has customerid (the unique identifier for the customer), name (The customer’s name), and age (the customer’s age).  The PurchaseOrder table has POID (the unique id for the purchase order), customerid (which refers back to the customer who made the purchase), and Purchase (what the purchase was).

Example:
Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
PurchaseOrder
POID CustomerID Purchase
1 3 Fiction
2 1 Biography
3 1 Fiction
4 2 Biography
5 3 Fiction
6 4 Fiction
 
     
SELECT
Create the file select.py with the following code:

import sys
import fileinput
# Loop through file and print lines
for line in fileinput.input(sys.argv[1]):
    print (line)

Then run the code:
python select.py customer.txt

CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
ORDER BY
Create the file orderby.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

listLines =[]
i =0

# Load file into an array

for line in fileinput.input(sys.argv[1]):
listLines.append(line)

# Create a custom sort based on second field

def getCustId(listLines):
return listLines.split( , , 2)[-1]

# Sort the array

listLines.sort(key=getCustId)

# Print arrys lines

for object in listLines:
    print (object)

Then run the code:
python orderby.py customer.txt

CustomerID Name Age
5 Susan 18
1 Joe 23
3 Lin 34
2 Mika 45
4 Sara 56
WHERE
Create the file select_by_id.py with the following code:
import sys

import fileinput
import string

# Loop through file

for line in fileinput.input(sys.argv[1]):

# Split line using a comma

tokens = string.split( line, , )

# If ID matches passed ID then print

if tokens[0] == sys.argv[2]:
print line,

Then run the code:
python select_by_id.py customer.txt 1

CustomerID Name Age
5 Susan 18
INNER JOIN
Create the file innerjoin.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

listB =[]
i =0

# Load second file into an array to loop through

for lineB in fileinput.input(sys.argv[2]):
listB.append(lineB)

# Loop through first file

for lineA in fileinput.input(sys.argv[1]):

# Split line using a comma

tokensA = string.split( lineA, , )

# Loop through array

for object in listB:

# Split line using a comma

tokensB = string.split(object, , )

#If there is a match print

if tokensA[0] == tokensB[1]:

# Remove newline character with strip method

print lineA.strip() + , + object,

Then run the code:
python innerjoin.py customer.txt orders.txt

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
LEFT OUTER JOIN
Create the file leftouterjoin.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

listB =[]
iFound =0

# Load second file into an array to loop through

for lineB in fileinput.input(sys.argv[2]):
listB.append(lineB)

# Loop through first file

for lineA in fileinput.input(sys.argv[1]):

# Split line using a comma

tokensA = string.split( lineA, , )
iFound =0

# Loop through array

for object in listB:

# Split line using a comma

tokensB = string.split(object, , )

#If there is a match print

if tokensA[0] == tokensB[1]:

# Remove newline character with strip method

print lineA.strip() + , + object,
iFound=1

#If there is no match print

if iFound ==0:
print lineA.strip() + , + , + , + , ,

Then run the code:
python leftouterjoin.py customer.txt orders.txt

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
5 Susan 18 NULL NULL
GROUP BY
Create the file groupby.py with the following code:

import sys
import fileinput
import string
import array

# Initialize variables

iCnt =1
iLoop =0

# Load and loop through file

for lineA in fileinput.input(sys.argv[1]):

# Split line using a comma

tokensA = string.split( lineA, , )

# Adjust with header and first line

if iLoop <2:
priorTokens = tokensA
iCnt=0

if tokensA[0] == priorTokens[0]:
iCnt =iCnt+1
else:
print priorTokens[1] + , + priorTokens[2].strip() + , +str(iCnt)
iCnt=1 iLoop = iLoop +1
priorTokens = tokensA

# print last line

print priorTokens[1] + , + priorTokens[2].strip() + ,

Then run the code:
python groupby.py customer.txt

Name Age Orders
Joe 23 1
Mika 45 2
Lin 34 2
Sara 56 1
UPDATE
Create the file update with the following code:

import sys
import fileinput
import string

# Loop through file

for line in fileinput.input(sys.argv[1]):

# Split line using a comma

tokens = string.split( line, , )

# If ID is not passed ID then print else replace age with passed parameter

if tokens[0] != sys.argv[2]:
print line,
else:
print tokens[0]+ , + tokens[1] + , + sys.argv[3]

Then run the code:
python groupby.py customer.txt 1 23

Customer
CustomerID Name Age
1 Joe 26
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
INSERT
Create the file insert.py with the following code:

import sys
import fileinput

# Loop through file and print lines

for line in fileinput.input(sys.argv[1]):
print line,

# Add new line from passed arguments

print sys.argv[2] + , + sys.argv[3] + , + sys.argv[4],

Then run the code:
python insert.py customer.txt 6 Joe 34

Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
6 Terry 50
DELETE
Create the file delete.py with the following code:

import sys
import fileinput
import string

# Loop through file

for line in fileinput.input(sys.argv[1]):

# Split line using a comma

tokens = string.split( line, , )

# If ID is not passed ID then print

if tokens[0] != sys.argv[2]:
print line,

Then run the code:
python delete.py customer.txt 1

Customer
CustomerID Name Age
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18