Creating the New

Listen Ideate CreateWhen you sit down and just listen, without doubt, or fear, without judgment or pride, to your customers, to your coworkers and to yourself something wonderful happens.  Something is born.  A weave of ideas and aspirations that, if nurtured and guided, can become an innovation that could potentially change the world.

I have built a career helping others see the potential in themselves to do something new, something authentic. I have helped others create 100’s of patent applications and have over a dozen patents and trade secrets to my own name. And, from these efforts, I learned a simple truth, if you want to create something new you need to look beyond yourself, and to do that you need to listen.

Often today we hero-worship those who embody arrogance in the name of innovation when, in reality, these are marketing constructs of people. To create we must look past the media version of a creator and see the creator in our mist, the true drivers of innovation who quietly listen to needs and craft new solutions. There are endless possibilities when I just listen but when hubris takes over the world narrows to what is within my single mind.

Python -Profile and Why to Avoid Loops.

If you have coded in R or SPlus you should be familiar with avoiding loops and capitalizing on the tappy function whenever you can. These rules of thumb will serve you well when coding in Python. Many python objects have built-in functions that can take iterators of pass parameters. A python also has functions such as map, reduce and filter that behavior similar to R’s tappy. Below are two quick examples of how to use the profile module and performance hit from using loops in Python.

 import profile
mylist = []
for i in range(10000):
    myList1.append(str(i))
 
def test_1( ):
    s_out1 = ""
    for val in myList1:
        s_out1 = s_out1 + val + ','
 
def test_2( ):
    s_out2= ",".join(myList1 )
 
profile.run('test_1()')
profile.run('test_2()')

The join function allows you to quickly concatenate strings within an iterator object without the hassle of looping through it. Below is an example of transforming all the elements within an iterator object, first by looping through the object then by using the map function.

myList2=[]

for i in range(26):
    myList2.append( chr(i+97))
 
myList2 = myList2*1000
 
def test_3( ):
    s_out1 = ""
    for val in myList2:
        s_out1 = val.upper( )
 
def test_4( ):
    s_out2= map(str.upper , myList2 )
 
profile.run('test_3()')
profile.run('test_4()') 

Although Python is quick to pick up there is, like with any language, good code and bad code.

Sorting in Hadoop with Itertools

Previously, I gave a simple example of a python map/reduce program. In this example I will show you how to sort the data within a key using itertools, a powerful python library focusing on looping algorithms. This example is an extension of Michael Noll’s detailed Hadoop example. The data in this example will be comma-delimited.

AccountId, SequenceId,Amount,Product
0123,9,12.50,Product9
0123,5, 4.60,Product5
0123,7,54.00,Product7
0123,2,34.75,Product2
0123,3, 6.34,Product3
0123,8,14.50,Product8
0123,1,52.56,Product1
0123,4,78.45,Product4
0123,6,89.50,Product6
0321,1,2.12,Product1
0321,8,90.50,Product8
0321,3, 2.35,Product3
0321,4,56.25,Product4
0321,9,71.00,Product9
0321,2,24.75,Product2
0321,7,34.34,Product7
0321,6,34.23,Product6
0321,5,37.03,Product5       

Notice the data is sorted by the AccountId however for a given account the data is not sorted by SequenceId. One important thing to remember is Hadoop does not preserve the order of the data when passing it to the nodes which will run the reduce step. The map step will group the data by the specified key but even if the data is sorted on the server correctly, it is not guaranteed that order will be preserved in the reduce. If your analysis requires the data to be sorted not just grouped by key this must be handled in the reduce step. Now let suppose you need to know the ordered sequence of products purchased for a given account. For clarity I numbered the product in accordance with its sequenceId. While this is easy to achieve with the python base library thankfully there is a library which has a number of function that makes this quite elegant to code, itertools. First let’s create a simple map step, mapper.py.

#!/usr/bin/env python 
import sys 
for line in sys.stdin: 
    # Split line by deliminator
    aline = line.split(',') 
    # Choose the first part as map key
    key= aline[0] 
    # Output map key followed by a tab and the rest of the data
    print '%s\t%s' % (key, line.strip()) 

This will define the AccountId as the key. Remember even if you included the SequenceId in the key it will not help you. The Hadoop server will then just send each record randomly across all the nodes. If you need to do an analysis by account a given node will not see all the account’s transactions.

Now, here is the reduce step, reducer.py:

#!/usr/bin/env python
import sys
from itertools import groupby
from operator  import itemgetter 

First, we define how to parse the input data stream. Notice how simple it is to handle to different delimitations, the tab to define the key for the Hadoop server and the comma to separate out the data fields.

def read_mapper_output(file ,sortBy ,  sepA,  sepB) :
    for line in file: 
        key1,line1 = line.rstrip().split(sepA, 1)
        aline      = line.split(sepB)
        # Create a second key to use on our data structure
        key2       = aline[sortBy]
        yield key1 , key2,  line1 

Yield defines this function as an iterator. And a structure was created with account number, sequence ID, and the entire data line.

def main(separator='\t'  ): 
    # Define the iterator
    data   = read_mapper_output(sys.stdin, 1, '\t'  , ',' ) 
    # Loop through the iterator grouped by the first key.
    for grpKey, grpDt in groupby(data, itemgetter(0)): 
        #set the product list blank
        products=''
        #Sort the grouped data by the second key.
        for key1, key2, line in sorted(grpDt , key=itemgetter(1)):  
             aline =line.split(',')
             # Populate you product list
             products = products + ',' +   aline[3]     
             print grpKey + products 

Finally, we call out the main function main as the default.

 
if __name__ == "__main__":
    main() 

Run the following to test the code:

cat test.data | python mapper.py | python reducer.py

It will output:

0123,Product1,Product2,Product3,Product4,Product5,Product6,Product7,Product8,Product9
0321,Product1,Product2,Product3,Product4,Product5,Product6,Product7,Product8,Product9

Even complex problems have an elegant and fast solution with Python.

Simple Hadoop Streaming Example

Hadoop is a software framework that distributes workloads across many machines. With a moderate-sized data center, it can process huge amounts of data in a brief period of time -thinks weeks equals hours. Because it is open-source, easy to use, and works, it is rapidly becoming a default tool in many analytic shops working with large datasets. In this post, I will give a very simple example of how to use Hadoop using Python. For more on Hadoop see here or type Hadoop overview in your favorite search engine.

For this example, let’s use the US Census County Business Patterns dataset.
Location: http://www2.census.gov/econ2007/CBP_CSV
Download: zbp07detail.zip then unzip the data.

This file provides a number of business establishments (EST) by NAICS and Zip code for the US. We are interested in the total number of establishments (EST) by zip code (zip).

The first step is to create the mapper. The map step creates the key the Hadoop system uses to keep parts of the data together for processing. The map step is expected to output the key followed by a tab (‘/t’) then the rest of the data required for processing. For this example, we will use Python. Create the file, map.py, and add the code below. The key which tells the Hadoop system how to group the data is defined as the value before the first tab (‘/t’).

#!/usr/bin/env python
import sys
for lineIn in sys.stdin:
    zip = lineIn[0:5]
#  Note: Key is defined here
print zip + '\t' + lineIn 

Now we create the reducer. The reducer step performs the analysis or data manipulation work. You only need the reducer step if the analysis requires a group by key which the map step is providing. In this example, it is a zip code. Create the file, reducer.py, and add the code below. In this step, you need to make sure you split the piped stream into the key and data line.

#!/usr/bin/env python
import sys
counts={}
for lineIn in sys.stdin:
    # Note: We separate the key here
    key, line = lineIn.rsplit('\t',1)
    aLine = line.split(',')
    counts[key] = counts.get(key,0) + int(aLine[3])
    for k , v in counts.items():
       print k + ',' +str(v)

To test the code just run the map and reduce steps alone. If it is a large file make sure to use a sub-sample.

cat zbp07detail.txt | python map.py -> zbp07detail.map.txt
cat zbp07detail.map.txt | python reduce.py -> Results.test.txt 

Now just run following commands at the shell prompt:

hadoop fs -put /zbp07detail.txt /user/mydir/ 
hadoop jar HadoopHome/contrib/streaming/hadoop-XXXX-streaming.jar \
-input /user/mydir/zbp07detail.txt \
-output /user/mydir/out \ -mapper ./map.py \ -reducer ./reducer.py \ -file ./map.py \ -file ./reducer.py
hadoop fs -cat /user/mydir/out > Results.txt

That is all there is to it. The file Results.txt will have the sums by zip.

Data Management Quick Comparison 3

The following article provides a comparison between R/S-Plus and Python for basic data management and manipulation. 

Load Data

R/S-Plus

mydata <- read.csv("c:/customer.csv") 
mydata_lkup <- read.csv("c:/purchaseorder.csv") 

Python

 
import csv 
custDS=[] 
for row in csv.reader(open('c:/customer.csv'), delimiter=',', quotechar='"'): 
    custDS.append(row) 
poDS=[] 
for row in csv.reader(open('c:/purchaseorder.csv'), delimiter=',', quotechar='"'): 
    poDS.append(row)

Selecting- All Fields

R/S-Plus

 mydata 

Python

print(custDS)

One Field

R/S-Plus

  
 mydata$col1 

Python

  
for x in custDS:
    print x[1] 

Subset

R/S-Plus

 
subset(mydata, col2 == 18 )

Python

[x for x in custDS if x[2]=='18']

Sorting

R/S-Plus

 
mydata [order(mydata [,1]),] 

Python

 
sorted(custDS, key=lambda customer: customer[2])

Join

R/S-Plus

 
merge (mydata_2, mydata_lkup, by.x = col1 , by.y = col2 , all = TRUE ) 

Python

 
poDt={}
for row in csv.reader(open('c:/purchaseorder.csv'), delimiter=',', quotechar='"'):
    poDt[row[0]] = row[1:4]dsOut=[]
for x in custDS:
    if x[0] in poDt:
        x.extend(poDt[x[0]] )
        print(x)

 

Sample

R/S-Plus

head(mydata , 10)

Python

for i in [0,1,2]:
    print(poDS[i])

Aggregate Analysis

R/S-Plus

xtabs( ~ col2, mydata_lkup) 

Python

poCounts = {}
for row in poDS_sorted: 
    poCounts[row[1]] = poCounts.get(row[1],0) + 1
print(poCounts)

Unique

R/S-Plus

unique(mydata_lkup$col2)

Python

uniqpoList=[] [uniqpoList.append(i[1]) for i in poDS if not uniqpoList.count(i[1])] uniqpoList

Data Management Quick Comparison 2

The following article provides a comparison between BASH and JAQL for basic data management and manipulation. 

 

Selecting- All Fields

BASH

more mydata.dat

JAQL

read(hdfs( Customers.dat )); 

-One Field

BASH

cut -c 13-22 mydata.dat 

JAQL

$Customers = read( Customers.dat );
$Customers -> transform { name: $.name };

– Subset

BASH

less mydata.dat |awk {if (substr($0,13,10) == 2786586641) print $0}

JAQL

$Customers = read( Customers.dat );
$Customers -> filter $.age == 18 ->  transform { name: $.name };

Sorting

BASH

sort -n -t +2 mydata.dat

JAQL

$Customers = read( Customers.dat );
$Customers -> sort by [$.age] -> transform { name: $.name };

Join

BASH

join -1 2 -2 1 mydata_2.dat mydata_lkup.dat | less
or (if no unmatched values and both files are sorted)
paste mydata_2.dat mydata_lkup.dat

JAQL

$CustomersPurchases =join $Customers, $PurchaseOrder where $Customers.CustomerId
== $PurchaseOrder.CustomerId into {$Customers.name, $PurchaseOrder.*} ;   

Sample

BASH

more mydata.dat| head -10

JAQL

$Customers = read( Customers.dat );
$Customers -> top(2); 

Aggregate Analysis

BASH

awk BEGIN { FS=OFS=SUBSEP= }{arr[$2,$3]++ }END {for (i in arr) print i,arr[i]} mydata_lkup.dat

JAQL

$CustomersPurchases -> group by $Customers_group = $.CustomerId into
{$Customers_group, total: count($[*].POID)};

Unique

BASH

less mydata_lkup.dat|cut -c 12|uniq

JAQL

$CustomersPurchases ->group by $Customers_group = $.CustomerId into
{$Customers_group };

JSON

JSON (JavaScript Object Notation) is a data format that is non-proprietary, human-readable, easy to parse, simple to generate, and widely adopted. You can read more about JSON at the project website here http://www.json.org/ Some of you might be saying, wait there is another data format with the same properties, namely XML, why bother with JSON? For me there are two reasons for data JSON is superior to XML. First, it is designed for data so handles things such as arrays much more eloquently than XML. There are versions of XML better suited for Data than the generic flavor but on the whole, XML is trying to provide a flexible structure to describe complex metadata not be a simple format for data. Second, there is only one type of JSON whereas XML can come in endless variety.  It is XML’s flexibility that is both its asset and liability. By being flexible XML can be used in a variety of ways but because of this confusion often arises about exactly how to consume and generate data. One of the things XML was supposed to solve. When transferring data I do not like ambiguity, I want strict conventions.

A great article about XML verse JSON with good comments is found here, http://ajaxian.com/archives/json-vs-xml-the-debate.

But I use JSON not because it is better for data (there is always something better) but mainly because a lot of data I want to consume uses the JSON format and you can load it directly into R: http://cran.r-project.org/web/packages/rjson/index.html

Below is an example of a JSON formatted dataset:

[{
"CustomerId" : 1,
"age" : 23,
"name" : "Joe"
},{
"CustomerId" : 2,
"age" : 45,
"name" : "Mika"
},{
"CustomerId" : 3,
"age" : 34,
"name" : "Lin"
}
}]

You can see that it is human readable. {} enclose rows and a row/column is expressed as “Colname” : value.

JAQL Data Management

JAQL is a JSON query language similar to SQL. One key difference is the dataset is accessed more like objects. A great overview is found here:JaqlOverview the strength of JAQL is it allows users simple, extendable code to manipulate data that is in a non-proprietary, readable, and commonly used file format. It is primarily used when processing data with Hadoop. JAQL tends to crash when it has exceptions so I would copy-paste my command from an editor. Also never press the up arrow.

Example Data:  
Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
PurchaseOrder
POID CustomerID Purchase
1 3 Fiction
2 1 Biography
3 1 Fiction
4 2 Biography
5 3 Fiction
6 4 Fiction

 

Create Data
$Customer = [ {CustomerId:1, name: Joe , age: 23}
, {CustomerId:1, name: Mika , age: 45}
, {CustomerId:1, name: Lin , age: 34}
, {CustomerId:1, name: Sara , age: 56}
, {CustomerId:1, name: Susan , age: 18}];

$PurchaseOrder = [
{POID: 1, CustomerId:3, Purchase: Fiction }
{POID: 2, CustomerId:1, Purchase: Biography }
{POID: 3, CustomerId:1, Purchase: Fiction }
{POID: 4, CustomerId:2, Purchase: Biography }
{POID: 5, CustomerId:3, Purchase: Fiction }
{POID: 6, CustomerId:3, Purchase: Fiction } ]; 
SELECT
hdfs($Customers) ; 
hdfs($PurchaseOrder) ;
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

Sort BY

$Customers -> sort by [$.age];
CustomerID Name Age
5 Susan 18
1 Joe 23
3 Lin 34
2 Mika 45
4 Sara 56
Filter
$Customers -> filter $.age == 18 ;
CustomerID Name Age
5 Susan 18
INNER JOIN
join $Customers, $PurchaseOrder where $Customers.CustomerId == $PurchaseOrder.CustomerId into {$Customers.name, $PurchaseOrder.*} ; 
CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
LEFT OUTER JOIN

join preserve $Customers, $PurchaseOrder where $Customers.CustomerId == $PurchaseOrder.CustomerId into {$Customers.name, $PurchaseOrder.*} ;

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
5 Susan 18 NULL NULL
GROUP BY
 
 $Customers -> group by $Customers_group = $.CustomerID into {$Customers_group, $Customers.age, total: count($) }; 
Name Age Orders
Joe 23 1
Mika 45 2
Lin 34 2
Sara 56 1
UPDATE

$Customer1= $Customers -> filter $.CustomerId == 1 -> transform {CustomerId: $.CustomerId, name: $.name ,age: 16} ; $Customerne1= $Customers -> filter $.CustomerId != 1 ; hdfs($Customer1 , $Customerne1);

Customer
CustomerID Name Age
1 Joe 26
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
INSERT
$Customer6=[{CustomerId:6,name: Terry ,age:50}]; hdfs($Customers , $Customer6); -> write(file( Customers.dat ));
Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
6 Terry 50
DELETE
$Customerne1= $Customers ->filter $.CustomerId != 1
Customer
CustomerID Name Age
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

JSON and JAQL

JSON (Javascript Object Notation) is growing in popularity as a data format. I am finding I am using it routinely when interfacing with sites such as FreeBase.com and at work when processing datasets using Hadoop.

One great strength of the JSON format is the rich set of tools available to work with the data. One example is JAQL, a JSON query language similar to SQL that works well with Hadoop. A great overview is found here <a href = http://code.google.com/p/jaql/wiki/JaqlOverview > here </a> he strength of JAQL is it allows users simple, extendable code to manipulate data that is in a non-proprietary, readable and commonly used file format.

I have added a JAQL example in the data section and to the data Management Quick Comparison.

Sympathy for the learner: Abuse #1

Abuse #1 Throwing data at the learner

As data mining becomes more popular in our risky times invariably the profession is becoming sloppy. I see this in research papers, interactions with consultants, and vendor presentations. It is not technical knowledge that I see lacking but sympathy for the learner. Many in the data mining field, for lack of a better word, abuse their learners. For those of you who are not data miners let me give you a brief overview of what I mean by a learner. Suppose you have a collection of data and a problem (or concept) that you hope can be better understood via that data. The learner is whatever method or tool you use to learn (estimate) the concept that you are trying to describe. The learner can be a linear regression, neural network, boosted tree, or even a human.

One way we abuse our learners is the growing tendency to throw data at learner with little consideration for the data’s presentation in hopes that amidst the cloud of information the concept will magically become clear. Remember a boosted tree knows nothing more than what is in the data. A boosted tree was not provided an education or even given the ability to read a book. Most learners have no common sense of knowledge and even forget what it learned in the previous model. Because of this any common sense knowledge about how the data works can provide a tremendous amount of information to the learner sometimes even exceeding the initial information content of the data alone.

Example: Say you are trying to model the optimal coverage for an automobile insurance policy. In the data, you have the number of drivers and vehicles. Common sense tells you it is important if there is a disparity between drivers and vehicles. An extra vehicle can go unused and an extra driver can’t drive. How can a learner ‘see’ this pattern? If it is a tree it creates numerous splits, (if 1 driver and 2 vehicles do this, if 2 drivers on a vehicle do this, …). Essentially the learner is forced to construct a proxy for the fact about whether there are more cars than vehicles. There are several problems with this, there is no guarantee the proxy will be correctly created, it makes the model needlessly complex, and it crowds out other patterns from being included in the tree. A better solution is to introduce a flag indicating more cars than drivers. Although this is a mere one-bit field behind is the complex reasoning as to why the disparity between drivers and vehicles matters and therefore it contains far more information than one bit. A simple one-bit field like this can make or break a model.

The presentation of the data to the learner is just as important as the data itself. What can be obvious, (more cars than drivers, international verse domestic transactions), can be pivotal in uncovering complex concepts. As a data miner put yourself in the leaner’s shoes and you will find yourself giving more sympathy to the learner.