Unleashing Janus eBook

Posted in AW Blog on August 2nd, 2010 by Ted Harris – Be the first to comment

I am releasing a free eBook version of my science fiction book Unleashing Janus. It is the story of a group of hackers trying to build the first conscious machine. While the novel is not text book data-mining hopefully it is entertaining. You can read more and download a copy here.

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Simple Hadoop Streaming Example

Posted in Data on June 6th, 2010 by Ted Harris – Be the first to comment

Hadoop is a software framework that distributes work loads across many machines. With a moderate sized data center it can process huge amounts of data in a brief period of time -thinks weeks equals hours. Because it is open source, easy to use and works, it is rapidly becoming a default tool in many analytic shops working with large datasets. In this post I will give a very simple example of how to use Hadoop using Python. For more on Hadoop see here or type hadoop overview in your favorite search engine.

For this example lets use the US Census County Business Patterns dataset.
Location: http://www2.census.gov/econ2007/CBP_CSV
Download: zbp07detail.zip then unzip the data.

This file provides number of business establishments (EST) by NAICS and Zip code for the US. We are interested in total number of establishments (EST) by zip code (zip).

First step is to create the mapper. The map step creates the key the Hadoop system uses to keep parts of the data together for processing. The map step is expected to output the key followed by a tab (‘/t’) then the rest of the data required for processing. For this example we will use Python. Create the file, map.py, and add the code below. The key which tells the hadoop system how to group the data is defined as the value before the first tab (‘/t’).

#!/usr/bin/env python
import sys
for lineIn in sys.stdin:
       zip = lineIn[0:5]
#       Note: Key is defined here
       print zip + '\t' + lineIn

Now we create the reducer. The reducer step performs the analysis or data manipulation work. You only need the reducer step if the analysis requires a group by key which the map step is providing. In this example it is zip code. Create the file, reducer.py, and add the code below. In this step you need to make sure you split the piped stream into the key and data line.

#!/usr/bin/env python
import sys
counts={}
for lineIn in sys.stdin:
#       Note: We separate the key here
       key, line = lineIn.rsplit('\t',1)
       aLine = line.split(',')
       counts[key] = counts.get(key,0) + int(aLine[3])
for k , v in counts.items():
      print k + ',' +str(v)

To test the code just run the map and reduce steps alone. If it is a large file make sure to use a sub sample.

cat zbp07detail.txt | python map.py > zbp07detail.map.txt
cat zbp07detail.map.txt | python reduce.py > Results.test.txt

Now just run following commands at the shell prompt:

hadoop fs -put /zbp07detail.txt /user/mydir/


hadoop jar HadoopHome/contrib/streaming/hadoop-XXXX-streaming.jar \
-input /user/mydir/zbp07detail.txt \
-output /user/mydir/out \
-mapper ./map.py \
-reducer ./reducer.py \
-file ./map.py \
-file ./reducer.py


hadoop fs -cat /user/mydir/out > Results.txt

That is all there is to it. The file Results.txt will have the sums by zip.

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Artistic Infographics for Model Analysis

Posted in AW Blog on May 1st, 2010 by Ted Harris – Be the first to comment

The concept of artistic infographics, making the artistic element of a graph as important as its information content has been around for a while. Two good sites one the topics of infographics are: FlowingData.com and Lee Byron. I used to have mixed feeling about infographics coming from an academic background were people who used ordinary graphs were called graphist. What they would call someone who used artistic infographics is best not written. I felt most of the data visualizations was beautiful, clever but ultimately impractical for the daily data-miner.

But as my workload increased I began to see the beauty of glancing at a graph and right away knowing if the model worked. When faced with literally hundreds of production models many of which are miss-behaving (some of them in very subtle ways) you have to rely on simple solutions. Automated QC systems are a quick solution but they typically only uncover previously known issues. Automated system can’t learn and adapt easily. The beauty of visualization is once seen it will be processed by the most flexible neural network on Earth, the human mind. The trick is to present the data properly to this learner and keep their attention.

One strength of artistic infographics is, it is entertaining. Motivation is one the scarcest commodity in the private sector. Making graphics enjoyable and fun is a good start to getting employees to pay attention. Remember, if the graph is used in a QC system for a scoring product someone may have to look your graph everyday, seven days a week for the life of the product. Also, most good data miners have a strong artistic streak that should not be ignored.

Here is a simple example of how to make analysis graphs more entertaining. The code below is written in R. R is a good choice for static graphs for dynamic graphs I prefer Flash. A good open source package can be found here: http://code.google.com/p/ofcgwt/

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Colorful QC Charts

Posted in Predictive Modeling on May 1st, 2010 by Ted Harris – 1 Comment

This is an example of how to make monitoring graphs that are more interesting therefore more likely to be paid attention to. The goal for such graphs is to make boring and uniteresting be ok. This is surprisingly easy. When the data has lots structure (lots of information) exciting and interintresting patterns emerge. When it is random the graphs tend to look less interesting. This is exactly what we want, residual should have no structure; they should be random.



The data is GDP, Wages and Employement for the US from 1949 to 2009 via St. Louis Federal Reserve. This is a great place for data on the economy.



The model’s goal is to predict next quarter’s employment using the previous quarters wage and GDP data and hopefully creating colorful errors in the process.



The data pull code can be found here.

And the code for generating the models and graphs here.



First we run a simple linear regression leaving out the last 50 oberservatons.


mdl.v1 <-lm(EMP ~ L1_GDP + L1_WAGES ,data = df.Mdl[1:SmpEnd, ])




If you looked at the summary it would have been obvious that this model has issues. The R-Square is near 1. But lets ignore that for now.



First we generate the out of sample predictions.


v1.pred <- predict(mdl.v1 , df.Mdl[OutSmpStart:OutSmpEnd,])
v1.pred


Next we normaized residuals using within sample varience (this is the Y)


v1.res <- v1.pred- df.Mdl[OutSmpStart:OutSmpEnd,]$EMP

v1.var <- var(df.Mdl[1:SmpEnd, ]$EMP )

v1.NormRes <- v1.res/sqrt(v1.var )



Now we get a sense of magintuted of the estimates for size.


v1.maxpred <- max(v1.pred^2)

v1.predAdj <- v1.pred^2/v1.maxpred



Next we create the ratio of the dependant over lagged dependant to indicate color. This is good measures of potental unit roots and concept drift.


v1.tRateOfChange <- abs(df.Mdl[(OutSmpStart-1):(OutSmpEnd-1),]$EMP/ df.Mdl[OutSmpStart:(OutSmpEnd),]$EMP)

v1.maxRateOfChange <- max(v1.tRateOfChange)

v1.RateOfChange <- v1.tRateOfChange/v1.maxRateOfChange


Set the plot options.


op <- par(bg ="black", col="white", col.lab ="white" ,col.axis ="white" , col.main ="white" ,col.sub ="white" )

plot(c(1996, 2009), c(-50, 50), type = "n", xlab="Year", ylab="Normalized Residuals", main = "Emp = L1 GNP + L1Wages", sub="Color: Ratio Dep vs Lagged Dep Radius: Normalized Esimate" )
abline( h=0, col = "white")

palette(rainbow(200))



Now look through the out of sample estimates and plot the values.




loops<- OutSmpEnd-OutSmpStart

## loop through the data

i <- (1:loops)

{

ptx = 1996 + i/4

pty = v1.NormRes[i]*100

ptr = v1.predAdj[i]*2

ptcolor = 35 + v1.RateOfChange[i]*165

points(ptx,pty , pch = 19, col =ptcolor, bg =ptcolor ,cex= ptr)

}





There is structure (patterens) in the residuals. This is bad. Residuals should appear random in plots. First off the residuals are trending up over time indicating Heteroscedasticity. Also note that large values (the size of the circles) have larger errors. Secondly, the color is uniform and high on the color chart showing a strong correlation between the dependant and it’s lagged value. This indicates a unit root.



Lets take the log of the varibles to deal with the heteroscedasticity and correct for the unit root by taking first difference.




df.Mdl.2<-df.Mdl[2:OutSmpEnd,]

df.Mdl.2$EMP_log_1D <- log(df.Mdl[2:OutSmpEnd,]$EMP ) - log(df.Mdl[1:(OutSmpEnd-1),]$EMP)

df.Mdl.2$L1_GDP_log_1D <- log(df.Mdl[2:OutSmpEnd,]$L1_GDP) - log(df.Mdl[1:(OutSmpEnd-1),]$L1_GDP)

df.Mdl.2$L1_WAGES_log_1D<- log(df.Mdl[2:OutSmpEnd,]$L1_WAGES) - log(df.Mdl[1:(OutSmpEnd-1),]$L1_WAGES)



Now re-estimate the model.


mdl.v2 <-lm(EMP_log_1D ~ L1_GDP_log_1D + L1_WAGES_log_1D ,data = df.Mdl.2[1:SmpEnd, ])





Now that is better -not perfect, but better. The residuals are centered near zero and show a inconsistent relationship for the dependent and its lagged value. This is not the best model but hopefully this shows how more creative charts can aid in model maintance and developement.




Note: Why did I only use the range 24-200 from the rainbow palette? Easier to see. below is a plot of all colors using the rainbow pallet. I find the lower end of the spectrum hard to distinguish form the high end so I skip it. Also note how the colors do not pop out as much with the white background. Black backgrounds are a quick way to make the colors stand out.





Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Data Management Quick Comparison 3

Posted in Data on May 1st, 2010 by Ted Harris – Be the first to comment
  R/S-Plus Python
Load Data mydata<- read.csv(“c:/customer.csv”)mydata_lkup<- read.csv(“c:/purchaseorder.csv”) import csvcustDS=[]

for row in csv.reader(open(‘c:/customer.csv’), delimiter=’,', quotechar=’”‘):

        custDS.append(row)

poDS=[]

for row in csv.reader(open(‘c:/purchaseorder.csv’), delimiter=’,', quotechar=’”‘):

        poDS.append(row)

Selecting
- All Fields
mydata print custDS

- One Field

mydata$col1

for x in custDS:        print x[1]
- Subset subset(mydata, col2 == 18 ) [x for x in custDS if x[2]==’18′],
Sorting mydata [order(mydata [,1]),] sorted(custDS, key=lambda customer: customer[2])
Join merge (mydata_2, mydata_lkup, by.x = col1 , by.y = col2 , all = TRUE ) poDt={}
for row in csv.reader(open(‘c:/purchaseorder.csv’), delimiter=’,', quotechar=’”’):
        poDt[row[0]] = row[1:4]

dsOut=[]
for x in custDS:
     if x[0] in poDt:
          x.extend(poDt[x[0]] )
          print x

Sample head(mydata , 10) for i in [0,1,2]:      print poDS[i]
Aggregate Analysis xtabs( ~ col2, mydata_lkup) poCounts = {}for row in poDS_sorted:

       poCounts[row[1]] = poCounts.get(row[1],0) + 1

poCounts

Unique unique(mydata_lkup$col2) uniqpoList=[]
[uniqpoList.append(i[1]) for i in poDS if not uniqpoList.count(i[1])]
uniqpoList
Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Data Management Quick Comparison 2

Posted in Data on April 19th, 2010 by Ted Harris – Be the first to comment
  Unix JAQL
Selecting
- All Fields
more mydata.dat read(hdfs( Customers.dat ));
- One Field

cut -c 13-22 mydata.dat

$Customers = read( Customers.dat );
$Customers
-> transform { name: $.name };
- Subset less mydata.dat |awk {if (substr($0,13,10) == 2786586641) print $0} $Customers = read( Customers.dat );
$Customers
-> filter $.age == 18
-> transform { name: $.name };
Sorting sort -n -t +2 mydata.dat $Customers = read( Customers.dat );
$Customers
-> sort by [$.age]
-> transform { name: $.name };
Join join -1 2 -2 1 mydata_2.dat mydata_lkup.dat | lessor (if no unmatched values and both files are sorted)paste mydata_2.dat mydata_lkup.dat $CustomersPurchases =join $Customers, $PurchaseOrder where $Customers.CustomerId
== $PurchaseOrder.CustomerId into {$Customers.name, $PurchaseOrder.*} ;
Sample more mydata.dat| head -10 $Customers = read( Customers.dat );
$Customers -> top(2);
Aggregate Analysis awk BEGIN { FS=OFS=SUBSEP= }{arr[$2,$3]++ }END {for (i in arr) print i,arr[i]} mydata_lkup.dat $CustomersPurchases -> group by $Customers_group = $.CustomerId into
{$Customers_group, total: count($[*].POID)};
Unique less mydata_lkup.dat|cut -c 12|uniq $CustomersPurchases -> group by $Customers_group = $.CustomerId into
{$Customers_group };
Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

New Poll, Links and Twitter Bots

Posted in AW Blog on March 23rd, 2010 by Ted Harris – Be the first to comment

We have added a new poll, ‘Plans for Hadoop’ and several new links.  If you have a suggested link please submit one.  The form is on the bottom of the link page.

I am still working on winding down some of my older sites so hopefully soon I will have more spare time to devote to AnalyticalWay.com and MoodRelate.com.  For now I have added two twitter bots for each site.  The Twitter.com/AnalyticalWay bot will re-post old articles and links Monday through Friday.  The Twitter.com/MoodRelate bot will tweet each day a city with the most and least tweets about being happy, sad, bored, in love and sick.  I am testing several other bots and more should come in the next few months.

Here is a good link showing how to post to Twitter via php www.website-ideas.co.uk .

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Finished Migration

Posted in AW Blog on February 24th, 2010 by Ted Harris – Be the first to comment

We have completed the first phase of migration to WordPress. It was amazingly easy. If your Joomla and WordPress databases are on the same server just run a few selct and insert commands and you will be up in no time. A detailed description of how to migrate to WP can be found here.

Although comments from the prior site were not transferred the comments have been re-opened on all posts. Older posts that have formatting issues will be updated eventually. If you find a formatting or other error in a post please leave a comment.

We also have extended AnalyticalWay’s social media connections. AnalyticalWay now has a YouTube channel that will promote videos we find useful or fun. This replaces our lecture links section. If you have a YouTube channel let us know. The bulletin board will now be on our new FaceBook page. It seems to make sense to have the bulletin board integrated fully with FaceBook. Older post will not be migrated. We have added a new Twitter account dedicated to AnalyticalWay.com. Ted will continue to post to the original Twitter account about interesting article’s he finds around the web. The new Twitter account will only tweet about new posts or changes here at AnalyticalWay.com. We may have additional writers so we thought it was best to have separate Twitter accounts.

We have improved links and RSS feeds pages. If you have a RSS feed or site you think should be listed scroll to the bottom of the links section and submit your link there. Please note not all RSS feeds work with the plugin we are using. In addition we have added a Calendar page. If you have any upcoming events you want posted please let us know.

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Just Migrated From Joomla to WordPress

Posted in AW Blog on February 13th, 2010 by Ted Harris – Be the first to comment

I just migrated the site from Joomla.  The majority of the content has been migrated but it will take me awhile to finish setting up the categories.

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

JSON

Posted in Data on December 31st, 2009 by Ted Harris – Be the first to comment

JSON (JavaScript Object Notation) is a data format that is non-proprietary, human readable, easy to parse, simple to generate and widely adopted. You can read more about JSON at the project website here http://www.json.org/ Some of you might be saying, wait there is another data format with same properties, namely XML, why bother with JSON? For me there are two reasons for data JSON is superior to XML. First it is designed for data so handles things such as arrays much more eloquently than XML. There are versions of XML better suited for Data than the generic flavor but on the whole XML is trying to provide a flexible structure to describe complex metadata not be a simple format for data. Second, there is only one type of JSON whereas XML can come in endless variety.  It is XMLs flexibility that is both its asset and liability. By being flexible XML can be used in a variety of ways but because of this confusion often arises about exactly how to consume and generate data. One of the things XML was suppose to solve. When transferring data I do not like ambiguity, I want strict conventions.

A great article about XML verse JSON with good comments is found here, http://ajaxian.com/archives/json-vs-xml-the-debate.

But I use JSON not because it is better for data (there is always something better) but mainly because a lot of data I want to consume uses the JSON format and you can load it directly into R: http://cran.r-project.org/web/packages/rjson/index.html

Below is an example of a JSON formatted dataset:
[{
CustomerId : 1,
age : 23,
name : Joe
},{
CustomerId : 2,
age : 45,
name : Mika
},{
CustomerId : 3,
age : 34,
name : Lin
},{
CustomerId : 4,
age : 56,
name : Sara
},{
CustomerId : 5,
age : 18,
name : Susan
}]

You can see that it is human readable. {} enclose rows and a row/column is expressed as Colname : value.

The one issue with JSON is text data must be encoded in the JSON format which replaces / with \/. This is similar to regex and can make urls look a bit funky. http://www.json.org is encoded as http:\/\/www.json.org

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)