Sympathy for the Learner post-COVID

One thing is certain, COVID-19 has upended traditional relationships within our data. Even if the economy returns to a more normal rhythm how we interact with others, react to stimuli and even our life long aspirations, have changed forever. We will adjust and find a new normal, just as humanity has in times of crisis and great tragedy before, but we must recognize, this change will also have an impact on our new workplace partners, Artificial Intelligence (AI).  And now that AI is becoming a ubiquitous part of everyone’s daily lives, when your AI gets confused, it may do real-world harm on an already struggling populace. 

As we adjust and learn how to navigate this new world we have access to a well-educated brain, creativity, and intuition that AI does not have. The world they see, the topology of their training data, has shifted drastically to show unfamiliar terrains and broken fundamental relationships. For example, what served as an incentive to human behavior (a crowded beach) might now cause fear in a large segment of the population. This is a time to make sure you continuously measure your AI’s performance, focus your data scientist on understanding the impact of the COVID-19, and be open-minded to a variety of AI methods. You must have sympathy for your learner (AI) but you should also make sure the right one is being used to answer your questions and to solve your problems. 

Post Covid-19, sympathy for the learner is not just required for high-quality models but necessary to assure minimum performing models. The world has changed, quite dramatically. The learner you choose (or how you present the data to the learner) needs to be able to adjust rapidly to this ever-changing landscape in production. Severe concept drift, when fundamental drivers to target relationships change, is now the expected with all models. Key drivers of our economy, behavior, and relationships are rapidly evolving but our historical data still see the old world order. Even something as simple as toilet paper purchases is now distorted. A sudden peak in toilet paper purchases for a given household prior to COVID-19 might mean a particularly good sale, more people moving into the household, or purchases for a small business. Now it probably means run of the mill hoarding. Recommendation engines, risk models and segmentation scores will collapse as this new data enters the system. Worse still, as the data distorted by the shelter in place order enters the training data the AI will try to use it to project into the future rather than understand this was a transitory event. Care must be taken when presenting this data to your learner. The learner has not been following the news, does not understand the concept of a global pandemic, and has no ability to imagine the future post Covid19. While this is frustrating, as long as you recognize the limitations of AI you can mitigate the impact of Covid19 on your modeling data and, again, get back to looking into the future with AI.

 

COVID 19 and Home Schooling

This is a time when forgotten skills are awoken to help us survive, and hopefully thrive, under challenging circumstances. From, cooking to yoga, we all are reaching back to our prior lives to reinvent what is normal. For us, how we teach our children at home is evolving as remote schooling continues. At first, remote classes were engaging, but soon both of my daughters yearned for more. Luckily, I taught at the University of Oregon where we explored student-led instruction with the students driving the goals for the course. For my courses, Micro Economics, students chose topics and I guided them to use Economic principles to answer their questions. Here are some examples:

  • How can we migrate to renewable energy faster?

  • Why are tuition and books so expensive?

This was more challenging than traditional methods but the rewards were great. True, my ratings improve but, more importantly, many students embraced Economics. Fast forward to today, my kids want more. Teaching is a stressful endeavor. We feel pressured to make sure our kids meet goals, deadlines, and milestones. What I learned, from my students, mentors, and daughter’s teachers is a simple truth, there are no set milestones. Milestones create walls when we need to create paths.

The first major project was HarmoniousWorlds.com which I talked about in a prior post. This project, a website outlining rules to a game they created was spawned by asking them how to resolve a dispute over the game. They saw the need to define and write down all the rules. This gave them ownership and well-defined goals. They pushed their language and math skills to make sure they could explain the game to their friends. When crafting posts, they took photos and consider how to best present the material to a general audience. The trick was to make sure they were not blocked in achieving their vision without taking over and marginalizing them. While my eldest embraced the project and quickly took command my youngest had difficulties once the initial rules were created.

In the prior post we created a curriculum that, after the initial set up, was not engaging for our youngest daughter who was just beginning to read. So, we focused on helping her achieve a long time goal of hers, creating a graphic novel. She picked a topic (for example, enchanted forest) and we created sentences, mostly using words we knew she understood, based on that topic. It was her job was to read each sentence and create a drawing based on it. We framed this as working together to create a book rather than an assignment. After several days she had a book of sentences with pictures that she drew. She then used this as reference material to create her own stories and even wrote her own sentences that we would have to draw for her. While these examples sounded like organized plans, they were not. Both grew organically with the children contributing as much to the plan as us. The truth is all kids want to learn. As parents, we can choose to guide them rather than educate them. Both paths are challenging. For us, working to help our daughters meet their goals made homeschooling more fun it brought us closer together.

Creating the New

Listen Ideate CreateWhen you sit down and just listen, without doubt, or fear, without judgment or pride, to your customers, to your coworkers and to yourself something wonderful happens.  Something is born.  A weave of ideas and aspirations that, if nurtured and guided, can become an innovation that could potentially change the world.

I have built a career helping others see the potential in themselves to do something new, something authentic. I have helped others create 100’s of patent applications and have over a dozen patents and trade secrets to my own name. And, from these efforts, I learned a simple truth, if you want to create something new you need to look beyond yourself, and to do that you need to listen.

Often today we hero-worship those who embody arrogance in the name of innovation when, in reality, these are marketing constructs of people. To create we must look past the media version of a creator and see the creator in our mist, the true drivers of innovation who quietly listen to needs and craft new solutions. There are endless possibilities when I just listen but when hubris takes over the world narrows to what is within my single mind.

Python -Profile and Why to Avoid Loops.

If you have coded in R or SPlus you should be familiar with avoiding loops and capitalizing on the tappy function whenever you can. These rules of thumb will serve you well when coding in Python. Many python objects have built-in functions that can take iterators of pass parameters. A python also has functions such as map, reduce and filter that behavior similar to R’s tappy. Below are two quick examples of how to use the profile module and performance hit from using loops in Python.

 import profile
mylist = []
for i in range(10000):
    myList1.append(str(i))
 
def test_1( ):
    s_out1 = ""
    for val in myList1:
        s_out1 = s_out1 + val + ','
 
def test_2( ):
    s_out2= ",".join(myList1 )
 
profile.run('test_1()')
profile.run('test_2()')

The join function allows you to quickly concatenate strings within an iterator object without the hassle of looping through it. Below is an example of transforming all the elements within an iterator object, first by looping through the object then by using the map function.

myList2=[]

for i in range(26):
    myList2.append( chr(i+97))
 
myList2 = myList2*1000
 
def test_3( ):
    s_out1 = ""
    for val in myList2:
        s_out1 = val.upper( )
 
def test_4( ):
    s_out2= map(str.upper , myList2 )
 
profile.run('test_3()')
profile.run('test_4()') 

Although Python is quick to pick up there is, like with any language, good code and bad code.

Sorting in Hadoop with Itertools

Previously, I gave a simple example of a python map/reduce program. In this example I will show you how to sort the data within a key using itertools, a powerful python library focusing on looping algorithms. This example is an extension of Michael Noll’s detailed Hadoop example. The data in this example will be comma-delimited.

AccountId, SequenceId,Amount,Product
0123,9,12.50,Product9
0123,5, 4.60,Product5
0123,7,54.00,Product7
0123,2,34.75,Product2
0123,3, 6.34,Product3
0123,8,14.50,Product8
0123,1,52.56,Product1
0123,4,78.45,Product4
0123,6,89.50,Product6
0321,1,2.12,Product1
0321,8,90.50,Product8
0321,3, 2.35,Product3
0321,4,56.25,Product4
0321,9,71.00,Product9
0321,2,24.75,Product2
0321,7,34.34,Product7
0321,6,34.23,Product6
0321,5,37.03,Product5       

Notice the data is sorted by the AccountId however for a given account the data is not sorted by SequenceId. One important thing to remember is Hadoop does not preserve the order of the data when passing it to the nodes which will run the reduce step. The map step will group the data by the specified key but even if the data is sorted on the server correctly, it is not guaranteed that order will be preserved in the reduce. If your analysis requires the data to be sorted not just grouped by key this must be handled in the reduce step. Now let suppose you need to know the ordered sequence of products purchased for a given account. For clarity I numbered the product in accordance with its sequenceId. While this is easy to achieve with the python base library thankfully there is a library which has a number of function that makes this quite elegant to code, itertools. First let’s create a simple map step, mapper.py.

#!/usr/bin/env python 
import sys 
for line in sys.stdin: 
    # Split line by deliminator
    aline = line.split(',') 
    # Choose the first part as map key
    key= aline[0] 
    # Output map key followed by a tab and the rest of the data
    print '%s\t%s' % (key, line.strip()) 

This will define the AccountId as the key. Remember even if you included the SequenceId in the key it will not help you. The Hadoop server will then just send each record randomly across all the nodes. If you need to do an analysis by account a given node will not see all the account’s transactions.

Now, here is the reduce step, reducer.py:

#!/usr/bin/env python
import sys
from itertools import groupby
from operator  import itemgetter 

First, we define how to parse the input data stream. Notice how simple it is to handle to different delimitations, the tab to define the key for the Hadoop server and the comma to separate out the data fields.

def read_mapper_output(file ,sortBy ,  sepA,  sepB) :
    for line in file: 
        key1,line1 = line.rstrip().split(sepA, 1)
        aline      = line.split(sepB)
        # Create a second key to use on our data structure
        key2       = aline[sortBy]
        yield key1 , key2,  line1 

Yield defines this function as an iterator. And a structure was created with account number, sequence ID, and the entire data line.

def main(separator='\t'  ): 
    # Define the iterator
    data   = read_mapper_output(sys.stdin, 1, '\t'  , ',' ) 
    # Loop through the iterator grouped by the first key.
    for grpKey, grpDt in groupby(data, itemgetter(0)): 
        #set the product list blank
        products=''
        #Sort the grouped data by the second key.
        for key1, key2, line in sorted(grpDt , key=itemgetter(1)):  
             aline =line.split(',')
             # Populate you product list
             products = products + ',' +   aline[3]     
             print grpKey + products 

Finally, we call out the main function main as the default.

 
if __name__ == "__main__":
    main() 

Run the following to test the code:

cat test.data | python mapper.py | python reducer.py

It will output:

0123,Product1,Product2,Product3,Product4,Product5,Product6,Product7,Product8,Product9
0321,Product1,Product2,Product3,Product4,Product5,Product6,Product7,Product8,Product9

Even complex problems have an elegant and fast solution with Python.

Simple Hadoop Streaming Example

Hadoop is a software framework that distributes workloads across many machines. With a moderate-sized data center, it can process huge amounts of data in a brief period of time -thinks weeks equals hours. Because it is open-source, easy to use, and works, it is rapidly becoming a default tool in many analytic shops working with large datasets. In this post, I will give a very simple example of how to use Hadoop using Python. For more on Hadoop see here or type Hadoop overview in your favorite search engine.

For this example, let’s use the US Census County Business Patterns dataset.
Location: http://www2.census.gov/econ2007/CBP_CSV
Download: zbp07detail.zip then unzip the data.

This file provides a number of business establishments (EST) by NAICS and Zip code for the US. We are interested in the total number of establishments (EST) by zip code (zip).

The first step is to create the mapper. The map step creates the key the Hadoop system uses to keep parts of the data together for processing. The map step is expected to output the key followed by a tab (‘/t’) then the rest of the data required for processing. For this example, we will use Python. Create the file, map.py, and add the code below. The key which tells the Hadoop system how to group the data is defined as the value before the first tab (‘/t’).

#!/usr/bin/env python
import sys
for lineIn in sys.stdin:
    zip = lineIn[0:5]
#  Note: Key is defined here
print zip + '\t' + lineIn 

Now we create the reducer. The reducer step performs the analysis or data manipulation work. You only need the reducer step if the analysis requires a group by key which the map step is providing. In this example, it is a zip code. Create the file, reducer.py, and add the code below. In this step, you need to make sure you split the piped stream into the key and data line.

#!/usr/bin/env python
import sys
counts={}
for lineIn in sys.stdin:
    # Note: We separate the key here
    key, line = lineIn.rsplit('\t',1)
    aLine = line.split(',')
    counts[key] = counts.get(key,0) + int(aLine[3])
    for k , v in counts.items():
       print k + ',' +str(v)

To test the code just run the map and reduce steps alone. If it is a large file make sure to use a sub-sample.

cat zbp07detail.txt | python map.py -> zbp07detail.map.txt
cat zbp07detail.map.txt | python reduce.py -> Results.test.txt 

Now just run following commands at the shell prompt:

hadoop fs -put /zbp07detail.txt /user/mydir/ 
hadoop jar HadoopHome/contrib/streaming/hadoop-XXXX-streaming.jar \
-input /user/mydir/zbp07detail.txt \
-output /user/mydir/out \ -mapper ./map.py \ -reducer ./reducer.py \ -file ./map.py \ -file ./reducer.py
hadoop fs -cat /user/mydir/out > Results.txt

That is all there is to it. The file Results.txt will have the sums by zip.

Automated Quality Control Charts

This is an example of how to make monitoring graphs that are more interesting therefore more likely to be paid attention to. The goal for such graphs is to make boring and uninteresting be OK. This is surprisingly easy. When the data has lots of structure (lots of information) exciting and interesting patterns emerge. When it is random the graphs tend to look less interesting. This is exactly what we want, residual should have no structure; they should be random.

The data is GDP, Wages, and Employment for the US from 1949 to 2009 via St. Louis Federal Reserve. This is a great place for data on the economy.

The model’s goal is to predict next quarter’s employment using the previous quarter’s wage and GDP data and hopefully creating colorful errors in the process.

The data pull code can be found here.
And the code for generating the models and graphs here.

First, we run a simple linear regression leaving out the last 50 observations.

 mdl.v1 <-lm(EMP ~ L1_GDP + L1_WAGES ,data = df.Mdl[1:SmpEnd, ]) 

If you looked at the summary it would have been obvious that this model has issues. The R-Square is near 1. But lets ignore that for now.
First, we generate the out of sample predictions.

v1.pred

Next, we normalize residuals using within-sample variance (this is the Y).

v1.res

Now we get a sense of the magnitude of the estimates for size.

v1.maxpred

Next, we create the ratio of the dependent over lagged dependent to indicate color. This is good measures of potential unit roots and concept drift.

v1.tRateOfChange

Set the plot options.

op

Now look through the out of sample estimates and plot the values.

loops

There is structure (patterns) in the residuals. This is bad. Residuals should appear random in plots. First off the residuals are trending up over time indicating Heteroscedasticity. Also note that large values (the size of the circles) have larger errors. Secondly, the color is uniform and high on the color chart showing a strong correlation between the dependent and it’s lagged value. This indicates a unit root.

Let’s take the log of the variables to deal with the heteroscedasticity and correct for the unit root by taking the first difference.

df.Mdl.2<-df.Mdl[2:OutSmpEnd,]  
 df.Mdl.2$EMP_log_1D

Now re-estimate the model.

 
  mdl.v2 <-lm(EMP_log_1D  ~   L1_GDP_log_1D + L1_WAGES_log_1D ,data = df.Mdl.2[1:SmpEnd, ])  

Now that is better -not perfect, but better. The residuals are centered near zero and show an inconsistent relationship for the dependent and its lagged value. This is not the best model but hopefully, this shows how more creative charts can aid in model maintenance and development.

Note: Why did I only use the range 24-200 from the rainbow palette? Easier to see. below is a plot of all colors using the rainbow pallet. I find the lower end of the spectrum hard to distinguish from the high end so I skip it. Also, note how the colors do not pop out as much with the white background. Black backgrounds are a quick way to make the colors stand out.

Data Management Quick Comparison 3

The following article provides a comparison between R/S-Plus and Python for basic data management and manipulation. 

Load Data

R/S-Plus

mydata <- read.csv("c:/customer.csv") 
mydata_lkup <- read.csv("c:/purchaseorder.csv") 

Python

 
import csv 
custDS=[] 
for row in csv.reader(open('c:/customer.csv'), delimiter=',', quotechar='"'): 
    custDS.append(row) 
poDS=[] 
for row in csv.reader(open('c:/purchaseorder.csv'), delimiter=',', quotechar='"'): 
    poDS.append(row)

Selecting- All Fields

R/S-Plus

 mydata 

Python

print(custDS)

One Field

R/S-Plus

  
 mydata$col1 

Python

  
for x in custDS:
    print x[1] 

Subset

R/S-Plus

 
subset(mydata, col2 == 18 )

Python

[x for x in custDS if x[2]=='18']

Sorting

R/S-Plus

 
mydata [order(mydata [,1]),] 

Python

 
sorted(custDS, key=lambda customer: customer[2])

Join

R/S-Plus

 
merge (mydata_2, mydata_lkup, by.x = col1 , by.y = col2 , all = TRUE ) 

Python

 
poDt={}
for row in csv.reader(open('c:/purchaseorder.csv'), delimiter=',', quotechar='"'):
    poDt[row[0]] = row[1:4]dsOut=[]
for x in custDS:
    if x[0] in poDt:
        x.extend(poDt[x[0]] )
        print(x)

 

Sample

R/S-Plus

head(mydata , 10)

Python

for i in [0,1,2]:
    print(poDS[i])

Aggregate Analysis

R/S-Plus

xtabs( ~ col2, mydata_lkup) 

Python

poCounts = {}
for row in poDS_sorted: 
    poCounts[row[1]] = poCounts.get(row[1],0) + 1
print(poCounts)

Unique

R/S-Plus

unique(mydata_lkup$col2)

Python

uniqpoList=[] [uniqpoList.append(i[1]) for i in poDS if not uniqpoList.count(i[1])] uniqpoList

Data Management Quick Comparison 2

The following article provides a comparison between BASH and JAQL for basic data management and manipulation. 

 

Selecting- All Fields

BASH

more mydata.dat

JAQL

read(hdfs( Customers.dat )); 

-One Field

BASH

cut -c 13-22 mydata.dat 

JAQL

$Customers = read( Customers.dat );
$Customers -> transform { name: $.name };

– Subset

BASH

less mydata.dat |awk {if (substr($0,13,10) == 2786586641) print $0}

JAQL

$Customers = read( Customers.dat );
$Customers -> filter $.age == 18 ->  transform { name: $.name };

Sorting

BASH

sort -n -t +2 mydata.dat

JAQL

$Customers = read( Customers.dat );
$Customers -> sort by [$.age] -> transform { name: $.name };

Join

BASH

join -1 2 -2 1 mydata_2.dat mydata_lkup.dat | less
or (if no unmatched values and both files are sorted)
paste mydata_2.dat mydata_lkup.dat

JAQL

$CustomersPurchases =join $Customers, $PurchaseOrder where $Customers.CustomerId
== $PurchaseOrder.CustomerId into {$Customers.name, $PurchaseOrder.*} ;   

Sample

BASH

more mydata.dat| head -10

JAQL

$Customers = read( Customers.dat );
$Customers -> top(2); 

Aggregate Analysis

BASH

awk BEGIN { FS=OFS=SUBSEP= }{arr[$2,$3]++ }END {for (i in arr) print i,arr[i]} mydata_lkup.dat

JAQL

$CustomersPurchases -> group by $Customers_group = $.CustomerId into
{$Customers_group, total: count($[*].POID)};

Unique

BASH

less mydata_lkup.dat|cut -c 12|uniq

JAQL

$CustomersPurchases ->group by $Customers_group = $.CustomerId into
{$Customers_group };

JSON

JSON (JavaScript Object Notation) is a data format that is non-proprietary, human-readable, easy to parse, simple to generate, and widely adopted. You can read more about JSON at the project website here http://www.json.org/ Some of you might be saying, wait there is another data format with the same properties, namely XML, why bother with JSON? For me there are two reasons for data JSON is superior to XML. First, it is designed for data so handles things such as arrays much more eloquently than XML. There are versions of XML better suited for Data than the generic flavor but on the whole, XML is trying to provide a flexible structure to describe complex metadata not be a simple format for data. Second, there is only one type of JSON whereas XML can come in endless variety.  It is XML’s flexibility that is both its asset and liability. By being flexible XML can be used in a variety of ways but because of this confusion often arises about exactly how to consume and generate data. One of the things XML was supposed to solve. When transferring data I do not like ambiguity, I want strict conventions.

A great article about XML verse JSON with good comments is found here, http://ajaxian.com/archives/json-vs-xml-the-debate.

But I use JSON not because it is better for data (there is always something better) but mainly because a lot of data I want to consume uses the JSON format and you can load it directly into R: http://cran.r-project.org/web/packages/rjson/index.html

Below is an example of a JSON formatted dataset:

[{
"CustomerId" : 1,
"age" : 23,
"name" : "Joe"
},{
"CustomerId" : 2,
"age" : 45,
"name" : "Mika"
},{
"CustomerId" : 3,
"age" : 34,
"name" : "Lin"
}
}]

You can see that it is human readable. {} enclose rows and a row/column is expressed as “Colname” : value.