Sympathy for the Learner post-COVID

One thing is certain, COVID-19 has upended traditional relationships within our data. Even if the economy returns to a more normal rhythm how we interact with others, react to stimuli and even our life long aspirations, have changed forever. We will adjust and find a new normal, just as humanity has in times of crisis and great tragedy before, but we must recognize, this change will also have an impact on our new workplace partners, Artificial Intelligence (AI).  And now that AI is becoming a ubiquitous part of everyone’s daily lives, when your AI gets confused, it may do real-world harm on an already struggling populace. 

As we adjust and learn how to navigate this new world we have access to a well-educated brain, creativity, and intuition that AI does not have. The world they see, the topology of their training data, has shifted drastically to show unfamiliar terrains and broken fundamental relationships. For example, what served as an incentive to human behavior (a crowded beach) might now cause fear in a large segment of the population. This is a time to make sure you continuously measure your AI’s performance, focus your data scientist on understanding the impact of the COVID-19, and be open-minded to a variety of AI methods. You must have sympathy for your learner (AI) but you should also make sure the right one is being used to answer your questions and to solve your problems. 

Post Covid-19, sympathy for the learner is not just required for high-quality models but necessary to assure minimum performing models. The world has changed, quite dramatically. The learner you choose (or how you present the data to the learner) needs to be able to adjust rapidly to this ever-changing landscape in production. Severe concept drift, when fundamental drivers to target relationships change, is now the expected with all models. Key drivers of our economy, behavior, and relationships are rapidly evolving but our historical data still see the old world order. Even something as simple as toilet paper purchases is now distorted. A sudden peak in toilet paper purchases for a given household prior to COVID-19 might mean a particularly good sale, more people moving into the household, or purchases for a small business. Now it probably means run of the mill hoarding. Recommendation engines, risk models and segmentation scores will collapse as this new data enters the system. Worse still, as the data distorted by the shelter in place order enters the training data the AI will try to use it to project into the future rather than understand this was a transitory event. Care must be taken when presenting this data to your learner. The learner has not been following the news, does not understand the concept of a global pandemic, and has no ability to imagine the future post Covid19. While this is frustrating, as long as you recognize the limitations of AI you can mitigate the impact of Covid19 on your modeling data and, again, get back to looking into the future with AI.

 

COVID 19 and Home Schooling

This is a time when forgotten skills are awoken to help us survive, and hopefully thrive, under challenging circumstances. From, cooking to yoga, we all are reaching back to our prior lives to reinvent what is normal. For us, how we teach our children at home is evolving as remote schooling continues. At first, remote classes were engaging, but soon both of my daughters yearned for more. Luckily, I taught at the University of Oregon where we explored student-led instruction with the students driving the goals for the course. For my courses, Micro Economics, students chose topics and I guided them to use Economic principles to answer their questions. Here are some examples:

  • How can we migrate to renewable energy faster?

  • Why are tuition and books so expensive?

This was more challenging than traditional methods but the rewards were great. True, my ratings improve but, more importantly, many students embraced Economics. Fast forward to today, my kids want more. Teaching is a stressful endeavor. We feel pressured to make sure our kids meet goals, deadlines, and milestones. What I learned, from my students, mentors, and daughter’s teachers is a simple truth, there are no set milestones. Milestones create walls when we need to create paths.

The first major project was HarmoniousWorlds.com which I talked about in a prior post. This project, a website outlining rules to a game they created was spawned by asking them how to resolve a dispute over the game. They saw the need to define and write down all the rules. This gave them ownership and well-defined goals. They pushed their language and math skills to make sure they could explain the game to their friends. When crafting posts, they took photos and consider how to best present the material to a general audience. The trick was to make sure they were not blocked in achieving their vision without taking over and marginalizing them. While my eldest embraced the project and quickly took command my youngest had difficulties once the initial rules were created.

In the prior post we created a curriculum that, after the initial set up, was not engaging for our youngest daughter who was just beginning to read. So, we focused on helping her achieve a long time goal of hers, creating a graphic novel. She picked a topic (for example, enchanted forest) and we created sentences, mostly using words we knew she understood, based on that topic. It was her job was to read each sentence and create a drawing based on it. We framed this as working together to create a book rather than an assignment. After several days she had a book of sentences with pictures that she drew. She then used this as reference material to create her own stories and even wrote her own sentences that we would have to draw for her. While these examples sounded like organized plans, they were not. Both grew organically with the children contributing as much to the plan as us. The truth is all kids want to learn. As parents, we can choose to guide them rather than educate them. Both paths are challenging. For us, working to help our daughters meet their goals made homeschooling more fun it brought us closer together.

The S-Language


The S programming language of statistical programming language was developed  Bell laboratories specifically for statistical modeling. There are two versions of  S.  One was developed by insightful under the name S-Plus.  The other is an open-source initiative called R.  S allows you to create objects and is very extendable and has power graphing capabilities.

Tips
Tip 1

Set Memory Size

memory.size(max = TRUE)
Tip 2

Today’s Date

Today <- format(Sys.Date(), %d %b %Y )
Tip 3

Set Working Directory

setwd( C:// )
Tip 4

Load In Data

ExampleData.path    <- file.path(getwd(), USDemographics.CSV ) 
ExampleData.FullSet  <- read.table( ExampleData.path, header=TRUE, sep= , , na.strings= NA , dec= . , strip.white=TRUE)
Tip 5

Split Data

ExampleData.Nrows <-  nrow(ExampleData.FullSet) ExampleData.NCol= ncol(ExampleData.FullSet) 
ExampleData.SampleSize <- ExampleData.Nrows /2
ExampleData.Sample <- sample(nrow(ExampleData.FullSet ),size = ExampleData.SampleSize ,
replace=FALSE, prob = NULL )
ExampleData.HoldBack  <- ExampleData.FullSet[ExampleData.Sample, c(5,1:ExampleData.NCol)]
ExampleData.Run   <- ExampleData.FullSet[-ExampleData.Sample, c(5,1:ExampleData.NCol)  ]
Tip 6

Create Function

Confusion <- function(a, b){
                  tbl <- table(a, b)
                  mis <- 1 - sum(diag(tbl))/sum(tbl)
                  list(table = tbl, misclass.prob = mis)
                   }
Tip 7

Recode Fields

ExampleData.FullSet$Savings 
ExampleData.FullSet$SavingsCat <- recode(ExampleData.FullSet$Savings, 
, -40000.00:-100.00 = HighNeg ; -100.00:-50.00  = MedNeg ; -50.00:10.00 = LowNeg ; 10.00:50.00 = Low ; 50.00:100.00 = Med ; 100.00:1000.00 = High ;;;  , as.factor.result=TRUE)
Tip 8

Summarize Data

Summary(ExampleData.FullSet)
Tip 9

Save output

save.image(file = c:/test.RData , version = NULL, ascii = FALSE,  compress = FALSE, safe = TRUE)
Tip 10

Subset

MyData.SubSample <- subset(MyData.Full, MyField ==0)
Tip 11

Remove Object From Memory

remove(list = c(‘MyObject’));
Tip  12

Create a Dataframe

TmpOuput <- data.frame ( Fields = c( Field1 , ‘Field2 , ‘Field3’),  Values   = c( 1 , 2 ,  2  ) )
Tip 13

Cut

data(swiss)
x <- swiss$Education  
swiss$Educated= cut(x, breaks=c(0, 11, 999), labels=c( 0 , 1 ))
Tip 14

Create Directories

dir.create( c:/MyProjects )

Unix/Linux Data Management

Data management using Unix/Linux is easy but it does have a few quirks. First, typical the header (the first row which contains the field names) is not contained in the data file requiring a scheme or layout of the data files. That is because most of the file operation does not recognize the first line as different from the rest of the file.  Also, complex multi table queries can only be achieved in multiple steps unlike SQL and SAS.  But the speed and efficiency of code make Unix/Linux a strong data management tool.

 
Example:
Customer.dat
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
PurchaseOrder.dat
1 3 Fiction
2 1 Biography
3 1 Fiction
4 2 Biography
5 3 Fiction
6 4 Fiction
 
     

  This data would be stored without column names.

SELECT

less customer.dat
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

ORDER BY

less customer.dat |sort -k 1.10-,1.12
5 Susan 18
1 Joe 23
3 Lin 34
2 Mika 45
4 Sara 56

WHERE

less customer.dat |awk {if (substr($0,3,5) == Susan )  print $0}
5 Susan 18

INNER JOIN

sort -k 2 purchaseorder.dat > srt_purchaseorder.dat
join -1 1 -2 2 customer.dat srt_purchaseorder.dat
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction

LEFT OUTER JOIN

sort -k 2 purchaseorder.dat > srt_purchaseorder.dat
join -a1 -1 1 -2 2 customer.dat srt_purchaseorder.dat
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
5 Susan 18 NULL NULL

GROUP BY

join  -1 1 -2 2 customer.dat srt_purchaseorder.dat > sum.dat
awk BEGIN { FS=OFS=SUBSEP= }{arr[$2,$3]++ }END {for (i in arr) print i,arr[i]}   sum.dat
Joe 23 1
Mika 45 2
Lin 34 2
Sara 56 1

UPDATE

less customer.dat |awk {if (substr($0,3,5) == Susan )  print substr($0,0,9) 59  ; else print $0 }
1 Joe 26
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

INSERT

cat customer.dat new_cust.dat
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
6 Terry 50

DELETE

less customer.dat |awk {if (substr($0,1,1) != 1 )  print $0
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

 

Perl Data


  Perl is an established language with lots of code examples provided on the Web.   It is a powerful language (meaning you can do a lot with very little code).   Perl has a rich library set, especially in regards to text processing.  The language can take some getting use to because of its heavy use of control character.  For example: ‘.’ means concatenate strings, and the leading character (@, $, & and %) indicates the data type being accessed (eg. $foobar is a scalar while @foobar is an array). Below is an example of a database containing two tables, customers and purchase order.  The Customer table has customerid (the unique identifier for the customer), name (The customer’s name) and age (the customer’s age).  The PurchaseOrder table has POID (the unique id for the purchase order), customerid (which refers back to the customer who made the purchase) and Purchase (what the purchase was).

Example:
Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
PurchaseOrder
POID CustomerID Purchase
1 3 Fiction
2 1 Biography
3 1 Fiction
4 2 Biography
5 3 Fiction
6 4 Fiction
 
     
SELECT
Create the file select.pl with the following code:

#!/usr/bin/perl -w

my @file_lines;

# Loop through data and print lines

while ( my $line = <> ) {

# Print Line

print $line . \n ;

# Populate next line

push @file_lines, $line; }

Then run the code:
perl select.pl customer.txt

CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
ORDER BY
Create the file orderby.pl with the following code:

#!/usr/bin/perl -w

# Open the file

open(DATA, $ARGV[0]); my $line;
my $iCnt;

# Initialize variables

$iCnt =0;
$iLoop =0;

# Loop through data and populate a array for later use.

while($line = )
{

# Copy data to an arry skipping header (first line with field names)

if ($iCnt >0)
{
$lines[$iCnt-1] = $line;
}

$iCnt ++;
}

# Close file object

close DATA;

# Sort using on custom sort based on second field

@sorted = sort{@PANa = split( \, , $a); @PANb = split( \, , $b); return $PANa[2] cmp $PANb[2] } (@lines);

# Loop through data and print lines

while($iLoop < $iCnt-1)
{
print $sorted[$iLoop];
$iLoop ++;
}

Then run the code:
perl orderby.pl customer.txt

CustomerID Name Age
5 Susan 18
1 Joe 23
3 Lin 34
2 Mika 45
4 Sara 56
WHERE
Create the file select_by_id.pl with the following code:

#!/usr/bin/perl -w my @file_lines;

# Loop through passed file

while ( my $line = <> ) {

# Split line using a comma

@PAN =split( \, , $line);
$out = $PAN[0];

# If ID is passed ID then print

if ($out eq 1 )
{
print $line . \n ;
}

# Populate next line

push @file_lines, $line; }

Then run the code:
perl select_by_id.pl customer.txt 1

CustomerID Name Age
5 Susan 18
INNER JOIN
Create the file innerjoin.pl with the following code:

#!/usr/bin/perl -w

# Open both files

open(DATA1, $ARGV[0]);
open(DATA2, $ARGV[1]); my $lineCust;
my $lineOrder;
my $iCnt;
my $iLoop;

# Initialize variables

$iCnt =0;

# Loop through data and populate a array for later use.

while($lineOrder = )
{
$orders[$iCnt] = $lineOrder;
$iCnt ++;
}

# Close file object

close DATA2;
while($lineCust = )
{

# Split line using a comma

@PAN1 =split( \, , $lineCust);
$out1 = $PAN1[0];

# Remove newline character

chop($PAN1[2]);
chop($PAN1[2]);

$iLoop =0;

while($iLoop < $iCnt)
{

# Split line using a comma

@PAN2 =split( \, , $orders[$iLoop]);
$out2 = $PAN2[1];

#If there is a match print

if ($out1 eq $out2)
{
print $PAN1[0] . , . $PAN1[1] . , . $PAN1[2] . , . $orders[$iLoop];
}

$iLoop ++;
}
}

# Close file object

close DATA1;

Then run the code:
perl innerjoin.py customer.txt orders.txt

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
LEFT OUTER JOIN
Create the file leftouterjoin with the following code:

#!/usr/bin/perl -w

# Open both files

open(DATA1, $ARGV[0]);
open(DATA2, $ARGV[1]); my $lineCust;
my $lineOrder;
my $iCnt;
my $iLoop;
my $iFound;

# Initialize variables

$iCnt =0;

# Loop through data and populate a array for later use.

while($lineOrder = )
{
$orders[$iCnt] = $lineOrder;
$iCnt ++;
}

# Close file object

close DATA2; while($lineCust = )
{

# Split line using a comma

@PAN1 =split( \, , $lineCust);
$out1 = $PAN1[0];

# Remove newline character

chop($PAN1[2]);
chop($PAN1[2]);

$iLoop =0;
$iFound =0;

while($iLoop < $iCnt)
{

# Split line using a comma

@PAN2 =split( \, , $orders[$iLoop]);
$out2 = $PAN2[1];

#If there is a match print

if ($out1 eq $out2)
{
print $PAN1[0] . , . $PAN1[1] . , . $PAN1[2] . , . $orders[$iLoop];
$iFound =1;
}

$iLoop ++;
}

#If there was no match print

if ($iFound ==0)
{
print $PAN1[0] . , . $PAN1[1] . , . $PAN1[2] . , . , . , . ,\n ;
}
}

# Close file object

close DATA1;

Then run the code:
perlleftouterjoin.pl customer.txt orders.txt

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
5 Susan 18 NULL NULL
GROUP BY
Create the file groupby.pl with the following code:

#!/usr/bin/perl -w

#Open the file

open(DATA1, $ARGV[0]); my $line;
my $iCnt;
my $iLoop;

# Initialize variables

$iCnt =1;
$iloop =0;

# Loop through data

while($line = )
{

# Split line using a comma

@PAN1 =split( \, , $line);
$id = $PAN1[0];
$name = $PAN1[1];
$age = $PAN1[2];

# Remove newline character

chop($age);
chop($age);

if ($iloop <2 )
{
$prior_id = $id ;
$iCnt = 0;
}

if ($prior_id eq $id)
{
$iCnt ++;
}
else
{
print $prior_name . , . $prior_age . , . $iCnt . \n ;
$iCnt =1;
}

$iloop ++;
$prior_id = $id ;
$prior_name = $name ;
$prior_age = $age ; }

print $prior_name . , . $prior_age . , . $iCnt . \n ;

# Close file object

close DATA;

Then run the code:
perl groupby.pl customer.txt

Name Age Orders
Joe 23 1
Mika 45 2
Lin 34 2
Sara 56 1
UPDATE
Create the file update.pl with the following code:

#!/usr/bin/perl -w

# Open the file

open(DATA, $ARGV[0]);

# Loop through data

while()
{

# Split line using a comma

@PAN =split( \, , $_);
$out = $PAN[0];

# Pint if ID is not equal to passed ID otherwise update.

if ($out eq $ARGV[1])
{
print $PAN[0]. , . $PAN[1] . , . $ARGV[2] . \n ;
}
else
{
print $_ ;
}

# Close file object

close DATA;
}

  Then run the code:
perl groupby.pl customer.txt 1 23

Customer
CustomerID Name Age
1 Joe 26
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
INSERT
Create the file insert.pl with the following code:

#!/usr/bin/perl -w

#Open the file

open(DATA, $ARGV[0]);

# Loop through data and print lines

while()
{
print $_ ;
}

# Add new line from passed arguments

print $ARGV[1]. , . $ARGV[2] . , . $ARGV[3] . \n ;

# Close file object

close DATA;

Then run the code:
perl insert.pl customer.txt 6 Joe 34

Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
6 Terry 50
DELETE
Create the file delete.pl with the following code:

#!/usr/bin/perl -w

#Open the file

open(DATA, $ARGV[0]);

# Loop through data

while()
{

# Split line using a comma

@PAN =split( \, , $_);
$out = $PAN[0];

# If ID is not passed ID then print

if ($out ne $ARGV[1])
{
print $_ ;
} }

# Close file object

close DATA;

Then run the code:
perl delete.pl customer.txt 1

Customer
CustomerID Name Age
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

Unix Primer

Primer in Data Management In Unix/Linux

Data manipulation in Unix/Linux is powerful yet easy after some practice. Much of basic file manipulation can be achieved using the basic toolset provided with most Unix/Linux installations. First lest generate the data file we will use for this exercise.  Lets start off with some random data and a list of the files in your home directory. Type od -D -A n /dev/random | head -100 > mydata.dat
Results You will now have a 100 records data file with four columns of random numbers.  (the od command dumps data in various useable formats). Now lets created another dataset:
Type  ls –l > mydir.out

wc Word count (with –l  option will get number of lines)
Type  less mydata.dat | wc –l 
Result:  100 
gzip Compresses a file for you.  Much of the size of a file (especially a text file) can be shrunk.  The trade off for the smaller size is slow access time and the need to uncompress the file to process it. Type: gzip –c mydata.dat > mydata.gz
Result you have created a gz file from mydata.out. Type ls –l mydata.*
Result
-rw-rw-rw- 1 tharris mkgroup-l-d 2348 May 20 08:44 mydata.gz
-rw-rw-rw- 1 tharris mkgroup-l-d 4500 May 20 08:42 mydata.out Notice gzipped file is 583 bytes while and original file is 2147.
zcat   Allows you to decompress a gzipped file.  You can pipe the output to a reader like less or a file. Type zcat mydata |less
Result: The resulting output should be the same as the original file.
grep Allows you to search a file for a particular string it then output the complete line containing that string. Type grep Apr mydir.out
Result -rwx——   1 tharris        mkgroup-l-d    2402 Apr 12 09:41 myrpoject.r
-rwx——+  1 tharris        ????????       1905 Apr 12 09:29 DesktopGarpLog.txt
drwx——+  3 tharris        ????????          0 Apr 12 09:34 Favorites Note: Remember to change the month ‘Apr’  to the month you are interested in.
sed     Search and replace
Type : less mydir.out | sed s/????????/Windows /
Result: -rwx——   1 tharris        mkgroup-l-d    2402 Apr 12 09:41 myproject.r
drwx——+ 13 tharris      windows           0 Jan 28 06:45 Application Data
drwx——+  6 tharris       windows           0 May 25 06:12 Desktop
-rwx——+  1 tharris       windows        1905 Apr 12 09:29 DesktopGarpLog.txt
drwx——+  3 tharris       windows           0 Apr 12 09:34 Favorites
Now those annoying questions marks are gone. Type : sed  s/????????/Windows / mydir.out > mydir_2.out
Result: you will now have a text file called test2.out with ?????? replaced by windows.
cut Allows you to access data columns
Type cut -c 50-56 mydir_2.out | less Result (of course with different dates)
Apr 1
Apr 25
May 12
May 12
Apr 17
awk  Allows you to access data columns but is more powerful than CUT.  Both cut and awk can used like a where clause in SQL or if clause in SAS. Type
less mydir_2.out |awk {if (substr($0,51,3) == Apr )  print $0} |less
Results -rwx——   1 tharris        mkgroup-l-d    2402 Apr 12 09:41 myproject.r
-rwx——+  1 tharris       windows        1905 Apr 12 09:29 DesktopGarpLog.txt
drwx——+  3 tharris       windows           0 Apr 12 09:34 Favorites To create a file
Type  less mydir_2.out |awk {if (substr($0,51,3) == Apr ) print $0} >mydir_3.out  
sort Sort allows you to order a file in either descending or ascending order.  You can specify a column to use as the key to sort the file by. Type: sort -n -t +2 mydata.dat
Result:  The output of mydata.dat will be displayed sorted by the second field.  The ‘-n’ options is for a numeric rather than an alphabetical sort.  The ‘-r’ option is for a reverse (descending) order. The ‘ ’ indicates the file is separated by a space and the ‘+2’ means sort by the second column.
head When working against very large files it is sometimes useful to work with a subset, especially when debugging code.  The Head command allows you to do this.
Type:  less mydata.dat| head -5
Results only the top five lines of the output will be shown.
tail To work with the bottom part rather than the top of a file use tail
Type  less mydata.dat | tail -5
Result Only the bottom five records of the file will be shown.
join Unix has a join command similar to a sql join or a sas merge statement.  To test the join function lets first construction two new data sets.  Enter the following code:
less mydata.dat > mydata_2.dat   less mydata_2.dat |awk {if (substr($0,14,1) == 3 || substr($0,15,1) == 1)  print substr($0,13,10) Y ;  else print  substr($0,13,10) N } > mydata_lkup.dat Now you have two new datasets, a subsample of our original random number data set and a lookup table with a key pointing back to the ordinal data. Now type: join -1 2 -2 1 mydata_2.dat mydata_lkup.dat | less
Result:
614230376 2116315928 2808687127 1513727505 Y
2786586641 1078697315 4284908016 933354663 N
901415638 2527438256 3497368500 3894108367 N
3338765228 3463564639 3715602095 3944235862 Y
2901961487 2787207594 3739011318 4040597610 N
2380204561 2381578890 2611563505 292512547 Y
3810153523 2377573389 44853491 2382807132 Y
1853161002 851838940 4237925568 3627299786 N
2070425071 1236857502 150640963 2672607003 N
534159806 1991382958 2279021152 3452133675 N Note col2 has been swapped with col1 and new data has been appended to the end of the data set. The code -1 2 -2 1 indicates which field to use for the join.  In this example we want col2  in our dataset to match to col1 in the look up table.
paste Another way to join two files is to use the past command.  Apast will merge two file horizontally regardless of a key value.  If you two files are sorted properly and do not contain any unlinked values like the dataset we constructed paste is a faster way to merge the files. Type paste mydata_2.dat mydata_lkup.dat Result 2116315928  614230376 2808687127 1513727505     614230376 Y 1078697315 2786586641 4284908016  933354663    2786586641 N 2527438256  901415638 3497368500 3894108367     901415638 N 3463564639 3338765228 3715602095 3944235862    3338765228 Y 2787207594 2901961487 3739011318 4040597610    2901961487 N 2381578890 2380204561 2611563505  292512547    2380204561 Y 2377573389 3810153523   44853491 2382807132    3810153523 Y   51838940 1853161002 4237925568 3627299786    1853161002 N 1236857502 2070425071  150640963 2672607003    2070425071 N 1991382958  534159806 2279021152 3452133675     534159806 N Past can also be used to pivot a file or two files so that all the text is on one file Type  paste –d: -2 mydata_2.dat Results All the data will be on one line.  This is sometimes useful in data processing.
split   This command is used to break apart a file into smaller parts. Type split -l 10 mydata.dat new
Results you will have ten new files called newa newb … newj each with 10 observations.  
uniq This command will create output with each line sequential line that is identical collapsed to a unique value. Type less mydata_lkup.dat|cut -c 12|sort|uniq
Result
N
Y Now let’s see what happen if we remove the sort command. Type less mydata_lkup.dat|cut -c 12|uniq
Result
Y
N
Y
N
Y
N Without the sort command only identical sequential lines are collapsed.  

Quick DOS Guide

Quick DOS Guide

DOS (Disk operating system) was at the heart of windows computers until recently. More recently it was hidden beneath the GUI we now think of as Windows. To run a DOS command via the command window: Start>Run at the prompt type cmd then enter. A command window should appear. It will be a window with gray text in a black background that will look familiar to anyone who worked with DOS based computers pre-Windows 95. From this window you can execute DOS command in an interact session. The command window you can issue DOS command to execute programs, copy or delete files and much more. Most of these commands have been wrapped in a GUI (what we now think of when we think of Windows) but are still accessible via the command window. DOS s equivalent in UNIX/Linux is a korn or bash shell however DOS lacks many of the data management tools out of the box that are standard with a Unix/Linux installation.

cd change directory
Type cd ~
Result ~ is your home directory.
dir gets a list of all files and directories in current directory. Type: dir
ResultMicrosoft Windows [Version 6.0.6000]
Copyright (c) 2006 Microsoft Corporation. All rights reserved. C:\Users\Ted>dir
Volume in drive C has no label.
Volume Serial Number is xxxx-xxxx Directory of C:\Users\Ted 03/19/2008 07:47 PM <DIR> .
03/19/2008 07:47 PM <DIR> ..
01/28/2008 07:06 PM <DIR> Contacts
05/27/2008 05:16 PM <DIR> Desktop
05/07/2008 06:59 AM <DIR> Documents
05/07/2008 06:51 AM <DIR> Downloads
03/19/2008 07:47 PM 575,342 ElfPDFStream.pdf
04/15/2008 06:34 PM <DIR> Favorites
01/28/2008 07:06 PM <DIR> Links
05/09/2008 07:01 PM <DIR> Music
02/29/2008 04:42 PM <DIR> Pictures
01/30/2008 07:33 PM <DIR> Saved Games
01/28/2008 07:06 PM <DIR> Searches
01/28/2008 07:06 PM <DIR> Videos
2 File(s) 580,944 bytes
13 Dir(s) 138,648,293,376 bytes free
PIPING |   Pipe or redirect output to another command.  This allows you to chain multiple commands together without having to create files at each immediate step. >  Pipe or redirect output to a physical file. Type: dir > test.out    
Results: Now you have created a text file is the deletes of the file located
mkdir Make a directory. Type: mkdir test
Results: type ls and see your directory listed.
rename Moves or renames a file. This is useful to rename output files when debugging a process.
Type: rename test.out list.out
Results: now the file test.out has been renamed to list.out.
cp Copy a file or directory (with -r option).
Type cp list.out test.out
Results Now you have two file list.out and test.out
rmdir Removes a directory
del Remove a file .
Type rm list.out
Results the list.out file has been deleted.
type Allows read-only access to a file.  To quit out of MORE type :q.
Type: more test.out
Result: The output should look the same as if you ran the ls –l command.
edit Edit is a command line text editor for DOS that is similar to VI.
echo To make text appear in the command window use the echo command.  This can be useful to alert users to how a script or program is running.
Attrib The change the permissions or mode of a file use attrib.  Type attrib +r test.out
cls
 

Clears command window of all text so far written back from the bat script

REM Use REM to put comments into the script
top List the top 10 processes on the machine.  Important to tell whether you are playing nice with others.
Set
Use to set environmental variables.

  Set path =  

Call
Run another bat from script without stopping .
Call c:\MyBatchTwo.bat
PAUSE Halts the execution of the code till user input. Useful to see error messages and output of the code.
Environmental Variables
Example Environmental Variables below.
%winDir%. Path to the windows directory.
%userprofile% Path to user s directory

%OS% The OS

Unix/Linux vs DOS

 

Command

UNIX

DOS

List files ls dir
Change directory cd cd
Comments # rem
File permissions chmod attrib
copy cp xcopy
Print text in console echo echo
Spawn a new thread & call
Read a test file Type more or less
Delete file rm del
Delete directory rmdir rd
Create a directory mkdir mkdir
Copy a file or directory cp cp
Move a file rename mv
piping > >
Edit a file vi or emacs edit

Quick Unix/Linux Guide

Unix/Linux is an operating system mainly used as servers in a business setting.  There are numerous customer oriented Unix/Linux editions as well as CYGWIN (a virtual Linux OS available for Windows).  Many of the Unix/Linux commands for file operations covered here are like those you typically done via the Windows GUI.  If you are familiar with DOS, these commands of the Unix/Linux equivalents to the standard DOS file operation commands.  There are numerous GUI front end to Unix/Linux (CDE, KDE, Window Maker, OSX, …) that allow you to execute these commands via menus or the mouse like you can from Windows.  The power of using commands, however, is speed, clarity in what you are trying to achieve, repeatability and, if put into a script, repeatability. Another benefit is not having to hunt through multiple layers of menus to find (if it even exists) the command you want.  This section will cover the basic file operation commands available in Unix/Linux.  Most of the commands listed here are available in the korn and bash shells. If you do not have access to a Unix of Linux machine I recommend downloading and installing CYGWIN. You can find it here.

xterm A Unix/Linux terminal is a application that allows you to communicate with a system.  Typically when you start up a terminal it has a shell attached to it that allows commands to be sent and received from the Unix/Linux system. An xterm shell is like a command windows in windows.  Two of the most common Unix/Linux shells are korn (ksh) and bash. 
& Runs command as a separate process (thread) Type: xterm &
Results: A new command shell should appear and you will be able to use both command shells.  If you had type xterm without the & you would not be able to use the first command shell until you exited the first.
cd change directory
Type cd ~
Result ~ is your home directory.
ls gets a list of all files and directories in current directory  (-l options gives a detailed view and –a to see configurations files.)  In Unix configurations files have a “.” prefix. Type: ls –l
Results: something like this on a CYGWIN installation: -rwx——   1 tharris        mkgroup-l-d    2402 Apr 12 09:41 myproject.r
drwx——+ 13 tharris        ????????          0 Jan 28 06:45 Application Data
drwx——+  6 tharris        ????????          0 May 25 06:12 Desktop
-rwx——+  1 tharris        ????????       1905 Apr 12 09:29 DesktopGarpLog.txt
drwx——+  3 tharris        ????????          0 Apr 12 09:34 Favorites The first column with the cryptic sequence of letters if indicates the rights and permissions.
-rwx: a file with read, write and execute permissions
drwx: a directory with read, write and execute permissions
The second column is the user who created the file (owner). 
The third is the Unix group from which the file was created.  File created while in Window do not have Unix groups so appear as ????????.
The forth is the file size followed by creation data and file name
PIPING |   Pipe or redirect output to another command.  This allows you to chain multiple commands together without having to create files at each immediate step. >  Pipe or redirect output to a physical file. Type:  ls –l > test.out    
Results: Now you have created a text file is the deletes of the file located
mkdir Make a directory. Type: mkdir test
Results: type ls and see your directory listed.
mv Moves file. This is useful to rename output files when debugging a process.
Type: mv test.out list.out
Results: now the file test.out has been renamed to list.out.
cp Copy a file or directory (with -r option).
Type cp list.out test.out
Results Now you have two file list.out and test.out
rm Remove a file or directory.
Type rm list.out
Results the list.out file has been deleted.
more Allows read-only access to a file.  To quit out of MORE type :q.
Type: more test.out
Result: The output should look the same as if you ran the ls –l command.
less Also allows read-only access to a test file.  The name is a misnomer; LESS has greater capabilities than MORE.  Like more to quit type :q.  Both MORE and LESS have far more capability than I will discus here.
vi and emacs Two powerful text editors for Unix.  VI is a command line text editor and does require a little sit down time with the manual.  Once you have mastered a few simple commands VI is a quick tool for editing text.  If, however, you want a tool more familiar to Windows text editors try Emacs.
echo To make text appear in the command window use the echo command.  This can be useful to alert users to how a script or program is running.
Creating a script   To create a script type emacs myscript.sh & in the command prompt.  Now lets put all that we have learn into use.  Type the follow code: #!/usr/bin/sh
echo I am starting
mkdir mytest
ls -l > ./mytest/test2.out
cd ./mytest
cp test2.out test3.out
mv test3.out test4.out
echo I am done. Now save the file by clicking on the familiar save icon. Everything should look familiar except #!/usr/bin/sh.  That line tells Unix which shell to use to execute the script.
chmod The change the permissions or mode of a file use chmod.  A good setting is 775. Type chmod 775 myscript.sh
Results now we can execute the script myscript.sh.
Executing a Script   Type ./myscript.sh
Result:
I am starting
I am done Note: If nothing appears between the two statements in the command terminal the script ran successful. To check the results change to the directory my testand check the results.
sleep  If you want a script to pause between command use the sleep command.  This can be useful to enable the user to kill a process that spawns lots of other processes at way points in the script. Type sleep 360
Results the command terminal should pause for 10 minutes
nohup Allows you to execute a script and log out without termination the script.  This is useful with scripts that take a long time to run.
Type nohup ./myscript.sh
Result: nohup: appending output to `nohup.out  
        [1] 272
You will not see anything in the command window.  The output will be redirected to a text file.  The text [1] 272 tells you the process id and will be different for you.
Reading output from Nohup Type less nohup.out
ps Allows you to see the status of processes (with the -p option you can specify a process to examine).  This can be useful when running long jobs.  It is similar to task manager. Example, add the following code to your script: sleep 360. Now when you run the script it will pause for 10 minutes.  This will give us time to use the PS command. Type nohup ./myscript.sh
Result: nohup: appending output to `nohup.out
           [1] 5056 Type ps –p 5056
Result:  
      PID    PPID    PGID     WINPID  TTY  UID    STIME COMMAND
     5056    1896    5056       2168    0 156949 08:23:37 /usr/bin/sh Another example is to show all the processes you have spawned.
Type ps –u
Results: a STIME COMMAND
     2528       1    2528       2528  con 156949 08:03:43 /usr/X11R6/bin/Xwin
     2336       1    2336       2876  con 156949 08:03:43 /usr/bin/xterm
     1220       1    1220       2984  con 156949 08:03:43 /usr/X11R6/bin/wmaker
     3912    1220    1220       1260  con 156949 08:03:45 /usr/X11R6/bin/wmaker
     4368    2336    4368       4528    1 156949 08:03:49 /usr/bin/bash
     5056    1896    5056       2168    0 156949 08:23:37 /usr/bin/sh
     6088    5056    5056       5848    0 156949 08:23:40 /usr/bin/sleep
     4612    1896    4612       3736    0 156949 08:25:24 /usr/bin/ps
top List the top 10 processes on the machine.  Important to tell whether you are playing nice with others.
kill  Allows a user to stop a process 
Type: kill 6088
       Process 6088 will be terminated regardless of state
nice If you are not playing nice with others (ie you are hogging resources) you can change the priority of your process to allow others to get done with what ever they need to do.  Nice enables this.  The values range from 1(highest) to 19 (lowest). 10 is the default. Type ; nice -17 nohup ./myscript.sh
Result: the process will be towards the end of priority.  When allocating resources Unix will prioritize other user processes over yours.

Five Model Evaluation Criteria

There are many different criteria to use to evaluate a statistical or data-mining model.  So many in facts it can be a bit confusing and at times seem like a sporting event where proponents of one criterion are constantly trying to prove it is the best.  There is no such thing as a best criterion.  Different criteria tell you different things about how a model behaves.  In a given situation one criterion may be better than others but that will change as situations change.  My recommendation, as with many other tools, is to use multiple methods and understand the strength and weaknesses of each method with the problem you are currently faced with.   Many of criteria are slight variation of another and most have residual sum of squares (RSS) in them in one manner or another.  The differences may be subtle but can lead to very different conclusions about the fit of a model.  This month examine non-visual measures.  Next month will look at visual tools.

 

MSE Criterion

The simplest of measure the mean squared error is the average of the square of actual verse predicted values.   A lower MSE means a better fitting model.  It does not provide you with a absolute measure like R-Squared meaning MSE is used to compare models with the same independent variable not as a measure of overall fit.  Remember you are summing the squared residuals because the sum of residuals should be zero, i.e. no bias in the model.

R-Squared

R-Squared is probably the first method you were taught in school.  It is the most derided of all the measures of goodness-of-fit mainly because, in the past, people have a tendency to over state its importance.  R- Squared could not be the only test you do but it should not be ignored.  It is criticized in text books because it is over used not because it is invalid. R-square values range from -1 to 1.  -1 indicates a perfect a negative correlation while 1 indicates a perfect positive correlation.  In social sciences, r-squared values for a good model range from .05 to .2.  In physical sciences, a good r-squared is much higher between .7 and .9. 

R-Squared = 1-MSE/((1/n)*TSS)

MSE = mean squared errors = RSS/N
TSS = Total Sum of Squares is sum of dependant variable’s deviation from the mean
n = observations.

In layman’s terms it can be thought of as a normalized MSE.

The standard R-Squared does not take into consideration the number of parameters used in the model.  This leads to one flaw, namely you can increase the R-Squared simply by adding random variables to your model.  The adjusted R-square corrects for this.

 

Adj. R-Squared = 1-(RSS/ (n-p-1) /(TSS/(n-1))

MSE = mean squared errors = RSS/N
TSS = Total Sum of Squares is sum of dependant variable’s deviation from the mean
n = observations
p= number of parameters

 

Akaike s Information Criterion (AIC) 

In 1972 Akaike introduced the Akaike s Information Criterion (AIC). AIC was an attempt to improve upon previous measure by penalizing the number of free parameters more greatly than adjusted R-Squared.  Its goal is to build the best possible model with the least number of parameters. This should reduce the likelihood of over fitting compared with R-Squared.

 

AIC = -2*log(L(theta hat) + 2*p
Where
        L is maximum likelihood function
        P is the number of parameters
Or
AIC = n*(ln((2*pi*RSS)/n) +1) + 2*p

Where
      RSS = is the residual sum of squares
      P is the number of parameters
      n is the number of observations

 

Schwarz s Informational Criterion (BIC or SIC)

The SIC penalizes additional variables more heavily than AIC otherwise it behaves the same.  

It is always better to penalize addition variables, right?  That sounds good; however, this is not always a superior measure than R-Squared.  Example, if you are building a model with potentially noisy data reducing the number of parameters may make the model more unstable out of sample.  By reducing the number of parameters (independent variables) each individual variable’s contribution to the model will increase.  If those variables have stability issues out of sample the model may be more likely to “explode”.  By having a greater number of parameters you reduce the chance any one variables anomaly will yield wild results, the other variables can compensate. In essence you are spreading the risk of a random measure error across multiple variables. This example aside the AIC is a sound measure of model performance.

 

SIC = -2*log(L(theta hat) + p*ln(n)
Where
      L is maximum likelihood function

Or

SIC = n*(ln((2*pi*RSS)/n) +1)+  p*ln(n)
Where
         RSS = is the residual sum of squares
 

 

Information based complexity Inverse-Fisher Information Matrix Criterion  (ICOMP(IFIM))

Bozdogan came up with ICOMP(IFIM))  as an alternative to the AIC based approaches.    
ICOMP(IFIM))  balances model fit against model complexity as measure by the inverse-Fisher information matrix.  This is superior to AIC based approaches because it defines complexity based on covariance matrix of the independent variables as apposed to just the count of independent variables.  Example, suppose you have one model with five independent variables that are not correlated with one another verse a model with four highly correlated parameters. Now suppose both have the same MSE.  The first model intuitively should be superior to the second however with AIC based approaches the second model would look superior.  ICOMP(IFIM))  should compensate for this. 

 

 

Further Reading

www.psy.vanderbilt.edu

http://web.utk.edu/~bozdogan/infomod.html

www.ntua.gr/ISBA2000/new/Fin_sessions_1.html

en.wikipedia.org/wiki/Coefficient_of_determination

etd.fcla.edu/UF/UFE1000122/erdogmus_d.pdf

faculty.psy.ohio-state.edu/myung/personal/info_matrix.pdf