Data Management Quick Comparison 2

The following article provides a comparison between BASH and JAQL for basic data management and manipulation. 

 

Selecting- All Fields

BASH

more mydata.dat

JAQL

read(hdfs( Customers.dat )); 

-One Field

BASH

cut -c 13-22 mydata.dat 

JAQL

$Customers = read( Customers.dat );
$Customers -> transform { name: $.name };

– Subset

BASH

less mydata.dat |awk {if (substr($0,13,10) == 2786586641) print $0}

JAQL

$Customers = read( Customers.dat );
$Customers -> filter $.age == 18 ->  transform { name: $.name };

Sorting

BASH

sort -n -t +2 mydata.dat

JAQL

$Customers = read( Customers.dat );
$Customers -> sort by [$.age] -> transform { name: $.name };

Join

BASH

join -1 2 -2 1 mydata_2.dat mydata_lkup.dat | less
or (if no unmatched values and both files are sorted)
paste mydata_2.dat mydata_lkup.dat

JAQL

$CustomersPurchases =join $Customers, $PurchaseOrder where $Customers.CustomerId
== $PurchaseOrder.CustomerId into {$Customers.name, $PurchaseOrder.*} ;   

Sample

BASH

more mydata.dat| head -10

JAQL

$Customers = read( Customers.dat );
$Customers -> top(2); 

Aggregate Analysis

BASH

awk BEGIN { FS=OFS=SUBSEP= }{arr[$2,$3]++ }END {for (i in arr) print i,arr[i]} mydata_lkup.dat

JAQL

$CustomersPurchases -> group by $Customers_group = $.CustomerId into
{$Customers_group, total: count($[*].POID)};

Unique

BASH

less mydata_lkup.dat|cut -c 12|uniq

JAQL

$CustomersPurchases ->group by $Customers_group = $.CustomerId into
{$Customers_group };

JAQL Data Management

JAQL is a JSON query language similar to SQL. One key difference is the dataset is accessed more like objects. A great overview is found here:JaqlOverview the strength of JAQL is it allows users simple, extendable code to manipulate data that is in a non-proprietary, readable, and commonly used file format. It is primarily used when processing data with Hadoop. JAQL tends to crash when it has exceptions so I would copy-paste my command from an editor. Also never press the up arrow.

Example Data:  
Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
PurchaseOrder
POID CustomerID Purchase
1 3 Fiction
2 1 Biography
3 1 Fiction
4 2 Biography
5 3 Fiction
6 4 Fiction

 

Create Data
$Customer = [ {CustomerId:1, name: Joe , age: 23}
, {CustomerId:1, name: Mika , age: 45}
, {CustomerId:1, name: Lin , age: 34}
, {CustomerId:1, name: Sara , age: 56}
, {CustomerId:1, name: Susan , age: 18}];

$PurchaseOrder = [
{POID: 1, CustomerId:3, Purchase: Fiction }
{POID: 2, CustomerId:1, Purchase: Biography }
{POID: 3, CustomerId:1, Purchase: Fiction }
{POID: 4, CustomerId:2, Purchase: Biography }
{POID: 5, CustomerId:3, Purchase: Fiction }
{POID: 6, CustomerId:3, Purchase: Fiction } ]; 
SELECT
hdfs($Customers) ; 
hdfs($PurchaseOrder) ;
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

Sort BY

$Customers -> sort by [$.age];
CustomerID Name Age
5 Susan 18
1 Joe 23
3 Lin 34
2 Mika 45
4 Sara 56
Filter
$Customers -> filter $.age == 18 ;
CustomerID Name Age
5 Susan 18
INNER JOIN
join $Customers, $PurchaseOrder where $Customers.CustomerId == $PurchaseOrder.CustomerId into {$Customers.name, $PurchaseOrder.*} ; 
CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
LEFT OUTER JOIN

join preserve $Customers, $PurchaseOrder where $Customers.CustomerId == $PurchaseOrder.CustomerId into {$Customers.name, $PurchaseOrder.*} ;

CustomerID Name Age POID Purchase
1 Joe 23 2 Biography
1 Joe 23 3 Fiction
2 Mika 45 4 Biography
3 Lin 34 1 Fiction
3 Lin 34 5 Fiction
4 Sara 56 6 Fiction
5 Susan 18 NULL NULL
GROUP BY
 
 $Customers -> group by $Customers_group = $.CustomerID into {$Customers_group, $Customers.age, total: count($) }; 
Name Age Orders
Joe 23 1
Mika 45 2
Lin 34 2
Sara 56 1
UPDATE

$Customer1= $Customers -> filter $.CustomerId == 1 -> transform {CustomerId: $.CustomerId, name: $.name ,age: 16} ; $Customerne1= $Customers -> filter $.CustomerId != 1 ; hdfs($Customer1 , $Customerne1);

Customer
CustomerID Name Age
1 Joe 26
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
INSERT
$Customer6=[{CustomerId:6,name: Terry ,age:50}]; hdfs($Customers , $Customer6); -> write(file( Customers.dat ));
Customer
CustomerID Name Age
1 Joe 23
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18
6 Terry 50
DELETE
$Customerne1= $Customers ->filter $.CustomerId != 1
Customer
CustomerID Name Age
2 Mika 45
3 Lin 34
4 Sara 56
5 Susan 18

JSON and JAQL

JSON (Javascript Object Notation) is growing in popularity as a data format. I am finding I am using it routinely when interfacing with sites such as FreeBase.com and at work when processing datasets using Hadoop.

One great strength of the JSON format is the rich set of tools available to work with the data. One example is JAQL, a JSON query language similar to SQL that works well with Hadoop. A great overview is found here <a href = http://code.google.com/p/jaql/wiki/JaqlOverview > here </a> he strength of JAQL is it allows users simple, extendable code to manipulate data that is in a non-proprietary, readable and commonly used file format.

I have added a JAQL example in the data section and to the data Management Quick Comparison.