Unix Primer

Primer in Data Management In Unix/Linux

Data manipulation in Unix/Linux is powerful yet easy after some practice. Much of basic file manipulation can be achieved using the basic toolset provided with most Unix/Linux installations. First lest generate the data file we will use for this exercise.  Lets start off with some random data and a list of the files in your home directory. Type od -D -A n /dev/random | head -100 > mydata.dat
Results You will now have a 100 records data file with four columns of random numbers.  (the od command dumps data in various useable formats). Now lets created another dataset:
Type  ls –l > mydir.out

wc Word count (with –l  option will get number of lines)
Type  less mydata.dat | wc –l 
Result:  100 
gzip Compresses a file for you.  Much of the size of a file (especially a text file) can be shrunk.  The trade off for the smaller size is slow access time and the need to uncompress the file to process it. Type: gzip –c mydata.dat > mydata.gz
Result you have created a gz file from mydata.out. Type ls –l mydata.*
-rw-rw-rw- 1 tharris mkgroup-l-d 2348 May 20 08:44 mydata.gz
-rw-rw-rw- 1 tharris mkgroup-l-d 4500 May 20 08:42 mydata.out Notice gzipped file is 583 bytes while and original file is 2147.
zcat   Allows you to decompress a gzipped file.  You can pipe the output to a reader like less or a file. Type zcat mydata |less
Result: The resulting output should be the same as the original file.
grep Allows you to search a file for a particular string it then output the complete line containing that string. Type grep Apr mydir.out
Result -rwx——   1 tharris        mkgroup-l-d    2402 Apr 12 09:41 myrpoject.r
-rwx——+  1 tharris        ????????       1905 Apr 12 09:29 DesktopGarpLog.txt
drwx——+  3 tharris        ????????          0 Apr 12 09:34 Favorites Note: Remember to change the month ‘Apr’  to the month you are interested in.
sed     Search and replace
Type : less mydir.out | sed s/????????/Windows /
Result: -rwx——   1 tharris        mkgroup-l-d    2402 Apr 12 09:41 myproject.r
drwx——+ 13 tharris      windows           0 Jan 28 06:45 Application Data
drwx——+  6 tharris       windows           0 May 25 06:12 Desktop
-rwx——+  1 tharris       windows        1905 Apr 12 09:29 DesktopGarpLog.txt
drwx——+  3 tharris       windows           0 Apr 12 09:34 Favorites
Now those annoying questions marks are gone. Type : sed  s/????????/Windows / mydir.out > mydir_2.out
Result: you will now have a text file called test2.out with ?????? replaced by windows.
cut Allows you to access data columns
Type cut -c 50-56 mydir_2.out | less Result (of course with different dates)
Apr 1
Apr 25
May 12
May 12
Apr 17
awk  Allows you to access data columns but is more powerful than CUT.  Both cut and awk can used like a where clause in SQL or if clause in SAS. Type
less mydir_2.out |awk {if (substr($0,51,3) == Apr )  print $0} |less
Results -rwx——   1 tharris        mkgroup-l-d    2402 Apr 12 09:41 myproject.r
-rwx——+  1 tharris       windows        1905 Apr 12 09:29 DesktopGarpLog.txt
drwx——+  3 tharris       windows           0 Apr 12 09:34 Favorites To create a file
Type  less mydir_2.out |awk {if (substr($0,51,3) == Apr ) print $0} >mydir_3.out  
sort Sort allows you to order a file in either descending or ascending order.  You can specify a column to use as the key to sort the file by. Type: sort -n -t +2 mydata.dat
Result:  The output of mydata.dat will be displayed sorted by the second field.  The ‘-n’ options is for a numeric rather than an alphabetical sort.  The ‘-r’ option is for a reverse (descending) order. The ‘ ’ indicates the file is separated by a space and the ‘+2’ means sort by the second column.
head When working against very large files it is sometimes useful to work with a subset, especially when debugging code.  The Head command allows you to do this.
Type:  less mydata.dat| head -5
Results only the top five lines of the output will be shown.
tail To work with the bottom part rather than the top of a file use tail
Type  less mydata.dat | tail -5
Result Only the bottom five records of the file will be shown.
join Unix has a join command similar to a sql join or a sas merge statement.  To test the join function lets first construction two new data sets.  Enter the following code:
less mydata.dat > mydata_2.dat   less mydata_2.dat |awk {if (substr($0,14,1) == 3 || substr($0,15,1) == 1)  print substr($0,13,10) Y ;  else print  substr($0,13,10) N } > mydata_lkup.dat Now you have two new datasets, a subsample of our original random number data set and a lookup table with a key pointing back to the ordinal data. Now type: join -1 2 -2 1 mydata_2.dat mydata_lkup.dat | less
614230376 2116315928 2808687127 1513727505 Y
2786586641 1078697315 4284908016 933354663 N
901415638 2527438256 3497368500 3894108367 N
3338765228 3463564639 3715602095 3944235862 Y
2901961487 2787207594 3739011318 4040597610 N
2380204561 2381578890 2611563505 292512547 Y
3810153523 2377573389 44853491 2382807132 Y
1853161002 851838940 4237925568 3627299786 N
2070425071 1236857502 150640963 2672607003 N
534159806 1991382958 2279021152 3452133675 N Note col2 has been swapped with col1 and new data has been appended to the end of the data set. The code -1 2 -2 1 indicates which field to use for the join.  In this example we want col2  in our dataset to match to col1 in the look up table.
paste Another way to join two files is to use the past command.  Apast will merge two file horizontally regardless of a key value.  If you two files are sorted properly and do not contain any unlinked values like the dataset we constructed paste is a faster way to merge the files. Type paste mydata_2.dat mydata_lkup.dat Result 2116315928  614230376 2808687127 1513727505     614230376 Y 1078697315 2786586641 4284908016  933354663    2786586641 N 2527438256  901415638 3497368500 3894108367     901415638 N 3463564639 3338765228 3715602095 3944235862    3338765228 Y 2787207594 2901961487 3739011318 4040597610    2901961487 N 2381578890 2380204561 2611563505  292512547    2380204561 Y 2377573389 3810153523   44853491 2382807132    3810153523 Y   51838940 1853161002 4237925568 3627299786    1853161002 N 1236857502 2070425071  150640963 2672607003    2070425071 N 1991382958  534159806 2279021152 3452133675     534159806 N Past can also be used to pivot a file or two files so that all the text is on one file Type  paste –d: -2 mydata_2.dat Results All the data will be on one line.  This is sometimes useful in data processing.
split   This command is used to break apart a file into smaller parts. Type split -l 10 mydata.dat new
Results you will have ten new files called newa newb … newj each with 10 observations.  
uniq This command will create output with each line sequential line that is identical collapsed to a unique value. Type less mydata_lkup.dat|cut -c 12|sort|uniq
Y Now let’s see what happen if we remove the sort command. Type less mydata_lkup.dat|cut -c 12|uniq
N Without the sort command only identical sequential lines are collapsed.