wc |
Word count (with –l option will get number of lines) Type less mydata.dat | wc –l Result: 100 |
gzip |
Compresses a file for you. Much of the size of a file (especially a text file) can be shrunk. The trade off for the smaller size is slow access time and the need to uncompress the file to process it. Type: gzip –c mydata.dat > mydata.gz Result you have created a gz file from mydata.out. Type ls –l mydata.* Result -rw-rw-rw- 1 tharris mkgroup-l-d 2348 May 20 08:44 mydata.gz -rw-rw-rw- 1 tharris mkgroup-l-d 4500 May 20 08:42 mydata.out Notice gzipped file is 583 bytes while and original file is 2147. |
zcat |
Allows you to decompress a gzipped file. You can pipe the output to a reader like less or a file. Type zcat mydata |less Result: The resulting output should be the same as the original file. |
grep |
Allows you to search a file for a particular string it then output the complete line containing that string. Type grep Apr mydir.out Result -rwx—— 1 tharris mkgroup-l-d 2402 Apr 12 09:41 myrpoject.r -rwx——+ 1 tharris ???????? 1905 Apr 12 09:29 DesktopGarpLog.txt drwx——+ 3 tharris ???????? 0 Apr 12 09:34 Favorites Note: Remember to change the month ‘Apr’ to the month you are interested in. |
sed |
Search and replace Type : less mydir.out | sed s/????????/Windows / Result: -rwx—— 1 tharris mkgroup-l-d 2402 Apr 12 09:41 myproject.r drwx——+ 13 tharris windows 0 Jan 28 06:45 Application Data drwx——+ 6 tharris windows 0 May 25 06:12 Desktop -rwx——+ 1 tharris windows 1905 Apr 12 09:29 DesktopGarpLog.txt drwx——+ 3 tharris windows 0 Apr 12 09:34 Favorites Now those annoying questions marks are gone. Type : sed s/????????/Windows / mydir.out > mydir_2.out Result: you will now have a text file called test2.out with ?????? replaced by windows. |
cut |
Allows you to access data columns Type cut -c 50-56 mydir_2.out | less Result (of course with different dates) Apr 1 Apr 25 May 12 May 12 Apr 17 |
awk |
Allows you to access data columns but is more powerful than CUT. Both cut and awk can used like a where clause in SQL or if clause in SAS. Type less mydir_2.out |awk {if (substr($0,51,3) == Apr ) print $0} |less Results -rwx—— 1 tharris mkgroup-l-d 2402 Apr 12 09:41 myproject.r -rwx——+ 1 tharris windows 1905 Apr 12 09:29 DesktopGarpLog.txt drwx——+ 3 tharris windows 0 Apr 12 09:34 Favorites To create a file Type less mydir_2.out |awk {if (substr($0,51,3) == Apr ) print $0} >mydir_3.out |
sort |
Sort allows you to order a file in either descending or ascending order. You can specify a column to use as the key to sort the file by. Type: sort -n -t +2 mydata.dat Result: The output of mydata.dat will be displayed sorted by the second field. The ‘-n’ options is for a numeric rather than an alphabetical sort. The ‘-r’ option is for a reverse (descending) order. The ‘ ’ indicates the file is separated by a space and the ‘+2’ means sort by the second column. |
head |
When working against very large files it is sometimes useful to work with a subset, especially when debugging code. The Head command allows you to do this. Type: less mydata.dat| head -5 Results only the top five lines of the output will be shown. |
tail |
To work with the bottom part rather than the top of a file use tail Type less mydata.dat | tail -5 Result Only the bottom five records of the file will be shown. |
join |
Unix has a join command similar to a sql join or a sas merge statement. To test the join function lets first construction two new data sets. Enter the following code: less mydata.dat > mydata_2.dat less mydata_2.dat |awk {if (substr($0,14,1) == 3 || substr($0,15,1) == 1) print substr($0,13,10) Y ; else print substr($0,13,10) N } > mydata_lkup.dat Now you have two new datasets, a subsample of our original random number data set and a lookup table with a key pointing back to the ordinal data. Now type: join -1 2 -2 1 mydata_2.dat mydata_lkup.dat | less Result: 614230376 2116315928 2808687127 1513727505 Y 2786586641 1078697315 4284908016 933354663 N 901415638 2527438256 3497368500 3894108367 N 3338765228 3463564639 3715602095 3944235862 Y 2901961487 2787207594 3739011318 4040597610 N 2380204561 2381578890 2611563505 292512547 Y 3810153523 2377573389 44853491 2382807132 Y 1853161002 851838940 4237925568 3627299786 N 2070425071 1236857502 150640963 2672607003 N 534159806 1991382958 2279021152 3452133675 N Note col2 has been swapped with col1 and new data has been appended to the end of the data set. The code -1 2 -2 1 indicates which field to use for the join. In this example we want col2 in our dataset to match to col1 in the look up table. |
paste |
Another way to join two files is to use the past command. Apast will merge two file horizontally regardless of a key value. If you two files are sorted properly and do not contain any unlinked values like the dataset we constructed paste is a faster way to merge the files. Type paste mydata_2.dat mydata_lkup.dat Result 2116315928 614230376 2808687127 1513727505 614230376 Y 1078697315 2786586641 4284908016 933354663 2786586641 N 2527438256 901415638 3497368500 3894108367 901415638 N 3463564639 3338765228 3715602095 3944235862 3338765228 Y 2787207594 2901961487 3739011318 4040597610 2901961487 N 2381578890 2380204561 2611563505 292512547 2380204561 Y 2377573389 3810153523 44853491 2382807132 3810153523 Y 51838940 1853161002 4237925568 3627299786 1853161002 N 1236857502 2070425071 150640963 2672607003 2070425071 N 1991382958 534159806 2279021152 3452133675 534159806 N Past can also be used to pivot a file or two files so that all the text is on one file Type paste –d: -2 mydata_2.dat Results All the data will be on one line. This is sometimes useful in data processing. |
split |
This command is used to break apart a file into smaller parts. Type split -l 10 mydata.dat new Results you will have ten new files called newa newb … newj each with 10 observations. |
uniq |
This command will create output with each line sequential line that is identical collapsed to a unique value. Type less mydata_lkup.dat|cut -c 12|sort|uniq Result N Y Now let’s see what happen if we remove the sort command. Type less mydata_lkup.dat|cut -c 12|uniq Result Y N Y N Y N Without the sort command only identical sequential lines are collapsed. |