The S programming language of statistical programming language was developed Bell laboratories specifically for statistical modeling. There are two versions of S. One was developed by insightful under the name S-Plus. The other is an open-source initiative called R. S allows you to create objects and is very extendable and has power graphing capabilities. |
|
Tips
|
|
Tip 1 |
Set Memory Size memory.size(max = TRUE) |
Tip 2 |
Today’s Date Today <- format(Sys.Date(), %d %b %Y ) |
Tip 3 |
Set Working Directory setwd( C:// ) |
Tip 4 |
Load In Data ExampleData.path <- file.path(getwd(), USDemographics.CSV ) ExampleData.FullSet <- read.table( ExampleData.path, header=TRUE, sep= , , na.strings= NA , dec= . , strip.white=TRUE) |
Tip 5 |
Split Data ExampleData.Nrows <- nrow(ExampleData.FullSet) ExampleData.NCol= ncol(ExampleData.FullSet) ExampleData.SampleSize <- ExampleData.Nrows /2 ExampleData.Sample <- sample(nrow(ExampleData.FullSet ),size = ExampleData.SampleSize , replace=FALSE, prob = NULL ) ExampleData.HoldBack <- ExampleData.FullSet[ExampleData.Sample, c(5,1:ExampleData.NCol)] ExampleData.Run <- ExampleData.FullSet[-ExampleData.Sample, c(5,1:ExampleData.NCol) ] |
Tip 6 |
Create Function Confusion <- function(a, b){ tbl <- table(a, b) mis <- 1 - sum(diag(tbl))/sum(tbl) list(table = tbl, misclass.prob = mis) } |
Tip 7 |
Recode Fields ExampleData.FullSet$Savings ExampleData.FullSet$SavingsCat <- recode(ExampleData.FullSet$Savings, , -40000.00:-100.00 = HighNeg ; -100.00:-50.00 = MedNeg ; -50.00:10.00 = LowNeg ; 10.00:50.00 = Low ; 50.00:100.00 = Med ; 100.00:1000.00 = High ;;; , as.factor.result=TRUE) |
Tip 8 |
Summarize Data Summary(ExampleData.FullSet) |
Tip 9 |
Save output save.image(file = c:/test.RData , version = NULL, ascii = FALSE, compress = FALSE, safe = TRUE) |
Tip 10 |
Subset MyData.SubSample <- subset(MyData.Full, MyField ==0) |
Tip 11 |
Remove Object From Memory remove(list = c(‘MyObject’)); |
Tip 12 |
Create a Dataframe TmpOuput <- data.frame ( Fields = c( Field1 , ‘Field2 , ‘Field3’), Values = c( 1 , 2 , 2 ) ) |
Tip 13 |
Cut data(swiss) x <- swiss$Education swiss$Educated= cut(x, breaks=c(0, 11, 999), labels=c( 0 , 1 )) |
Tip 14 |
Create Directories
|
Tag: Open Source
Unix/Linux Data Management
Data management using Unix/Linux is easy but it does have a few quirks. First, typical the header (the first row which contains the field names) is not contained in the data file requiring a scheme or layout of the data files. That is because most of the file operation does not recognize the first line as different from the rest of the file. Also, complex multi table queries can only be achieved in multiple steps unlike SQL and SAS. But the speed and efficiency of code make Unix/Linux a strong data management tool.
This data would be stored without column names.
|
Unix Primer
Primer in Data Management In Unix/Linux Data manipulation in Unix/Linux is powerful yet easy after some practice. Much of basic file manipulation can be achieved using the basic toolset provided with most Unix/Linux installations. First lest generate the data file we will use for this exercise. Lets start off with some random data and a list of the files in your home directory. Type od -D -A n /dev/random | head -100 > mydata.dat
|
Quick Unix/Linux Guide
Unix/Linux is an operating system mainly used as servers in a business setting. There are numerous customer oriented Unix/Linux editions as well as CYGWIN (a virtual Linux OS available for Windows). Many of the Unix/Linux commands for file operations covered here are like those you typically done via the Windows GUI. If you are familiar with DOS, these commands of the Unix/Linux equivalents to the standard DOS file operation commands. There are numerous GUI front end to Unix/Linux (CDE, KDE, Window Maker, OSX, …) that allow you to execute these commands via menus or the mouse like you can from Windows. The power of using commands, however, is speed, clarity in what you are trying to achieve, repeatability and, if put into a script, repeatability. Another benefit is not having to hunt through multiple layers of menus to find (if it even exists) the command you want. This section will cover the basic file operation commands available in Unix/Linux. Most of the commands listed here are available in the korn and bash shells. If you do not have access to a Unix of Linux machine I recommend downloading and installing CYGWIN. You can find it here.
|
Unix/Linux vs DOS
|
R/Splus
The S language’s power is not its data management capability nor is data management the intent of the S language. However, often times when evaluating the output of a model you may need to perform basic data management with R/SPlus and you will find the S language acceptable in this role. The commands will seem more similar to Unix/Linux than SQL. However, the S language has many of the benefits of Unix/Linux (a concise language for data management) while being more data centric (allowing meta data for dataframes which includes column names). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Example: | |||||||||||||||||||||||||||||||||||||||||||||||||||||
|
SELECT |
|||||||||||||||||||||||||||||||||||||||||
customer
|
|
||||||||||||||||||||||||||||||||||||||||
ORDER BY |
|||||||||||||||||||||||||||||||||||||||||
customer[order(customer[,2]),]
|
|
||||||||||||||||||||||||||||||||||||||||
WHERE |
|||||||||||||||||||||||||||||||||||||||||
subset(customer, custid == 5 )
|
|
||||||||||||||||||||||||||||||||||||||||
INNER JOIN |
|||||||||||||||||||||||||||||||||||||||||
merge (purchaseorder, customer, by.x = custid , by.y = custid , all = FALSE )
|
|
||||||||||||||||||||||||||||||||||||||||
LEFT OUTER JOIN |
|||||||||||||||||||||||||||||||||||||||||
merge (purchaseorder, customer, by.x = custid , by.y = custid , all = TRUE )
|
|
||||||||||||||||||||||||||||||||||||||||
GROUP BY |
|||||||||||||||||||||||||||||||||||||||||
Cust_sum <- merge (purchaseorder, customer, by.x = custid , by.y = custid , all = FALSE ) xtabs( ~ fname, cust_sum)
|
|
||||||||||||||||||||||||||||||||||||||||
UPDATE |
|||||||||||||||||||||||||||||||||||||||||
customer[1,]$age <-23
|
|
||||||||||||||||||||||||||||||||||||||||
INSERT |
|||||||||||||||||||||||||||||||||||||||||
newcust <- data.frame(custid = 6, fname = Terry , age =50) rbind(newcust,customer)
|
|
||||||||||||||||||||||||||||||||||||||||
DELETE |
|||||||||||||||||||||||||||||||||||||||||
subset(customer, custid != 1 )
|
|
Example R Function
You can extend R/S-Plus usability by writing functions. This is similar to macros in the SAS Language. These functions can be anything from a new statistical algorithm to file operation to data manipulation. Below I give an example of a custom R macro. This function takes output from an rpart tree and converts it to SAS code suitable to use in a data a step. This is useful when coding nodes into a model. This dirty little secret: I developed the code by looking at the default print method for the rpart package and adapting it to generate SAS code. This code can also be modified to generate SQL code as well. When attempting to write new code I suggest first looking at published package that do something similar then try to adapt them to your own use. The S language (which both R and SPlus use) is similar to C. There are many good editor for S. this code was written using TinnR. printSAS.rpart <- function(x, minlength=0, spaces=2, cp, tree.depth <- getFromNamespace( tree.depth , rpart ) if(!inherits(x, rpart )) stop( Not legitimate rpart object ) ylevel <- attr(x, ylevels ) #32 is the maximal depth tfun <- (x$functions)$print
z <- labels(x, digits=digits, minlength=minlength, …) term <- rep( , length(depth)) for(i in 1:length(depth)) #print(1:i) if(term[i] == Terminal ) } # end for final[i] <- paste ( If , temp1[i] , z[i], then NodeVal = , yval[i] , ; ) |
||