In Class Exercise 2:

Using package foreign and extracting

You can take a break or do the following for the next 15 minutes. I’ll walk around the classroom answering questions.

Why don’t you quit out of R, then start it up again.
I have stored some non-R Data on the internet, see how many of these you can read.
- You’ll probably need to load the foreign library for some of these

http://www.stat.ucla.edu/˜vlew/datasets/divorce3.dta

http://www.stat.ucla.edu/˜vlew/datasets/HoustonCrime.Rdata

http://www.stat.ucla.edu/˜vlew/datasets/yrbss.sav

(extra - this is a SAS file) http://www.stat.ucla.edu/˜vlew/datasets/depressionbmi.sas7bdat

Data frames are a type of list and are a common format for data in R, particularly from imports.

Let’s practice extracting what we want from a data frame. Recall the Apple data:

apple <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=AAPL")
## check it
str(apple)

## 'data.frame':    8403 obs. of  7 variables:
##  $ Date     : Factor w/ 8403 levels "1980-12-12","1980-12-15",..: 8403 8402 8401 8400 8399 8398 8397 8396 8395 8394 ...
##  $ Open     : num  525 528 540 541 542 ...
##  $ High     : num  526 531 540 542 543 ...
##  $ Low      : num  519 522 531 538 540 ...
##  $ Close    : num  523 523 532 539 543 ...
##  $ Volume   : int  8697800 10309000 9830400 5798000 6443600 7170000 6023900 7163000 7929700 10706000 ...
##  $ Adj.Close: num  523 523 532 539 543 ...

Please answer a few questions based on the results of str(apple)

What type of R object is this? (hint: R tells you immediately)
How many observations are there?
How many variables?

(If you don't like Apple, choose another symbol instead of AAPL, like HOG for Harley Davidson or LUV for Southwest Airlines)

Try something easy, a great first function is summary(), we can use it to quickly generate summary statistics for an entire data frame.

summary(apple)

##          Date           Open            High            Low       
##  1980-12-12:   1   Min.   : 11.1   Min.   : 11.1   Min.   : 11.0  
##  1980-12-15:   1   1st Qu.: 25.6   1st Qu.: 26.0   1st Qu.: 25.1  
##  1980-12-16:   1   Median : 40.5   Median : 41.2   Median : 39.9  
##  1980-12-17:   1   Mean   : 96.2   Mean   : 97.5   Mean   : 94.9  
##  1980-12-18:   1   3rd Qu.: 77.0   3rd Qu.: 78.0   3rd Qu.: 75.8  
##  1980-12-19:   1   Max.   :702.4   Max.   :705.1   Max.   :699.6  
##  (Other)   :8397                                                  
##      Close           Volume           Adj.Close    
##  Min.   : 11.0   Min.   :4.96e+04   Min.   :  1.2  
##  1st Qu.: 25.5   1st Qu.:5.28e+06   1st Qu.:  6.3  
##  Median : 40.5   Median :9.50e+06   Median :  9.9  
##  Mean   : 96.2   Mean   :1.34e+07   Mean   : 74.2  
##  3rd Qu.: 77.0   3rd Qu.:1.69e+07   3rd Qu.: 55.7  
##  Max.   :702.1   Max.   :2.65e+08   Max.   :677.7  
##

But if you have lots and lots of variables, you should use the extract operator $ or know how to use the square brackets to help you reduce the amount of information on the screen. Here is an example, these are the same function (summary) and the same piece of information (the high) but the output is slightly different in format. The choice of one over another is a function of your needs.

summary(apple$High)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    11.1    26.0    41.2    97.5    78.0   705.0

summary(apple[[3]])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    11.1    26.0    41.2    97.5    78.0   705.0

summary(apple[3])

##       High      
##  Min.   : 11.1  
##  1st Qu.: 26.0  
##  Median : 41.2  
##  Mean   : 97.5  
##  3rd Qu.: 78.0  
##  Max.   :705.1

summary(apple["High"])

##       High      
##  Min.   : 11.1  
##  1st Qu.: 26.0  
##  Median : 41.2  
##  Mean   : 97.5  
##  3rd Qu.: 78.0  
##  Max.   :705.1

The difference between the single square bracket and the double square bracket is the single square returns a data.frame and the double returns a vector, otherwise, they look the same. I find the single bracket more applicable at this stage of your training for example, suppose I want the 3rd through 5th variables:

summary(apple[3:5])

##       High            Low            Close      
##  Min.   : 11.1   Min.   : 11.0   Min.   : 11.0  
##  1st Qu.: 26.0   1st Qu.: 25.1   1st Qu.: 25.5  
##  Median : 41.2   Median : 39.9   Median : 40.5  
##  Mean   : 97.5   Mean   : 94.9   Mean   : 96.2  
##  3rd Qu.: 78.0   3rd Qu.: 75.8   3rd Qu.: 77.0  
##  Max.   :705.1   Max.   :699.6   Max.   :702.1

## this won't work
summary(apple[[3:5]])

## Error: recursive indexing failed at level 2

The easy-to-understand answer for the error above is that the double bracket extract doesn't support extracting a set of values, just one in this instance.

Finally, can you explain why this one works?

summary(apple[c(2, 3, 5)])

##       Open            High           Close      
##  Min.   : 11.1   Min.   : 11.1   Min.   : 11.0  
##  1st Qu.: 25.6   1st Qu.: 26.0   1st Qu.: 25.5  
##  Median : 40.5   Median : 41.2   Median : 40.5  
##  Mean   : 96.2   Mean   : 97.5   Mean   : 96.2  
##  3rd Qu.: 77.0   3rd Qu.: 78.0   3rd Qu.: 77.0  
##  Max.   :702.4   Max.   :705.1   Max.   :702.1