http://www.stat.ucla.edu/˜vlew/datasets/divorce3.dta
http://www.stat.ucla.edu/˜vlew/datasets/HoustonCrime.Rdata
http://www.stat.ucla.edu/˜vlew/datasets/yrbss.sav
(extra - this is a SAS file) http://www.stat.ucla.edu/˜vlew/datasets/depressionbmi.sas7bdat
Let’s practice extracting what we want from a data frame. Recall the Apple data:
apple <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=AAPL")
## check it
str(apple)
## 'data.frame': 8403 obs. of 7 variables:
## $ Date : Factor w/ 8403 levels "1980-12-12","1980-12-15",..: 8403 8402 8401 8400 8399 8398 8397 8396 8395 8394 ...
## $ Open : num 525 528 540 541 542 ...
## $ High : num 526 531 540 542 543 ...
## $ Low : num 519 522 531 538 540 ...
## $ Close : num 523 523 532 539 543 ...
## $ Volume : int 8697800 10309000 9830400 5798000 6443600 7170000 6023900 7163000 7929700 10706000 ...
## $ Adj.Close: num 523 523 532 539 543 ...
Please answer a few questions based on the results of str(apple)
(If you don't like Apple, choose another symbol instead of AAPL, like HOG for Harley Davidson or LUV for Southwest Airlines)
Try something easy, a great first function is summary(), we can use it to quickly generate summary statistics for an entire data frame.
summary(apple)
## Date Open High Low
## 1980-12-12: 1 Min. : 11.1 Min. : 11.1 Min. : 11.0
## 1980-12-15: 1 1st Qu.: 25.6 1st Qu.: 26.0 1st Qu.: 25.1
## 1980-12-16: 1 Median : 40.5 Median : 41.2 Median : 39.9
## 1980-12-17: 1 Mean : 96.2 Mean : 97.5 Mean : 94.9
## 1980-12-18: 1 3rd Qu.: 77.0 3rd Qu.: 78.0 3rd Qu.: 75.8
## 1980-12-19: 1 Max. :702.4 Max. :705.1 Max. :699.6
## (Other) :8397
## Close Volume Adj.Close
## Min. : 11.0 Min. :4.96e+04 Min. : 1.2
## 1st Qu.: 25.5 1st Qu.:5.28e+06 1st Qu.: 6.3
## Median : 40.5 Median :9.50e+06 Median : 9.9
## Mean : 96.2 Mean :1.34e+07 Mean : 74.2
## 3rd Qu.: 77.0 3rd Qu.:1.69e+07 3rd Qu.: 55.7
## Max. :702.1 Max. :2.65e+08 Max. :677.7
##
But if you have lots and lots of variables, you should use the extract operator $ or know how to use the square brackets to help you reduce the amount of information on the screen. Here is an example, these are the same function (summary) and the same piece of information (the high) but the output is slightly different in format. The choice of one over another is a function of your needs.
summary(apple$High)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.1 26.0 41.2 97.5 78.0 705.0
summary(apple[[3]])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.1 26.0 41.2 97.5 78.0 705.0
summary(apple[3])
## High
## Min. : 11.1
## 1st Qu.: 26.0
## Median : 41.2
## Mean : 97.5
## 3rd Qu.: 78.0
## Max. :705.1
summary(apple["High"])
## High
## Min. : 11.1
## 1st Qu.: 26.0
## Median : 41.2
## Mean : 97.5
## 3rd Qu.: 78.0
## Max. :705.1
The difference between the single square bracket and the double square bracket is the single square returns a data.frame and the double returns a vector, otherwise, they look the same. I find the single bracket more applicable at this stage of your training for example, suppose I want the 3rd through 5th variables:
summary(apple[3:5])
## High Low Close
## Min. : 11.1 Min. : 11.0 Min. : 11.0
## 1st Qu.: 26.0 1st Qu.: 25.1 1st Qu.: 25.5
## Median : 41.2 Median : 39.9 Median : 40.5
## Mean : 97.5 Mean : 94.9 Mean : 96.2
## 3rd Qu.: 78.0 3rd Qu.: 75.8 3rd Qu.: 77.0
## Max. :705.1 Max. :699.6 Max. :702.1
## this won't work
summary(apple[[3:5]])
## Error: recursive indexing failed at level 2
The easy-to-understand answer for the error above is that the double bracket extract doesn't support extracting a set of values, just one in this instance.
Finally, can you explain why this one works?
summary(apple[c(2, 3, 5)])
## Open High Close
## Min. : 11.1 Min. : 11.1 Min. : 11.0
## 1st Qu.: 25.6 1st Qu.: 26.0 1st Qu.: 25.5
## Median : 40.5 Median : 41.2 Median : 40.5
## Mean : 96.2 Mean : 97.5 Mean : 96.2
## 3rd Qu.: 77.0 3rd Qu.: 78.0 3rd Qu.: 77.0
## Max. :702.4 Max. :705.1 Max. :702.1