1. Enrollment. 2. Plotting the sample mean. 3. R Cookbook. 1. Enrollment. I'm not giving out any more PTE numbers. If you are an unenrolled Statistics MS or PhD student, then please see me. Note small changes to hw1 ("below" in line 5, and "100,000" in 2b.) 2. Plotting the sample mean. cumsum() outputs the cumulative sum from i = 1 to k of a vector, as k goes from 1 to n. ## Suppose we want to plot the sample mean of 200,000 iid N(0.12, 1)s. n = 200000 x = rnorm(n, mean=0.12, sd = 10) y = cumsum(x)/(1:n) plot(y) ## this is useful to see if the sample mean has converged. ## problems: a) the line is so thick you can't see what it's converged to. ## b) the y-axis is too broad to see what it's converged to. ## c) the x and y labels. plot(c(1,n), c(-0.4, 0.4), type="n", xlab="k", ylab="") mtext(s=2,l=2,cex=0.7, expression(paste(frac(1,k), " ",sum(x_i,i==1, k)))) points(1:n, y, pch=".") ## or lines(1:n,y) abline(h=mean(x),lty=2) mean(x) ## By the central limit theorem, ## the standard error is sigma/sqrt(n) = 1/sqrt(200000) ~ 0.002236, ## and a 95% range for mean(x) is 0.12 +/- 1.96/sqrt(200000) ~ (0.1156, 0.1244) se2 = 1.96*10/sqrt(1:n) lines(1:n,y+se2,lty=2,col="blue") lines(1:n,y-se2,lty=2,col="blue") 3. R Cookbook. p40 gives the order of operations in R. You might wonder what unary minus and plus are, and why they take precedence over multiplication or division. Unary means they just act on one element. For instance, the minus in the number -3 just acts on the 3. 3 * - 4 3 * + 4 Anyway, the most important example is the one he shows on p41, 0:n-1, when n = 10. This does 0:n first, so it creates the vector from 0 to 10, and then subtracts 1 from each element, so it's from -1 to 9. This is a VERY common mistake. n = 10 0:n 0:n-1 0:(n-1) (0:(n-1)) ## it doesn't hurt to put parentheses around : expressions. p41, %in% gives True or False as whether each element on the left is in the vector on the right. c(1:3) %in% c(14:24) c(1:3) %in% c(14:24,2) p42, functions return the LAST expression, or you can specify return(x). Variables in a function are local. x in a function is different from x outside the function. x = 3 f2 = function(n){ n*x } f2 = function(n){ x = 4 n*x } f2(5) x However, you can do x <<- 4 to change x globally. f2 = function(n){ x <<- 4 n*x } f2(5) x p44 tells you how to open a new editor window. This does not always work on all platforms. The common mistkes on p46-49 are a must read. A common mistake is using =, which is for assignment, instead of the logical function ==. x = c(1,3) y = x[x=3] R doesn't really know what to do with this. What it ends up doing is essentially ignoring the "x=" part and just taking y = x[3]. x = c(1,3,4,8,3) y = x[x=3] z = x[x==3] p47, the problem total = 1 + 2 + 3 + 4 + 5 is really common. This happens often when text editors wrap your code strangely, and then you cut and paste it. Note that if you do total = 1 + 2 + 3 + 4 + 5 then the problem goes away. R is smart about that. A related problem is this: total = 1 + 2 + 3 ## this will be the total of all my elements in my dataset R treats "my dataset" as a new line, and gives a syntax error. p48, two common problems. aList[i] and aList[[i]], and & and &&. aList = list(w = rep(0,4), x = 1:10, y = rep(3,12), z = 1:5) aList[[2]] ## the second item in the list aList[2] ## a list containing the second item, and usually not what you want. x = aList[[2]] y = aList[2] x+1 y+1 mode(x) mode(y) mode() is kinda nice. x = 3 mode(x) ## "numeric". x = "a" mode(x) ## "character". help("&") p48, &, |, && and || are for logical arguments. "& and && indicate logical AND and | and || indicate logical OR. The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector." Basically, in my personal experience, you use && within if(), and you use & to generate a vector of Trues and Falses. x = c(1:10) (x<6) (x>3) ((x < 6) & (x > 3)) ## what you expect ((x < 6) && (x > 3)) ## evaluates only the first element. y = x^2 y if((x < 5) & (y < 5)) cat("good") ## Teetor says avoid this if((x < 5) && (y < 5)) cat("good") | means "or". Similar issues as with &. End of chapter 2. Finding the statistical mode, in R. mode(x) gives you the type of variable x is, not the most commonly appearing element of a vector. How do find the mode, in the statistical sense? We can find the mode using table(). table(x) gives a list of the sorted elements in x, along with their counts. x = c(17,4.3,2.1,4.3,1) table(x) y = which.max(table(x)) ## Outputs the index in sorted list of x, not the mode. y = table(x) names(y) z = as.numeric(names(y)) ## equivalent to z = sort(unique(x)) z[which.max(y)] mode2 = function(x){ ## finds the mode, ## but if there's a tie, defaults to the smallest mode. y = table(x) z = as.numeric(names(y)) z[which.max(y)] } x = c(17,17,4,4,2) mode2(x) mode3 = function(x){ ## finds the mode(s) y = table(x) z = as.numeric(names(y)) y1 = as.numeric(y) w = (y1 == max(y1)) z[w] } mode3 = function(x){ ## finds the mode(s) y = table(x) z = names(y) w = (y == max(y)) z[w] } ## equivalently, we could replace w = (y1 == max(y1)) with ## w = which(y1 == max(y1)) mode3(x) x = c(rep(1,7), rep(10,7), rep(-1,8)) mode3(x) x1 = x[x != -1] mode3(x1) Chapter 3. p51, setting the working directory is really important. You need this to read files or write to files and know where they go. p54 search() lists all packages currently loaded [not just installed but loaded into the current session]. search() library(MASS) search() detach(package:MASS) search() p57, datasets. head(pressure) ## or just pressure or pressure[1,] data() ## lists all the preloaded datasets. instead of data(Cars93, package="MASS") you could just do library(MASS) data(Cars93) p58, library() lists all installed (but not necessarily loaded) packages. install.packages() is useful to install a new one. p63, source() is sometimes useful though it can have problems, especially if your comments go over lines, as in the example with ## this will be the total of all the elements in my dataset. Chapter 4, Input and Output. p72, you can type straight into R [scores = c(61, 66, 90, 88, 100)] or use the editor [scores = edit(score)]. ## typo in the text. score <- data.frame() should be scores <- data.frame(). sink() p75 is sometimes useful, especially when outputting within a loop. Similar to cat("x", file = "x.txt") Note that you have to do sink("x.txt"), then your commands, and then sink() to return to normal output. read.fwf() p 77 is nice. A similar function is strsplit(). You can load a whole page of numbers, text, etc. using read.delim() and then choose the column you want using substr(). read.table() is extremely useful, p79. scan() p87 is great, and once you get the data in R, you can usually manipulate the data quite easily into a table or matrix anyway. For instance, sink("y.txt") z = runif(102) cat(z) sink() y = scan("y.txt") x = matrix(y, ncol=3, byrow=T) ## I stopped here. read.csv p81 and readHTMLTable p84 don't work too well in my experience. Same with dbConnect() on p90. The example on p88 and 89 about scan is definitely worth reading, and also illustrates the use of order(). You save perm = order(world.series$year) so that you can keep year and pattern in the same order as each other. Note that for a list, like world.series, you use the $ key to get one element of the list. For instance, x = list(a = c(1:3), b = rep(5,7)) x$a x$b You can also do x[[1]] or x[[2]] to get the first or 2nd objects in the list. p91 illustrates the useful function paste(), which merges character strings and numeric objects. x = 1 y = 3.4 z = paste("The answer to problem",x,"is",y,".") cat(z) ## Note that paste adds a space between elements by default. ## You can change sep to "" to change this. z = paste("The answer to problem",x,"is",y,".",sep="") cat(z) You can extract part of a character string using substr(). substr(z,5,10) a = "abcdefghij" substr(a,3,5) ## Done with Chapter 4.