1. Enrollment. 2. Notes on hw1. 3. R Cookbook. 1. Enrollment. I'm not giving out any more PTE numbers. If you are an unenrolled Statistics MS or PhD student, then please see me. 2. Notes on hw1. Read through ch. 6 of the R Cookbook for next week. As arguments to plot(), pch is useful to change the plotting symbol, and col for plotting color. type="n" is to set up the plot but not plot the points. points(x,y) adds the points to the current plot. cex is for character size. lines(x,y) connects the dots and adds them to the current plot. lty adjusts the line type. ; is equivalent to hitting return. x = 1:5; y = c(3,2,4,3,4) plot(x,y) plot(x,y,type="n") points(x,y,pch=".") ## tiny dots points(x,y,pch=2) ## triangles plot(x,y,type="n"); points(x,y,pch=3) ## plus signs plot(x,y,type="n"); points(x,y,pch=3, col="blue") plot(x,y,type="n"); points(x,y,pch=x, col="blue") points(x,y,pch=as.character(x)) plot(x,y,pch=x, col="blue",cex=2); points(x,y,pch=as.character(x),cex=.7) lines(x,y,col="red",lty=3) runif(n) generates n pseudo-random uniform variables on [0,1]. x = runif(10000) y = runif(10000)*20-7 ## 10000 random uniforms on [-7,13]. plot(x,y) plot(x,y,pch=".") quantile(x,0.9) quantile(y,0.9) sort() sorts a vector, by default from smallest to largest. z = sort(y) z[1:5] z[9000] z = sort(y,decreasing=T) ## to sort from biggest to smallest. z[1:5] 3. R Cookbook. p41, %in% gives True or False as whether each element on the left is in the vector on the right. c(1:3) %in% c(14:24) c(1:3) %in% c(14:24,2) p42, functions return the LAST expression, or you can specify return(x). Variables in a function are local. x in a function is different from x outside the function. x = 3 f2 = function(n){ x = 4 n*x } f2(5) x However, you can do x <<- 4 to change x globally. f2 = function(n){ x <<- 4 n*x } f2(5) x p44 tells you how to open a new editor window. This does not always work on all platforms. The common mistkes on p46-49 are a must read. A common mistake is using =, which is for assignment, instead of the logical function ==. x = c(1,3) y = x[x=3] R doesn't really know what to do with this. What it ends up doing is essentially ignoring the "x=" part and just taking y = x[3]. x = c(1,3,4,8,3) y = x[x=3] z = x[x==3] p47, the problem total = 1 + 2 + 3 + 4 + 5 is really common. This happens often when text editors wrap your code strangely, and then you cut and paste it. Note that if you do total = 1 + 2 + 3 + 4 + 5 then the problem goes away. R is smart about that. A related problem is this: total = 1 + 2 + 3 ## this will be the total of all my elements in my dataset R treats "my dataset" as a new line, and gives a syntax error. p48, two common problems. aList[i] and aList[[i]], and & and &&. aList = list(w = rep(0,4), x = 1:10, y = rep(3,12), z = 1:5) aList[[2]] ## the second item in the list aList[2] ## a list containing the second item, and usually not what you want. x = aList[[2]] y = aList[2] x+1 y+1 mode(x) mode(y) mode() is kinda nice. x = 3 mode(x) ## "numeric". x = "a" mode(x) ## "character". help("&") p48, &, |, && and || are for logical arguments. "& and && indicate logical AND and | and || indicate logical OR. The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector." Basically, in my personal experience, you use && within if(), and you use & to generate a vector of Trues and Falses. x = c(1:10) (x<6) (x>3) ((x < 6) & (x > 3)) ## what you expect ((x < 6) && (x > 3)) ## evaluates only the first element. y = x^2 y if((x < 5) & (y < 5)) cat("good") ## Teetor says avoid this if((x < 5) && (y < 5)) cat("good") | means "or". Similar issues as with &. End of chapter 2. Finding the statistical mode, in R. mode(x) gives you the type of variable x is, not the most commonly appearing element of a vector. We can find the mode using table(). table(x) gives a list of the sorted elements in x, along with their counts. x = c(17,4.3,2.1,4.3,1) table(x) y = which.max(table(x)) ## Outputs the index in sorted list of x, not the mode. y = table(x) names(y) z = as.numeric(names(y)) ## equivalent to z = sort(unique(x)) z[which.max(y)] mode2 = function(x){ ## finds the mode, ## but if there's a tie, defaults to the smallest mode. y = table(x) z = as.numeric(names(y)) z[which.max(y)] } x = c(17,17,4,4,2) mode2(x) mode3 = function(x){ ## finds the mode(s) y = table(x) z = as.numeric(names(y)) y1 = as.numeric(y) w = (y1 == max(y1)) z[w] } ## equivalently, we could replace w = (y1 == max(y1)) with w = which(y1 == max(y1)) mode3(x) x = c(rep(1,7), rep(10,7), rep(-1,8)) mode3(x) x1 = x[x != -1] mode3(x1) Chapter 3. p51, setting the working directory is really important. You need this to read files or write to files and know where they go. p54 search() lists all packages currently loaded [not just installed but loaded into the current session]. search() library(MASS) search() detach(package:MASS) search() p57, datasets. head(pressure) ## or just pressure or pressure[1,] data() ## lists all the preloaded datasets. instead of data(Cars93, package="MASS") you could just do library(MASS) data(Cars93) p58, library() lists all installed (but not necessarily loaded) packages. install.packages() is useful to install a new one. p63, source() is sometimes useful though it can have problems, especially if your comments go over lines, as in the example with ## this will be the total of all the elements in my dataset. Chapter 4, Input and Output. p72, you can type straight into R [scores = c(61, 66, 90, 88, 100)] or use the editor [scores = edit(score)]. ## typo in the text. score <- data.frame() should be scores <- data.frame(). sink() p75 is sometimes useful, especially when outputting within a loop. Similar to cat("x", file = "x.txt") Note that you have to do sink("x.txt"), then your commands, and then sink() to return to normal output. read.fwf() p 77 is nice. A similar function is strsplit(). You can load a whole page of numbers, text, etc. using read.delim() and then choose the column you want using substr(). read.table() is extremely useful, p79. scan() p87 is great, and once you get the data in R, you can usually manipulate the data quite easily into a table or matrix anyway. For instance, sink("y.txt") z = runif(102) cat(z) sink() y = scan("y.txt") x = matrix(y, ncol=3, byrow=T) read.csv p81 and readHTMLTable p84 don't work too well in my experience. Same with dbConnect() on p90. The example on p88 and 89 about scan is definitely worth reading, and also illustrates the use of order(). You save perm = order(world.series$year) so that you can keep year and pattern in the same order as each other. Note that for a list, like world.series, you use the $ key to get one element of the list. For instance, x = list(a = c(1:3), b = rep(5,7)) x$a x$b You can also do x[[1]] or x[[2]] to get the first or 2nd objects in the list. p91 illustrates the useful function paste(), which merges character strings and numeric objects. x = 1 y = 3.4 z = paste("The answer to problem",x,"is",y,".") cat(z) ## Note that paste adds a space between elements by default. ## You can change sep to "" to change this. z = paste("The answer to problem",x,"is",y,".",sep="") cat(z) You can extract part of a character string using substr(). substr(z,5,10) a = "abcdefghij" substr(a,3,5) ## Done with Chapter 4.