1. Admin stuff. 2. Rejection sampling again. 3. R Cookbook ch. 8-9. 4. R Cookbook ch 10. 5. R Cookbook ch 11. 1. Admin stuff. Reminder, no class or office hour Tue Oct 22. We will switch to C next lecture, and then return to R Cookbook later in the course. Tue, Nov 26 [1] DAS GUPTA, APARUPA . MAO, JUNHUA . WANG, PENG . [2] CAI, YUAN . SINGH, VIKRAM VIR . WU, GUANI . [3] LIU, YUNLEI . KARWA, ROHAN SUNIL . KUHFELD, MEGAN REBECCA . [4] MURPHY, JOHN LEE . HONG, JI YEON . SHAH, VAIBHAV CHETAN . [5] Csapo, Marika . LEWIS, BRONWYN ASHLEIGH . ZHANG, QIANG . Tue, Dec 3 [6] NAN, XIAOMENG . ARELLANO, RYAN CHRISTOPHER . SHU, LE . [7] WANG, XINYUE . CHEN, JIAMIN . HAN, TIAN . [8] THAKUR, MANOJ RAMESHCHANDRA . POWELL, CHRISTOPHER GRANT . KHODAVIRDI, KHATEREH . [9] XIE, MEIHUI . WONG, ALEX KING LAP . XIE, DAN . [10] TAN, MENGXIN . ROBERTS, SHANE WILLIAM STROUTH . HUANG, PENG JUN . [11] YU, CHENGCHENG . WU, JOHN SHING . MOUSAVI, SEYED ALI . Thu, Dec 5 [12] FERNANDEZ, KEITH EDWARD . AHLUWALIA, ANSUYA . HAROUNIAN, BENJAMIN . [13] GU, JIAYING . RAMIREZ, ARTURO . DEMIRDJIAN, LEVON . [14] HUYNH, HIEN TRINH . SHEN, LINLING . CHANDRASEKHARAN, MANJANA . [15] WISNER, TRISTAN GARDNER . MAGYAR, ZSUZSANNA BLANKA . HE, JIA . [16] KRUMHOLZ, SAMUEL DAVID . KATTE, SAMIR RAJENDRA . WAINFAN, KATHRYN TANYA . [17] WANG, TENG . SCHWEIG, JONATHAN DAVID . Gupta, Hitesh . There will be a couple changes in the teams and orders, probably next class, Oct 24. Attendance is mandatory the last 3 classes. 2. Rejection sampling again. Sample with density g(x). Keep each point with prob. f(x)/[bg(x)]. This makes sense since f(x) is always ² bg(x). The resulting density is thus g(x) f(x) / [bg(x)] = f(x)/b. So we're keeping point x with prob. density f(x)/b. But since b is constant, if you consider the density of KEPT pts, it's simply f(x). 3. Back to R Cookbook, p178. Last time we discussed rnorm, rexp, etc. rexp(3, 10) rexp(3, 0.1) The arguments can be vectors instead of scalars, and R will cycle through them one at a time, as described on p182. rexp(3, c(.01, 10, .02)) signif(rexp(3, c(.01, 10, .02)),3) signif(rexp(3, c(.01, 10)) ,2) ## it cycles through, so the 3rd has mean 100. The note about ?Normal and ?TDist on pp 178-179 is silly. You can use help() with a function as usual. For instance, functions for simulating normal or t distributed random variables are rnorm() and rt(), and you can do help(rnorm) or help(rt). You just can't do help(norm) or help(t) to get these help files, because norm() and t() are matrix operator functions [norm and transpose, respectively]. p179, choose(n,k) gives the number of combinations. p180, combn(1:5,3) or combn(5,3) gives the actual combinations. See for instance combn(4,3) combn(c(8,-5,40,20),2) combn(c(8,8,40,20),2) p184, sample(1:10, 5) chooses 5 items randomly from the vector (1:10). sample(1:10,5) By default it is without replacement, but you can do sample(1:10,5, rep=T) sample(1:10,c(5,4,2)) ## doesn't do what you'd think. Only the 5 is read. If you want to generate 3 samples of different sizes from the vector 1:10, you could do y = c(5,4,2) x = list() for(i in 1:3){ x[[i]] = sample(1:10,y[i]) } You can also have sample choose elements with different probabilities, by giving it as an argument a vector of probabilities, as on p185. sample(1:10, 50, rep=T, prob = (1:10)/55) p187, diff() calculates the differences between successive elements of a vector. It returns a vector of length n-1, if your initial vector had length n. Note that when your vector is a list of times, often you want to sort your vector before using diff(). t = rpois(10, lambda=20) diff(t) t1 = sort(t) t1 diff(t1) pp 191-192 show an interesting example of plotting densities. He calculates the density on a vector x of points all at once. dnorm((1:10)/10) We can do this with the exponential distribution. x = seq(0,10,length.out=100) y = dexp(x) plot(x,y, main="Exponential distribution", type="l", ylab="density", xlab="value") Note that he uses the polygon() function to shade in a region. region.x consists of all the x-coordinates of the region. region.y consists of all the y-coordinates of the region. x1 = x[1 <= x & x <= 3] region.x = c(min(x1), x1, max(x1)) ## Teetor uses x1[1] and tail(x1,1) instead of min() and max() y1 = y[1 <= x & x <= 3] region.y = c(0,y1,0) polygon(region.x, region.y, col="blue", density = 20) p199, mean(x > 0) computes the relative frequency, because (x > 0) is a vector of TRUEs (1) and FALSEs (0). y = runif(50000) mean(y > 0.8) p200 describes table(), which we've seen before. p201, summary(table) does a chi square test for independence between 2 factors. A small p-value indicates statistically significant dependence, i.e. strong evidence against the assumption of independence. shy = factor(c(rep("Y", 23), rep("N",20))) gpa = factor(c(rep("A",10), rep("B", 10), rep("C",10), rep("A", 2), rep("C", 2), rep("B", 9))) table(shy) table(gpa) table(shy,gpa) summary(table(shy, gpa)) p203, scale() normalizes a numerical vector, matrix, or data frame. x = c(1,2,4,4,3) y = scale(x) y[1:5] p203-204, t.test() to test against a null hypothesis of the mean, or to get a confidence interval for the population mean. t.test(x, mu=1) t.test(x, conf.level = 0.975) y = t.test(x, conf.level = 0.975) attributes(y) y$conf.int y$conf.int[1:2] p206, wilcox.test(x, conf.int=TRUE) ## CI for the median x = c(1:10,-7.1,20) wilcox.test(x, conf.int=TRUE) p207, testing if the population proportion of successes = some null hypothesis value. Say you have 70 successes out of 97 trials, and your null hypothesis is that p = 0.80. prop.test(70,97,0.8) You can use this too to get a CI for p. prop.test(70,97,0.8, conf.level = 0.9) Test for normality. shapiro.test(x) p210, in the nortest package are more tests for normality. p211, testing if the total number of runs, or streaks, or equivalently changes, in binary time series data, is significantly different from what you'd expect by chance if the observations are iid, using runs.test(). install.packages("tseries") library(tseries) x = runif(50) y = (x > 0.4) runs.test(as.factor(y)) x = sample(c(0,1,2),50,rep=T) runs.test(as.factor(x)) ## error because the data are not binary. p212, t.test() to test if the difference between two sample means is statistically significant. x = runif(400) y = rnorm(70000, mean=.57) t.test(x,y) By default, paired = FALSE, but if you have paired data and x and y have the same length you can say t.test(x,y[1:400],paired = TRUE) p214, an alternative is the nonparametric Wilcoxon-Mann-Whitney test, where you calculate the sum of the ranks of values of x in the vector c(x,y), and compare with what you'd expect if the two distributions were equivalent. wilcox.test(x,y) p215, Pearson test to see if a correlation is statistically significant. x and y must have the same length. x = runif(200) y = 0.8*x + rnorm(200) plot(x,y,main=expression(x[i]^-2)) cor.test(x,y) abline(lsfit(x,y)) Equivalently, you can fit a line by regression and use the t-test on the coefficient for y. b = lm(y ~ x) ## similar to b = lsfit(x,y) summary(b) abline(b) attributes(b) They're the same. summary(b)$coef[8] cor.test(x,y)$p.value The book also discusses prop.test() on p217, for comparing two proportions, pairwise.t.test() on p218, for comparing the means of > 2 samples, and the Kolmogorov-Smirnov test ks.test() to see if two cumulative distribution functions might be equivalent. 4. R Cookbook ch10. p222, list of plotting functions, such as plot(), boxplot(), and hist(). p225, the main = "my title" argument in plot creates a title. x = runif(10) y = rnorm(10) plot(x,y,xlab="uniforms", ylab="normals", main="Figure 1") p226, grid() makes it like graph paper. grid() grid(col=grey(.1)) p227, pch is a useful argument to plot. We've seen pch="." for small points. You can give pch a vector and it will plot each point differently. cex controls the size of the characters. names = c("a","b","c","d","e","f","g","h","i","j") plot(x,y,pch=names,cex=1.5) plot(x,y,pch=".",cex=5) When pch takes a number, it plots a symbol, with 1 = circle, 2 = triangle, 3 = plus, etc. types2 = c(rep(1,5),rep(2,3),rep(3,2)) plot(x,y,pch=types2) If you want it to plot the index of your data, you have to make the indices characters. names = as.character(1:10) plot(x,y,pch=names) ## note that number 10 gets cut off. plot(x[1:9],y[1:9],pch=names[1:9]) text works better for this. plot(x,y,type="n") text(x,y,names) Use points() to overlay symbols. points(x,y,pch=types2) ## It looks bad. plot(x,y,type="n") text(x+.05,y,names) points(x,y,pch=types2,col="red",cex=2) You might want to make the yaxis intersect 0. Use the argument xaxs = "i" and yaxs = "i" in plot(). x = c((0:10)/10) y = x plot(x,y,type="l") plot(x,y,type="l",yaxs="i") plot(x,y,type="l",xaxs="i",yaxs="i") plot(x,y,type="l",xlim=c(0.1,.9),ylim=c(0.1,.9),xaxs="i",yaxs="i") Legend, p229. legend(0.5, .6, c("shoes", "socks", "hats"), pch=c(1:3),cex=.8,col=c("red","blue","green")) You can use "lty" in place of "pch" above. See p230. Note that this is NOT the kind of legend I want you to do for hw2 in problem 1e. barplot() p236, abline() p245, boxplot() p247, and hist() p249 are self-explanatory. p244 shows a common mistake when plotting two curves on the same plot. You have to set the ranges first so that both curves fit. p253, qqnorm(), qqline() x = rt(500,df=4) ## generate 500 draws from the t_4 distribution. hist(x, nclass=40, main = "") qqnorm(x) qqline(x) To make other types of quantile plots, see p255. plot(qt(ppoints(x),df=4),sort(x)) abline(a=0,b=1)