1. Final projects. 2. Rejection sampling again. 3. R Cookbook ch.9 continued. 4. R Cookbook ch. 10. 5. R Cookbook ch. 11. No lectures or office hours Nov 1 or Nov 6. 1. Final projects. 3 people have dropped recently. * 16. BAE, JOONBUM. * 18. HO, KING CHUNG. * 8. SHEN, ZHIYUAN. So, I'll dissolve team 8, and I'm going to move the others from team 8 into 16 and 18. Medha Uppala goes to 16, and Yang Lu goes to 18. Thur, Nov 29 1. MALEKMOHAMMADI, MAHSA . CHUGH, KARAN . XIA, FANGTING . 2. BELOGOLOVA, HELEN . WANG, YUTING . ELLIOTT, PETER WILLIAM . 3. LIANG, MUZHOU . CHEN, YU-CHING . CARLITZ, RUTH DENALI . 4. DANIELSON, AARON . CUNG, BIANCA . CARLEN, JANE . 5. HUANG, MING-CHUN . KARYEKAR, AMIT SUHAS . SHI, HAOLUN . 6. CHANG, HUA-I . TRICKEY, KIRSTEN ALEXANDRA . SKIPPER, PETER TYREL . Tues, Dec 4 7. ZHU, HUA . KIM, BRIAN . LACOUR, MICHAEL JULES . 9. HUANG, YAFEI . CHOI, ICK KYU . LIU, JIA LIN . 10. GLEASON, TANIA MARIE . LUO, YIYANG . JASWANI, DRUMIL PARAS . 11. LI, JINCHAO . CHEN, RUI . WANG, YAN . 12. WANG, CHENHUI . GAURAV, ADITYA . TCHEMODANOV, NATALIA . 13. XIE, JIANWEN . XIAO, QIAN . BUDHRANI, VINEESHA . Thur, Dec 6 14. CAO, RENZHE . WANG, WENJIA . ZHOU, YUXI . 15. VASILYEV, ALEXANDER . LIAN, XIAOCHEN . TSUI, JASON . 16. LI, LU . NAYAK, SHILPI . UPPALA, MEDHA . 17. RICHMOND, ROBERT JAMISON . LI, ZHONGBO . ZHOU, YUAN . 18. ARFA, JONATHAN MICHAEL . WOLF, JEFFREY ARIEN . LU, YANG . 19. KAPADIA, HIRAL . NIE, XIAOHAN . CHEN, XIANJIE . We will switch to C next lecture, and then return to R Cookbook later in the course. 2. Discuss rejection sampling again. Sample with density g(x). Keep each point with prob. f(x)/[bg(x)]. This makes sense since f(x) is always ² bg(x). The resulting density is thus g(x) f(x) / [bg(x)] = f(x)/b. So we're keeping point x with prob. density f(x)/b. But since b is constant, if you consider the density of KEPT pts, it's simply f(x). 3. R Cookbook ch9 continued. p199, mean(x > 0) computes the relative frequency, because (x > 0) is a vector of TRUEs (1) and FALSEs (0). y = runif(500) mean(y > 0.8) p200 describes table(), which we've seen before. p201, summary(table) does a chi square test for independence between 2 factors. A small p-value indicates statistically significant dependence, i.e. strong evidence against the assumption of independence. shy = factor(c(rep("Y", 23), rep("N",20))) gpa = factor(c(rep("A",10), rep("B", 10), rep("C",10), rep("A", 2), rep("C", 2), rep("B", 9))) table(shy) table(gpa) table(shy,gpa) summary(table(shy, gpa)) p203, scale() normalizes a numerical vector, matrix, or data frame. x = c(1,2,4,4,3) y = scale(x) y[1:5] p203-204, t.test() to test against a null hypothesis of the mean, or to get a confidence interval for the populaton mean. t.test(x, mu=1) t.test(x, conf.level = 0.975) y = t.test(x, conf.level = 0.975) attributes(y) y$conf.int y$conf.int[1:2] p206, wilcox.test(x, conf.int=TRUE) ## CI for the median x = c(1:10,-7.1,20) wilcox.test(x, conf.int=TRUE) p207, testing if the population proportion of successes = some null hypothesis value. Say you have 70 successes out of 97 trials, and your null hypothesis is that p = 0.80. prop.test(70,97,0.8) You can use this too to get a CI for p. prop.test(70,97,0.8, conf.level = 0.9) Test for normality. shapiro.test(x) p210, in the nortest package are more tests for normality. p211, testing if the total number of runs, or streaks, or equivalently changes, in binary time series data, is significantly different from what you'd expect by chance if the observations are iid, using runs.test(). install.packages("tseries") library(tseries) x = runif(50) y = (x > 0.4) runs.test(as.factor(y)) x = sample(c(0,1,2),50,rep=T) runs.test(as.factor(x)) ## error because the data are not binary. p212, t.test() to test if the difference between two sample means is statistically significant. x = runif(400) y = rnorm(700, mean=.57) t.test(x,y) By default, paired = FALSE, but if you have paired data and x and y have the same length you can say t.test(x,y[1:400],paired = TRUE) p214, an alternative is the nonparametric Wilcoxon-Mann-Whitney test, where you calculate the sum of the ranks of values of x in the vector c(x,y), and compare with what you'd expect if the two distributions were equivalent. wilcox.test(x,y) p215, Pearson test to see if a correlation is statistically significant. x and y must have the same length. x = runif(200) y = 0.8*x + rnorm(200) plot(x,y,main=expression(x[i]^-2)) cor.test(x,y) abline(lsfit(x,y)) Equivalently, you can fit a line by regression and use the t-test on the coefficient for y. b = lm(y ~ x) ## similar to b = lsfit(x,y) summary(b) abline(b) attributes(b) They're the same. summary(b)$coef[8] cor.test(x,y)$p.value The book also discusses prop.test() on p217, for comparing two proportions, pairwise.t.test() on p218, for comparing the means of > 2 samples, and the Kolmogorov-Smirnov test ks.test() to see if two cumulative distribution functions might be equivalent. 4. R Cookbook ch10. p222, list of plotting functions, such as plot(), boxplot(), and hist(). p225, the main = "my title" argument in plot creates a title. x = runif(10) y = rnorm(10) plot(x,y,xlab="uniforms", ylab="normals", main="Figure 1") p226, grid() makes it like graph paper. grid() grid(col=grey(.2)) p227, pch is a useful argument to plot. We've seen pch="." for small points. You can give pch a vector and it will plot each point differently. cex controls the size of the characters. names = c("a","b","c","d","e","f","g","h","i","j") plot(x,y,pch=names,cex=1.5) plot(x,y,pch=".",cex=5) When pch takes a number, it plots a symbol, with 1 = circle, 2 = triangle, 3 = plus, etc. types2 = c(rep(1,5),rep(2,3),rep(3,2)) plot(x,y,pch=types2) If you want it to plot the index of your data, you have to make the indices characters. names = as.character(1:10) plot(x,y,pch=names) ## note that number 10 gets cut off. plot(x[1:9],y[1:9],pch=names[1:9]) text works better for this. plot(x,y,type="n") text(x,y,names) Use points() to overlay symbols. points(x,y,pch=types2) ## It looks bad. plot(x,y,type="n") text(x+.05,y,names) points(x,y,pch=types2,col="red",cex=2) You might want to make the yaxis intersect 0. Use the argument xaxs = "i" and yaxs = "i" in plot(). x = c((0:10)/10) y = x plot(x,y,type="l") plot(x,y,type="l",yaxs="i") plot(x,y,type="l",xaxs="i",yaxs="i") plot(x,y,type="l",xlim=c(0.1,.9),ylim=c(0.1,.9),xaxs="i",yaxs="i") Legend, p229. legend(0.8, 1, c("shoes", "socks", "hats"), pch=c(1:3),cex=.8) You can use "lty" in place of "pch" above. See p230. Note that this is NOT the kind of legend I want you to do for hw2 in problem 1e. barplot() p236, abline() p245, boxplot() p247, and hist() p249 are self-explanatory. p244 shows a common mistake when plotting two curves on the same plot. You have to set the ranges first so that both curves fit. p253, qqnorm(), qqline() x = rt(500,df=4) ## generate 500 draws from the t_4 distribution. hist(x, nclass=40, main = "") qqnorm(x) qqline(x) To make other types of quantile plots, see p255. plot(qt(ppoints(x),df=4),sort(x)) abline(a=0,b=1) 5. R Cookbook ch11. lm, p269-285. lm(y ~ x), y = b0 + b1 x + eps lm(y ~ u + v + w) fits y = b0 + b1 u + b2 v + b3 w + eps. lm(y ~ u + v + w + 0) for no intercept, b0. lm(y ~ u*v*w) for all interactions. See p279. y = b0 + b1u + b2v + b3w + b4uv + b5uw + b6vw + b7 uvw + e. To specify a particular interaction, use :. lm(y ~ u + v + w + u:v:w) for just the uvw term. See p280. This fits y = b0 + b1 u + b2 v + b3 w + b4 uvw + eps. p280, lm(y ~ (u+v+w)^i) for all order i interactions. u = rnorm(10000) v = rnorm(10000) w = rnorm(10000) x = rnorm(10000) y = 10 + 20*u + 30*u*v + rnorm(10000) j = lm(y ~ (u + v + w + x)^3) summary(j) Use - to eliminate a term. j = lm(y ~ (u + v + w + x)^3 - v:w:x - v) summary(j) step, p281, to do stepwise regression, forward or backward. k = step(j,direction="backward") AIC = Akaike's Information Criterion, lowest AIC is best fitting model. Assuming eps = iid Normal(mean 0, some variance sigma^2), can calculate the likelihood associated with a particular model and parameters. AIC = -2 log (likelihood) + 2p. Lower AIC is preferred. summary(k) k2 = step(lm(y ~ 1), direction="forward", scope = ( ~ (u+v+w+x)^4), trace=0) summary(k2) p285, I(u+v) for a term that actually just represents u+v. Similarly, I(u^2) for a term that is u^2, or I(u*v) for a column that is u*v. k = lm(y ~ I(u*v*w)) k = lm(y ~ I(x^2 + x)) p292, confint for confidence intervals. confint(k2) p293, plot the residuals with plot(k2, which=1), and other plots with plot(k2).