1. Final projects and group assignment. 2. A brief glance at regression in R. 3. Kernel density estimation. 4. 2-d kernel smoothing. 5. Simulating from a density by rejection sampling. 6. Maps. 7. R Cookbook ch. 8. 1. Final projects. I will randomly assign you to groups of 3-4 students, and you will analyze some data using the methods we have talked about in class, including the methods used for the homeworks and also other methods we have discussed in class. Your group will write up your analysis in a written report, and the group will also make an oral presentation, where every member of your group will speak. The oral presentations will be 8-10 minutes each in total. You will be graded on the performance of your group, rather than individually, and each group will receive one overall grade based on the oral report and written report, combined. This combined grade will count for 15% of your overall grade in the class, and the other 85% will be based on your homework assignments. Your dataset, which you will find yourselves, on the web, can be anything you choose, but it should be: a) regression style data, i.e. a response variable, and for each observation, a bunch (at least 2 or 3) explanatory variables. You should have at the very least n=30 observations. b) something of genuine interest to you. Analyze the data using the methods we have talked about in class, such as linear regression, univariate kernel density estimation, 2-d kernel density estimation, statistical testing, quantile plots, and kernel regression. At least one component of your data anlysis should be done in C. Your final project should be submitted to me in pdf by email to frederic@stat.ucla.edu by Sun, Dec 9, 11:59pm. Note the address. Do not send them to stat202a@stat.ucla.edu. Obviously, you only need to send one per group. -- Read up through the end of ch10 in R Cookbook for next class. I will now randomly assign people to groups. First, add any missing names to roster.txt. If you want to change oral presentation dates and times with another group, feel free but let me know. I will use sample(), which randomly permutes the elements of a vector. a = c("a","b","c","d") sample(a) x = scan("roster.txt",what="char",sep="\n") n = length(x) y = sample(x) fl = floor(n/3) for(i in 1:fl){ if(i == 1) cat("\n\n Thur, Nov 29\n") if(i == 7) cat("\n\n Tues, Dec 4 \n") if(i == 14) cat("\n\n Thur, Dec 6 \n") for(j in 1:3) cat(y[3*(i-1)+j],". ") cat("\n") } Thur, Nov 29 1. MALEKMOHAMMADI, MAHSA . CHUGH, KARAN . XIA, FANGTING . 2. BELOGOLOVA, HELEN . WANG, YUTING . ELLIOTT, PETER WILLIAM . 3. LIANG, MUZHOU . CHEN, YU-CHING . CARLITZ, RUTH DENALI . 4. DANIELSON, AARON . CUNG, BIANCA . CARLEN, JANE . 5. HUANG, MING-CHUN . KARYEKAR, AMIT SUHAS . SHI, HAOLUN . 6. CHANG, HUA-I . TRICKEY, KIRSTEN ALEXANDRA . SKIPPER, PETER TYREL . Tues, Dec 4 7. ZHU, HUA . KIM, BRIAN . LACOUR, MICHAEL JULES . 8. UPPALA, MEDHA . SHEN, ZHIYUAN . LU, YANG . 9. HUANG, YAFEI . CHOI, ICK KYU . LIU, JIA LIN . 10. GLEASON, TANIA MARIE . LUO, YIYANG . JASWANI, DRUMIL PARAS . 11. LI, JINCHAO . CHEN, RUI . WANG, YAN . 12. WANG, CHENHUI . GAURAV, ADITYA . TCHEMODANOV, NATALIA . 13. XIE, JIANWEN . XIAO, QIAN . BUDHRANI, VINEESHA . Thur, Dec 6 14. CAO, RENZHE . WANG, WENJIA . ZHOU, YUXI . 15. VASILYEV, ALEXANDER . LIAN, XIAOCHEN . TSUI, JASON . 16. BAE, JOONBUM . LI, LU . NAYAK, SHILPI . 17. RICHMOND, ROBERT JAMISON . LI, ZHONGBO . ZHOU, YUAN . 18. ARFA, JONATHAN MICHAEL . HO, KING CHUNG . WOLF, JEFFREY ARIEN . 19. KAPADIA, HIRAL . NIE, XIAOHAN . CHEN, XIANJIE . !!! YOU CAN ALSO EMAIL ME AND I WILL PUT YOU IN TOUCH WITH YOUR TEAMMATES!!! 2. A brief glance at regression in R. Adding a regression line, p232. x = runif(100) y = -1.5*x + rnorm(100) z = lm(y ~ x) plot(x,y) abline(z, lty=2, col="blue") z$coef lm() can also be used for multivariate regression. See p269-271, lm(y ~ x1 + x2 + x3). x1 = runif(100) x2 = rnorm(100) x3 = rexp(100) y = 17.0 + x1 + 1.5*x2 - 5.0*x3 + rnorm(100) z = lm(y ~ x1 + x2 + x3) z$coef z$coef[1] ## the estimated intercept z$coef[3] ## the estimated slope corresponding to x2 attributes(z) summary(z) In ch. 11, p297, there are functions to calculate the "influence" of each point in regression. You won't use this for hw2. 3. Kernel density estimation. x = rnorm(1000) hist(x,nclass=40,prob=T,main="simulated N(0,1) random variables") lines(density(x),lty=2,col="red") p250, density() is useful for kernel density estimation. Suppose your univariate data are x1, x2, ..., xn. For any number x0, and any h, f^(x0) = 1/(hn) · k((x0-x_i / h), where the sum is from 1 to n, and where k() is some density, usually centered at 0, and usually with a peak at 0. By f^, I mean fhat, our estimate of the true density, f. Here f is normal. h is called the bandwidth. If h is small, then each point gives a lot of density nearby, whereas if h is large, then each point gets smoothed out heavily. People have shown that the kernel, k, doesn't matter nearly as much as the choice of bandwidth. See help(density). By default, kernel = "gaussian". By default, bw="nrd0", which is Silverman(1986)'s rule of thumb, 0.9 s n^(-.2) . Here, s is an estimate of sd, given by min{sample sd, IQR/1.34}. This is often too small, so Scott(1992) recommended 1.06 s n^(-.2) . bw.nrd0() yields Silverman's rule, and bw.nrd() yield's Scott's. bw.nrd0(x) 0.9 * min(sd(x),IQR(x)/1.34) * 1000^(-.2) bw.nrd(x) 1.06 * min(sd(x),IQR(x)/1.34) * 1000^(-.2) hist(x,nclass=40,prob=T,main="simulated N(0,1) random variables") b2 = bw.nrd(x) lines(density(x,bw=b2),col="white") lines(density(x,bw=10*b2),col="blue") lines(density(x,bw=.1*b2),col="brown") lines(sort(x),dnorm(sort(x)),col="green") legend("topright",c("Scott's bw", "huge bw", "tiny bw", "truth"), lty=1,col=c("red","blue","brown","green"),cex=0.8) To compute kernel estimates on a grid, supply density() with n, from, and to. Or do it yourself. Let's take a really small example. x = c(1,2.8,3) z = density(x,bw=0.08,n=101,from=0,to=4) ## by default, n = 512 and it goes from min(x) - 3bw to max(x) + 3bw. plot(c(0,4),c(0,2),type="n",xlab="x",ylab="density",yaxs="i") lines(z,col="red") Note that the curve looks smooth but it's really discrete. z$x[1:10] plot(z$x, z$y) Consider, for instance, the kernel density estimate f^(x0) at x0 = 1.20. z$x[31] ## 1.20 z$y[31] ## f^(1.2) = 0.07465388. But density() does some slightly weird things with small datasets especially. In help(density), "The algorithm used in density.default disperses the mass of the empirical distribution function over a regular grid of at least 512 points and then uses the fast Fourier transform to convolve this approximation with a discretized version of the kernel and then uses linear approximation to evaluate the density at the specified points." Alternatively, you can find the kernel density estimate yourself. density2 = function(x1,xgrid,bw2){ ## x1 = data. ## xgrid = vector of values where you'll compute the kernel estimate. ## bw2 = bandwidth n = length(xgrid) y = rep(0, n) for(i in 1:n){ y[i] = sum(dnorm(x1-xgrid[i], sd=bw2)) / length(x1) } y } x = c(1,2.8,3) g = seq(0,4,length=101) y = density2(x,g,0.08) y[31] ## 0.07303459 plot(g,y,type="l",xlab="x",ylab="density") Look at the difference between density() and density2(). plot(g,z$y - y) You can make the grid and bandwidth whatever you want. x1 = c(1,2.8,3) grid1 = seq(-2,6,length=101) bw1 = bw.nrd(x1) y = density2(x1,grid1,bw1) plot(grid1,y,type="l",xlab="x",ylab="density") 4. 2-d kernel smoothing, using kernel2d() in splancs. install.packages("splancs") library(splancs) ## First, input 23 points using the mouse. n = 23 plot(c(0,1),c(0,1),type="n",xlab="longitude",ylab="latitude",main="locations") x1 = rep(0,n) y1 = rep(0,n) for(i in 1:n){ z1 = locator(1) x1[i] = z1$x y1[i] = z1$y points(x1[i],y1[i]) } I stopped here.