1. Final projects and group assignment. 2. Kernel density estimation in R, continued. 3. 2-d kernel smoothing. 4. Simulating from a density by rejection sampling. 5. Maps. 6. R Cookbook ch. 8. Reminder, no class or office hour Tue Oct 22. 1. Final projects. I will randomly assign you to groups of 2-3 students, and you will analyze some data using the methods we have talked about in class, including the methods used for the homeworks and also other methods we have discussed in class. Your group will write up your analysis in a written report, and the group will also make an oral presentation, where every member of your group will speak. The oral presentations will be 8-10 minutes each in total. You will be graded on the performance of your group, rather than individually, and each group will receive one overall grade based on the oral report and written report, combined. This combined grade will count for 15% of your overall grade in the class, and the other 85% will be based on your homework assignments. Your dataset, which you will find yourselves, on the web, can be anything you choose, but it should be: a) regression style data, i.e. a response variable, and for each observation, a bunch (at least 2 or 3) explanatory variables. You should have at the very least n=30 observations. b) something of genuine interest to you. Analyze the data using the methods we have talked about in class, such as linear regression, univariate kernel density estimation, 2-d kernel density estimation, statistical testing, quantile plots, and kernel regression. At least one component of your data anlysis should be done in C. Your final project should be submitted to me in pdf by email to frederic@stat.ucla.edu by Sun, Dec 8, 11:59pm. Note the address. Do not send them to stat202a@stat.ucla.edu. Obviously, you only need to send one per group. -- Read up through the end of ch10 in R Cookbook for next class. I will now randomly assign people to groups. First, add any missing names to roster.txt. If you want to change oral presentation dates and times with another group, feel free but let me know. I will use sample(), which randomly permutes the elements of a vector. a = c("a","b","c","d") sample(a) mygroups1 = function(){ x = scan("roster.txt",what="char",sep="\n") n = length(x) y = sample(x) fl = floor(n/3) for(i in 1:fl){ if(i == 1) cat("\n\n Tue, Nov 26\n") if(i == 6) cat("\n\n Tue, Dec 3 \n") if(i == 12) cat("\n\n Thu, Dec 5 \n") for(j in 1:3) cat(y[3*(i-1)+j],". ") cat("\n") } } mygroups1() Tue, Nov 26 [1] DAS GUPTA, APARUPA . MAO, JUNHUA . WANG, PENG . [2] CAI, YUAN . SINGH, VIKRAM VIR . WU, GUANI . [3] LIU, YUNLEI . KARWA, ROHAN SUNIL . KUHFELD, MEGAN REBECCA . [4] MURPHY, JOHN LEE . HONG, JI YEON . SHAH, VAIBHAV CHETAN . [5] Csapo, Marika . LEWIS, BRONWYN ASHLEIGH . ZHANG, QIANG . Tue, Dec 3 [6] NAN, XIAOMENG . ARELLANO, RYAN CHRISTOPHER . SHU, LE . [7] WANG, XINYUE . CHEN, JIAMIN . HAN, TIAN . [8] THAKUR, MANOJ RAMESHCHANDRA . POWELL, CHRISTOPHER GRANT . KHODAVIRDI, KHATEREH . [9] XIE, MEIHUI . WONG, ALEX KING LAP . XIE, DAN . [10] TAN, MENGXIN . ROBERTS, SHANE WILLIAM STROUTH . HUANG, PENG JUN . [11] YU, CHENGCHENG . WU, JOHN SHING . MOUSAVI, SEYED ALI . Thu, Dec 5 [12] FERNANDEZ, KEITH EDWARD . AHLUWALIA, ANSUYA . HAROUNIAN, BENJAMIN . [13] GU, JIAYING . RAMIREZ, ARTURO . DEMIRDJIAN, LEVON . [14] HUYNH, HIEN TRINH . SHEN, LINLING . CHANDRASEKHARAN, MANJANA . [15] WISNER, TRISTAN GARDNER . MAGYAR, ZSUZSANNA BLANKA . HE, JIA . [16] KRUMHOLZ, SAMUEL DAVID . KATTE, SAMIR RAJENDRA . WAINFAN, KATHRYN TANYA . [17] WANG, TENG . SCHWEIG, JONATHAN DAVID . Gupta, Hitesh . !!! PLEASE FIND YOUR TEAMMATES AFTER CLASS, OR YOU CAN EMAIL ME AND I WILL PUT YOU IN TOUCH WITH YOUR TEAMMATES!!! Attendance is mandatory these last 3 classes. 2. Kernel density estimation in R, continued. To compute kernel estimates on a grid, supply density() with n, from, and to. Or do it yourself. Let's take a really small example. x = c(1,2.8,3) z = density(x,bw=0.08,n=101,from=0,to=4) ## by default, n = 512 and it goes from min(x) - 3bw to max(x) + 3bw. plot(c(0,4),c(0,2),type="n",xlab="x",ylab="density",yaxs="i") lines(z,col="red") Note that the curve looks smooth but it's really discrete. z$x[1:10] plot(z$x, z$y) Consider, for instance, the kernel density estimate f^(x0) at x0 = 1.20. z$x[31] ## 1.20 z$y[31] ## f^(1.2) = 0.07465388. But density() does some slightly weird things with small datasets especially. In help(density), "The algorithm used in density.default disperses the mass of the empirical distribution function over a regular grid of at least 512 points and then uses the fast Fourier transform to convolve this approximation with a discretized version of the kernel and then uses linear approximation to evaluate the density at the specified points." Alternatively, you can find the kernel density estimate yourself. density2 = function(x1,xgrid,bw2){ ## x1 = data. ## xgrid = vector of values where you'll compute the kernel estimate. ## bw2 = bandwidth n = length(xgrid) y = rep(0, n) for(i in 1:n){ y[i] = sum(dnorm(x1-xgrid[i], sd=bw2)) / length(x1) } y } x = c(1,2.8,3) g = seq(0,4,length=101) y = density2(x,g,0.08) y[31] ## 0.07303459 plot(g,y,type="l",xlab="x",ylab="density") Look at the difference between density() and density2(). plot(g,z$y - y) You can make the grid and bandwidth whatever you want. x1 = c(1,2.8,3) grid1 = seq(-2,6,length=101) bw1 = bw.nrd(x1) y = density2(x1,grid1,bw1) plot(grid1,y,type="l",xlab="x",ylab="density") 3. 2-d kernel smoothing, using kernel2d() in splancs. install.packages("splancs") library(splancs) ## First, input 23 points using the mouse. n = 23 plot(c(0,1),c(0,1),type="n",xlab="longitude",ylab="latitude",main="locations") x1 = rep(0,n) y1 = rep(0,n) for(i in 1:n){ z1 = locator(1) x1[i] = z1$x y1[i] = z1$y points(x1[i],y1[i]) } bdw = sqrt(bw.nrd0(x1)^2+bw.nrd0(y1)^2) ## a possible default bandwidth b1 = as.points(x1,y1) bdry = matrix(c(0,0,1,0,1,1,0,1,0,0),ncol=2,byrow=T) z = kernel2d(b1,bdry,bdw) attributes(z) par(mfrow=c(1,2)) image(z,col=gray((64:20)/64),xlab="km E of origin",ylab="km N of origin") points(b1) x4 = seq(min(z$z),max(z$z),length=100) plot(c(0,10),c(.8*min(x4),1.2*max(x4)),type="n",axes=F,xlab="",ylab="") image(c(-1:1),x4,matrix(rep(x4,2),ncol=100,byrow=T),add=T,col=gray((64:20)/64)) text(2,min(x4),as.character(signif(min(x4),2)),cex=1) text(2,(max(x4)+min(x4))/2,as.character(signif((max(x4)+min(x4))/2,2)),cex=1) text(2,max(x4),as.character(signif(max(x4),2)),cex=1) mtext(s=3,l=-3,at=1,"density (pts/km^2)") 4. Simulating from a density by rejection sampling. To do simulation by rejection sampling, you need the density f(x) you're interested in to have the property f(x) <= b g(x), where b is some constant and g(x) is some density that's easy to sample from. For instance, if f(x) is known to have some known maximum c, and finite support of length d, then you can use g(x) is the uniform distribution, and let b = cd. Basically, the way it works is you simulate x0 from g(x), keep it with probability f(x0)/[bg(x0)] and delete it otherwise, and repeat. First, for simplicity, suppose the density has a nice closed form, like the triangular density on x in (0,2), f(x) = 1 - |x-1|, Here c = 1, and the support is (0,2), so d = 2, and b = cd = 2. ## Plot the density x = seq(-1,3,length=101) y = rep(0,101) y[(x>0) & (x<2)] = 1 - abs(x[(x>0) & (x<2)] - 1) plot(x,y,type="l",ylab="f(x)",lty=2) To simulate from this density using rejection sampling a) Simulate a draw x0 from g(x) = uniform on (0,2) = 1/2 on (0,2). b) Keep x0 with probability f(x0) / [bg(x0)] . c) Repeat. ## 10000 draws from the triangular f(x) b = 2; g = 0.5 n = 10000 x = c() ## these will contain the results. i = 0 while(i < n){ x0 = runif(1)*2 fx0 = 1 - abs(x0 - 1) if(runif(1) < fx0 / (b*g)){ ## keep x0 i = i+1 x[i] = x0 if(i/100 == floor(i/100)) cat(i, " ") } } hist(x,nclass=40,prob=T,add=T) What if you want to sample from a more complicated density f(x), such as a kernel smoothing? Again, let c = the maximum of f(x), which you can find by looking at a plot of f(x). The support might not be finite, technically, but you can often find a finite region that effectively contains almost all the support of f(x). Let's use the simple example from before where we smoothed 3 points at 1, 2.8, and 3. x1 = c(1,2.8,3) grid1 = seq(-2,6,length=101) bw1 = bw.nrd(x1) y = density2(x1,grid1,bw1) plot(grid1,y,type="l",xlab="x",ylab="density") Suppose we want to simulate 30,000 draws from this density f(x). The density clearly does not go higher than 0.5 anywhere, and (-2,6) is plenty wide to contain almost all the support of f(x). So, c = 0.5, d = 8, b = cd = 4. The only hard part is that f(x0) now needs to be computed for each x0. b = 4; g = 1/8 n = 30000 x = c() ## these will contain the results. i = 0 while(i < n){ x0 = runif(1)*8 - 2 ## this simulates the uniform g(x) on (-2,6) fx0 = density2(x1,x0,bw1) ## this computes f(x0) if(runif(1) < fx0 / (b*g)){ ## keep x0 i = i+1 x[i] = x0 if(i/100 == floor(i/100)) cat(i, " ") } } hist(x,nclass=40,prob=T,add=T,density=50) 5. Maps. Map of California counties. install.packages("maps") library(maps) map('county', 'California') plot(seq(-118,-116,length=10),seq(33,35,length=10)) map('county', 'California', add=T) ## It's not too easy to make it look good. 6. R Cookbook ch8. dnorm(), pnorm(), qnorm(), rnorm() are the normal pdf, cdf, quantile function, and simulator functions, respectively. dnorm(1.2) ## normal density 1/(sqrt(2*pi))*exp(-1.2^2/2) pnorm(1.96) ## normal cdf pnorm(0) qnorm(.5) ## normal quantile qnorm(0.975) ## see p190 qnorm(0) qnorm(1) qnorm(1.1) ## error rnorm(3) ## pseudo-random normals The default is mean = 0, sd = 1, but you can set these. rnorm(3, mean = 50, sd = 10) qnorm(0.975, mean = 50, sd = 10) rnorm(3, 5, 40) ## mean = 5 and sd = 40 rnorm(3, 5, mean=40) ## mean = 40 and sd = 5 You can substitute "norm" for "binom", "geom", etc. See tables on pp177-178. We've seen "unif" already. See the warning about rexp() on p178. The parameter is the rate 1/beta, not the mean beta. rexp(3, 10)