hw2 notes. You must work independently for all 4 homeworks! 1. Housing data. 2. Regression. 3. Kernel density estimation. 4. 2d kernel smoothing. 5. Rejection sampling. 6. Maps. 1. Housing data. Loading in the housing data. I removed spaces from the city names, like Beverly Hills, and I removed symbols like $ % , so it would be easier to load. x = read.table("LAhousingpricesaug2013.txt",header=T) y = as.numeric(as.vector(x[,3])) x1 = as.numeric(as.vector(x[,4])) x2 = as.numeric(as.vector(x[,7])) x3 = as.numeric(as.vector(x[,9])) z = !is.na(y+x1+x2+x3) yy = y[z] x11 = x1[z] x22 = x2[z] x33 = x3[z] Alternatively, x = scan("LAhousingpricesaug2013.txt",skip=1,what="char") z = matrix(x,ncol=9,byrow=T) y = as.numeric(z[,3]) x1 = as.numeric(z[,4]) x2 = as.numeric(z[,7]) x3 = as.numeric(z[,9]) z1 = (!is.na(y) & !is.na(x1) & !is.na(x2) & !is.na(x3)) myy = y[z1] myx1 = x1[z1] myx2 = x2[z1] myx3 = x3[z1] 2. A brief glance at regression in R. Adding a regression line. x = runif(100) y = -1.5*x + rnorm(100) z = lm(y ~ x) plot(x,y) abline(z, lty=2, col="blue") z$coef lines(ksmooth(x,y,bandwidth=bw.nrd(x)),lty=1) ## kernel regression lm() can also be used for multivariate regression. lm(y ~ x1 + x2 + x3). x1 = runif(100) x2 = rnorm(100) x3 = rexp(100) y = 17.0 + x1 + 1.5*x2 - 5.0*x3 + rnorm(100) z = lm(y ~ x1 + x2 + x3) z$coef z$coef[1] ## the estimated intercept z$coef[3] ## the estimated slope corresponding to x2 attributes(z) summary(z) There are functions in R to calculate the "influence" of each point in regression. Do not use these for hw2. You should write your own loop to do this. 3. Kernel density estimation. x = rnorm(1000) hist(x,nclass=40,prob=T,main="simulated N(0,1) random variables") lines(density(x),lty=2,col="red") density() is useful for kernel density estimation. Suppose your univariate data are x1, x2, ..., xn. For any number x0, and any h, f^(x0) = 1/(hn) ∑ k((x0-x_i / h), where the sum is from 1 to n, and where k() is some density, usually centered at 0, and usually with a peak at 0. By f^, I mean fhat, our estimate of the true density, f. Here f and k are normal. h is called the bandwidth. If h is small, then each point gives a lot of density nearby, whereas if h is large, then each point gets smoothed out heavily. People have shown that the kernel, k, doesn't matter nearly as much as the choice of bandwidth. See help(density). By default, kernel = "gaussian". By default, bw="nrd0", which is Silverman(1986)'s rule of thumb, 0.9 s n^(-.2) . Here, s is an estimate of sd, given by min{sample sd, IQR/1.34}. This is often too small, so Scott(1992) recommended 1.06 s n^(-.2) . bw.nrd0() yields Silverman's rule, and bw.nrd() yield's Scott's rule. bw.nrd0(x) 0.9 * min(sd(x),IQR(x)/1.34) * 1000^(-.2) bw.nrd(x) 1.06 * min(sd(x),IQR(x)/1.34) * 1000^(-.2) ##### PLOTTING A DENSITY ESTIMATE WITH A LEGEND hist(x,nclass=40,prob=T,main="simulated N(0,1) random variables") b2 = bw.nrd(x) lines(density(x,bw=b2),col="red") lines(density(x,bw=10*b2),col="blue") lines(density(x,bw=.1*b2),col="brown") lines(sort(x),dnorm(sort(x)),col="green") legend("topright",c("Scott's bw", "huge bw", "tiny bw", "truth"), lty=1,col=c("red","blue","brown","green"),cex=0.8) To compute kernel estimates on a grid, supply density() with n, from, and to. Or do it yourself. Let's take a really small example. x = c(1,2.8,3) z = density(x,bw=0.08,n=101,from=0,to=4) ## by default, n = 512 and it goes from min(x) - 3bw to max(x) + 3bw. plot(c(0,4),c(0,2),type="n",xlab="x",ylab="density",yaxs="i") lines(z,col="red") Note that the curve looks smooth but it's really discrete. z$x[1:10] points(z$x, z$y) Consider, for instance, the kernel density estimate f^(x0) at x0 = 1.20. z$x[31] ## 1.20 z$y[31] ## f^(1.2) = 0.07465388. But density() does some slightly weird things with small datasets especially. In help(density), "The algorithm used in density.default disperses the mass of the empirical distribution function over a regular grid of at least 512 points and then uses the fast Fourier transform to convolve this approximation with a discretized version of the kernel and then uses linear approximation to evaluate the density at the specified points." Alternatively, you can find the kernel density estimate yourself. density2 = function(x1,xgrid,bw2){ ## x1 = data. ## xgrid = vector of values where you'll compute the kernel estimate. ## bw2 = bandwidth n = length(xgrid) y = rep(0, n) for(i in 1:n){ y[i] = sum(dnorm(x1-xgrid[i], sd=bw2)) / length(x1) } y } x = c(1,2.8,3) g = seq(0,4,length=101) y = density2(x,g,0.08) y[31] ## 0.07303459 plot(g,y,type="l",xlab="x",ylab="density") Look at the difference between density() and density2(). plot(g,z$y - y) You can make the grid and bandwidth whatever you want. x1 = c(1,2.8,3) grid1 = seq(-2,6,length=101) bw1 = bw.nrd(x1) y = density2(x1,grid1,bw1) plot(grid1,y,type="l",xlab="x",ylab="density") 4. 2-d kernel smoothing, using kernel2d() in splancs. install.packages("splancs") library(splancs) ## First, input 23 points using the mouse. n = 23 plot(c(0,1),c(0,1),type="n",xlab="longitude",ylab="latitude",main="locations") x1 = rep(0,n) y1 = rep(0,n) for(i in 1:n){ z1 = locator(1) x1[i] = z1$x y1[i] = z1$y points(x1[i],y1[i]) } ##### PLOT THE POINTS WITH A 2D KERNEL SMOOTHING IN GREYSCALE PLUS A LEGEND bdw = sqrt(bw.nrd0(x1)^2+bw.nrd0(y1)^2) ## a possible default bandwidth b1 = as.points(x1,y1) bdry = matrix(c(0,0,1,0,1,1,0,1,0,0),ncol=2,byrow=T) z = kernel2d(b1,bdry,bdw) attributes(z) par(mfrow=c(1,2)) image(z,col=gray((64:20)/64),xlab="km E of origin",ylab="km N of origin") points(b1) x4 = seq(min(z$z),max(z$z),length=100) plot(c(0,10),c(.8*min(x4),1.2*max(x4)),type="n",axes=F,xlab="",ylab="") image(c(-1:1),x4,matrix(rep(x4,2),ncol=100,byrow=T),add=T,col=gray((64:20)/64)) text(2,min(x4),as.character(signif(min(x4),2)),cex=1) text(2,(max(x4)+min(x4))/2,as.character(signif((max(x4)+min(x4))/2,2)),cex=1) text(2,max(x4),as.character(signif(max(x4),2)),cex=1) mtext(s=3,l=-3,at=1,"density (pts/km^2)") Note that there is also ksmooth() to do kernel regression, which is something a bit different. We will get to this later. One important note on ksmooth() is the default bandwidth is 0.5, which is often a terrible choice. Use bw.nrd(x) instead. 5. Simulating from a density by rejection sampling. To do simulation by rejection sampling, you need the density f(x) you're interested in to have the property f(x) <= b g(x), where b is some constant and g(x) is some density that's easy to sample from. For instance, if f(x) is known to have some known maximum c, and finite support of length d, then you can use g(x) is the uniform distribution, and let b = cd. Basically, the way it works is you simulate x0 from g(x), keep it with probability f(x0)/[bg(x0)] and delete it otherwise, and repeat. The idea is this. Sample with density g(x), and then keep each point with prob. f(x)/[bg(x)]. This makes sense since f(x) is always ≤ bg(x). The resulting density is thus g(x) f(x) / [bg(x)] = f(x)/b. So we're keeping point x with prob. density f(x)/b. But since b is constant, if you consider the density of KEPT pts, it's simply f(x). First, for simplicity, suppose the density has a nice closed form, like the triangular density on x in (0,2), f(x) = 1 - |x-1|, Here c = 1, and the support is (0,2), so d = 2, and b = cd = 2. ## Plot the density x = seq(-1,3,length=101) y = rep(0,101) y[(x>0) & (x<2)] = 1 - abs(x[(x>0) & (x<2)] - 1) plot(x,y,type="l",ylab="f(x)",lty=2) To simulate from this density using rejection sampling a) Simulate a draw x0 from g(x) = uniform on (0,2) = 1/2 on (0,2). b) Keep x0 with probability f(x0) / [bg(x0)] . c) Repeat. ## 10000 draws from the triangular f(x) b = 2; g = 0.5 n = 10000 x = c() ## these will contain the results. i = 0 while(i < n){ x0 = runif(1)*2 fx0 = 1 - abs(x0 - 1) if(runif(1) < fx0 / (b*g)){ ## keep x0 i = i+1 x[i] = x0 if(i/100 == floor(i/100)) cat(i, " ") } } hist(x,nclass=40,prob=T,add=T,col="green") What if you want to sample from a more complicated density f(x), such as a kernel smoothing? Again, let c = the maximum of f(x), which you can find by looking at a plot of f(x). The support might not be finite, technically, but you can often find a finite region that effectively contains almost all the support of f(x). Let's use the simple example from before where we smoothed 3 points at 1, 2.8, and 3. x1 = c(1,2.8,3) grid1 = seq(-2,6,length=101) bw1 = bw.nrd(x1) y = density2(x1,grid1,bw1) plot(grid1,y,type="l",xlab="x",ylab="density") Suppose we want to simulate 30,000 draws from this density f(x). The density clearly does not go higher than 0.5 anywhere, and (-2,6) is plenty wide to contain almost all the support of f(x). So, c = 0.5, d = 8, b = cd = 4. The only hard part is that f(x0) now needs to be computed for each x0. b = 4; g = 1/8 n = 30000 x = c() ## these will contain the results. i = 0 while(i < n){ x0 = runif(1)*8 - 2 ## this simulates the uniform g(x) on (-2,6) fx0 = density2(x1,x0,bw1) ## this computes f(x0) if(runif(1) < fx0 / (b*g)){ ## keep x0 i = i+1 x[i] = x0 if(i/100 == floor(i/100)) cat(i, " ") } } hist(x,nclass=40,prob=T,add=T,col="green") 6. Maps. Map of California counties. install.packages("maps") library(maps) map('county', 'California') plot(seq(-118,-116,length=10),seq(33,35,length=10)) map('county', 'California', add=T) ## It's not easy to make it look good. See also RgoogleMaps. install.packages("RgoogleMaps") library(RgoogleMaps) help(bubbleMap)