0. Misc. 1. Sum of squared differences from observations for each gridpoint. 2. Kernel regression. 3. Teetor ch. 11. 4. Projects. 0. Misc. Reminder, no class or OH Tue Nov 5. After today, you will have seen everything you need for all 4 homeworks in my opinion. I may explain things like passing a function to a C function, building R packages, etc., but you won't need those for the homework problems. When I say in 2d "sample 100 pairs of observations (Xi, Yi) with replacement," I mean, if your dataset has length n, then let b = sample(1:n, 100, rep=T) and for each element i in b, take (Xi, Yi). 2. sum of squared differences from observations for each gridpoint. In sumsgrid.c, #include #include void ssg2 (double *x, int *n, double *g, int *m, double *y) /* x will be the vector of data of length n, g will be the vector of m gridpoints, and y will be a vector of length m, of squared differences from gridpoint i to the n observations. */ { int i; int j; double a; for(i = 0; i < *m; i++){ a = 0.0; for(j=0; j < *n; j++){ a += pow(g[i] - x[j], 2); } y[i] = a; } } in R, system("R CMD SHLIB sumsgrid.c") dyn.load("sumsgrid.so") sum4 = function(data2, grid2){ n = length(data2) m = length(grid2) a = .C("ssg2", as.double(data2), as.integer(n), as.double(grid2), as.integer(m), y=double(m)) a$y ## or equivalently a[[5]] } b = c(1,3,4) gr = c(0,1,2,3,4,5) sum4(b,gr) This function ssg2 will be similar to your kernel density code, except that you will also have a bandwidth as an additional input, and you will replace the line a += pow(g[i] - x[j], 2); with the kernel density estimate. 3. Kernel regression. Given vectors X and Y of length n, suppose you want to look at the relationship between X and Y, without assuming a linear model or other model. You could just plot(x,y), but perhaps you want to look at a smooth summary of the relationship between X and Y. Suppose Y_i = f(X_i) + eps_i, for i = 1 to n, where f is some smooth function, and where eps_i are iid with mean 0. The Nadaraya-Watson or kernel regression estimate of f is fhat(x) = Sum [Y_i k(x-X_i; b)] / Sum [k(x-X_i; b)], where k is some kernel function, b is some bandwidth, and the sums are from i = 1 to n. Note that k(x-X_i; b) = k{(x-X_i)/b)}/b. [for instance, compare dnorm(7.0/33.3)/33.3 with dnorm(7.0, sd=33.3).] A commonly used kernel is the Epanechnikov kernel, k(u) = 3/4 (1-u^2), if |u| ² 1, and 0 otherwise. This was proposed in Epanechnikov, V.A. (1969). "Non-parametric estimation of a multivariate probability density". Theory of Probability and its Applications 14: 153Ð158. and shown to have minimal variance under certain conditions. in epanech.c, #include #include /* x and y will be the data vectors each of length n, b is the bandwidth, g will be the vector of m gridpoints where we want to calculate the kernel regression estimates, est, which is thus also of length m. */ void kernreg2 (double *x, double *y, int *n, double *b, double *g, int *m, double *est) { int i,j; double a1,a2,c; for(i = 0; i < *m; i++){ a1 = 0.0; a2 = 0.0; for(j=0; j < *n; j++){ if(fabs((x[j] - g[i])/ *b) <= 1.0){ c = 0.75 * (1.0-pow((x[j]-g[i])/ *b,2.0))/ *b; a1 += y[j] * c; a2 += c; } } if(a2 > 0.0) est[i] = a1/a2; else est[i] = 0.0; Rprintf("I'm on gridpoint %f and index %d right now.\n", g[i], i); } } // Watch out for /* in your code. This is treated as a comment started. // Use / *b instead of /*b. // These double slashes can be used in C to comment an individual line, // like # in R. In R, system("R CMD SHLIB epanech.c") dyn.load("epanech.so") n = 1000 x = runif(n)*4 y = 0.1*x^5 - 2*x^3 + 10*x + rnorm(n) bw = bw.nrd(x) m = 100 g2 = seq(0,4,length=m) ## This time, I'm going to call the C function directly, ## without making a separate R function to call it. a = .C("kernreg2", as.double(x), as.double(y), as.integer(n), as.double(bw), as.double(g2), as.integer(m), est=double(m)) plot(c(0,4),c(-3,17),type="n",xlab="x",ylab="y") points(x,y,pch=".",col="red") lines(g2,a$est,col="black") lines(g2,.1*g2^5 - 2*g2^3 + 10*g2,col="blue") Here, bw ~ 0.3. Try it again with bw = .01 or bw = 2. The Nadaraya-Watson estimator tends to be biased at the edges, especially when the bandwidth is large. Play around with a[[1]], a[[2]], etc. C functions called by R output back to R a list containing all the arguments passed to them, though here the last argument, a$est, is changed of course after running the function kernreg2. Note that, when calling your C function, the order of the arguments matters, and their type matters, but their names don't matter. For instance, here in R I called the bandwidth bw, but in C I called it b. Same with g2 and g. 4. A couple more things from R Cookbook ch. 11. p297, im = influence.measures(k2) outputs all 7 influence measures listed on p298 for each of the 10,000 data points. To get just the Cook's distances, use im2 = im$infmat[,6] Cook's distance = SUM from i= 1 to 10000 of [^Y_j - ^Y_j(-i)]^2 / (p MSE). ^Y_j = prediction of observation j based on the model. ^Y_j(-i) = prediction of observation j using model with observation i omitted, p = number of parameters estimated, MSE = mean squared error. p299, dwtest() in lmtest package, and acf() to test the autocorrelation. The Durbin-Watson test has been criticized, e.g. by Chatfield (1986), on the grounds that it basically just tests the first autocorrelation but users often think it looks at more. p300-1, prediction intervals and predictions using predict(). 5. Final project and oral report tips. For the oral projects. Rule 1: Do not look at me when you are talking! Rule 2: 8-10 minutes per oral report and everybody in the group must speak for roughly equal amounts. I will cut you off if your group goes over. You can have someone in the audience or on your team helping you with the time. Rule 3: Everyone must be respectful and quiet during other people's talks. No questions because we won't have time. Rule 4: Send me a pdf or powerpoint version of your group's slides by 6pm the night before your talk, to frederic@stat.ucla.edu. That way, I can set up the talks in order ahead of time and we won't have to waste time in class waiting for each person to connect their laptop to the projector. About 7 or 8 slides seems right, though it's fine with me if you want fewer or more. Rule 5: Give us a sense of your data. Assume that the listener knows what the statistical methods you are using are. Tell us what they say about your data. Emphasize the results more than the methods. Go slowly in the beginning so that the listener really understands what your data are. Rule 6: Speculate and generalize but use careful language. Say "It seems" or "appears" rather than "is" when it comes to speculative statements or models. For example, you might say "The residuals appear approximately normal" or "a linear model seems to fit well" but not "The residuals are normal" or "The data come from a linear model". Rule 7: Start with an introduction explaining what your data are, how you got them, and why they are interesting (roughly 2 minutes), then show your results as clearly as possible, with figures preferred (roughly 3 minutes), and then conclude (roughly 2 minutes). In your conclusion, mention the limitations of your analysis and speculate about what might make a future analysis better, if you had infinite time. This might include collecting more data, or getting data on more variables, as well as more sophisticated statistical methods. For your written reports, rules 5-7 apply again. Replace "minutes" with "pages". Have just the text in the beginning, and then the figures at the end. Email your pdf document to me, at frederic@stat.ucla.edu , by Sun, Dec 8, 11:59pm.