1. Administrative stuff. 2. Kernel regression. 3. Newton-Raphson optimization. 1. Admin. -- No class or office hours Thur Nov 3, due to the mandatory Statistics faculty retreat. -- Use the computer labs in BH 9413 if your computer doesn't seem to work with regard to running C in R. -- Hw4 is up on the course website and is due Tues, Nov 15, 1030am. Late homeworks will not be accepted. After today, you will have seen everything you need for all 4 homeworks in my opinion. I will explain things like C++ and passing a function to a C function, etc., but you won't need those for the homework problems. 2. Kernel regression. Given vectors X and Y of length n, suppose you want to look at the relationship between X and Y, without assuming a linear model or other model. You could just plot(x,y), but perhaps you want to look at a smooth summary of the relationship between X and Y. Suppose Y_i = f(X_i) + eps_i, for i = 1 to n, where f is some smooth function, and where eps_i are iid with mean 0. The Nadaraya-Watson or kernel regression estimate of f is fhat(x) = Sum [Y_i k(x-X_i; b)] / Sum [k(x-X_i; b)], where k is some kernel function, b is some bandwidth, and the sums are from i = 1 to n. Note that k(x-X_i; b) = k{(x-X_i)/b)}/b. [for instance, compare dnorm(1.2/4.1)/4.1 with dnorm(1.2, sd=4.1).] A commonly used kernel is the Epanechnikov kernel, k(u) = 3/4 (1-u^2), if |u| ² 1, and 0 otherwise. This was proposed in Epanechnikov, V.A. (1969). "Non-parametric estimation of a multivariate probability density". Theory of Probability and its Applications 14: 153Ð158. and shown to have minimal variance under certain conditions. in epanech.c, #include #include /* x and y will be the data vectors each of length n, b is the bandwidth, g will be the vector of m gridpoints where we want to calculate the kernel regression estimates, est, which is thus also of length m. */ void kernreg2 (double *x, double *y, int *n, double *b, double *g, int *m, double *est) { int i,j; double a1,a2,c; for(i = 0; i < *m; i++){ a1 = 0.0; a2 = 0.0; for(j=0; j < *n; j++){ if(fabs((x[j] - g[i])/ *b) <= 1.0){ c = 0.75 * (1.0-pow((x[j]-g[i])/ *b,2.0))/ *b; a1 += y[j] * c; a2 += c; } } if(a2 > 0.0) est[i] = a1/a2; else est[i] = 0.0; Rprintf("I'm on gridpoint %f and index %d right now.\n", g[i], i); } } // Watch out for /* in the code. This is treated as a comment started. // Use / *b instead of /*b. // These double slashes can be used in C to comment an individual line, // like # in R. In R, system("R CMD SHLIB epanech.c") dyn.load("epanech.so") n = 1000 x = runif(n)*4 y = x^2 + rnorm(n) bw = bw.nrd(x) m = 100 g2 = seq(0,4,length=m) ## This time, I'm going to call the C function directly, without makeing a ## separate R function to call it. a = .C("kernreg2", as.double(x), as.double(y), as.integer(n), as.double(bw), as.double(g2), as.integer(m), est=double(m)) plot(c(0,4),c(-1,17),type="n",xlab="x",ylab="y") points(x,y,pch=".",col="red") lines(g2,a$est,col="black") lines(g2,g2^2,col="blue") Try it again with bw = .01 or bw = 2 The Nadaraya-Watson estimator tends to be biased at the edges, especially when the bandwidth is large. Note that, when calling your C function, the order of the arguments matters, and their type matters, but their names don't matter. For instance, here in R I called the bandwidth bw, but in C I called it b. Same with g2 and g. 3. Newton-Raphson optimization. With Newton-Raphson methods, or quasi-Newton methods, you start with an initial guess x_0 and try to find the minimum of a smooth function f(x). You start with x_0, then move toward x_1, then x_2, .... If f is smooth, then at the minimum x, f'(x) = 0. The Taylor expansion of f around x is f(x + delta) = f(x) + f'(x) delta + f''(x) delta^2 / 2! + .... Approximating this using just the first two terms is common. f(x + delta) ~ f(x) + f'(x) delta. So, if the derivative of this is 0 at the minimum, x, then that means f'(x) + f''(x) delta ~ 0. Therefore, delta ~ -f'(x) / f''(x). So, what quasi-Newton methods do is start with some guess x_0, then for i = 0, 1, 2, ..., let x_{i+1} = x_i + delta = x_i - f'(x_i) / f''(x_i). This extends readily to higher dimensional functions. The function optim() in R already does fast quasi-Newton optimization, because it calls fortran. Also, there are numerous websites that have quasi-Newton C programs, so I do not want to make this part of Project B. Instead, what I'd like you to do is this. a) First, write a C function called neglog, to compute the negative loglikelihood. This function should take, as inputs, x, y, t, mag, n, xmin, xmax, ymin, ymax, tmin, tmax, mag0, theta, and res, where the first 4 are vectors of length n, indicating the longitudes, latitudes, times, and magnitudes of the data, n is an integer indicating the number of observations, the next 6 entries are double-precision numbers indicating the range of the observed space, mag0 is a a double-precision number which is the cutoff minimum magnitude of the data, theta = {mu, K, c, p, a, d, q} is a vector of 7 parameters, and res = the resulting negative loglikelihood. Have your function calculate the negative loglikelihood, and it's probably easier if you approximate the integral term numerically, using some large number of space-time locations around each point. Note that the integral term, º lambda(t,x,y) dt dx dy, = mu [tmax-tmin] [xmax-xmin] [ymax-ymin] + º (·g)dt dx dy, and º (·g)dt dx dy = ·ºg dt dx dy, so you can just sum, over all n points, ºg dt dx dy. The integral º lambda(t,x,y) dt dx dy should be approximately n when theta gets close to the true parameter vector, because E [º lambda(t,x,y) dt dx dy] = the expected number of points in the space. So, it's a good idea to include in the C function a line like Rprintf("The integral is %f .\n", int2); or replace int2 with whatever you are calling your approximation of the integral term. Also, note that the parameters in theta must be non-negative, and negative terms might do weird things to your negative loglikelihood function. So, your code should start with a bunch of if() statements, so that if any of the parameters is negative you simply return some large positive number, like 99999, for the negative loglikelihood. Create a function in R that simply calls neglog in C. Call your R function neglog2. Ideally, neglog2 should not take very long to run. b) To run optim() in R to minimize neglog2, you need an initial guess at theta. What I'd like you to do is simply place a large number (you can decide the number -- maybe 100,000 if that doesn't take too long, but it's ok if you want to do fewer) of points theta = {mu, K, c, p, a, d, q} at random, uniformly, in [0,100] x [0,100] x [0,10] x [0,10] x [0,100] x [0,100] x [0,10]. Evaluate neglog2 at each of these thetas, and store the value of theta that minimizes neglog2. Call this value pstart. c) Take the same 374 points you downloaded for problem 2 of hw3, and using your initial guess pstart from b) and your function neglog2 in a), run optim() and record the final parameter estimates. You can also get approximate standard errors from optim. For instance, if you do fit2 = optim(pstart, neglog2, control=list(maxit=200), hessian=T) pend = fit2$par ## these are the parameter estimates. b3 = sqrt(diag(solve(fit2$hess))) ## these are the standard errors. optim() can be quite slow when calculating f is slow. Let's try out optim() a bit more. We saw optim in lec9. f = function(theta){ (theta[1]-2.1)^2 + (2.0+(theta[2]-2.2)^2)*(3.0+(theta[3]-2.3)^2) + (theta[4]-2.4)^2 + (theta[5]-2.5)^2 + (3.0+(theta[6]-2.6)^4)*(4.0+(theta[7]-2.7)^4) + (theta[8]-2.8)^2 } guess1 = rep(10,8) timea = Sys.time() g = optim(guess1,f,method="BFGS",control=list(maxit=200),hessian=T) Sys.time() - timea g$par sqrt(diag(solve(g$hess))) So it's fast, but if f is slow, then it slows down a lot. f = function(theta){ a = runif(100000) b = sort(a) cat("a.") (theta[1]-2.1)^2 + (2.0+(theta[2]-2.2)^2)*(3.0+(theta[3]-2.3)^2) + (theta[4]-2.4)^2 + (theta[5]-2.5)^2 + (3.0+(theta[6]-2.6)^4)*(4.0+(theta[7]-2.7)^4) + (theta[8]-2.8)^2 } timea = Sys.time() f(guess1) Sys.time() - timea timea = Sys.time() g = optim(guess1,f,method="BFGS",control=list(maxit=200),hessian=T) Sys.time() - timea g$par g$counts Notice the large number of times f is called. It calls f not only every iteration, but also to approximate the derivatives and second derivatives. It's calling f about 10 times per iteration. Coming up. ## examples of defining a vector or matrix within C. ## examples of 2 C functions communicating. ## examples of C++. ## more information about project A.