0. Final projects and Rcpp. 1. MLE in general. 2. MLE for Hawkes point processes. 3. Newton-Raphson optimization for the MLE using optim(). 4. Building R packages. Final projects. There have been a couple changes. Thu, Dec 3 1. FISCHER, ERIC MERCADO. 2. SOUDA, NAVIN VARADARAJ. 3. BURTON, HENRY. 4. HWANGBO, NATHAN MIN. 5. HUANG, STELLA HONGYING. 6. SHINKRE, TANVI RAHUL. 7. ZHAO, YIJIA. 8. O'NEILL, ELIZABETH. 9. DONG, CHRIS YUANCHAO. 10. HOFFMANN, NATHAN ISAAC. 11. FAN, HAIBO. 12. WANG, KAIXIN. Tue, Dec 8 13. CHU, HANQING 14. LEE, CHRISTY 15. AGRAWAL, SURABHI 16. PARIDAR, MAHSA 17. CHEN, ALEX 18. LI, BILL 19. YAN, GUANAO 20. JACOBSON, THOMAS ABRAM 21. ZHAI, XUFAN 22. MUELLER, SCOTT ALLEN 23. WONG, EMILY FRANCES 24. VINAS, LUCIANO Thu, Dec 10 25. KIM, DOEUN 26. ZHANG, XINYUAN 27. ZHANG, ZHE 28. RESCH, JOSEPH 29. XU, CHAO 30. ZHOU, CARTLAND 31. NGUEN, CHUNG KYONG 32. TREJO, ALFREDO 33. LIU, VINCENT BOJIE 34. VARGAS, SANTIAGO Rcpp. For Object Oriented Programming examples see https://www3.ntu.edu.sg/home/ehchua/programming/cpp/cp3_OOP.html library(Rcpp) cppFunction('int add2(int n) { return n*2.5; }') add2(17) ## but try it with n*2.5. It rounds it to an integer! 1. MLE. Instead of minimizing the sum of squared residuals as in regression, one can estimate the parameters of a model by finding the values of the parameters that maximize the likelihood of the data. This is in some sense the reverse of what you want. Given the data, you want the values of the parameters that are most likely. From a Bayesian point of view, the MLE is equivalent to putting a uniform (or uninformative) prior on your parameters and then taking the peak (or mode) of the posterior distribution. For example, suppose you are interested in p = the % of days it rains in a particular place. You randomly select 1 day and find that it rained on that day. The MLE of p is 1. In fact the likelihood of your data is p, so if p = 1, then the likelihood is maximized at 1. If you started with a uniform prior for p, then the posterior distribution would be triangular (p), and maximized at 1. So, the MLE kind of agrees with Bayesian analysis, though most Bayesians would advocate taking the mean of the posterior, rather than the peak. Sir Ronald Fisher wrote several important papers between 1912 and 1922 advocating the use of the MLE. He showed that from a frequentist perspective, the MLE has nice asymptotic properties like consistency, asymptotic normality, asymptotic unbiasedness, and asymptotic efficiency, under quite general conditions. Several people have written about special cases, though, where the MLE is a lousy estimate. See e.g. the excellent review by Le Cam, L.M. (1990). Maximum likelihood: an introduction. Int. Statis. Rev. 58, 153--171. 2. Hawkes point processes and MLE. A point process is a collection of points in some space. Often the space is the real line and represents times. An example might be the origin times of major California earthquakes. Alternatively, you might have a spatial-temporal point process, where each point is the spatial-temporal location of the occurrance of some event, like a hurricane in the U.S. A point process is SIMPLE if, with probability one, all its points are distinct. That is, no 2 points overlap exactly. Simple point processes are uniquely characterized by their CONDITIONAL RATES. For a space-time point process, the conditional rate lambda(t,x,y) = the limiting expected rate at which pts are accumulating near (t,x,y), given everything that has occurred before time t, = the limit, as delta |B_(x,y)| -> 0, of E[number of points in [t,t+delta] x B_(x,y)] / [delta |B_(x,y)|] | Ht], where|B_(x,y)| is a ball around (x,y), and Ht = the entire history of what pts have occurred before or at time t. Point processes may be clustered -- points cause future points nearby to become more likely. inhibitive -- points cause future points to become less likely. Poisson -- the rate of future points is independent of what pts have occurred previously. An important model for clustered space-time point processes is the Hawkes model, where lambda(t) = mu + ·_{i: t_i < t} g(t-t_i). g(u) is called the triggering density. Typically g(u) is large when u = 0, and g(u) decreases gradually to zero. An example is g(u) = exp(-beta u), for beta > 0. In the formula above, t_i represents the times at which the points occurred. An important example of a Hawkes process is the ETAS or Epidemic-Type Aftershock Sequence model of Ogata (1988), where g(u) = K/(u+c)^p, where mu, K, c, and p are nonnegative parameters to be estimated. A space-time version of ETAS was proposed by Ogata (1998), see equation 2.3 or 2.4 of ogata98.pdf, where lambda(t,x,y) = mu + ·_{i: t_i < t} g(t-t_i, x-x_i, y-y_i) and (2.3) g(u,v,w) = K/(u+c)^p * exp(-a (M_i - M_0)) {(v^2 + w^2 +d)}^{-q}, or (2.4) g(u,v,w) = K/(u+c)^p * {(v^2 + w^2) exp(-a (M_i - M_0)) + d}^{-q}, where the parameter vector Theta = {mu, K, c, p, a, d, and q}, (t_i, x_i, y_i, M_i) are the time, x-coordinate, y-coordinate, and magnitude of earthquake i, and M_0 is the minimum cutoff magnitude for the earthquake catalog being used. M0 is typically known by the user, not estimated with the data. Note that (2.4) for some reason isn't labelled in Ogata (1998), on page 384. Also, Ogata writes g(t,x,y), which is a little confusing I think because g is really a function of the interevent time and distance from a previous earthquake, not of ordinary time and location. Point process parameters may be estimated by maximum likelihood. That is, we find the parameter vector Theta that maximizes the likelihood of the data. Typically, in practice it's a bit computationally easier to minimize -log(likelihood). The loglikelihood = ·_i log(lambda(t_i,x_i,y_i)) - º lambda(t,x,y) dt dx dy, where the sum is over the observed points i, and the integral is over the whole observed space. Given values of the parameters, it is often relatively easy to calculate the sum, but calculating or approximating the integral can be tricky. On the bottom of page 385 and top of page 386 of Ogata (1998), ogata98.pdf , which is on the course website, he shows some calculus on the integral term, changing it to polar coordinates, for example. He finds that, if the study region is [0,T] x A, and the points are (ti,xi,yi,Mi), from i = 1 to n, then º lambda(t,x,y) dt dx dy is approximately mu T |A| + (·_i º from 0 to (T-ti) K/(u+c)^p du) [·_k Sk(xi,yi,Mi) Dk/(2pi)], where Sk and Dk have sort of complicated explanations given on p385-6 of Ogata (1998). If you can't get the integral analytically using calculus, then another option is to approximate the integral by sampling thousands of space-time locations around each point (ti,xi,yi), evaluating ·g at these sampled locations, and using the approximation º lambda(t,x,y) dt dx dy is approximately mu T |A| + sum(sampled ·g) * space-time area associated with each of the sampled lambdas. For instance suppose we wanted to approximate º from 0 to 3 of f(t)dt, where f(t) = 7 + exp(-2t) + 20exp(-4t). The integral = 7t - exp(-2t)/2 - 20exp(-4t)/4] from 0 to 3 = 7*3 - exp(-6)/2 + exp(0)/2 - 5exp(-12) + 5exp(0) = 7*3 - exp(-6)/2 + 1/2 - 5*exp(-12) + 5 x = seq(0,5,length=100) f = 7 + exp(-2*x) + 20*exp(-4*x) plot(x,f,type="l",xlab="t",ylab="7 + exp(-2t) + 20exp(-4t)") trueint = 7*3 - exp(-6)/2 + 1/2 - 5*exp(-12) + 5 m = 1000 sampgrid = seq(0,3,length=m) sampf = exp(-2*sampgrid) + 20*exp(-4*sampgrid) 7*3 + sum(sampf) * 3/m Suppose we wanted to approximate º from 0 to 3 of f(t)dt, where f(t) = 7 + exp(-2t) + dnorm(t-2). The true integral is 7*3 - exp(-6)/2 + 1/2 + pnorm(1) - pnorm(-2) x = seq(0,5,length=100) f = 7 + exp(-2*x) + dnorm(x-2) plot(x,f,type="l",xlab="t",ylab="7 + exp(-2t) + dnorm(t-2)") trueint = 7*3 - exp(-6)/2 + 1/2 + pnorm(1) - pnorm(-2) m = 100000 sampgrid = seq(0,3,length=m) sampf = exp(-2*sampgrid) + dnorm(sampgrid-2) 7*3 + sum(sampf) * 3/m Since C is so fast, it might be easier to just use the approximation method. 3. Newton-Raphson optimization. With Newton-Raphson methods, or quasi-Newton methods, you start with an initial guess x_0 and try to find the minimum of a smooth function f(x). You start with x_0, then move toward x_1, then x_2, .... If f is smooth, then at the minimum x, f'(x) = 0. The Taylor expansion of f around x is f(x + delta) = f(x) + f'(x) delta + f''(x) delta^2 / 2! + .... Approximating this using just the first two terms is common. f(x + delta) ~ f(x) + f'(x) delta. So, if the derivative of this is 0 at the minimum, x, then that means f'(x) + f''(x) delta ~ 0. Therefore, delta ~ -f'(x) / f''(x). So, what quasi-Newton methods do is start with some guess x_0, then for i = 0, 1, 2, ..., let x_{i+1} = x_i + delta = x_i - f'(x_i) / f''(x_i). This extends readily to higher dimensional functions. The function optim() in R already does fast quasi-Newton optimization, because it calls fortran. Instead, what one could do is this. a) First, write a C function called neglog, to compute the negative loglikelihood. This function could take, as inputs, x, y, t, mag, n, xmin, xmax, ymin, ymax, tmin, tmax, mag0, theta, and res, where the first 4 are vectors of length n, indicating the longitudes, latitudes, times, and magnitudes of the data, n is an integer indicating the number of observations, the next 6 entries are double-precision numbers indicating the range of the observed space, mag0 is a a double-precision number which is the cutoff minimum magnitude of the data, theta = {mu, K, c, p, a, d, q} is a vector of 7 parameters, and res = the resulting negative loglikelihood. Have your function calculate the negative loglikelihood, and it's probably easier if you approximate the integral term numerically, using some large number of space-time locations around each point. Note that the integral term, º lambda(t,x,y) dt dx dy, = mu [tmax-tmin] [xmax-xmin] [ymax-ymin] + º (·g)dt dx dy, and º (·g)dt dx dy = ·ºg dt dx dy, so you can just sum, over all n points, ºg dt dx dy. The integral º lambda(t,x,y) dt dx dy should be approximately n when theta gets close to the true parameter vector, because E [º lambda(t,x,y) dt dx dy] = the expected number of points in the space. So, it's a good idea to include in the C function a line like Rprintf("The integral is %f .\n", int2); or replace int2 with whatever you are calling your approximation of the integral term. Also, note that the parameters in theta must be non-negative, and negative terms might do weird things to your negative loglikelihood function. So, your code should start with a bunch of if() statements, so that if any of the parameters is negative you simply return some large positive number, like 99999, for the negative loglikelihood. Create a function in R that simply calls neglog in C. Call your R function neglog2. Ideally, neglog2 should not take very long to run. b) To run optim() in R to minimize neglog2, you need an initial guess at theta. What I'd do is simply place a large number, like maybe 100,000 if that doesn't take too long, of points theta = {mu, K, c, p, a, d, q} at random, uniformly, in [0,100] x [0,100] x [0,10] x [0,10] x [0,100] x [0,100] x [0,10]. Then one could evaluate neglog2 at each of these thetas, and store the value of theta that minimizes neglog2. Call this value pstart. c) Then one could take a point process dataset, and using the initial guess pstart from b) and the function neglog2 in a), run optim() and record the final parameter estimates. You can also get approximate standard errors from optim. For instance, if you do fit2 = optim(pstart, neglog2, control=list(maxit=200), hessian=T) pend = fit2$par ## these are the parameter estimates. b3 = sqrt(diag(solve(fit2$hess))) ## these are the standard errors. optim() can be quite slow when calculating f is slow. Let's try out optim() a bit more. f = function(theta){ (theta[1]-2.1)^2 + (2.0+(theta[2]-2.2)^2)*(3.0+(theta[3]-2.3)^2) + (theta[4]-2.4)^2 + (theta[5]-2.5)^2 + (3.0+(theta[6]-2.6)^4)*(4.0+(theta[7]-2.7)^4) + (theta[8]-2.8)^2 } guess1 = rep(10,8) timea = Sys.time() g = optim(guess1,f,method="BFGS",control=list(maxit=200),hessian=T) Sys.time() - timea g$par sqrt(diag(solve(g$hess))) So it's fast, but if f is slow, then it slows down a lot. num = 0 f = function(theta){ a = runif(100000) b = sort(a) num <<- num + 1 cat(num," ") (theta[1]-2.1)^2 + (2.0+(theta[2]-2.2)^2)*(3.0+(theta[3]-2.3)^2) + (theta[4]-2.4)^2 + (theta[5]-2.5)^2 + (3.0+(theta[6]-2.6)^4)*(4.0+(theta[7]-2.7)^4) + (theta[8]-2.8)^2 } timea = Sys.time() f(guess1) Sys.time() - timea timea = Sys.time() g = optim(guess1,f,method="BFGS",control=list(maxit=200),hessian=T) Sys.time() - timea g$par g$counts Notice the large number of times f is called. It calls f not only every iteration, but also to approximate the derivatives and second derivatives. It's calling f about 10 times per iteration.