NO CLASS TUE 11/21 OR THU 11/23! 0. Final project schedule. 1. MLE in general. 2. MLE for Hawkes point processes. Final projects. Thu, Nov30 1. "FROELICH, AUSTIN" 2. "AGRAWAL, VANI" 3. "LIANG, SIYU" 4. "NANDY, SIDDHARTH" 5. "ARJON, ALEJANDRA" 6. "MEHTA, YASHVI MAYUR" 7. "QI, SHIRLEY" 8. "VO, DAVIS MINH" 9. "GOH, HAN GYUL" 10. "TANG, BRADLEY TIANYOU" 11. "CHEN, BAITING" 12. "CHEN, HAOTIAN" 13. "CHEN, GORDON" Tue, Dec5 14. "VIDAL, JAY" 15. "ANTONY, KEVIN J" 16. "CHAE, JE HOON" 17. "ZHANG, YUELONG" 18. "PARK, JENNIFER Y" 19. "JIANG, FLORA" 20. "CHIANG, JENNIFER" 21. "XU, LI" 22. "KANG, ANDREA" 23. "WANG, CHUNHE" 24. "TANG, CHENCHEN" 25. "ASKARI, MOHAMMAD" 26. "BATENI, BORNA" 27. "GOELTNER, KARL LIN DE HAN" Thu, Dec7 28. "LI, KATHERINE" 29. "XU, JIAMIN" 30. "ILANGO, DHAKSHINA PRIYA RAJESWARI" 31. "CHEN, ALEX LICHY" 32. "TROXELL, DAVID JOSEPH" 33. "KHATIB, RAMZEY" 34. "BOLOURANI, ANAHITA" 35. "DSOUZA, REBECCA AURELIA" 36. "LI, HANNAH" 37. "BHAGWAT, VAIDEHI PRASHANT" 38. "FENG, JUSTIN ADAM" 39. "LIU, SILVIA" 40. "HE, HENGZHI" -- Attendance is mandatory at your own day plus at least 1 of the 2 other days. -- Email me your slides, ideally as pdf or ppt, the day before your presentation. You will use my computer for your presentation. 1. MLE. Instead of minimizing the sum of squared residuals as in regression, one can estimate the parameters of a model by finding the values of the parameters that maximize the likelihood of the data. This is in some sense the reverse of what you want. Given the data, you want the values of the parameters that are most likely. From a Bayesian point of view, the MLE is equivalent to putting a uniform (or uninformative) prior on your parameters and then taking the peak (or mode) of the posterior distribution. For example, suppose you are interested in p = the % of days it rains in a particular place. You randomly select 1 day and find that it rained on that day. The MLE of p is 1. In fact the likelihood of your data is p, so if p = 1, then the likelihood is maximized at 1. If you started with a uniform prior for p, then the posterior distribution would be triangular (p), and maximized at 1. So, the MLE kind of agrees with Bayesian analysis, though most Bayesians would advocate taking the mean of the posterior, rather than the peak. Sir Ronald Fisher wrote several important papers between 1912 and 1922 advocating the use of the MLE. He showed that from a frequentist perspective, the MLE has nice asymptotic properties like consistency, asymptotic normality, asymptotic unbiasedness, and asymptotic efficiency, under quite general conditions. Several people have written about special cases, though, where the MLE is a lousy estimate. See e.g. the excellent review by Le Cam, L.M. (1990). Maximum likelihood: an introduction. Int. Statis. Rev. 58, 153--171. 2. Hawkes point processes and MLE. A point process is a collection of points in some space. Often the space is the real line and represents times. An example might be the origin times of major California earthquakes. Alternatively, you might have a spatial-temporal point process, where each point is the spatial-temporal location of the occurrance of some event, like a hurricane in the U.S. A point process is SIMPLE if, with probability one, all its points are distinct. That is, no 2 points overlap exactly. Simple point processes are uniquely characterized by their CONDITIONAL RATES. For a space-time point process, the conditional rate lambda(t,x,y) = the limiting expected rate at which pts are accumulating near (t,x,y), given everything that has occurred before time t, = the limit, as delta |B_(x,y)| -> 0, of E[number of points in [t,t+delta] x B_(x,y)] / [delta |B_(x,y)|] | Ht], where|B_(x,y)| is a ball around (x,y), and Ht = the entire history of what pts have occurred before time t. Point processes may be clustered -- points cause future points nearby to become more likely. inhibitive -- points cause future points to become less likely. Poisson -- the rate of future points is independent of what pts have occurred previously. An important model for clustered space-time point processes is the Hawkes model, where lambda(t) = mu + ·_{i: t_i < t} g(t-t_i). g(u) is called the triggering density. Typically g(u) is large when u = 0, and g(u) decreases gradually to zero. An example is g(u) = exp(-beta u), for beta > 0. In the formula above, t_i represents the times at which the points occurred. An important example of a Hawkes process is the ETAS or Epidemic-Type Aftershock Sequence model of Ogata (1988), where g(u) = K/(u+c)^p, where mu, K, c, and p are nonnegative parameters to be estimated. A space-time version of ETAS was proposed by Ogata (1998), see equation 2.3 or 2.4 of ogata98.pdf, where lambda(t,x,y) = mu + ·_{i: t_i < t} g(t-t_i, x-x_i, y-y_i) and (2.3) g(u,v,w) = K/(u+c)^p * exp(-a (M_i - M_0)) {(v^2 + w^2 +d)}^{-q}, or (2.4) g(u,v,w) = K/(u+c)^p * {(v^2 + w^2) exp(-a (M_i - M_0)) + d}^{-q}, where the parameter vector Theta = {mu, K, c, p, a, d, and q}, (t_i, x_i, y_i, M_i) are the time, x-coordinate, y-coordinate, and magnitude of earthquake i, and M_0 is the minimum cutoff magnitude for the earthquake catalog being used. M0 is typically known by the user, not estimated with the data. Note that (2.4) for some reason isn't labelled in Ogata (1998), on page 384. Also, Ogata writes g(t,x,y), which is a little confusing I think because g is really a function of the interevent time and distance from a previous earthquake, not of ordinary time and location. Point process parameters may be estimated by maximum likelihood. That is, we find the parameter vector Theta that maximizes the likelihood of the data. Typically, in practice it's a bit computationally easier to minimize -log(likelihood). The loglikelihood = ·_i log(lambda(t_i,x_i,y_i)) - º lambda(t,x,y) dt dx dy, where the sum is over the observed points i, and the integral is over the whole observed space. Given values of the parameters, it is often relatively easy to calculate the sum, but calculating or approximating the integral can be tricky. If you can't get the integral analytically using calculus, then another option is to approximate the integral by sampling thousands of space-time locations around each point (ti,xi,yi), evaluating ·g at these sampled locations, and using the approximation º lambda(t,x,y) dt dx dy is approximately mu T |A| + sum(sampled ·g) * space-time area associated with each of the sampled lambdas. For instance suppose we wanted to approximate º from 0 to 3 of f(t)dt, where f(t) = 7 + exp(-2t) + 20exp(-4t). The integral = 7t - exp(-2t)/2 - 20exp(-4t)/4] from 0 to 3 = 7*3 - exp(-6)/2 + exp(0)/2 - 5exp(-12) + 5exp(0) = 7*3 - exp(-6)/2 + 1/2 - 5*exp(-12) + 5 x = seq(0,5,length=100) f = 7 + exp(-2*x) + 20*exp(-4*x) plot(x,f,type="l",xlab="t",ylab="7 + exp(-2t) + 20exp(-4t)") trueint = 7*3 - exp(-6)/2 + 1/2 - 5*exp(-12) + 5 m = 10000 sampgrid = seq(0,3,length=m) sampf = exp(-2*sampgrid) + 20*exp(-4*sampgrid) 7*3 + sum(sampf) * 3/m Suppose we wanted to approximate º from 0 to 3 of f(t)dt, where f(t) = 7 + exp(-2t) + dnorm(t-2). The true integral is 7*3 - exp(-6)/2 + 1/2 + pnorm(1) - pnorm(-2) x = seq(0,5,length=100) f = 7 + exp(-2*x) + dnorm(x-2) plot(x,f,type="l",xlab="t",ylab="7 + exp(-2t) + dnorm(t-2)") trueint = 7*3 - exp(-6)/2 + 1/2 + pnorm(1) - pnorm(-2) m = 10^8 ## change to 10^8? sampgrid = seq(0,3,length=m) sampf = exp(-2*sampgrid) + dnorm(sampgrid-2) 7*3 + sum(sampf) * 3/m Since C is so fast, it might be easier to just use the approximation method.