1. MLE. 2. MLE for Hawkes point processes. 3. MLE using optim(). 1. MLE. Instead of minimizing the sum of sqaured residuals as in regression, one can estimate the parameters of a model by finding the values of the parameters that maximize the likelihood of the data. This is in some sense the reverse of what you want. Given the data, you want the values of the parameters that are most likely. From a Bayesian point of view, the MLE is equivalent to putting a uniform (or uninformative) prior on your parameters and then taking the peak (or mode) of the posterior distribution. For example, suppose you are interested in p = the % of days it rains in a particular place. You randomly select 1 day and find that it rained on that day. The MLE of p is 1. In fact the likelihood of your data is p, so if p = 1, then the likelihood is maximized at 1. If you started with a uniform prior for p, then the posterior distribution would be triangular (p), and maximized at 1. So, the MLE kind of agrees with Bayesian analysis, though most Bayesians would advocate taking the mean of the posterior, rather than the peak. Sir Ronald Fisher wrote several important papers between 1912 and 1922 advocating the use of the MLE. He showed that from a frequentist perspective, the MLE has nice asymptotic properties like consistency, asymptotic normality, asymptotic unbiasedness, and asymptotic efficiency, under quite general conditions. Several people have written about special cases, though, where the MLE is a lousy estimate. See e.g. the excellent review by Le Cam, L.M. (1990). Maximum likelihood: an introduction. Int. Statis. Rev. 58, 153--171. 2. Hawkes point processes and MLE. A point process is a collection of points in some space. Often the space is the real line and represents times. An example might be the origin times of major California earthquakes. Alternatively, you might have a spatial-temporal point process, where each point is the spatial-temporal location of the occurrance of some event, like a UFO sighting. A point process is SIMPLE if, with probability one, all its points are distinct. That is, no 2 points overlap exactly. Simple point processes are uniquely characterized by their CONDITIONAL RATES. For a space-time point process, the conditional rate lambda(t,x,y) = the limiting expected rate at which pts are accumulating near (t,x,y), given everything that has occurred before time t, = the limit, as delta |B_(x,y)| -> 0, of E[number of points in [t,t+delta] x B_(x,y)] / [delta |B_(x,y)|] | Ht], where|B_(x,y)| is a ball around (x,y), and Ht = the entire history of what pts have occurred before or at time t. Point processes may be clustered -- points cause future points nearby to become more likely. inhibitive -- points cause future points to become less likely. Poisson -- the rate of future points is independent of what other pts have occurred. An important model for clustered space-time point processes is the Hawkes model, where lambda(t) = mu + ·_{i: t_i < t} g(t-t_i). g(u) is called the triggering density. Typically g(u) is large when u = 0, and g(u) decreases gradually to zero. An example is g(u) = exp(-beta u), for beta > 0. In the formula above, t_i represents the times at which the points occurred. An important example of a Hawkes process is the ETAS or Epidemic-Type Aftershock Sequence model of Ogata (1988), where g(u) = K/(u+c)^p, where mu, K, c, and p are nonnegative parameters to be estimated. A space-time version of ETAS was proposed by Ogata (1998), where lambda(t,x,y) = mu + ·_{i: t_i < t} g(t-t_i, x-x_i, y-y_i) and g(u,v,w) = K/(u+c)^p * exp(-a (M_i - M_0)) {(v^2 + w^2 +d)}^{-q}, or g(u,v,w) = K/(u+c)^p * {(v^2 + w^2) exp(-a (M_i - M_0)) + d}^{-q}, where the parameter vector Theta = {mu, K, c, p, a, d, and q}, (t_i, x_i, y_i, M_i) are the time, x-coordinate, y-coordinate, and magnitude of earthquake i, and M_0 is the minimum cutoff magnitude for the earthquake catalog being used. M0 is typically known by the user, not estimated with the data. Point process parameters may be estimated by maximum likelihood. That is, we find the parameter vector Theta that maximizes the likelihood of the data. Typically, in practice it's a bit computationally easier to minimize -log(likelihood). The loglikelihood = ·_i log(lambda(t_i,x_i,y_i)) - º lambda(t,x,y) dt dx dy, where the sum is over the observed points i, and the integral is over the whole observed space. Given values of the parameters, it is often relatively easy to calculate the sum, but calculating or approximating the integral can be tricky. If you can't get the integral analytically using calculus, then another option is to approximate the integral by sampling thousands of space-time locations around each point (ti,xi,yi), evaluating ·g at these sampled locations, and using the approximation º lambda(t,x,y) dt dx dy is approximately mu T |A| + sum(sampled ·g) * space-time area associated with each of the sampled lambdas. For instance suppose we wanted to approximate º from 0 to 3 of f(t)dt, where f(t) = 7 + exp(-2t) + 20exp(-4t). The integral = 7t - exp(-2t)/2 - 20exp(-4t)/4] from 0 to 3 = 7*3 - exp(-6)/2 + exp(0)/2 - 5exp(-12) + 5exp(0) = 7*3 - exp(-6)/2 + 1/2 - 5*exp(-12) + 5 x = seq(0,5,length=100) f = 7 + exp(-2*x) + 20*exp(-4*x) plot(x,f,type="l",xlab="t",ylab="7 + exp(-2t) + 20exp(-4t)") trueint = 7*3 - exp(-6)/2 + 1/2 - 5*exp(-12) + 5 m = 1000 sampgrid = seq(0,3,length=m) sampf = exp(-2*sampgrid) + 20*exp(-4*sampgrid) 7*3 + sum(sampf) * 3/m Suppose we wanted to approximate º from 0 to 3 of f(t)dt, where f(t) = 7 + exp(-2t) + dnorm(t-2). The true integral is 7*3 - exp(-6)/2 + 1/2 + pnorm(1) - pnorm(-2) x = seq(0,5,length=100) f = 7 + exp(-2*x) + dnorm(x-2) plot(x,f,type="l",xlab="t",ylab="7 + exp(-2t) + dnorm(t-2)") trueint = 7*3 - exp(-6)/2 + 1/2 + pnorm(1) - pnorm(-2) m = 100000 sampgrid = seq(0,3,length=m) sampf = exp(-2*sampgrid) + dnorm(sampgrid-2) 7*3 + sum(sampf) * 3/m Since C is so fast, it might be easier to just use the approximation method. Note that the integral term, º lambda(t,x,y) dt dx dy, = mu [tmax-tmin] [xmax-xmin] [ymax-ymin] + º (·g)dt dx dy, and º (·g)dt dx dy = ·ºg dt dx dy, so you can just sum, over all n points, ºg dt dx dy. In some circumstances it might be reasonable to write g(t,x,y) = Kf(t,x,y), where f is some density and K is a constant, because then (*) ºg dt dx dy ~ K, which makes things WAY easier. The approximation (*) would be exact if all aftershock activity for every earthquake were always observed in your observation window. Also, as a check, the integral º lambda(t,x,y) dt dx dy should be approximately n when theta gets close to the true parameter vector, because E [º lambda(t,x,y) dt dx dy] = the expected number of points in the space. 3. Newton-Raphson optimization. With Newton-Raphson methods, or quasi-Newton methods, you start with an initial guess x_0 and try to find the minimum of a smooth function f(x). You start with x_0, then move toward x_1, then x_2, .... If f is smooth, then at the minimum x, f'(x) = 0. The Taylor expansion of f around x is f(x + delta) = f(x) + f'(x) delta + f''(x) delta^2 / 2! + .... Approximating this using just the first two terms is common. f(x + delta) ~ f(x) + f'(x) delta. So, if the derivative of this is 0 at the minimum, x, then that means f'(x) + f''(x) delta ~ 0. Therefore, delta ~ -f'(x) / f''(x). So, what quasi-Newton methods do is start with some guess x_0, then for i = 0, 1, 2, ..., let x_{i+1} = x_i + delta = x_i - f'(x_i) / f''(x_i). This extends readily to higher dimensional functions. The function optim() in R already does fast quasi-Newton optimization, because it calls fortran. You can also get approximate standard errors from optim. For instance, if you do fit2 = optim(pstart, neglog2, control=list(maxit=200), hessian=T) pend = fit2$par ## these are the parameter estimates. b3 = sqrt(diag(solve(fit2$hess))) ## these are the standard errors. optim() can be quite slow when calculating f is slow. Let's try out optim() a bit. We saw optim in lec9. f = function(theta){ (theta[1]-2.1)^2 + (2.0+(theta[2]-2.2)^2)*(3.0+(theta[3]-2.3)^2) + (theta[4]-2.4)^2 + (theta[5]-2.5)^2 + (3.0+(theta[6]-2.6)^4)*(4.0+(theta[7]-2.7)^4) + (theta[8]-2.8)^2 } guess1 = rep(10,8) timea = Sys.time() g = optim(guess1,f,method="BFGS",control=list(maxit=200),hessian=T) Sys.time() - timea g$par sqrt(diag(solve(g$hess))) So it's fast, but if f is slow, then it slows down a lot. num = 0 f = function(theta){ a = runif(100000) b = sort(a) num <<- num + 1 cat(num," ") (theta[1]-2.1)^2 + (2.0+(theta[2]-2.2)^2)*(3.0+(theta[3]-2.3)^2) + (theta[4]-2.4)^2 + (theta[5]-2.5)^2 + (3.0+(theta[6]-2.6)^4)*(4.0+(theta[7]-2.7)^4) + (theta[8]-2.8)^2 } timea = Sys.time() f(guess1) Sys.time() - timea timea = Sys.time() g = optim(guess1,f,method="BFGS",control=list(maxit=200),hessian=T) Sys.time() - timea g$par g$counts Notice the large number of times f is called. It calls f not only every iteration, but also to approximate the derivatives and second derivatives. It's calling f about 10 times per iteration. If you're fitting a distribution to univariate data, you can just use fitdistr() in R's MASS package, which generally calls optim() unless there is a closed form solution. Here's an example of fitting the ETAS model, using the approximation (*). In the file myss2.c, #include #include void ss2 (double *x, double *y, double *t, double *hm, int *n, double *x1, double *y1, double *T, double *theta, double *a2){ int i,j; double sum1, sum2, r2, const1, mu, K, c, p, a, d, q, lam1; mu = theta[0]; K = theta[1]; c = theta[2]; p = theta[3]; a = theta[4]; d = theta[5]; q = theta[6]; sum1 = 0.0; for(i = 0; i < *n; i++){ sum1 += exp(a*hm[i]); } *a2 = mu * *T + K*sum1 - log(mu/(*x1 * *y1)); const1 = K * (p-1)* pow(c,(p-1)) * (q-1) * pow(d,(q-1)) / 3.141593; for(i = 1; i < *n; i++){ sum2 = 0.0; for(j=0; j < i; j++){ r2 = (x[i]-x[j])*(x[i]-x[j]) + (y[i]-y[j])*(y[i]-y[j]); sum2 += pow(t[i]-t[j]+c,(-p)) * exp(a*hm[j])*pow(r2+d,(-q)); } lam1 = mu / *x1 / *y1 + const1 * sum2; if(lam1 < 0.0000000001) *a2 = 999999999999; else { *a2 -= log(lam1); } } } Then in R, x = read.table("scec2.txt") ## post Hector Mine 1999 earthquakes. y0 = x[,1] y1 = as.Date(y0,"%m/%d/%y") y2 = julian(y1,origin = as.Date("1999-10-16")) t = y2 + x[,2]/24 + x[,3]/60/24 + x[,4]/60/60/24 m = x[,5] y = x[,6] x = x[,7] plot(x,y,pch=1,cex=exp((m-2)/2.5),xlab="longitude",ylab="latitude") points(x[2],y[2],cex=exp((m[2]-2)/2.5),col="red") z1 = list(t=t, hm=m-3, lon=x+117, lat=y-34, n = length(t)) z = z1 X1 = 1 Y1 = 1 T = 434 ## in days system("R CMD SHLIB myss2.c") dyn.load("myss2.so") sum3 = function(theta1){ if(min(min(theta1),theta1[4]-1,theta1[7]-1) < 0.00000001) return(9e20) a3 = .C("ss2", as.double(z$lon), as.double(z$lat), as.double(z$t), as.double(z$hm), as.integer(z$n), as.double(X1), as.double(Y1), as.double(T), as.double(theta1), a2=double(1), onelam=double(1)) i <<- i+1 cat(i, signif(theta1,3),"\n") a3$a2 ## or equivalently a[[10]] } i = 0 theta = c(5.250, 0.200, 0.025, 1.500, 0.700, 0.025, 1.500) sum3(theta) i = 0 b1 = optim(theta,sum3,hessian=T,control=list(maxit=1000)) b1$par se = sqrt(diag(solve(b1$hess))) theta = b1$par b1 = optim(theta,sum3,hessian=T) b1$conv ## if 1, it might not have converged yet.