1. Projects. 2. sum of squared differences between observations. 3. sum of squared differences from observations for each gridpoint. -- No class or office hours Thur Nov 3, due to the mandatory Statistics faculty retreat. From a student: "Some of us were working at getting R and C working, and I found a solution you may find useful. Ê When we were trying to run R CMD SHLIB, an error came back to the effect of 'gcc-4.2 file not found'. Ê In this case, the executable was just /usr/bin/gcc, so we fixed it with a symlink by running the command 'sudo ln -s /usr/bin/gcc /usr/bin/gcc-4.2' at the terminal." -- Final project teams and oral report order. Tues, Nov 29 1 MENON, TULASI. 2 MONROE, SCOTT. 3 LIU, XUENING. ERLIKHMAN, GENNADY. DABAGH, SAFAA. 4 BEYOR, ALEXA. BANSAL, RAHUL. JOHNSTON, RACHEL. 5 MATTHEWS, JERRID. YING, VICTOR. GARAND, JOSEPH. 6 CHEN, YU-CHING. WANG, ZHAO. LEE, JONGYOON. 7 BAI, HAN. BRUMBAUGH, STEPHEN. DEL VALLE, ARTURO. 8 LY, DIANA. DALININA, RUSLANA. JIANG, FEIFEI. 9 GONYEA, TIANQING. TONG, XIAOXIAO. CHEN, LICHAO. Thur, Dec 1 10 MOLYNEUX, JAMES. YANG, HO-SHUN. YIN, KEVIN. 11 XIE, XIAOXI. PAVLOVSKAIA, MARIA. MORALES, EMMANUEL. 12 AGRAWAL, PALASH. CHEN, MENGNA. TSIOKOS, CHRISTOS. 13 CHANG, BO. 14 MCLAUGHLIN, KATHERINE. NG, LUNG. CRAWFORD, TIMOTHY. 15 CHEN, YUE. MCCARTHY, PATRICK. LAM, HO YEUNG. 16 LU, WEIPENG. WU, YI. LI, SHANSHAN. 17 WRIGHT, JENNIFER. CHEN, CHEN. SONG, XI. 18 EMBREE, JOSHUA. 19 GARCIA, EDUARDO. GORDON, JOSHUA. WONG, ALBERT. For project A, start looking for a dataset and dividing up tasks. A regression type dataset, where you have many independent observational units (e.g. people), and for each, you have 3 or 4 variables recorded, and where one is the logical response variable, might be ideal. You can do regression and also analyze each variable independently. For project B, you can choose ETAS 2.3 or ETAS 2.4 in Ogata (1998). You should submit a pdf of your project and your text files of your code, a .c and .r file, to me. After you submit your code to me, I will generate a simulation and see if your code works to get estimates that are close to the true values. 2. Sum of squared differences between observations. In sumsq.c, #include #include void ss2 (double *x, int *n, double *y) /* x will be the vector of data of length n, and y will be a vector of squared differences from obs i to the other n-1 observations. */ { int i,j; double a; for(i = 0; i < *n; i++){ a = 0.0; for(j=0; j < *n; j++){ a += pow(x[i] - x[j], 2); } y[i] = a; } } I could have said i = 0; while(i < *n){ commands; i++; } in R, system("R CMD SHLIB sumsq.c") dyn.load("sumsq.so") sum3 = function(data2){ n = length(data2) a = .C("ss2", as.double(data2), as.integer(n), y=double(n)) a$y ## or equivalently a[[3]] } b = c(1,3,4) sum3(b) n = c(100, 1000, 2000, 3000, 5000, 7000, 8000, 10000) t2 = rep(0,8) for(i in 1:8){ b = runif(n[i]) timea = Sys.time() d = sum3(b) timeb = Sys.time() t2[i] = timeb-timea cat(n[i]," ") } plot(n,t2,ylab="time (sec)") ## Now try the same thing in R, without C. sum4 = function(data2){ n = length(data2) x = rep(0,n) for(i in 1:n){ for(j in 1:n){ x[i] = x[i] + (data2[i] - data2[j])^2 } } x } b = c(1,3,4) sum4(b) n = c(1:8)*100 t3 = rep(0,8) for(i in 1:8){ b = runif(n[i]) timea = Sys.time() d = sum4(b) timeb = Sys.time() t3[i] = timeb-timea cat(n[i]," ") } plot(n,t3,ylab="time (sec)") 3. sum of squared differences from observations for each gridpoint. In sumsgrid.c, #include #include void ssg2 (double *x, int *n, double *g, int *m, double *y) /* x will be the vector of data of length n, g will be the vector of m gridpoints, and y will be a vector of length m, of squared differences from gridpoint i to the n observations. */ { int i,j; double a; for(i = 0; i < *m; i++){ a = 0.0; for(j=0; j < *n; j++){ a += pow(g[i] - x[j], 2); } y[i] = a; } } in R, system("R CMD SHLIB sumsgrid.c") dyn.load("sumsgrid.so") sum4 = function(data2, grid2){ n = length(data2) m = length(grid2) a = .C("ssg2", as.double(data2), as.integer(n), as.double(grid2), as.integer(m), y=double(m)) a$y ## or equivalently a[[5]] } b = c(1,3,4) gr = c(0,1,2,3,4,5) sum4(b,gr) This function ssg2 will be similar to your kernel density code for hw3, except that you will also have a bandwidth as an additional input, and you will replace the line a += pow(g[i] - x[j], 2); with the kernel density estimate. ## example of defining a vector or matrix within C. ## example of 2 C functions communicating. ## example of C++. 4. Hawkes models and MLE. A point process is a collection of points in some space. Often the space is the real line and represents times. An example might be the origin times of major California earthquakes. Alternatively, you might have a spatial-temporal point process, where each point is the spatial-temporal location of the occurrance of some event, like a hurricane in the U.S. A point process is SIMPLE if, with probability one, all its points are distinct. That is, no 2 points overlap exactly. Simple point processes are uniquely characterized by their CONDITIONAL RATES. For a space-time point process, the conditional rate lambda(t,x,y) = the limiting expected rate at which pts are accumulating near (t,x,y), given everything that has occurred before time t. = the limit, as delta |B_(x,y)| -> 0, of E[number of points in [t,t+delta] x B_(x,y)] / [delta |B_(x,y)|] | Ht], where|B_(x,y)| is a ball around (x,y), and Ht = the entire history of what pts have occurred before or at time t. Point processes may be clustered -- pts cause future points nearby to become more likely. inhibitive -- pts cause future points to become less likely. Poisson -- the rate of future points is independent of what pts have occurred previously. An important model for clustered space-time point processes is the Hawkes model, where lambda(t) = mu + ·_{i: t_i < t} g(t-t_i). g(u) is called the triggering density. Typically g(u) is large when u = 0, and g(u) decreases gradually to zero. An example is g(u) = exp(-beta u), for beta > 0. In the formula above, t_i represents the times at which the points occurred. An important example of a Hawkes process is the ETAS or Epidemic-Type Aftershock Sequence model of Ogata (1988), where g(u) = K/(u+c)^p, where mu, K, c, and p are nonnegative parameters to be estimated. A space-time version of ETAS was proposed by Ogata (1998), see equation 2.3 or 2.4 of http://wildfire.stat.ucla.edu/pdflibrary/ogata98.pdf, where lambda(t,x,y) = mu + ·_{i: t_i < t} g(t-t_i, x-x_i, y-y_i) and (2.3) g(u,v,w) = K/(u+c)^p * exp(-a (M_i - M_0)) {(v^2 + w^2 +d)}^{-q}, or (2.4) g(u,v,w) = K/(u+c)^p * {(v^2 + w^2) exp(-a (M_i - M_0)) + d}^{-q}, where the parameter vector Theta = {mu, K, c, p, a, d, and q}, (t_i, x_i, y_i, M_i) are the time, x-coordinate, y-coordinate, and magnitude of earthquake i, and M_0 is the minimum cutoff magnitude for the earthquake catalog being used. M0 is typically known by the user, not estimated with the data. Note that (2.4) for some reason isn't labelled in Ogata (1998), on page 384. Also, Ogata writes g(t,x,y), which is a little confusing I think because g is really a function of the interevent time and distance from a previous earthquake, not of ordinary time and location. Point process parameters may be estimated by maximum likelihood. That is, we find the parameter vector Theta that maximizes the likelihood of the data. Typically, in practice it's a bit computationally easier to minimize -log(likelihood). The loglikelihood = ·_i log(lambda(t_i,x_i,y_i)) - º lambda(t,x,y) dt dx dy, where the sum is over the observed points i, and the integral is over the whole observed space. Given values of the parameters, it is often relatively easy to calculate the sum, but calculating or approximating the integral can be tricky. On the bottom of page 385 and top of page 386 of Ogata (1998), http://wildfire.stat.ucla.edu/pdflibrary/ogata98.pdf , he shows some calculus on the integral term, changing it to polar coordinates, for example. He finds that, if the study region is [0,T] x A, and the points are (ti,xi,yi,Mi), from i = 1 to n, then º lambda(t,x,y) dt dx dy is approximately mu T |A| + (·_i º from 0 to (T-ti) K/(u+c)^p du) [·_k Sk(xi,yi,Mi) Dk/(2pi)], where Sk and Dk have sort of complicated explanations given on p385-6 of Ogata (1998). If you can't get the integral analytically using calculus, then another option is to approximate the integral by sampling thousands of space-time locations around each point (ti,xi,yi), evaluating ·g at these sampled locations, and using the approximation º lambda(t,x,y) dt dx dy is approximately mu T |A| + sum(sampled ·g) * space-time area associated with each of the sampled lambdas. For instance suppose we wanted to approximate º from 0 to 3 of f(t)dt, where f(t) = 7 + exp(-2t) + 20exp(-4t). The integral = 7t - exp(-2t)/2 - 20exp(-4t)/4] from 0 to 3 = 7*3 - exp(-6)/2 + exp(0)/2 - 5exp(-12) + 5exp(0) = 7*3 - exp(-6)/2 + 1/2 - 5*exp(-12) + 5 x = seq(0,5,length=100) f = 7 + exp(-2*x) + 20*exp(-4*x) plot(x,f,type="l",xlab="t",ylab="7 + exp(-2t) + 20exp(-4t)") trueint = 7*3 - exp(-6)/2 + 1/2 - 5*exp(-12) + 5 m = 1000 sampgrid = seq(0,3,length=m) sampf = exp(-2*sampgrid) + 20*exp(-4*sampgrid) 7*3 + sum(sampf) * 3/m Suppose we wanted to approximate º from 0 to 3 of f(t)dt, where f(t) = 7 + exp(-2t) + dnorm(t-2). The true integral is 7*3 - exp(-6)/2 + 1/2 + pnorm(1) - pnorm(-2) x = seq(0,5,length=100) f = 7 + exp(-2*x) + dnorm(x-2) plot(x,f,type="l",xlab="t",ylab="7 + exp(-2t) + dnorm(t-2)") trueint = 7*3 - exp(-6)/2 + 1/2 + pnorm(1) - pnorm(-2) m = 100000 sampgrid = seq(0,3,length=m) sampf = exp(-2*sampgrid) + dnorm(sampgrid-2) 7*3 + sum(sampf) * 3/m Since C is so fast, it might be easier to just use the approximation method. Coming up. Kernel regression. Newton-Raphson optimization.