1. Hw and Nov 3. 2. Final projects. 3. R Cookbook ch. 11. 4. R Cookbook ch. 13.1 and 13.2. 1. Hw2 and final projects. -- Hw2 and hw3 are on the course website http://www.stat.ucla.edu/~frederic/202a/F11 . Hw2 is due Tue, Oct 25, 1030am. Hw3 is due Tue, Nov 8, 1030am. Late homeworks will not be accepted! -- No class Nov 3, due to the mandatory Statistics faculty retreat. 2. Final projects. Random assignment to groups for project A. Find some data on the web on a subject of your choice. If possible, choose something you are genuinely interested in. Analyze the data using the methods we have talked about in class, such as linear regression, univariate kernel density estimation, 2-d kernel density estimation, statistical testing, quantile plots, and kernel regression. At least one component of your data anlysis should be done in C. Your group will write up your analysis in a written report, and the group will also make an oral presentation, where every member of your group will speak. The oral presentations will be 5-7 minutes each in total. You will be graded on the performance of your group, rather than individually, and each group will receive one overall grade based on the oral report and written report, combined. This combined grade will count for 15% of your overall grade in the class, and the other 85% will be based on your homework assignments. Option B. You will work individually and write C (or C++) code to perform maximum likelihood estimation for Hawkes point processes using the Newton-Raphson optimization method, and use this function to fit a model to the same 374 points you downloaded for problem 2 of hw3. You will write up your analysis in a written report and will also give an oral presentation lasting 4-5 minutes. If you choose option B, your course grade will be based 50% on your homework assignments, 35% on your written project which will include your C code, and 15% on your oral presentation. Either way, your final project should be submitted to me by email to frederic@stat.ucla.edu by Thur, Dec 8, 11:59pm. Note the address. Do not send them to stat202a@stat.ucla.edu. We will now randomly assign people to groups. Chang, Bo, Embree, Josh, Menon, Tulasi, and Monroe, Scott are doing project B. x = scan("roster.txt",what="char",sep="\n") j = c(6,16,33,35) x[j] x1 = x[-j] y = matrix(sample(x1),ncol=3) yb = sample(1:19) k = 1 for(i in c(1:19)){ if(i == 1) cat(" Tues, Nov 29 \n") if(i == 10) cat("\n\n Thur, Dec 1 \n") cat(i, " ") if((i == yb[1]) || (i == yb[2]) || (i == yb[3]) || (i == yb[4])){ if(i == yb[1]) cat(x[j[1]],"\n") if(i == yb[2]) cat(x[j[2]],"\n") if(i == yb[3]) cat(x[j[3]],"\n") if(i == yb[4]) cat(x[j[4]],"\n") } else{ cat(y[k,],"\n") k = k +1 } } Tues, Nov 29 1 MENON, TULASI. 2 MONROE, SCOTT. 3 LIU, XUENING. ERLIKHMAN, GENNADY. DABAGH, SAFAA. 4 BEYOR, ALEXA. BANSAL, RAHUL. JOHNSTON, RACHEL. 5 MATTHEWS, JERRID. YING, VICTOR. GARAND, JOSEPH. 6 CHEN, YU-CHING. WANG, ZHAO. LEE, JONGYOON. 7 BAI, HAN. BRUMBAUGH, STEPHEN. DEL VALLE, ARTURO. 8 LY, DIANA. DALININA, RUSLANA. JIANG, FEIFEI. 9 GONYEA, TIANQING. TONG, XIAOXIAO. CHEN, LICHAO. Thur, Dec 1 10 MOLYNEUX, JAMES. YANG, HO-SHUN. YIN, KEVIN. 11 XIE, XIAOXI. PAVLOVSKAIA, MARIA. MORALES, EMMANUEL. 12 AGRAWAL, PALASH. CHEN, MENGNA. TSIOKOS, CHRISTOS. 13 CHANG, BO. 14 MCLAUGHLIN, KATHERINE. NG, LUNG. CRAWFORD, TIMOTHY. 15 CHEN, YUE. MCCARTHY, PATRICK. LAM, HO YEUNG. 16 LU, WEIPENG. WU, YI. LI, SHANSHAN. 17 WRIGHT, JENNIFER. CHEN, CHEN. SONG, XI. 18 EMBREE, JOSHUA. 19 GARCIA, EDUARDO. GORDON, JOSHUA. WONG, ALBERT. 3. R Cookbook ch. 11. lm, p269-285. lm(y ~ x), y = b0 + b1 x + eps lm(y ~ u + v + w), lm(y ~ u + v + w + 0) for no intercept, lm(y ~ u*v*w) for all first order interactions, y = b0 + b1u + b2v + b3w + b4uv + b5uw + b6vw + b7 uvw + e, p279. To specify a particular interaction, use :. lm(y ~ u + v + w + u:v:w) for just the uvw term, 280. lm(y ~ u + v + w), y = b0 + b1 u + b2 v + b3 w + eps p280, lm(y ~ (u+v+w)^i) for all order i-1 interactions. u = rnorm(10000) v = rnorm(10000) w = rnorm(10000) x = rnorm(10000) y = 10 + 20*u + 30*u*v + rnorm(10000) j = lm(y ~ (u + v + w + x)^4) summary(j) Use - to eliminate a term. j = lm(y ~ (u + v + w + x)^4 - v:w:x) summary(j) step, p281, to do stepwise regression, forward or backward. k = step(j,direction="backward") AIC = Akaike's Information Criterion, lowest AIC is best fitting model. Assuming eps = iid Normal(mean 0, some sd sigma^2), can calculate the likelihood associated with a particular model and parameters. AIC = -2 log (likelihood) + 2p. Lower AIC is preferred. summary(k) k2 = step(lm(y ~ 1), direction="forward", scope = ( ~ (u+v+w+x)^4), trace=0) summary(k2) p285, I(u+v) for a term that actually just represents u+v. Similarly, I(u^2) for a term that is u^2, or I(u*v) for a column that is u*v. p292, confint for confidence intervals. confint(k2) p293, plot the residuals with plot(k2, which=1), and other plots with plot(k2). p297, im = influence.measures(k2) outputs all 7 influence measures listed on p298 for each of the 10,000 data points. To get just the Cook's distances, use im2 = im$infmat[,6] Cook's distance = SUM from i= 1 to 10000 of [^Y_j - ^Y_j(-i)]^2 / (p MSE). ^Y_j = prediction of observation j based on the model. ^Y_j(-i) = prediction of observation j using model with observation i omitted, p = number of parameters estimated, MSE = mean squared error. p299, dwtest() in lmtest package, and acf() to test the autocorrelation. The Durbin-Watson test has been criticized, e.g. by Chatfield (1986), on the grounds that it basically just tests the first autocorrelation but users often think it looks at more. p300-1, prediction intervals and predictions using predict(). 4. R Cookbook ch. 13.1 and 13.2. optimize, p 335, to minimize a function of one variable. f = function(x){exp((abs(x)-2.0))+(x-3)^2} x = seq(-10,10,length=200) y = f(x) par(mfrow=c(1,2)) plot(x,y) plot(x,y,ylim=c(min(y),min(y)+1)) c(1:200)[y == min(y)] x[124] x[123:125] f(x[124]) g = optimize(f) ## doesn't work g = optimize(f, lower=-10,upper=10) g g$min is the minimum, and g$obj = f(g$min). optim(), p336, for functions of more than one variable. It uses a Newton-Raphson type method by default, and doesn't require you to give a range but you do need an initial guess. f = function(x) 3+(x[1]-2.4)^2 + (x[2]-2.5)^2 guess1 = c(10,10) g = optim(guess1,f) g$par Next few classes: C in R. Simple C functions. Kernel regression. Newton-Raphson optimization. MLE. Hawkes models.