1. Hw2 and final projects. 2. R Cookbook ch. 8. 3. R Cookbook ch. 9. 1. Hw2 and final projects. -- Hw2 is on the course website http://www.stat.ucla.edu/~frederic/202a/F11 in word and pdf, and is be due Tue, Oct 25, 1030am. Late homeworks will not be accepted! -- I'm giving you each an option for the final projects. Option A. I will randomly assign you to groups of 3 or 4 students, and you will analyze some data using the methods we have talked about in class, including the methods used for the homeworks but also other methods we have discussed in class. Your group will write up your analysis in a written report, and the group will also make an oral presentation, where every member of your group will speak. The oral presentations will be about 5-10 minutes each in total. You will be graded on the performance of your group, rather than individually, and each group will receive one overall grade based on the oral report and written report, combined. This combined grade will count for 15% of your overall grade in the class, and the other 85% will be based on your homework assignments. Option B. You will work individually and write a C (or C++) code to perform maximum likelihood estimation for Hawkes point processes using the Newton-Raphson optimization method, and use this function to fit a model to some point process data that I will provide. You will write up your analysis in a written report and will also give an oral presentation lasting 5-10 minutes. If you choose option B, your overall grade in the class will be as it was described in the outline I discussed on day 1. That is, your grade will be based 50% on your homework assignments, 35% on your written project which will include your C code, and 15% on your oral presentation. I will randomly assign people to groups on Thursday, Oct 20. If you would like to do Option B, you MUST email me before 8pm on Wednedsay, Oct 19, to frederic@stat.ucla.edu. If I do not hear from you by 8pm on Wednesday, Oct 19, I will assume you will do Option A. -- Read up through the end of ch10 in R Cookbook for next class. 2. R Cookbook ch8. dnorm(), pnorm(), qnorm(), rnorm() are the normal pdf, cdf, quantile function, and simulator functions, respectively. dnorm(1.2) 1/(sqrt(2*pi))*exp(-1.2^2/2) pnorm(1.96) pnorm(0) qnorm(.5) qnorm(0.975) ## see p190 qnorm(0) qnorm(1) qnorm(1.1) ## error rnorm(3) The default is mean = 0, sd = 1, but you can set these. rnorm(3, mean = 50, sd = 10) qnorm(0.975, mean = 50, sd = 10) rnorm(3, 5, 40) ## mean = 5 and sd = 40 rnorm(3, 5, mean=40) ## mean = 40 and sd = 5 You can substitute "norm" for "binom", "geom", etc. See tables on pp177-178. We've seen "unif" already. See the warning about the rexp() on p178. The parameter is the rate 1/beta, not the mean beta. rexp(3, 10) rexp(3, 0.1) The arguments can be vectors instead of scalars, and R will cycle through them one at a time, as described on p182. rexp(3, c(.01, 10, .02)) signif(rexp(3, c(.01, 10, .02)),3) signif(rexp(3, c(.01, 10)) ,2) ## it cycles through, so the 3rd has mean 100. The note about ?Normal and ?TDist on pp 178-179 is silly. You can use help() with a function as usual. For instance, functions for simulating normal or t distributed random variables are rnorm() and rt(), and you can do help(rnorm) or help(rt). You just can't do help(norm) or help(t) to get these help files, because norm() and t() are matrix operator functions [norm and transpose, respectively]. set.seed(), p183. p179, choose(n,k) gives the number of combinations. p180, combn(1:5,3) or combn(5,3) gives the actual combinations. See for instance combn(4,3) combn(c(8,-5,40,20),2) combn(c(8,8,40,20),2) p184, sample(1:10, 5) chooses 5 items randomly from the vector (1:10). sample(1:10,5) By default it is without replacement, but you can do sample(1:10,5, rep=T) sample(1:10,c(5,3,1)) ## doesn't work. If you want to generate 3 samples of different sizes from the vector 1:10, you could do y = c(5,3,1) x = list() for(i in 1:3){ x[[i]] = sample(1:10,y[i]) } # for(i in y){ # x[[i]] = sample(1:10,i) # } You can also have sample choose elements with different probabilities, by giving it as an argument a vector of probabilities, as on p185. sample(1:10, 5, rep=T, prob = (1:10)/55) sample(5:10) just randomly reshuffles the elements in the vector 5:10 (or any vector). See p186. sample(5:10) p187, diff() calculates the differences between successive elements of a vector. It returns a vector of length n-1, if your initial vector had length n. Note that when your vector is a list of times, often you want to sort your vector before using diff(). t = rpois(10, lambda=20) diff(t) t1 = sort(t) t1 diff(t1) pp 191-192 show an interesting example of plotting densities. He calculates the density on a vector x of points all at once. dnorm((1:10)/10) We can do this with the exponential distribution. x = seq(0,10,length.out=100) y = dexp(x) plot(x,y, main="Exponential distribution", type="l",ylab="density", xlab="quantile") Note that he uses the polygon() function to shade in a region. region.x consists of all the x-coordinates of the region. region.y consists of all the y-coordinates of the region. x1 = x[1 <= x & x <= 3] region.x = c(min(x1), x1, max(x1)) ## Teetor uses x1[1] and tail(x1,1) instead of min() and max() y1 = y[1 <= x & x <= 3] region.y = c(0,y1,0) polygon(region.x, region.y, col="red") ## End of ch8. 3. R Cookbook ch9. p 197, summary(). summary(y) p199, mean(x > 0) computes the relative frequency, because (x > 0) is a vector of TRUEs (1) and FALSEs (0). mean(y > 0.5). p200 describes table(), which we've seen before. p201, summary(table) does a chi square test for independence between 2 factors. A small p-value indicates statistically significant dependence, i.e. strong evidence against the assumption of independence. shy = factor(c(rep("Y", 13), rep("N",10))) gpa = factor(c(rep("A",5), rep("B", 4), rep("C",4), rep("A", 2), rep("C", 3), rep("B", 5))) table(shy) table(gpa) table(shy,gpa) summary(table(shy, gpa)) p203, scale() normalizes a numerical vector, matrix, or data frame. x = c(1,2,4,4,3) y = scale(x) y[1:5] p203-204, t.test() to test against a null hypothesis of the mean, or to get a confidence interval for the populaton mean. t.test(x, mu=1) t.test(x, conf.level = 0.975) y = t.test(x, conf.level = 0.975) attributes(y) y$conf.int y$conf.int[1:2] p206, wilcox.test(x, conf.int=TRUE) ## CI for the median x = c(1:10,7.1,20) wilcox.test(x, conf.int=TRUE) p207, testing if the population proportion of successes = some null hypothesis value. Say you have 70 successes out of 98 trials, and your null hypothesis is that p = 0.80. prop.test(70,97,0.8) You can use this too to get a CI for p. prop.test(70,97,0.8, conf.level = 0.9) Test for normality. shapiro.test(x) p210, in the nortest package are more tests for normality. p211, testing for the total number of runs, or streaks, or equivalently changes, in binary time series data, using runs.test(). install.packages("tseries") library(tseries) x = runif(50) y = (x > 0.4) runs.test(as.factor(y)) x = sample(c(0,1,2),50,rep=T) runs.test(as.factor(x)) ## error because the data are not binary. p212, t.test() to test if the difference between two sample means is statistically significant. x = runif(400) y = rnorm(700, mean=.57) t.test(x,y) By default, paired = FALSE, but if x and y have the same length you can say t.test(x,y[1:400],paired = TRUE) p214, an alternative is the nonparametric Wilcoxon-Mann-Whitney test, wilcox.test(x,y) p215, Pearson test to see if a correlation is statistically significant. x and y must have the same length, of course. x = runif(200) y = 0.8*x + rnorm(200) plot(x,y) cor.test(x,y) abline(lsfit(x,y)) Equivalently, you can fit a line by regression and use the t-test on the coefficient for y. b = lm(y~x) summary(b) abline(b) attributes(b) They're the same. summary(b)$coef[8] cor.test(x,y)$p.value The book also discusses prop.test() on p217, for comparing two proportions, pairwise.t.test() on p218, for comparing the means of > 2 samples, and the Kolmogorov-Smirnov test ks.test() to see if two cumulative distribution functions might be equivalent. ## End of ch9.