1. Hw2 and final projects. 2. R Cookbook ch. 9. 3. R Cookbook ch. 10. 1. Hw2 and final projects. -- Hw2 is on the course website http://www.stat.ucla.edu/~frederic/202a/F11 . I corrected a mistake in problem 2 since Tuesday. Also, I put a copy of the data for problem 2 on the course website, in the file housingdata.webarchive, just in case the other site, www.dqnews.com, crashes or changes or something. The two datasets are identical. Hw2 is due Tue, Oct 25, 1030am. Late homeworks will not be accepted! -- For the final projects, I will randomly assign people to groups on Thursday, Oct 20. If you would like to do Option B, you MUST email me before 8pm on Wednedsay, Oct 19, to frederic@stat.ucla.edu. If I do not hear from you by 8pm on Wednesday, Oct 19, I will assume you will do Option A. See 202alec6.txt for the options. -- Read up through the end of ch11 in R Cookbook for next class. 2. R Cookbook ch9. p 197, summary(). summary(y) p199, mean(x > 0) computes the relative frequency, because (x > 0) is a vector of TRUEs (1) and FALSEs (0). mean(y > 0.5). p200 describes table(), which we've seen before. p201, summary(table) does a chi square test for independence between 2 factors. A small p-value indicates statistically significant dependence, i.e. strong evidence against the assumption of independence. shy = factor(c(rep("Y", 13), rep("N",10))) gpa = factor(c(rep("A",5), rep("B", 4), rep("C",4), rep("A", 2), rep("C", 3), rep("B", 5))) table(shy) table(gpa) table(shy,gpa) summary(table(shy, gpa)) p203, scale() normalizes a numerical vector, matrix, or data frame. x = c(1,2,4,4,3) y = scale(x) y[1:5] p203-204, t.test() to test against a null hypothesis of the mean, or to get a confidence interval for the populaton mean. t.test(x, mu=1) t.test(x, conf.level = 0.975) y = t.test(x, conf.level = 0.975) attributes(y) y$conf.int y$conf.int[1:2] p206, wilcox.test(x, conf.int=TRUE) ## CI for the median x = c(1:10,7.1,20) wilcox.test(x, conf.int=TRUE) p207, testing if the population proportion of successes = some null hypothesis value. Say you have 70 successes out of 97 trials, and your null hypothesis is that p = 0.80. prop.test(70,97,0.8) You can use this too to get a CI for p. prop.test(70,97,0.8, conf.level = 0.9) Test for normality. shapiro.test(x) p210, in the nortest package are more tests for normality. p211, testing for the total number of runs, or streaks, or equivalently changes, in binary time series data, using runs.test(). install.packages("tseries") library(tseries) x = runif(50) y = (x > 0.4) runs.test(as.factor(y)) x = sample(c(0,1,2),50,rep=T) runs.test(as.factor(x)) ## error because the data are not binary. p212, t.test() to test if the difference between two sample means is statistically significant. x = runif(400) y = rnorm(700, mean=.57) t.test(x,y) By default, paired = FALSE, but if x and y have the same length you can say t.test(x,y[1:400],paired = TRUE) p214, an alternative is the nonparametric Wilcoxon-Mann-Whitney test, wilcox.test(x,y) p215, Pearson test to see if a correlation is statistically significant. x and y must have the same length, of course. x = runif(200) y = 0.8*x + rnorm(200) plot(x,y,main=expression(x[i]^-2)) cor.test(x,y) abline(lsfit(x,y)) Equivalently, you can fit a line by regression and use the t-test on the coefficient for y. b = lm(y ~ x) ## similar to b = lsfit(x,y) summary(b) abline(b) attributes(b) They're the same. summary(b)$coef[8] cor.test(x,y)$p.value The book also discusses prop.test() on p217, for comparing two proportions, pairwise.t.test() on p218, for comparing the means of > 2 samples, and the Kolmogorov-Smirnov test ks.test() to see if two cumulative distribution functions might be equivalent. ## End of ch9. 3. R Cookbook ch10. p222, list of plotting functions, such as plot(), boxplot(), and hist(). p225, the main = "" argument in plot creates a title. x = runif(10) y = rnorm(10) plot(x,y,xlab="uniforms", ylab="normals", main="Figure 1") p226, grid() makes it like graph paper. grid() grid(col=grey(.2)) p227, pch is a useful argument to plot. We've seen pch="." for small points. You can give pch a vector and it will plot each point differently. cex controls the size of the characters. names = c("a","b","c","d","e","f","g","h","i","j") plot(x,y,pch=names,cex=1.5) plot(x,y,pch=".",cex=5) When pch takes a number, it plots a symbol, with 1 = circle, 2 = triangle, 3 = plus, etc. types2 = c(rep(1,5),rep(2,3),rep(3,2)) plot(x,y,pch=types2) If you want it to plot the index of your data, you have to make the indices characters. names = as.character(1:10) plot(x,y,pch=names) Use points() to overlay symbols. points(x,y,pch=types2) ## It looks bad. plot(x,y,pch=names) points(x,y,pch=types2,col="red",cex=2) Note that it has a problem with 2 character names, like "10". Use text(). plot(x,y,type="n") text(x,y,names,col="blue") points(x,y,pch=types2,col="red",cex=2) ## Last time someone asked about making the yaxis intersect 0. Use the argument xaxs = "i" and yaxs = "i" in plot(). x = c((0:10)/10) y = x plot(x,y,type="l") plot(x,y,type="l",yaxs="i") plot(x,y,type="l",xaxs="i",yaxs="i") Legend, p229. legend(0.8, 1, c("shoes", "socks", "hats"), pch=c(1:3),cex=.8) You can use "lty" in place of "pch" above. See p230. Note that this is NOT the kind of legend I want you to do for hw2 in problem 1e. Adding a regression line, p232. z = lm(y ~ x) abline(z) plot(x,y,type="n") text(x,y,names,col="blue") abline(z, lty=2) ## dashed line z$coef lm() can also be used for multivariate regression. p269-271, lm(y ~ x1 + x2 + x3). x1 = runif(100) x2 = rnorm(100) x3 = rexp(100) y = x1+x2+x3+rnorm(100) z = lm(y ~ x1 + x2 + x3) z$coef z$coef[1] ## the estimated intercept z$coef[3] ## the estimated slope corresponding to x2 barplot() p236, abline() p245, boxplot() p247, and hist() p249 are self-explanatory. p244 shows a common mistake when plotting two curves on the same plot. You have to set the ranges first so that both curves fit. p250, density() is useful for kernel density estimation. Suppose your univariate data are x1, x2, ..., xn. For any number x0, f^(x0) = · k((x0-x_i)/bw)/n, where the sum is from 1 to n, and where k() is some density, usually centered at 0, and usually with a peak at 0. bw is a bandwidth. If bw is small, then each point gives a lot of density nearby, whereas if bw is large, then each point gets smoothed out heavily. People have shown that the kernel, k, doesn't matter nearly as much as the choice of bandwidth. See help(density). x = rnorm(100) hist(x,nclass=40,prob=T,main="simulated N(0,1) random variables") lines(density(x),lty=2) By default, kernel = "gaussian". By default, bw="nrd0", which is Silverman(1986)'s rule of thumb, 0.9 s n^(-.2) . Here, s is an estimate of sd, given by min{sample sd, IQR/1.34}. This is often too small, so Scott(1992) recommended 1.06 s n^(-.2) . bw.nrd0() yields Silverman's rule, and bw.nrd() yield's Scott's. bw.nrd0(x) 0.9 * min(sd(x),IQR(x)/1.34) * 100^(-.2) bw.nrd(x) 1.06 * min(sd(x),IQR(x)/1.34) * 100^(-.2) hist(x,nclass=40,prob=T,main="simulated N(0,1) random variables") b2 = bw.nrd(x) lines(density(x,bw=b2),col="red") lines(density(x,bw=4*b2),col="blue") lines(density(x,bw=.1*b2),col="brown") legend(-1.5,0.6,c("Scott's bw", "large bw", "small bw"), lty=1,col=c("red","blue","green"),cex=0.8) To compute kernel estimates on a grid, supply density() with n, from, and to. Or do it yourself. Let's take a really small example. x = c(1,2.8,3) z = density(x,bw=0.08,n=101,from=0,to=4) ## by default, n = 512 and it goes from min(x) - 3bw to max(x) + 3bw. plot(c(0,4),c(0,2),type="n",xlab="x",ylab="density",yaxs="i") lines(z,col="red") z$x z$y plot(z$x, z$y) Consider, for instance, the kernel density estimate f^(x0) at x0 = 1.20. z$x[31] ## 1.20 z$y[31] ## f^(1.2) = 0.07465388. But density() does some slightly weird things with small datasets especially. In help(density), "The algorithm used in density.default disperses the mass of the empirical distribution function over a regular grid of at least 512 points and then uses the fast Fourier transform to convolve this approximation with a discretized version of the kernel and then uses linear approximation to evaluate the density at the specified points." Alternatively, you can find the kernel density estimate yourself. density2 = function(x1,xgrid,bw2){ ## x1 = data. ## xgrid = vector of values where you'll compute the kernel estimate. ## bw2 = bandwidth n = length(xgrid) y = rep(0, n) for(i in 1:n){ y[i] = sum(dnorm(x1-xgrid[i], sd=bw2)) / length(x1) } y } x = c(1,2.8,3) g = seq(0,4,length=101) y = density2(x,g,0.08) y[31] ## 0.07303459 plot(g,y,type="l",xlab="x",ylab="density") Look at the difference between density() and density2(). plot(g,z$y - y) ## I ended here. ## 2-d kernel smoothing, using kernel2d() in splancs. install.packages("splancs") library(splancs) ## First, input 23 points using the mouse. n = 23 plot(c(0,1),c(0,1),type="n",xlab="longitude",ylab="latitude",main="locations") x1 = rep(0,n) y1 = rep(0,n) for(i in 1:n){ z1 = locator(1) x1[i] = z1$x y1[i] = z1$y points(x1[i],y1[i]) } bdw = sqrt(bw.nrd0(x1)^2+bw.nrd0(y1)^2) ## default bandwidth b1 = as.points(x1,y1) bdry = matrix(c(0,0,1,0,1,1,0,1,0,0),ncol=2,byrow=T) z = kernel2d(b1,bdry,bdw) par(mfrow=c(1,2)) image(z,col=gray((64:20)/64),xlab="km E of origin",ylab="km N of origin") points(b1) x4 = (0:100)/100*(max(z$z)-min(z$z))+min(z$z) plot(c(0,10),c(.8*min(x4),1.2*max(x4)),type="n",axes=F,xlab="",ylab="") image(c(-1:1),x4,matrix(rep(x4,2),ncol=101,byrow=T),add=T,col=gray((64:20)/64)) text(2,min(x4),as.character(signif(min(x4),2)),cex=1) text(2,(max(x4)+min(x4))/2,as.character(signif((max(x4)+min(x4))/2,2)),cex=1) text(2,max(x4),as.character(signif(max(x4),2)),cex=1) mtext(s=3,l=-3,at=1,"Rate (pts/km^2)") Next time, we will discuss simulating from a density by rejection sampling, and p252 on.