## 1. No lecture Nov5. ## 2. game. ## 3. Final project order. ## 4. Compiling C. ## 5. Note on hw4. ## 6. Approximating pi in C. ## 7. dnorm in C. ## 8. Sum of squared differences between observations in C. 1. Faculty retreat. No class Thu Nov5 because of the faculty retreat. Also no class Thu Nov26 for Thanksgiving. 2. game. a. In the breakout room, exchange emails, and pick a leader. b. Decide on a time to meet later, like maybe right after class if you like. c. The leader should download the game from the course webpage, http://www.stat.ucla.edu/~frederic/202a/F20 . Download the whole folder called gamefor202a. This will take a few minutes because of all the pictures. d. The leader should run the first 10 lines of game.txt, which will require installing and loading various libraries. This may also take a minute or two. e. When the time comes to meet, the leader should start a zoom, invite the others, start R, share screen, set the R working directory to the images folder in gamefor202a, and copy and paste everything in game.txt into R. 3. Final projects. For your final projects, you will analyze some data using the methods we have talked about in class, including the methods used for the homeworks and also other methods we have discussed in class. You will write up your analysis in a written report, and will also make an oral presentation. The presentation will be only 5 minutes each in total. No going over! I will cut you off at 5 minutes. However, I would like to take 1 quick question from the audience or me afterwards. You will use your own computer and share screen for your presentation. For the final 3 lectures, attendance is mandatory. Please do not interrupt with difficult questions, but clarifying questions are fine. Deeper questions should be asked after the presentation. Your dataset, which you will find yourselves, on the web, can be anything you choose, but it should be: a) regression style data, i.e. a response variable, and for each observation, a bunch (at least 2 or 3) explanatory variables. You should have at the very least n=30 observations. One of the variables should be a sensible response variable you can imagine wanting to predict. b) something of genuine interest to you. Analyze the data using the methods we have talked about in class, such as linear regression, univariate kernel density estimation, 2-d kernel density estimation, testing, quantile plots, and kernel regression. You can do regression and also analyze each variable individually. At least one component of your data anlysis should be done in C. Your final project should be submitted to me in pdf by email to frederic@stat.ucla.edu by Sun Dec 14, 11:59pm. Note the address. Do not send them via ccle and do not send them to stat202a@stat.ucla.edu. They are all due the same date, regardless when your oral presentation is. I will now randomly assign people to presentation times. If you want to change oral presentation dates and times with another person, feel free but let me know. I will use sample(). setwd = "/Users/rickpaikschoenberg/Documents/2020/202a/done" y = scan("roster.txt",what="char",sep="\n") n = length(y) w = sample(y) for(i in 1:n){ if(i == 1) cat("\n\n Thu, Dec 3\n") if(i == 13) cat("\n\n Tue, Dec 8 \n") if(i == 25) cat("\n\n Thu, Dec 10 \n") cat(i,". ",w[i],"\n",sep="") } Thu, Dec 3 1. FISCHER, ERIC MERCADO 2. SOUDA, NAVIN VARADARAJ 3. BURTON, HENRY 4. HWANGBO, NATHAN MIN 5. HUANG, STELLA HONGYING 6. SHINKRE, TANVI RAHUL 7. ZHAO, YIJIA 8. O'NEILL, ELIZABETH 9. DONG, CHRIS YUANCHAO 10. HOFFMANN, NATHAN ISAAC 11. FAN, HAIBO 12. WANG, KAIXIN Tue, Dec 8 13. CHU, HANQING 14. LEE, CHRISTY 15. AGRAWAL, SURABHI 16. PARIDAR, MAHSA 17. CHEN, ALEX 18. LI, BILL 19. YAN, GUANAO 20. JACOBSON, THOMAS ABRAM 21. ZHAI, XUFAN 22. MUELLER, SCOTT ALLEN 23. WONG, EMILY FRANCES 24. VINAS, LUCIANO Thu, Dec 10 25. KIM, DOEUN 26. ZHANG, XINYUAN 27. ZHANG, ZHE 28. RESCH, JOSEPH 29. XU, CHAO 30. ZHOU, CARTLAND 31. NGUEN, CHUNG KYONG 32. TREJO, ALFREDO 33. LIU, VINCENT BOJIE 34. GABRIEL, CHRISTOPHER JOHN 35. VARGAS, SANTIAGO I put some sample projects and presentation powerpoints from previous students on the course website in the folder sampleprojects. Give us a sense of your data. Assume that the listener knows what the statistical methods you are using are. Tell us what they say about your data. Emphasize the results more than the methods. Go slowly in the beginning so that the listener really understands what your data are. Speculate and generalize but use careful language. Say "It seems" or "appears" rather than "is" when it comes to speculative statements or models. For example, you might say "The residuals appear approximately normal" or "a linear model seems to fit well" but not "The residuals are normal" or "The data come from a linear model". Start with an introduction explaining what your data are, how you got them, and why they are interesting (1-2 minutes), then show your results as clearly as possible, with figures preferred (roughly 2 minutes), and then conclude (1 minute). In your conclusion, mention the limitations of your analysis and speculate about what might make a future analysis better, if you had infinite time. This might include collecting more data, or getting data on more variables, as well as more sophisticated statistical methods. For your written reports, apply these same rules. Your project should be 5 pages or less of text, followed by as many figures or tables as you want. Have just the text in the beginning, and then the figures at the end. Do not worry about embedding the figures in the text. Email your pdf document to me, at frederic@stat.ucla.edu , by Sun, Dec 14, 11:59pm. 4. Note on hw4. When I say in hw4 problem 2d "sample 100 pairs of observations (Xi, Yi) with replacement," I mean, if your dataset has length n, then let b = sample(1:n, 100, rep=T) and for each element i in b, take (Xi, Yi). 5. Compiling C. For compiling C in Windows, these links might be useful: http://www.stat.columbia.edu/~gelman/stuff_for_blog/AlanRPackageTutorial.pdf . https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-include-compiled-C-code_003f . From a former student: "Some of us were working at getting R and C working, and I found a solution you may find useful.   When we were trying to run R CMD SHLIB, an error came back to the effect of 'gcc-4.2 file not found'.   In this case, the executable was just /usr/bin/gcc, so we fixed it with a symlink by running the command 'sudo ln -s /usr/bin/gcc /usr/bin/gcc-4.2' at the terminal." 6. Approximate pi in C. In mypi.c, #include #include void pi2 (int *n, double *y){ int i; double x[*n]; x[0] = 1.0; y[0] = sqrt(6.0); for(i = 1; i < *n; i++) { x[i] = x[i-1] + 1.0 / ((i+1.0)*(i+1.0)); /* or x[i] = x[i-1] + 1.0 / pow(i+1.0,2.0); */ y[i] = sqrt(6.0 * x[i]); } } In R, ## set working directory to the one containing mypi.c. system("R CMD SHLIB mypi.c") dyn.load("mypi.so") pi3 = function(n){ .C("pi2",as.integer(n), y = double(n)) } b = pi3(1000000) b$y[1000000] Note that you have to be incredibly careful in C when doing arithmetic between integers and non-integers. If instead of 1.0 / ((i+1.0)*(i+1.0)); you do 1 / ((i+1.0)*(i+1.0)); or 1.0 / (i+1.0)^2; crazy stuff happens. ^ is a bitwise operator meaning "XOR", i.e. X^Y = 1 if X=1 or Y=1 but not both. 7. dnorm in C. You can access C versions of many basic R functions, including for instance dnorm(), rnorm(), etc. The syntax in C of dnorm is double dnorm(double x, double mu, double sigma, int give_log) in mydn.c, #include #include void norm2 (int *n, double *upper, double *bw, double *y){ int i; double x, inc; x = -1.0 * *upper; inc = 2.0 * *upper / *n; for(i = 0; i < *n; i++) { y[i] = dnorm(x / *bw, 0,1,0); x += inc; } } ## I stopped here last time. In R, system("R CMD SHLIB mydn.c") dyn.load("mydn.so") norm3 = function(n, u, b){ d = .C("norm2", as.integer(n), as.double(u), as.double(b), y = double(n)) d$y } b = 12.4 n = 100000 u = 5*b a = norm3(n,u,b) title2 = paste("normal density with sd ", as.character(b)) plot(seq(-u,u,length=n), a, type="l", main=title2,xlab="x", ylab="f(x)") 8. Sum of squared differences between observations in C. In sumsq.c, #include #include void ss2 (double *x, int *n, double *y) /* x will be the vector of data of length n, and y will be a vector of squared differences from obs i to the other n-1 observations. */ { int i,j; double a; for(i = 0; i < *n; i++){ a = 0.0; for(j=0; j < *n; j++){ a += pow(x[i] - x[j], 2); } y[i] = a; } } in R, system("R CMD SHLIB sumsq.c") dyn.load("sumsq.so") sum3 = function(data2){ n = length(data2) a = .C("ss2", as.double(data2), as.integer(n), y=double(n)) a$y ## or equivalently a[[3]] } b = c(1,3,4) sum3(b) n = c(100, 1000, 2000, 3000, 5000, 7000, 8000, 10000) t2 = rep(0,8) for(i in 1:8){ b = runif(n[i]) timea = Sys.time() d = sum3(b) timeb = Sys.time() t2[i] = timeb-timea cat(n[i]," ") } par(mfrow=c(1,2)) plot(n,t2,ylab="time (sec)") ## Now try the same thing in R, without C. sum4 = function(data2){ n = length(data2) x = rep(0,n) for(i in 1:n){ for(j in 1:n){ x[i] = x[i] + (data2[i] - data2[j])^2 } } x } b = c(1,3,4) sum4(b) n = c(1:8)*100 t3 = rep(0,8) for(i in 1:8){ b = runif(n[i]) timea = Sys.time() d = sum4(b) timeb = Sys.time() t3[i] = timeb-timea cat(n[i]," ") } plot(n,t3,ylab="time (sec)")