## 1. Class will be on zoom Mon Nov 22. ## 2. Final project order. ## 3. Compiling C and calling it from R. ## 4. Approximating pi in C. ## 5. dnorm in C. ## 6. Sum of squared differences between observations in C. 1. Nov22. Class will be on zoom Mon Nov 22. Use the zoom link https://ucla.zoom.us/j/91509411456?pwd=aXNUMmhYRElBUzljeXdPMHExSkljZz09 Meeting ID: 915 0941 1456 Password: 235711 2. Final projects. For your final projects, you will analyze some data using the methods we have talked about in class, including the methods used for the homeworks and also other methods we have discussed in class. You will write up your analysis in a written report, and will also make an oral presentation. The presentation will be only 4 minutes each in total. No going over! I will cut you off at 4 minutes. However, I would like to take 1 quick question from the audience or me afterwards. You will use my computer for your presentation. Email me by 10pm the night before your presentation with your slides in pdf or ppt. For the final 2 lectures, attendance is mandatory. Please do not interrupt with difficult questions, but clarifying questions are fine. Deeper questions should be asked after the presentation. Your dataset, which you will find yourselves, on the web, can be anything you choose, but it should be: a) regression style data, i.e. a response variable, and for each observation, a bunch (at least 2 or 3) explanatory variables. You should have at the very least n=30 observations. One of the variables should be a sensible response variable you can imagine wanting to predict. b) something of genuine interest to you and where you have some knowledge about the topic. Analyze the data using the methods we have talked about in class, such as linear regression, kernel regression, univariate kernel density estimation, 2-d kernel density estimation, classification, quantile plots, or gam. You can do regression and also analyze each variable individually. At least one component of your data anlysis should be done in C. Your final project should be submitted to me in pdf by email to frederic@stat.ucla.edu by Sat Dec11, 11:59pm. Note the address. Do not send them via ccle. They are all due the same date, regardless when your oral presentation is. I will now randomly assign people to presentation times. If you want to change oral presentation dates and times with another person, feel free but let me know. I will use sample(). setwd("/Users/rickpaikschoenberg/Desktop") y = scan("roster.txt",what="char",sep="\n") n = length(y) w = sample(y) for(i in 1:n){ if(i == 1) cat("\n\n Mon, Nov29\n") if(i == 18) cat("\n\n Wed, Dec1 \n") cat(i,". ",w[i],"\n",sep="") } Mon, Nov29 1. ZHAI, ZHIQIAN 2. JEON, NANUM 3. WANG, QINGYANG 4. GUPTA, VIPUL 5. BAIERL, JOHN DUCHATEAU 6. ZHAO, MINGLU 7. SUNCHU, ROHIT 8. GAO, YINGQI 9. WANG, ANDREA 10. RODRIGUEZ SANCHEZ, SANTIAGO 11. ESCANDON VANEGAS, HECTOR 12. LACEY, ZACHARY 13. TSANG, DARREN 14. CHEONG, RYAN H 15. KIM, NICKLAUS JUN 16. GHODSI, SAEED 17. PARAB, SHARDUL Wed, Dec1 18. MONDAL, RAJDEEP 19. SWOVELAND, JACOB MICHAEL 20. TAO, LAN 21. BERLIND, DAVIS S 22. FENG, TIANYING 23. O'DELL, RYAN JOSEPH 24. WANG, YUXIN. 25. ALSAADOUN, ABDULAZIZ SAADOUN 26. STECHER, ELAYNE 27. CHO, SOONHONG 28. DAHAL, LAXMAN 29. MCEVOY, KYLE ROBERTS 30. PHILLIPS, SOPHIE 31. SUVARNA, ASHIMA 32. AVETISYAN, ROZETA 33. MITRA, ROUHIN 34. TOLEDO LUNA, JOSE RODRIGO 35. SINGH, ABHIMANYU If you would like to switch let me know. Maybe someone will want to switch with you. I put some sample projects and presentation powerpoints from previous students on the course website in the folder sampleprojects. Give us a sense of your data. Assume that the listener knows what the statistical methods you are using are. Tell us what they say about your data. Emphasize the results more than the methods. Go slowly in the beginning so that the listener really understands what your data are. Speculate and generalize but use careful language. Say "It seems" or "appears" rather than "is" when it comes to speculative statements or models. For example, you might say "The residuals appear approximately normal" or "a linear model seems to fit well" but not "The residuals are normal" or "The data come from a linear model". Start with an introduction explaining what your data are, how you got them, and why they are interesting (1 minute), then show your results as clearly as possible, with figures preferred (roughly 2 minutes), and then conclude (1 minute). In your conclusion, you might mention what the main thing is you have learned about your data, and the limitations of your analysis, and speculate about what might make a future analysis better, if you had infinite time. This might include collecting more data, or getting data on more variables, as well as more sophisticated statistical methods. For your written reports, apply these same rules. Your project should be 5 pages or less of text, followed by as many figures or tables as you want. Have just the text in the beginning, and then the figures at the end. Do not worry about embedding the figures in the text. Email your pdf document to me, at frederic@stat.ucla.edu , by Sat Dec11, 11:59pm. 3. Compiling C and calling it from R. The base R comes with a C compiler if you have version 2.1.3 or later. There are also other C compilers you can use, like XCode, for Mac OSX. It includes a C and C++ compiler, among other tools. Another common one is GCC. There are many free ones for PCs. See for instance https://www.thoughtco.com/list-of-free-c-compilers-958190 . But for most of you, if you download R to your computer, it should be fine. The first step to writing C code is opening a text editor. Write your C code, call it something.c, then compile it, to create an object file called something.so. Then you can load that into R. Hello world. Create a C function to print "Hello world!" and call this function n times in R. In a text editor, create a C file. Say it's called hello.c The file looks like: #include #include /* Start a comment. Continue your comment. */ void hello (int *n) { int i,j; double a2; a2 = 4.5; j = 3; for(i = 0; i < *n; i++) Rprintf("Hello world number %d . The integral is %f .\n", i, a2); } PUT THE FILE HELLO.C IN YOUR WORKING R DIRECTORY, OR MAKE YOUR CURRENT R DIRECTORY THE FOLDER CONTAINING HELLO.C. In UNIX, in same directory where hello.c is, type R CMD SHLIB hello.c or, in R, do system("R CMD SHLIB hello.c") In R, in the same directory, do dyn.load("hello.so") hello2 = function(n){ .C("hello",as.integer(n)) } y = hello2(10) For compiling C in Windows, these links might be useful: http://www.stat.columbia.edu/~gelman/stuff_for_blog/AlanRPackageTutorial.pdf . https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-include-compiled-C-code_003f . From a former student: "Some of us were working at getting R and C working, and I found a solution you may find useful.   When we were trying to run R CMD SHLIB, an error came back to the effect of 'gcc-4.2 file not found'.   In this case, the executable was just /usr/bin/gcc, so we fixed it with a symlink by running the command 'sudo ln -s /usr/bin/gcc /usr/bin/gcc-4.2' at the terminal." 4. Approximate pi in C. In mypi.c, #include #include void pi2 (int *n, double *y){ int i; double x[*n]; x[0] = 1.0; y[0] = sqrt(6.0); for(i = 1; i < *n; i++) { x[i] = x[i-1] + 1.0 / ((i+1.0)*(i+1.0)); /* or x[i] = x[i-1] + 1.0 / pow(i+1.0,2.0); */ y[i] = sqrt(6.0 * x[i]); } } In R, ## set working directory to the one containing mypi.c. system("R CMD SHLIB mypi.c") dyn.load("mypi.so") pi3 = function(n){ .C("pi2",as.integer(n), y = double(n)) } b = pi3(1000000) b$y[1000000] Note that you have to be incredibly careful in C when doing arithmetic between integers and non-integers. If instead of 1.0 / ((i+1.0)*(i+1.0)); you do 1 / ((i+1.0)*(i+1.0)); or 1.0 / (i+1.0)^2; crazy stuff might happen. ^ is a bitwise operator meaning "XOR", i.e. X^Y = 1 if X=1 or Y=1 but not both. 5. dnorm in C. You can access C versions of many basic R functions, including for instance dnorm(), rnorm(), etc. The syntax in C of dnorm is double dnorm(double x, double mu, double sigma, int give_log) in mydn.c, #include #include void norm2 (int *n, double *upper, double *bw, double *y){ int i; double x, inc; x = -1.0 * *upper; inc = 2.0 * *upper / *n; for(i = 0; i < *n; i++) { y[i] = dnorm(x / *bw, 0,1,0); x += inc; } } In R, system("R CMD SHLIB mydn.c") dyn.load("mydn.so") norm3 = function(n, u, b){ d = .C("norm2", as.integer(n), as.double(u), as.double(b), y = double(n)) d$y } b = 12.4 n = 100000 u = 5*b a = norm3(n,u,b) title2 = paste("normal density with sd ", as.character(b)) plot(seq(-u,u,length=n), a, type="l", main=title2,xlab="x", ylab="f(x)") 6. Sum of squared differences between observations in C. In sumsq.c, #include #include void ss2 (double *x, int *n, double *y) /* x will be the vector of data of length n, and y will be a vector of squared differences from obs i to the other n-1 observations. */ { int i,j; double a; for(i = 0; i < *n; i++){ a = 0.0; for(j=0; j < *n; j++){ a += pow(x[i] - x[j], 2); } y[i] = a; } } in R, system("R CMD SHLIB sumsq.c") dyn.load("sumsq.so") sum3 = function(data2){ n = length(data2) a = .C("ss2", as.double(data2), as.integer(n), y=double(n)) a$y ## or equivalently a[[3]] } b = c(1,3,4) sum3(b) n = c(100, 1000, 2000, 3000, 5000, 7000, 8000, 10000) t2 = rep(0,8) for(i in 1:8){ b = runif(n[i]) timea = Sys.time() d = sum3(b) timeb = Sys.time() t2[i] = timeb-timea cat(n[i]," ") } par(mfrow=c(1,2)) plot(n,t2,ylab="time (sec)",main="C") ## Now try the same thing in R, without C. sum4 = function(data2){ n = length(data2) x = rep(0,n) for(i in 1:n){ for(j in 1:n){ x[i] = x[i] + (data2[i] - data2[j])^2 } } x } b = c(1,3,4) sum4(b) n = c(1:8)*100 t3 = rep(0,8) for(i in 1:8){ b = runif(n[i]) timea = Sys.time() d = sum4(b) timeb = Sys.time() t3[i] = timeb-timea cat(n[i]," ") } plot(n,t3,ylab="time (sec)",main="R")