0. Announcements and projects. 1. Logistic regression in R. 2. Nonlinear regression in R. 3. C++ in R. 4. Big data in R. 0. Announcements and projects. If anyone wants to switch times, let me know. See lecture8notes.txt for the order. Some sample projects are on the course website, just to give you examples. Don't follow them closely, but notice the enthusiasm, the attention to details, the identification of outliers, and the interpretation of results. What would be a great idea for your project is to take some regression style data, fit a generalized additive model by kernel regression (and/or perhaps fitting terms that look linear or quadratic by linear or quadratic functions), and using C to get confidence intervals for your fitted curves. This helped people on Windows for compiling C. Sys.setenv(BINPREF="C:/rtools40/mingw64/bin/") On hw4, when you need to compute an approximation of an integral, you can simply add up the areas of tiny rectangles to approximate the integral, or use Simpson's rule if you prefer. Either way is fine with me. 1. Logistic regression in R. Suppose your response variable Y is 0 or 1 for all observations. Normally for a regression model, Y = X beta + epsilon, where the epsilons are iid normal with mean 0, which is obviously unreasonable here since Y is always 0 or 1. So instead it makes more sense to model P(Y=1) = X beta + epsilon, but that has problems because X beta + epsilon can be negative, or > 1. By the way, by X beta I am thinking of X as a matrix where the first column is all 1's, and each other column is a covariate, and there are p-1 covariates, and beta = (beta_0, beta_1, ..., beta_{p-1}). Thus X beta = beta_0 + beta_1 X_1 + beta_2 X_2 + ... + beta_{p-1} X_{p-1}. So, instead of modeling P(Y=1) = X beta + epsilon, one models p = P(Y = 1), via X beta = f(p), or equivalently p = f^{-1}(X beta), where f is called a link function. Note by the way that when Y is 0 or 1, E(Y) = P(Y = 1). So we could also write this as E(Y) = f^{-1}(X beta). Often people use the logistic function, f^{-1}(x) = exp(x)/[1+exp(x)], and this corresponds to logistic regression. Here E(Y) = e^{X beta} / [1 + e^{X beta}]. Regardless of X beta, f^{-1}(X beta) will then be between 0 and 1. The link then is the logit function, i.e. X beta = f(p) = log{p/(1-p)}. log(p/(1-p)) is called the log odds ratio. The logistic regression model is that the Y's are independent Bernoulli(p) random variables, where the vector p = e^{X beta} / [1 + e^{X beta}]. Logistic regression is an example of Generalized Linear Modeling (GLM), where one models g(E(Y)) = X beta for some link function g, and where Y has some modelled distribution. #install.packages("UsingR") library(UsingR) data(babies) This dataset accompanies Stat Labs: Mathematical Statistics through Applications Springer-Verlag (2001) by Deborah Nolan and Terry Speed, see www.stat.berkeley.edu/users/statlabs/labs.html . babies[1,] bab2 = subset(babies, subset = gestation < 999 & wt1 < 999 & ht < 99 & smoke < 9 & age < 99 & number < 98) ## A baby is considered premature if the gestation < 37 weeks. prem = as.numeric(bab2$gest < 37*7) smoke = as.numeric(bab2$smoke > 0) weight = bab2$wt1 ## mother's weight at conception height = bab2$ht ## mother's height age = bab2$age ## mother's age num = bab2$number ## baby number bmi = weight/2.2 / (height*.0254)^2 hist(bmi,nclass=20) fit1 = glm(prem ~ smoke + age + bmi, family = binomial) summary(fit1) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.610895 0.820464 -4.401 1.08e-05 *** smoke 0.205526 0.217915 0.943 0.346 age 0.008197 0.018269 0.449 0.654 bmi 0.038270 0.030897 1.239 0.215 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## nothing is significant, but the estimate for smoking is large. ## According to the fitted model, ## log odds increase by 0.2055 because of smoking. mean(prem) ## 0.082, so a typical birth has probability 8.2% of being premature. mean(prem[smoke==0]) ## 7.4% mean(prem[smoke>0]) ## 8.7% How can we interpret this 0.2055? p = 0.082 q = 1-p log(p/q) ## -2.415478 ## so log odds for typical birth are -2.415. ## Now add .2055. r = -2.415 + .2055 exp(r)/(1+exp(r)) ## 9.9%. ## so a typical woman's chance of premature delivery is 8.2%, and ## according to the model, changing her from a nonsmoker to smoker ## would increase this to 9.9%. ## But beware of inferring causation. And the coef is not stat. sig. You can also fit more complex models and pick the best fitting by minimizing AIC. fit1 = glm(prem ~ (smoke + age + num + bmi)^2, family = binomial) summary(fit1) library(MASS) stepAIC(fit1) 2. Nonlinear regression in R. Instead of Y = X beta + eps, what if you want to fit Y = f(x) + eps? You might have, for instance, an exponential relationship between Y and X, like Y_i = beta_0 e^{beta_1 x_i} + epsilon_i. You can use nls() in R to fit such models. fit1 = nls(formula, start = ..., trace = FALSE). The model formula is a little different in nls() than in lm(). Again you use y ~ mean, but you express mean using ordinary math notation. For instance, you can do y ~ n * exp(-b*(t-t0)), where t is your data and n,b, and t0 are parameters to be estimated. It's often nice to write your own function f to compute the mean, and then you can just write the formula as y ~ f(t,n,b,t0). nls() works by starting at some initial guess of the parameter values, stored in start =- ..., ideally somewhat close to the true values. It then moves around in each direction using Newton-Raphson type methods til it finds the smallest sum of squared residuals it can. trace = TRUE is useful if you want to see each step, and FALSE is the default. You can sometimes find reasonable starting values using curve(, add=TRUE). Yellowfin tuna example. library(UsingR) data(yellowfin) number = yellowfin$count yr = yellowfin$year plot(yr,number) f = function(t,n,r,d){ n * (exp(-r*(t-1952))*(1-d) +d) } curve(f(x, n=6, r = 1/10, d=.1),add=TRUE,col="BLUE") fit1 = nls(number ~ f(yr,n,r,d), start=c(n=6,r=1/10,d=.1)) fit1 par1 = summary(fit1)$par[,1] curve(f(x,par1[1],par1[2],par1[3]),add=TRUE,col="RED") legend("top",legend=c("initial","fitted exp."),lty=c(1,1), col=c("BLUE","RED")) 3. C++ in R. This is based on Hadley Wickham's notes on Rcpp, http://adv-r.had.co.nz/Rcpp.html . See also http://www.learncpp.com and http://www.cplusplus.com . Typical bottlenecks that C++ can address include: Loops that can't be easily vectorised because subsequent iterations depend on previous ones. Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than that in R. Problems that require advanced data structures and algorithms that R doesn't provide. Through the standard template library (STL), C++ has efficient implementations of many important data structures, from ordered maps to double-ended queues. Unlike C, we will be writing, compiling and running C++ code inside R, using Rcpp and cppFunction. Here is a simple example. ## install.packages("Rcpp") library(Rcpp) cppFunction('int add(int x, int y, int z) { int sum = x + y + z; return sum; }') add(1,5,2) As with C, You must declare the type of output the function returns. This function returns an int (a scalar integer). The classes for the most common types of R vectors are: NumericVector, IntegerVector, CharacterVector, and LogicalVector. Scalars and vectors are different. The scalar equivalents of numeric, integer, character, and logical vectors are double, int, String, and bool. Each vector type has a matrix equivalent, NumericMatrix, IntegerMatrix, CharacterMatrix, and LogicalMatrix. Using them is straightforward. Like C, you must use an explicit return statement to return a value from a function, and you have to use = for assignment, not <-. In C++, loops are much faster than in R. mymean = function(x){ y = 0 n = length(x) for(i in 1:n){ y = y + x[i] } y/n } Now here it is in C++. cppFunction('double mymeanC(NumericVector x) { int n = x.size(); double total = 0; for(int i = 0; i < n; ++i) { total += x[i]; } return total / n; }') x = runif(50000000) mymean(x) mymeanC(x) To find the length of the vector, we use the .size() method, which returns an integer. C++ methods are called with . Note that we could say int i inside the for loop declaration. You can't do this in traditional C, but in C++ it's ok. Now here is an example with vector input and vector output. Suppose we want a function to compute the squared difference from a particular number x and a vector y. In R it is simple. pdiffR = function(x, y) { (x - y) ^ 2 } In C++ it would be more complex but twice as fast. cppFunction('NumericVector pdiffC(double x, NumericVector y) { int n = y.size(); NumericVector out(n); for(int i = 0; i < n; ++i) { out[i] = pow(y[i] - x, 2.0); } return out; }') x = 3.2 y = 1:10 pdiffR(x,y) pdiffC(x,y) Matrices are dealt with just as in C, but in C++, you subset a matrix with (), not []. cppFunction('NumericVector rowSumsC(NumericMatrix x) { int nrow = x.nrow(), ncol = x.ncol(); NumericVector out(nrow); for (int i = 0; i < nrow; i++) { double total = 0; for (int j = 0; j < ncol; j++) { total += x(i, j); } out[i] = total; } return out; }') x = matrix(1:100, ncol=10) rowSums(x) rowSumsC(x) You can use .nrow() and .ncol() methods to get the dimensions of a matrix. ## SIMPLE EXAMPLE OF OBJECT ORIENTED SYNTAX IN C++. From https://www3.ntu.edu.sg/home/ehchua/programming/cpp/cp3_OOP.html cppFunction('int add1(int n) { /* The Circle class (All source codes in one file) (CircleAIO.cpp) */ #include // using IO functions #include // using string using namespace std; class Circle { private: double radius; // Data member (Variable) string color; // Data member (Variable) public: // Constructor with default values for data members Circle(double r = 1.0, string c = "red") { radius = r; color = c; } double getRadius() { // Member function (Getter) return radius; } string getColor() { // Member function (Getter) return color; } double getArea() { // Member function return radius*radius*3.1416; } }; // need to end the class declaration with a semi-colon // Construct a Circle instance Circle c1(1.2, "blue"); cout << "Radius=" << c1.getRadius() << " Area=" << c1.getArea() << " Color=" << c1.getColor() << endl; // Construct another Circle instance Circle c2(3.4); // default color cout << "Radius=" << c2.getRadius() << " Area=" << c2.getArea() << " Color=" << c2.getColor() << endl; // Construct a Circle instance using default no-arg constructor Circle c3; // default radius and color cout << "Radius=" << c3.getRadius() << " Area=" << c3.getArea() << " Color=" << c3.getColor() << endl; return 0; }') add1(1) 4. Big data in R. Much of this is thanks to help from Xiaojuan Hao and Ryan Rosario. #install.packages("bigmemory") #install.packages("biganalytics") #install.packages("bigtabulate") #install.packages("bigalgebra") library(bigmemory) library(biganalytics) library(bigtabulate) Regardless of the number of cores on your CPU, R will only use 1 on a default build. R reads data into memory by default. Other packages like SAS and programming languages can read data from files on demand. It is easy to exhaust RAM by storing unnecessary data. The OS and system architecture can only access 232 = 4GB of memory on a 32 bit system. It is not wise to use more memory than available. The system will start swapping which slows your system to a crawl. Solutions would be to add RAM, use a database, sample a bit of your data at a time rather than use all of it at once, or use packages designed especially for big data. There are several R packages for High Performance Computing, HPC. Some do Explicit Parallelism, where the user controls the parallelization. Implicit Parallelism: the system abstracts it away. Large Memory: working with large amounts of data without choking R. Map/Reduce: basically an abstraction for parallelism that divides data rather than architecture. For datasets too big to load into R, one can use the R packages bigmemory and ff. The word RAM refers to physical memory, via chips that are installed on the system motherboard. Virtual memory, or swap, is disk space that is used to store objects that do not fit into RAM and are less frequently accessed. This is slow. A cluster is a group of systems that communicate with each other to accomplish a computation. R reads data into RAM all at once typically. Objects in R live in memory entirely. Keeping unnecessary data in RAM will cause R to choke eventually. On most systems it is not possible to use more than 2GB of memory. If you rely on virtual memory this will cause the system to grind to a halt. This is called thrashing. bigmemory is part of the big family, which is useful to load in large datasets and doing simple manipulations on it. It is useful when the bulk of your actual analysis is performed in C, and all you need to do in R is simple stuff and perhaps more complex tasks on subsets of your data. ff allows file based access to datasets simply not fitting into memory. One can also use databases and read and write into the database for piecemeal analysis. The big family consists of several packages for performing tasks on large datasets. I'll talk mostly about bigmemory. biganalytics provides analysis routines on big.matrix such as GLM and bigkmeans. synchronicity adds Boost mutex functionality to R. bigtabulate adds table and split-like support for R matrices and big.matrix memory efficiently. bigalgebra provides BLAS and LAPACK linear algebra routines for native R matrices and big.matrix. bigvideo provides video camera streaming via OpenCV. bigmemory implements several matrix objects. big.matrix is an R object that simply points to a data structure in C++. shared.big.matrix is similar, but can be shared among multiple R processes. filebacked.big.matrix does not point to a data structure. It points to a file on disk containing the matrix, and the file can be shared across a cluster. One potential pitfall is matrices can contain only one type of data. The data types for elements are dictated by C++ not R, double, integer, short, char. First, construct a big.matrix object. Let us suppose we want to create a large matrix of 0s and 1s. We can construct a matrix of type int (4 bytes), short (2 bytes), double (8 bytes), or char (1 byte). Since all we need is 0 and 1, we use char. We also zero out the matrix. We have now created a pointer to a C++ matrix that is on disk. But, to share this matrix we need to share this descriptor. Then, we can open a second R session, load the location of the matrix from disk, and bind the matrix to an R variable. For a 5000 by 5000 matrix, char 24 MB of RAM short 48 MB, int 96 MB double 192 MB. read.big.matrix inherits from read.table so the syntax and inputs are similar. It also adds a few more: type, the C++ type to use for the matrix. separated, separate the columns into individual files if TRUE. extracols explicitly adds columns to the matrix that you may use later. backingfile, backingpath and descriptorfile control where important data about the matrix is stored on disk. Simulate a large dataset with 10 million rows and 5 columns. mydata=matrix(c(NA),nrow=10^7,ncol=5) set.seed(12345) mydata[,1]=sample(c(1:170), 10^7, replace = TRUE) mydata[,2]=sample(c(1:470179), 10^7, replace = TRUE) mydata[,3]=sample(c(1:5), 10^7, replace = TRUE) mydata[,4]=sample(c(1999:2005), 10^7, replace = TRUE) mydata[,5]=rpois(10^7, lambda = 5 + 1.27 * mydata[,1] + 2.0*(mydata[,4]==2002)) write.table(mydata, file = "example.txt", sep = " ", row.names = F, col.names = F) The file is 213.9MB. #x = read.table("example.txt",colClasses = "integer", header=F, #col.names = c("rating", "customer", "promotion","year", #"month")) ## This would take several minutes every time. x = read.big.matrix("example.txt", header =F,type = "integer",sep = " ", backingfile ="data.bin", descriptor = "data.desc",col.names = c("month", "customer","promotion","year", "rating"), shared=TRUE) The next time we read in the data it will be superfast. datadesc = dget("data.desc") x = attach.big.matrix(datadesc) head(x) biganalytics provides a wrapper to the biglm package. library(biganalytics) blm = biglm.big.matrix(rating ~ month + customer + promotion + year, data=x) summary(blm) another = biglm.big.matrix(rating ~ month + customer + promotion + year, data=x, fc="year") ## year is a factor now summary(another) Without bigmemory, this process would take several GB of RAM. Most systems do not have that much RAM, so the process would never complete. With bigmemory processing it takes less than a minute. Just like with base lm, we can get information about the fitted model with summary. Here are some other features of bigmemory. You can write a big.matrix to ASCII file using write.big.matrix. You can bin data for counting, creating histograms or visualizations using binit. You create a copy of the content using deepcopy. You search a matrix using mwhich including fast C comparison operators. Columns of matrices are separated in RAM, rather than contiguous. Note passing a big.matrix to a function is call by reference not call by value. For more info, see www.bigmemory.org . A really useful function is mwhich. It lets you extract rows or columns from your big matrix. cust.indices.inefficient <- which(x[, "customer"] == as.integer(6)) cust.indices <- mwhich(x, "customer", 6, "eq") these <- mwhich(x, c("customer", "rating"), list(6, 100000),list("eq","le"), "AND") x[these,] This extracts the movies where the customer = 6 and the rating <= 100000. The choices are eq, neq, le, lt, ge and gt. You can replace AND with OR. le means less than or equal to. lt means less than, etc. ff is for fast access files. It provides data structures that are stored on disk. They act as if they are in memory. Only necessary/active parts of the data from disk are mapped into main memory. The ff library supports R standard atomic types like double, logical, raw, integer, as well as non-standard atomic types like boolean (1 bit), quad (2 bit unsigned), nibble (4 bit unsigned), byte (1 byte signed with NA), ubyte (1 byte unsigned), short or ushort (1 byte signed w/NA and unsigned resp.), and single (4 byte float with NA). It also accepts nonatomic types such as factor, ordered, POSIXct, Date, etc. The package ff performs the following functionality to the user. ffopen creates and opens flat files. Using the parameters length or dim will create a new file. input and output can be done using the common brackets [ ] notation. Other functions for ff objects such as the usual dim and length as well as some other useful functions such as sample are included. #install.packages("ff") library(ff) n = 10^7 movie = ff(vmode="integer",length = n) movie[1:n] = 1:n month = ff(vmode="double", length = n) month[1:n] = x[,1] customer = ff(vmode="double", length = n) customer[1:n] = x[,2] promotion = ff(vmode="double", length = n) promotion[1:n] = runif(n) year = ff(vmode="double", length = n) year[1:n] = x[,4] rating = ff(vmode="double", length = n) rating[1:n] = x[,5] fftogether <- ffdf(movie=movie,customer=customer,rating=rating,year=year,month=month) ## model <- biglm(rating ~ month + customer + year, data=fftogether) ## summary(model) ff, ffm and to some degree ffdf support some other options you may need. R is usually pretty good at picking good defaults for you. initdata. Value to use to initialize the object for construction. length. Optional length of vector. Used for construction. vmode. Virtual storage mode (makes a big difference in memory overhead). filename. Give a name for the file created for the object. overwrite. If TRUE, allows overwriting of file objects. If no filename is given, a new file is created. If a filename is given and the file exists, the object will be loaded into R. Use the ffsave function to save a series of objects to file. The specified objects are saved and are given the extension .RData. The ff files related to the objects are saved and zipped using an external zip application, and given the extension .ffData. ffsave has some useful parameters. ffsave(..., list = character(0L), file = stop("‰Ûªfile‰Ûª must be specified", envir = parent.frame(), rootpath = NULL, add = FALSE, move = FALSE, compress = !move, compression_level = 6, precheck=TRUE)} ... refers to the objects to be saved. the file parameter specifies the root name for the file. It is best to use absolute paths. add=TRUE adds these objects to an existing archive. compress=TRUE saves and compresses. safe=TRUE writes a temporary file first for verification, then moves to a persistent location. ffsave.image allows the user to save the entire workspace to an ffarchive. ffinfo, when passed the path for an ffarchive displays information about the archive. ffdrop allows the user to delete an ffarchive. bigmemory vs. ff Which one to use is a matter of taste. The performance time is similar. MapReduce is a way of dividing a large job into many smaller jobs producing an output, and then combining the individual outputs into one output. It is a classic divide and conquer approach that is parallel and can easily be distributed among many cores or a CPU, or among many CPUs and nodes. It is patented by Google, its biggest user. Two fundamental steps: 1 map step: perform some operation f , in parallel, on data and output a key/value pair for each record/row. 2 reduce step: group common elements and compute some summary statistic, one per group. The notion of map and reduce comes from Lisp and functional programming, where map performs some operation on every element in a collection, and reduce collapses the results of that operation into one result for the collection. One implementation of this is Hadoop, which can integrate with R with the HadoopStreaming package. For datasets with size in the range of 10GB, bigmemory and ff are effective. For larger datasets, Hadoop is superior. There is also bigglm.bigmatrix by the way, and bigkmeans, and apply for big matrices. g = bigglm(rating[1:m] ~ month[1:m], data=x, family=poisson()) summary(g) Coming up. ## Maximum likelihood estimation in C and R. ## Building R Packages.