1. Big data in R. 2. Rcpp. 3. Running Python inside R. 4. Maximum likelihood estimation in C and R. 5. Hawkes point processes and MLE. -- The order of oral reports is in revisedorder.txt. -- Email me your slides, ideally as pdf or ppt, the day before your presentation. You will use my computer for your presentation. 1. Big data in R. Much of this is thanks to help from Xiaojuan Hao and Ryan Rosario. #install.packages("bigmemory") #install.packages("biganalytics") #install.packages("bigtabulate") #install.packages("bigalgebra") library(bigmemory) library(biganalytics) library(bigtabulate) library(bigalgebra) Regardless of the number of cores on your CPU, R will only use 1 on a default build. R reads data into memory by default. Other packages like SAS and programming languages can read data from files on demand. It is easy to exhaust RAM by storing unnecessary data. The OS and system architecture can only access 232 = 4GB of memory on a 32 bit system. It is not wise to use more memory than available. The system will start swapping which slows your system to a crawl. Solutions would be to add RAM, use a database, sample a bit of your data at a time rather than use all of it at once, or use packages designed especially for big data. There are several R packages for High Performance Computing, HPC. Some do Explicit Parallelism, where the user controls the parallelization. Implicit Parallelism: the system abstracts it away. Large Memory: working with large amounts of data without choking R. Map/Reduce: basically an abstraction for parallelism that divides data rather than architecture. For datasets too big to load into R, one can use the R packages bigmemory or ff. The word RAM refers to physical memory, via chips that are installed on the system motherboard. Virtual memory, or swap, is disk space that is used to store objects that do not fit into RAM and are less frequently accessed. This is slow. A cluster is a group of systems that communicate with each other to accomplish a computation. R reads data into RAM all at once typically. Objects in R live in memory entirely. Keeping unnecessary data in RAM will cause R to choke eventually. On most systems it is not possible to use more than 2GB of memory. If you rely on virtual memory this will cause the system to grind to a halt. This is called thrashing. bigmemory is part of the big family, which is useful to load in large datasets and doing simple manipulations on it. It is useful when the bulk of your actual analysis is performed in C, and all you need to do in R is simple stuff and perhaps more complex tasks on subsets of your data. ff allows file based access to datasets simply not fitting into memory. One can also use databases and read and write into the database for piecemeal analysis. The big family consists of several packages for performing tasks on large datasets. I'll talk mostly about bigmemory. biganalytics provides analysis routines on big.matrix such as GLM and bigkmeans. synchronicity adds Boost mutex functionality to R. bigtabulate adds table and split-like support for R matrices and big.matrix memory efficiently. bigalgebra provides BLAS and LAPACK linear algebra routines for native R matrices and big.matrix. bigvideo provides video camera streaming via OpenCV. bigmemory implements several matrix objects. big.matrix is an R object that simply points to a data structure in C++. shared.big.matrix is similar, but can be shared among multiple R processes. filebacked.big.matrix does not point to a data structure. It points to a file on disk containing the matrix, and the file can be shared across a cluster. One potential pitfall is matrices can contain only one type of data. The data types for elements are dictated by C++ not R, double, integer, short, char. First, construct a big.matrix object. Let us suppose we want to create a large matrix of 0s and 1s. We can construct a matrix of type int (4 bytes), short (2 bytes), double (8 bytes), or char (1 byte). Since all we need is 0 and 1, we use char. We also zero out the matrix. We have now created a pointer to a C++ matrix that is on disk. But, to share this matrix we need to share this descriptor. Then, we can open a second R session, load the location of the matrix from disk, and bind the matrix to an R variable. For a 5000 by 5000 matrix, char 24 MB of RAM short 48 MB, int 96 MB double 192 MB. read.big.matrix inherits from read.table so the syntax and inputs are similar. It also adds a few more: type, the C++ type to use for the matrix. separated, separate the columns into individual files if TRUE. extracols explicitly adds columns to the matrix that you may use later. backingfile, backingpath and descriptorfile control where important data about the matrix is stored on disk. Simulate a large dataset with 10 million rows and 5 columns. mydata=matrix(0,nrow=10^7,ncol=5) set.seed(12345) mydata[,1]=sample(c(1:170), 10^7, replace = TRUE) mydata[,2]=sample(c(1:470179), 10^7, replace = TRUE) mydata[,3]=sample(c(1:5), 10^7, replace = TRUE) mydata[,4]=sample(c(1999:2005), 10^7, replace = TRUE) mydata[,5]=rpois(10^7, lambda = 5 + 1.27 * mydata[,1] + 2.0*(mydata[,4]==2002)) write.table(mydata, file = "example.txt", sep = " ", row.names = F, col.names = F) The file is around 200MB. #x = read.table("example.txt",colClasses = "integer", header=F, #col.names = c("rating", "customer", "promotion","year", #"month")) ## This would take several minutes every time. x = read.big.matrix("example.txt", header =F,type = "integer",sep = " ", backingfile ="data.bin", descriptor = "data.desc",col.names = c("month", "customer","promotion","year", "rating"), shared=TRUE) ### The next time we read in the data it will be superfast!!! datadesc = dget("data.desc") x = attach.big.matrix(datadesc) head(x) biganalytics provides a wrapper to the biglm package. library(biganalytics) blm = biglm.big.matrix(rating ~ month + customer + promotion + year, data=x) summary(blm) another = biglm.big.matrix(rating ~ month + customer + promotion + year, data=x, fc="year") ## year is a factor now summary(another) Without bigmemory, this process would take several GB of RAM. Most systems do not have that much RAM, so the process would never complete. With bigmemory processing it takes less than a minute. Just like with base lm, we can get information about the fitted model with summary. Here are some other features of bigmemory. You can write a big.matrix to ASCII file using write.big.matrix. You can bin data for counting, creating histograms or visualizations using binit. You create a copy of the content using deepcopy. You search a matrix using mwhich including fast C comparison operators. Columns of matrices are separated in RAM, rather than contiguous. Note passing a big.matrix to a function is call by reference not call by value. For more info, see www.bigmemory.org . A really useful function is mwhich. It lets you extract rows or columns from your big matrix. cust.indices.inefficient <- which(x[, "customer"] == as.integer(6)) cust.indices <- mwhich(x, "customer", 6, "eq") these <- mwhich(x, c("customer", "rating"), list(6, 100),list("eq","le"), "AND") x[these,] This extracts the movies where the customer = 6 and the rating <= 100. The choices are eq, neq, le, lt, ge and gt. You can replace AND with OR. le means less than or equal to. lt means less than, etc. ff is for fast access files. It provides data structures that are stored on disk. They act as if they are in memory. Only necessary/active parts of the data from disk are mapped into main memory. The ff library supports R standard atomic types like double, logical, raw, integer, as well as non-standard atomic types like boolean (1 bit), quad (2 bit unsigned), nibble (4 bit unsigned), byte (1 byte signed with NA), ubyte (1 byte unsigned), short or ushort (1 byte signed w/NA and unsigned resp.), and single (4 byte float with NA). It also accepts nonatomic types such as factor, ordered, POSIXct, Date, etc. The package ff performs the following functionality to the user. ffopen creates and opens flat files. Using the parameters length or dim will create a new file. input and output can be done using the common brackets [ ] notation. Other functions for ff objects such as the usual dim and length as well as some other useful functions such as sample are included. #install.packages("ff") library(ff) n = 10^7 movie = ff(vmode="integer",length = n) movie[1:n] = 1:n month = ff(vmode="double", length = n) month[1:n] = x[,1] customer = ff(vmode="double", length = n) customer[1:n] = x[,2] promotion = ff(vmode="double", length = n) promotion[1:n] = runif(n) year = ff(vmode="double", length = n) year[1:n] = x[,4] rating = ff(vmode="double", length = n) rating[1:n] = x[,5] fftogether <- ffdf(movie=movie,customer=customer,rating=rating,year=year,month=month) ## model <- biglm(rating ~ month + customer + year, data=fftogether) ## summary(model) ff, ffm and to some degree ffdf support some other options you may need. R is usually pretty good at picking good defaults for you. initdata. Value to use to initialize the object for construction. length. Optional length of vector. Used for construction. vmode. Virtual storage mode (makes a big difference in memory overhead). filename. Give a name for the file created for the object. overwrite. If TRUE, allows overwriting of file objects. If no filename is given, a new file is created. If a filename is given and the file exists, the object will be loaded into R. Use the ffsave function to save a series of objects to file. The specified objects are saved and are given the extension .RData. The ff files related to the objects are saved and zipped using an external zip application, and given the extension .ffData. ffsave has some useful parameters. ffsave(..., list = character(0L), file = stop("‰Ûªfile‰Ûª must be specified", envir = parent.frame(), rootpath = NULL, add = FALSE, move = FALSE, compress = !move, compression_level = 6, precheck=TRUE)} ... refers to the objects to be saved. the file parameter specifies the root name for the file. It is best to use absolute paths. add=TRUE adds these objects to an existing archive. compress=TRUE saves and compresses. safe=TRUE writes a temporary file first for verification, then moves to a persistent location. ffsave.image allows the user to save the entire workspace to an ffarchive. ffinfo, when passed the path for an ffarchive displays information about the archive. ffdrop allows the user to delete an ffarchive. bigmemory vs. ff Which one to use is a matter of taste. The performance time is similar. MapReduce is a way of dividing a large job into many smaller jobs producing an output, and then combining the individual outputs into one output. It is a classic divide and conquer approach that is parallel and can easily be distributed among many cores or a CPU, or among many CPUs and nodes. It is patented by Google, its biggest user. Two fundamental steps: 1 map step: perform some operation f , in parallel, on data and output a key/value pair for each record/row. 2 reduce step: group common elements and compute some summary statistic, one per group. The notion of map and reduce comes from Lisp and functional programming, where map performs some operation on every element in a collection, and reduce collapses the results of that operation into one result for the collection. One implementation of this is Hadoop, which can integrate with R with the HadoopStreaming package. For datasets with size in the range of 10GB, bigmemory and ff are effective. For larger datasets, Hadoop is superior. There is also bigglm.big.matrix by the way, and bigkmeans, and apply for big matrices. ## g = bigglm.big.matrix(rating ~ month + customer + promotion + year, data=x,family=poisson()) ## summary(g) ## true parameter is 1.27 for month, 2 for year, and 0 for the others. ## mydata[,5]=rpois(10^7, lambda = 5 + 1.27 * mydata[,1] + 2.0*(mydata[,4]==2002)) h = bigglm.big.matrix(rating ~ month + customer + promotion + year, data=x,fc="year",family=poisson()) summary(h) ## 2. RCPP. 2. Rcpp. For Object Oriented Programming examples see https://www3.ntu.edu.sg/home/ehchua/programming/cpp/cp3_OOP.html library(Rcpp) cppFunction('int add2(int n) { return n*2.0; }') add2(17) ## but try it with n*2.5. It rounds it to an integer! As with C, You must declare the type of output the function returns. This function returns an int (a scalar integer). The classes for the most common types of R vectors are: NumericVector, IntegerVector, CharacterVector, and LogicalVector. Scalars and vectors are different. The scalar equivalents of numeric, integer, character, and logical vectors are double, int, String, and bool. Each vector type has a matrix equivalent, NumericMatrix, IntegerMatrix, CharacterMatrix, and LogicalMatrix. Using them is straightforward. Like C, you must use an explicit return statement to return a value from a function, and you have to use = for assignment, not <-. In C++, loops are much faster than in R. mymean = function(x){ y = 0 n = length(x) for(i in 1:n){ y = y + x[i] } y/n } Now here it is in C++. cppFunction('double mymeanC(NumericVector x) { int n = x.size(); double total = 0; for(int i = 0; i < n; ++i) { total += x[i]; } return total / n; }') x = runif(80000000) mymean(x) mymeanC(x) To find the length of the vector, we use the .size() method, which returns an integer. C++ methods are called with . Note that we could say int i inside the for loop declaration. You can't do this in traditional C, but in C++ it's ok. Now here is an example with vector input and vector output. Suppose we want a function to compute the squared difference from a particular number x and a vector y. In R it is simple. pdiffR = function(x, y) { (x - y) ^ 2 } In C++ it would be more complex but twice as fast. cppFunction('NumericVector pdiffC(double x, NumericVector y) { int n = y.size(); NumericVector out(n); for(int i = 0; i < n; i++) { out[i] = pow(y[i] - x, 2.0); } return out; }') x = 3.2 y = 1:10 pdiffR(x,y) pdiffC(x,y) Matrices are dealt with just as in C, but in C++, you subset a matrix with (), not []. cppFunction('NumericVector rowSumsC(NumericMatrix x) { int nrow = x.nrow(), ncol = x.ncol(); NumericVector out(nrow); for (int i = 0; i < nrow; i++) { double total = 0; for (int j = 0; j < ncol; j++) { total += x(i, j); } out[i] = total; } return out; }') x = matrix(1:100, ncol=10) rowSums(x) rowSumsC(x) You can use .nrow() and .ncol() methods to get the dimensions of a matrix. ## SIMPLE EXAMPLE OF OBJECT ORIENTED SYNTAX IN C++. From https://www3.ntu.edu.sg/home/ehchua/programming/cpp/cp3_OOP.html cppFunction('int add1(int n) { /* The Circle class (All source codes in one file) (CircleAIO.cpp) */ #include // using IO functions #include // using string using namespace std; class Circle { private: double radius; // Data member (Variable) string color; // Data member (Variable) public: // Constructor with default values for data members Circle(double r = 1.0, string c = "red") { radius = r; color = c; } double getRadius() { // Member function (Getter) return radius; } string getColor() { // Member function (Getter) return color; } double getArea() { // Member function return radius*radius*3.1416; } }; // need to end the class declaration with a semi-colon // Construct a Circle instance Circle c1(1.2, "blue"); cout << "Radius=" << c1.getRadius() << " Area=" << c1.getArea() << " Color=" << c1.getColor() << endl; // Construct another Circle instance Circle c2(3.4); // default color cout << "Radius=" << c2.getRadius() << " Area=" << c2.getArea() << " Color=" << c2.getColor() << endl; // Construct a Circle instance using default no-arg constructor Circle c3; // default radius and color cout << "Radius=" << c3.getRadius() << " Area=" << c3.getArea() << " Color=" << c3.getColor() << endl; return 0; }') add1(1) ############################### 3. Running python inside R. # install.packages("reticulate") library(reticulate) ## First, make a Python script outside R and call it myprod.py. def pro(x, y): return x * y * 2 source_python('myprod.py') pro(3, 12) ## 4. MLE in R and C. 4. MLE. Instead of minimizing the sum of squared residuals as in regression, one can estimate the parameters of a model by finding the values of the parameters that maximize the likelihood of the data. This is in some sense the reverse of what you want. Given the data, you want the values of the parameters that are most likely. From a Bayesian point of view, the MLE is equivalent to putting a uniform (or uninformative) prior on your parameters and then taking the peak (or mode) of the posterior distribution. For example, suppose you are interested in p = the % of days it rains in a particular place. You randomly select 1 day and find that it rained on that day. The MLE of p is 1. In fact the likelihood of your data is p, so if p = 1, then the likelihood is maximized at 1. If you started with a uniform prior for p, then the posterior distribution would be triangular (p), and maximized at 1. So, the MLE kind of agrees with Bayesian analysis, though most Bayesians would advocate taking the mean of the posterior, rather than the peak. Sir Ronald Fisher wrote several important papers between 1912 and 1922 advocating the use of the MLE. He showed that from a frequentist perspective, the MLE has nice asymptotic properties like consistency, asymptotic normality, asymptotic unbiasedness, and asymptotic efficiency, under quite general conditions. Several people have written about special cases, though, where the MLE is a lousy estimate. See e.g. the excellent review by Le Cam, L.M. (1990). Maximum likelihood: an introduction. Int. Statis. Rev. 58, 153--171. With Newton-Raphson methods, or quasi-Newton methods, you start with an initial guess x_0 and try to find the minimum of a smooth function f(x). You start with x_0, then move toward x_1, then x_2, .... If f is smooth, then at the minimum x, f'(x) = 0. The Taylor expansion of f around x is f(x + delta) = f(x) + f'(x) delta + f''(x) delta^2 / 2! + .... Approximating this using just the first two terms is common. f(x + delta) ~ f(x) + f'(x) delta. So, if the derivative of f(x + delta) is 0 at the minimum, x + delta, then that means f'(x) + f''(x) delta ~ 0. Therefore, delta ~ -f'(x) / f''(x). So, what quasi-Newton methods do is start with some guess x_0, then for i = 0, 1, 2, ..., let x_{i+1} = x_i + delta = x_i - f'(x_i) / f''(x_i). This extends readily to higher dimensional functions. The function optim() in R already does fast quasi-Newton optimization, because it calls fortran. optim() can be quite slow when calculating f is slow. Let's try out optim() a bit more. f = function(theta){ (theta[1]-2.1)^2 + (2.0+(theta[2]-2.2)^2)*(3.0+(theta[3]-2.3)^2) + (theta[4]-2.4)^2 + (theta[5]-2.5)^2 + (3.0+(theta[6]-2.6)^4)*(4.0+(theta[7]-2.7)^4) + (theta[8]-2.8)^2 } guess1 = rep(10,8) timea = Sys.time() g = optim(guess1,f,method="BFGS",control=list(maxit=200),hessian=T) Sys.time() - timea g$par sqrt(diag(solve(g$hess))) So it's fast, but if f is slow, then it slows down a lot. num = 0 f = function(theta){ a = runif(100000) b = sort(a) num <<- num + 1 cat(num," ") (theta[1]-2.1)^2 + (2.0+(theta[2]-2.2)^2)*(3.0+(theta[3]-2.3)^2) + (theta[4]-2.4)^2 + (theta[5]-2.5)^2 + (3.0+(theta[6]-2.6)^4)*(4.0+(theta[7]-2.7)^4) + (theta[8]-2.8)^2 } timea = Sys.time() f(guess1) Sys.time() - timea timea = Sys.time() g = optim(guess1,f,method="BFGS",control=list(maxit=200),hessian=T) Sys.time() - timea g$par g$counts Notice the large number of times f is called. It calls f not only every iteration, but also to approximate the derivatives and second derivatives. It's calling f about 10 times per iteration. 5. Hawkes point processes and MLE. A point process is a collection of points in some space. Often the space is the real line and represents times. An example might be the origin times of major California earthquakes. Alternatively, you might have a spatial-temporal point process, where each point is the spatial-temporal location of the occurrance of some event, like a hurricane in the U.S. A point process is SIMPLE if, with probability one, all its points are distinct. That is, no 2 points overlap exactly. Simple point processes are uniquely characterized by their CONDITIONAL RATES. For a space-time point process, the conditional rate lambda(t,x,y) = the limiting expected rate at which pts are accumulating near (t,x,y), given everything that has occurred before time t, = the limit, as delta |B_(x,y)| -> 0, of E[number of points in [t,t+delta] x B_(x,y)] / [delta |B_(x,y)|] | Ht], where|B_(x,y)| is a ball around (x,y), and Ht = the entire history of what pts have occurred before or at time t. Point processes may be clustered -- points cause future points nearby to become more likely. inhibitive -- points cause future points to become less likely. Poisson -- the rate of future points is independent of what pts have occurred previously. An important model for clustered space-time point processes is the Hawkes model, where lambda(t) = mu + ·_{i: t_i < t} g(t-t_i). g(u) is called the triggering density. Typically g(u) is large when u = 0, and g(u) decreases gradually to zero. An example is g(u) = exp(-beta u), for beta > 0. In the formula above, t_i represents the times at which the points occurred. An important example of a Hawkes process is the ETAS or Epidemic-Type Aftershock Sequence model of Ogata (1988), where g(u) = K/(u+c)^p, where mu, K, c, and p are nonnegative parameters to be estimated. A space-time version of ETAS was proposed by Ogata (1998), see equation 2.3 or 2.4 of ogata98.pdf, where lambda(t,x,y) = mu + ·_{i: t_i < t} g(t-t_i, x-x_i, y-y_i) and (2.3) g(u,v,w) = K/(u+c)^p * exp(-a (M_i - M_0)) {(v^2 + w^2 +d)}^{-q}, or (2.4) g(u,v,w) = K/(u+c)^p * {(v^2 + w^2) exp(-a (M_i - M_0)) + d}^{-q}, where the parameter vector Theta = {mu, K, c, p, a, d, and q}, (t_i, x_i, y_i, M_i) are the time, x-coordinate, y-coordinate, and magnitude of earthquake i, and M_0 is the minimum cutoff magnitude for the earthquake catalog being used. M0 is typically known by the user, not estimated with the data. Note that (2.4) for some reason isn't labelled in Ogata (1998), on page 384. Also, Ogata writes g(t,x,y), which is a little confusing I think because g is really a function of the interevent time and distance from a previous earthquake, not of ordinary time and location. Point process parameters may be estimated by maximum likelihood. That is, we find the parameter vector Theta that maximizes the likelihood of the data. Typically, in practice it's a bit computationally easier to minimize -log(likelihood). The loglikelihood = ·_i log(lambda(t_i,x_i,y_i)) - º lambda(t,x,y) dt dx dy, where the sum is over the observed points i, and the integral is over the whole observed space. Given values of the parameters, it is often relatively easy to calculate the sum, but calculating or approximating the integral can be tricky. On the bottom of page 385 and top of page 386 of Ogata (1998), ogata98.pdf , which is on the course website, he shows some calculus on the integral term, changing it to polar coordinates, for example. He finds that, if the study region is [0,T] x A, and the points are (ti,xi,yi,Mi), from i = 1 to n, then º lambda(t,x,y) dt dx dy is approximately mu T |A| + (·_i º from 0 to (T-ti) K/(u+c)^p du) [·_k Sk(xi,yi,Mi) Dk/(2pi)], where Sk and Dk have sort of complicated explanations given on p385-6 of Ogata (1998). If you can't get the integral analytically using calculus, then another option is to approximate the integral by sampling thousands of space-time locations around each point (ti,xi,yi), evaluating ·g at these sampled locations, and using the approximation º lambda(t,x,y) dt dx dy is approximately mu T |A| + sum(sampled ·g) * space-time area associated with each of the sampled lambdas. For instance suppose we wanted to approximate º from 0 to 3 of f(t)dt, where f(t) = 7 + exp(-2t) + 20exp(-4t). The integral = 7t - exp(-2t)/2 - 20exp(-4t)/4] from 0 to 3 = 7*3 - exp(-6)/2 + exp(0)/2 - 5exp(-12) + 5exp(0) = 7*3 - exp(-6)/2 + 1/2 - 5*exp(-12) + 5 x = seq(0,5,length=100) f = 7 + exp(-2*x) + 20*exp(-4*x) plot(x,f,type="l",xlab="t",ylab="7 + exp(-2t) + 20exp(-4t)") trueint = 7*3 - exp(-6)/2 + 1/2 - 5*exp(-12) + 5 m = 1000 sampgrid = seq(0,3,length=m) sampf = exp(-2*sampgrid) + 20*exp(-4*sampgrid) 7*3 + sum(sampf) * 3/m Suppose we wanted to approximate º from 0 to 3 of f(t)dt, where f(t) = 7 + exp(-2t) + dnorm(t-2). The true integral is 7*3 - exp(-6)/2 + 1/2 + pnorm(1) - pnorm(-2) x = seq(0,5,length=100) f = 7 + exp(-2*x) + dnorm(x-2) plot(x,f,type="l",xlab="t",ylab="7 + exp(-2t) + dnorm(t-2)") trueint = 7*3 - exp(-6)/2 + 1/2 + pnorm(1) - pnorm(-2) m = 100000 sampgrid = seq(0,3,length=m) sampf = exp(-2*sampgrid) + dnorm(sampgrid-2) 7*3 + sum(sampf) * 3/m Since C is so fast, it might be easier to just use the approximation method.