0. Misc. 1. Sum of squared differences from observations for each gridpoint. 2. Kernel regression. 3. Vectors and matrices in C. 4. Calling C functions from C. 5. Running C from terminal. 6. Reading in from a file. 7. Computing an integral in C. 8. Bias in the sample SD. 9. Bias in sample variance. 10. GAM. 0. A couple miscellaneous notes. When I compile a .c file on Windows (with RStudio), a .dll file is generated instead of a .so file. I have to dynamically load in this .dll file instead of a .so file. For MySQL, a good reference is King, T., Reese, G., Yarger, R., and Williams, H.E. (2002). Managing and Using MySQL: Open Source SQL Databases for Managing Information and Web Sites, 2nd ed. O'Reilly and Associates, Inc., Sebastopol, CA. Or Sams Teach Yourself SQL in 10 Minutes, http://www.amazon.com/dp/0672325675, and websites that you might like are http://w3schools.com/php/php_mysql_intro.asp http://www.tizag.com/mysqlTutorial . 1. Sum of squared differences from observations for each gridpoint. previously we computed the sum of squared differences between observations in C. In sumsq.c, #include #include void ss2 (double *x, int *n, double *y) /* x will be the vector of data of length n, and y will be a vector of squared differences from obs i to the other n-1 observations. */ { int i,j; double a; for(i = 0; i < *n; i++){ a = 0.0; for(j=0; j < *n; j++){ a += pow(x[i] - x[j], 2); } y[i] = a; } } in R, system("R CMD SHLIB sumsq.c") dyn.load("sumsq.so") sum3 = function(data2){ n = length(data2) a = .C("ss2", as.double(data2), as.integer(n), y=double(n)) a$y ## or equivalently a[[3]] } b = c(1,3,4) sum3(b) Often in addition to your dataset, you also have a grid of points where you want to compute something, like the sum of squared difference. This is what we will do now. We will now find the sum of squared differences from observations for each gridpoint. In sumsgrid.c, #include #include void ssg2 (double *x, int *n, double *g, int *m, double *y) /* x will be the vector of data of length n, g will be the vector of m gridpoints, and y will be a vector of length m, of squared differences from gridpoint i to the n observations. */ { int i; int j; double a; for(i = 0; i < *m; i++){ a = 0.0; for(j=0; j < *n; j++){ a += pow(g[i] - x[j], 2); } y[i] = a; } } in R, system("R CMD SHLIB sumsgrid.c") dyn.load("sumsgrid.so") sum4 = function(data2, grid2){ n = length(data2) m = length(grid2) a = .C("ssg2", as.double(data2), as.integer(n), as.double(grid2), as.integer(m), y=double(m)) a$y ## or equivalently a[[5]] } b = c(1,3,4) gr = c(0,1,2,3,4,5) sum4(b,gr) This function ssg2 will be similar to your kernel density code, except that you will also have a bandwidth as an additional input, and you will replace the line a += pow(g[i] - x[j], 2); with the kernel density estimate. 2. Kernel regression. Given vectors X and Y of length n, suppose you want to look at the relationship between X and Y, without assuming a linear model or other model. You could just plot(x,y), but perhaps you want to look at a smooth summary of the relationship between X and Y. Suppose Y_i = f(X_i) + eps_i, for i = 1 to n, where f is some smooth function, and where eps_i are iid with mean 0. The Nadaraya-Watson or kernel regression estimate of f is fhat(x) = Sum [Y_i k(x-X_i; b)] / Sum [k(x-X_i; b)], where k is some kernel function, b is some bandwidth, and the sums are from i = 1 to n. Note that k(x-X_i; b) = k{(x-X_i)/b)}/b. [for instance, compare dnorm(7.0/33.3)/33.3 with dnorm(7.0, sd=33.3).] A commonly used kernel is the Epanechnikov kernel, k(u) = 3/4 (1-u^2), if |u| ≤ 1, and 0 otherwise. This was proposed in Epanechnikov, V.A. (1969). "Non-parametric estimation of a multivariate probability density". Theory of Probability and its Applications 14: 153–158. and shown to have minimal variance under certain conditions. in epanech.c, #include #include /* x and y will be the data vectors each of length n, b is the bandwidth, g will be the vector of m gridpoints where we want to calculate the kernel regression estimates, est, which is thus also of length m. */ void kernreg2 (double *x, double *y, int *n, double *b, double *g, int *m, double *est) { int i,j; double a1,a2,c; for(i = 0; i < *m; i++){ a1 = 0.0; a2 = 0.0; for(j=0; j < *n; j++){ if(fabs((x[j] - g[i])/ *b) <= 1.0){ c = 0.75 * (1.0-pow((x[j]-g[i])/ *b,2.0))/ *b; a1 += y[j] * c; a2 += c; } } if(a2 > 0.0) est[i] = a1/a2; else est[i] = 0.0; Rprintf("I'm on gridpoint %f and index %d right now.\n", g[i], i); } } // Watch out for /* in your code. This is treated as a comment started. // Use / *b instead of /*b. // These double slashes can be used in C to comment an individual line, // like # in R. In R, system("R CMD SHLIB epanech.c") dyn.load("epanech.so") n = 1000 x = runif(n)*4 y = 0.1*x^5 - 2*x^3 + 10*x + rnorm(n) bw = bw.nrd(x) m = 100 g2 = seq(0,4,length=m) ## This time, I'm going to call the C function directly, ## without making a separate R function to call it. a = .C("kernreg2", as.double(x), as.double(y), as.integer(n), as.double(bw), as.double(g2), as.integer(m), est=double(m)) plot(c(0,4),c(-3,17),type="n",xlab="x",ylab="y",cex.axis=3) points(x,y,col="red",cex=3) lines(g2,a$est,col="black") lines(g2,.1*g2^5 - 2*g2^3 + 10*g2,col="blue") legend("topleft",lty=c(1,1),col=c("black","blue"),c("estimate","truth")) Here, bw ~ 0.3. Try it again with bw = .01 or bw = 2. The Nadaraya-Watson estimator tends to be biased at the edges, especially when the bandwidth is large. Play around with a[[1]], a[[2]], etc. C functions called by R output back to R a list containing all the arguments passed to them, though here the last argument, a$est, is changed after running the function kernreg2. Note that, when calling your C function, the order of the arguments matters, and their type matters, but their names don't matter. For instance, here in R I called the bandwidth bw, but in C I called it b. Same with g2 and g. 3. Vectors and matrices in C. You can just say double x[100] or double x[25][4], for example. In mypi2.c, /* Calculate the first 100 terms in the approximation of pi. */ #include #include void pi100 (double *y){ int i; double x[100]; x[0] = 1.0; y[0] = sqrt(6.0); for(i = 1; i < 100; i++) { x[i] = x[i-1] + pow(i+1.0,-2.0); y[i] = sqrt(6.0 * x[i]); } } In R, system("R CMD SHLIB mypi2.c") dyn.load("mypi2.so") a = .C("pi100",y=double(100)) plot(1:100,a$y,type="l",xlab="n",ylab="Approx. of pi") In mypi3.c, /* Calculate 1*2*3 + 4*5*6 + ... + 28*29*30. */ #include #include void weirdsum (double *y){ int i,j; double x[10][3]; *y = 0.0; for(i = 0; i < 10; i++) { for(j = 0; j < 3; j++){ x[i][j] = 3*i + j + 1.0; } *y += x[i][0] * x[i][1] * x[i][2]; } } In R, system("R CMD SHLIB mypi3.c") dyn.load("mypi3.so") y = 1.0 a = .C("weirdsum",as.double(y)) a If you want x to be an n by m matrix, where n and m are inputs, you can just do, for instance, in mypi4.c, /* Calculate 1*2*3*...*m + (m+1)(m+2)...(2m) + ... + [(n-1)m+1][(n-1)m+2]...(nm). */ #include #include void pi103 (double *y, int *n, int *m){ int i,j; double x[*n][*m]; double a; /* First define the matrix. */ for(i = 0; i < *n; i++) { for(j = 0; j < *m; j++){ x[i][j] = *m*i + j + 1.0; } } /* Now multiply each row and add the product to y. */ *y = 0.0; for(i = 0; i < *n; i++){ a = 1.0; for(j = 0; j < *m; j++){ a *= x[i][j]; } *y += a; } } In R, system("R CMD SHLIB mypi4.c") dyn.load("mypi4.so") y = 1.0 a = .C("pi103",as.double(y),as.integer(10),as.integer(3)) a[[1]] ## to check, prod(1:3)+prod(4:6)+prod(7:9)+prod(10:12)+ prod(13:15)+prod(16:18)+prod(19:21)+prod(22:24)+prod(25:27)+prod(28:30) a = .C("pi103",result=as.double(y),as.integer(8),as.integer(4)) a$result prod(1:4)+prod(5:8)+prod(9:12)+prod(13:16)+ prod(17:20)+prod(21:24)+prod(25:28)+prod(29:32) Note that you have to say "y = as.double(y)" rather than just as.double(y) if you want to be able to call the result with a$result. To do this in R, you could do n = 8 m = 4 x = matrix(1:(n*m),byrow=T,ncol=m) y = apply(x,1,prod) sum(y) You pretty much have to pass matrices in R to C as vectors. Matrices in R are actually stored as long vectors, with their dimensions as 2 attributes. However, you can pass the matrices just as you would vectors, with one caveat. The matrices are read column by column, not row by row! a = matrix(1:6,ncol=3,byrow=T) a a[1] a[2] a[3] a[1:6] In mymatrix.c, #include #include void matmult (double *a, int *n1, int *m1, double *b, int *n2, int *m2, double *y) { int i,j,k; double x; double a1[*n1][*m1]; double b1[*n2][*m2]; for(j = 0; j < *m1; j++) { for(i = 0; i < *n1; i++) { a1[i][j] = a[j * *n1 + i]; } } for(j = 0; j < *m2; j++) { for(i = 0; i < *n2; i++) { b1[i][j] = b[j * *n2 + i]; } } for(i = 0; i < *n1; i++) { for(j = 0; j < *m2; j++) { x = 0.0; for(k = 0; k < *m1; k++) { x += a1[i][k] * b1[k][j]; } y[i* *m2 + j] = x; // y[i][j] = x; } } } In R, system("R CMD SHLIB mymatrix.c") dyn.load("mymatrix.so") a = matrix(1:6,ncol=3,byrow=T) ## or equivalently a = c(1,4,2,5,3,6) b = matrix(1:12,ncol=4,byrow=T) y = rep(0,8) x = .C("matmult",as.double(a),as.integer(2),as.integer(3),as.double(b),as.integer(3), as.integer(4),as.double(y)) y = matrix(x[[7]], ncol=4, byrow=T) y a %*% b What if you want to define a vector of length n, or a matrix with a dimension n, and n is not an input or a known integer but instead something that must be calculated inside your function, or something that changes as you go through your function? Then you need to do dynamic memory allocation. We might get to this later. 4. Calling C functions from C. In mysd1.c, /* Create a 10 x 4 matrix of integers and compute the sample SD of each row. */ /* Row 1 will be (1, 4, 9, 16). Row 2 will be (25, 36, 49, 64), etc. */ #include #include double sd2(double *x, int n){ int i; double xbar, b; xbar = 0.0; for(i = 0; i < n; i++){ xbar += x[i]; } xbar = xbar/n; b = 0.0; // b will be the sum of squared deviations for(i = 0; i < n; i++){ b += pow(x[i] - xbar, 2); } return(pow(b/(n-1),0.5)); } void rowsd (double *y){ int i,j; double a; double x[10][4]; for(i = 0; i < 10; i++) { for(j = 0; j < 4; j++){ x[i][j] = pow(4*i + j + 1.0, 2); } y[i] = sd2(x[i], 4); } } In R, system("R CMD SHLIB mysd1.c") dyn.load("mysd1.so") a = .C("rowsd",y=double(10)) a$y ## to check it, sd(c(1,4,9,16)) sd(c(25,36,49,64)) 5. Running C from terminal. In mypi5.c, #include #include int main ( ){ int i; float x,y,h; printf("\n How many years old are you? "); scanf("%d", &i); printf("\n How long is the first side of your right triangle? "); scanf("%f", &x); printf("\n How long is the second side of your right triangle? "); scanf("%f", &y); h = sqrtf(x*x + y*y); printf("\n You are %d years old and the hypotenuse is %f . \n", i, h); return(i); } In terminal on mac (or powershell on PC), cd Desktop gcc mypi5.c -o mypi5.out ./mypi5.out 6. Reading in from a file. You first need a file pointer FILE *fp; FILE is a structure that holds information about the file. Use FILE * to pass the information around by reference. fp = fopen(filename, mode); filename is a string that holds the name of the file on disk (including a path like /cs/course if necessary). mode is a string representing how you want to open the file. Most often you'll open a file for reading ("r") or writing ("w"). In data.txt, 1 14 5 37 7 2 1 100 In readin1.c, #include #include #include void readssq (int *n){ /* Read in the first n terms in "data.txt" and compute the sum of squares. Write the output to "output.txt" */ FILE *myfile1, *myfile2; int i; float a1, a2; a2 = 0.0; myfile1 = fopen("data.txt", "r"); myfile2 = fopen("output.txt", "w"); for(i = 0; i < *n; i++){ fscanf(myfile1, "%f", &a1); a2 += pow(a1, 2); } fclose(myfile1); Rprintf("The sum of squares is %f \n", a2); fprintf(myfile2, "%f \n", a2); fclose(myfile2); } In R, system("R CMD SHLIB readin1.c") dyn.load("readin1.so") a = .C("readssq",as.integer(5)) The input file that we are opening for reading ("r") must already exist, but the output file we are opening for writing ("w") does not have to exist. If it doesn't, it will be created. If this output file does already exist, its previous contents will be thrown away and lost. Use "a" for append to write something to the end of a file without losing its contents. In fprintf, Integer values use %d, Floating point values, i.e. decimals, use %f, %lf for double precision numbers, Single characters use %c, Character strings use %s. 7. Computing an integral in C. A rough way to approximate an integral of some function, is to divide the interval along the x-axis into n pieces, and for each little piece, compute the area = width of the piece times f(x) at the left endpoint (or center, or right endpoint) of the piece, and add up those areas. ## In expofile.c #include #include double expo(double a, double b, double x){ // Computes a exp(bx) return(a * exp(b*x)); } void myintegral(double *start,double *stop,int *n, double *a, double *b, double *ans){ /* integrates aexp(bx) from start to stop numerically, using n rectangles. */ double x,t,inc; long i; inc = (*stop - *start) / (double) *n; x = *start; t = *a * exp(*b * x); for(i=1; i< *n; i++){ x += inc; t += *a * exp(*b * x); } *ans = t * inc; } system("R CMD SHLIB expofile.c") dyn.load("expofile.so") a = 2.5 b = 3.5 c = 1.0 d = 4.0 n = 100000000 y = .C("myintegral",as.double(c), as.double(d), as.integer(n), as.double(a), as.double(b),as.double(0.0)) ## compare with the integral of a exp(bx) from x = c to d ## = a/b * (exp(b*d) - exp(b*c)) ## = 858979.4. y[[6]] 8. Bias in the sample sd. Given n iid observations from some population with mean µ and variance sigma^2, it is well known (and not hard to show) that the sample variance s^2 = ∑ (Xi - Xbar)^2 / (n-1) is an unbiased estimate of sigma^2. But is s an unbiased estimate of sigma? What is its bias? Suppose the population is uniform (0,1). If we were writing code to answer this question in R, we might do the following. ## R code to calculate bias in the sample sd of n uniforms on (0,1), ## where n goes from 2 to nmax=100, and for each such n, ## we will generate n uniforms, ## calculate the sample sd, and repeat m times, where m = 10000. ## The true variance of U(0,1) = E(X^2) - [E(X)]^2 = 1/3 - 1/4 = 1/12, ## so sigma = 1/sqrt(12). nmax = 100 n = 2:nmax m = 10000 bias2 = rep(0,99) ## to be general 99 should be changed to nmax-1 for(i in 1:99){ s = rep(0,m) for(j in 1:m){ x = runif(n[i]) s[j] = sd(x) } bias2[i] = mean(s) - 1/sqrt(12) cat(i," ") } plot(n,bias2,xlab="n",ylab="bias", ylim=c(-.003,.001), main="bias in SD of n uniforms", type="l") abline(h=0,lty=2) ## There's a lot of scatter and noise, because m is so small. It's mostly noise. In sd2.c, #include #include #include double sd2(double *x, int j) { int i; double a1, xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; a1 = 0.0; for(i = 0; i < j; i++) { a1 += pow(x[i] - xbar,2); } a1 = pow(a1/(j-1), 0.5); return(a1); } double mean2(double *x, int j) { int i; double xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; return(xbar); } void bias3 (int *nmax, int *m, double *b){ /* The output will be b, the bias in the sd for each n, from n=2 up to nmax. */ int i,j,k; double x[*nmax]; double s[*m]; GetRNGstate(); for(i = 2; i < *nmax+1; i++){ for(j = 0; j < *m; j++){ for(k = 0; k < i; k++){ x[k] = runif(0,1); } s[j] = sd2(x, i); } b[i-1] = mean2(s, *m) - pow(12, -.5); if(i/1000 == floor(i/1000)) Rprintf("%i ",i); } PutRNGstate(); } In R, system("R CMD SHLIB sd2.c") dyn.load("sd2.so") nmax = 100 m = 100000 a = .C("bias3",as.integer(nmax), as.integer(m),b = double(nmax)) bias1 = a$b plot(c(2:nmax),bias1[2:nmax],xlab="n",ylab="bias",ylim=c(-.003,.001), main="bias in SD of n uniforms", type="l") abline(h=0,lty=2) 9. Bias in sample variance? The sample variance s^2 = ∑ (Xi - Xbar)^2 / (n-1). s^2 is an unbiased estimate of sigma^2, regardless of n and the distribution of Xi. Last time we investigated the bias in s as an estimate of sigma, supposing the random variables Xi are uniform (0,1). What would it look like if we repeated this, but looked at s^2 instead of s? In sd3.c, #include #include double s2(double *x, int j) { int i; double a1, xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; a1 = 0.0; for(i = 0; i < j; i++) { a1 += pow(x[i] - xbar,2); } a1 = a1/(j-1); // I changed this from a1 = pow(a1/(j-1), 0.5). return(a1); } double mean2(double *x, int j) { int i; double xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; return(xbar); } void bias4 (int *nmax, int *m, double *b){ /* The output will be b, the bias in s2 for each n, from n=2 up to nmax. */ int i,j,k; double x[*nmax]; double s[*m]; GetRNGstate(); for(i = 2; i < *nmax+1; i++){ for(j = 0; j < *m; j++){ for(k = 0; k < i; k++){ x[k] = runif(0,1); } s[j] = s2(x, i); // I changed this from sd2 to s2. } b[i-1] = mean2(s, *m) - 1.0/12.0; // I changed pow(12,-.5) to 1.0/12.0. if(i/1000 == floor(i/1000)) Rprintf("%i ",i); } PutRNGstate(); } In R, system("R CMD SHLIB sd3.c") dyn.load("sd3.so") nmax = 100 m = 100000 a = .C("bias4",as.integer(nmax), as.integer(m),b = double(nmax)) bias1 = a$b plot(c(2:nmax),bias1[2:nmax],xlab="n",ylab="bias", ylim=c(-.003,.001), main="bias in sample var of n uniforms", type="l") abline(h=0,lty=2) 10. Generalized additive models with kernel regression. Suppose you have a response Y, and 3 explanatory variables, X1, X2, and X3. A generalized additive model is of the form Y = f1(X1) + f2(X2) + f3(X3) + eps, where eps = noise (iid, mean 0, variance sigma^2). The idea is that the functions f1, f2, and f3 don't need to be linear or some particular parameteric form. You can estimate these functions and then see if they look linear, quadratic, exponential, or whatever. R has a function gam() in the mgcv package, which fits these functions using penalized regression splines. However, one could also fit these functions by kernel regression, using your C kernel regression function, or more simply using ksmooth(), which does basically the same thing. Consider the following example. We will generate some fake data, and then fit Y = f1(x1) using kernel regression. Then we will take the residuals from that, where residual[i] = Y[i] - f1(x1[i]), and fit the residuals as a function of x2 by kernel regression, and keep going. You might want to do something like this for your projects. x1 = runif(100)*10 x2 = runif(100)*20 x3 = runif(100)*30 y = x1^3 - 10*x1^2 + 5*x1 - .1*x2^3 + 2*x2^2 + 4*x2 + 12 + 7 * x3 + 14*rnorm(100) plot(x1,y) c1 = ksmooth(x1,y, bandwidth = bw.nrd(x1),x.points=x1) lines(c1) fit1 = rep(0,100) fit1[order(x1)] = c1$y y1 = y - fit1 ## these are the residuals from this first fit ## note that x1 is not ordered, so if you do lines(x1,y1), it won't look good ## because lines joins the first point to the second point to the third, etc. ## but you can do points(x1,fit1,pch=2) to check that things are going ok. plot(x2,y1,ylab="residuals after first kernel smoothing") c2 = ksmooth(x2,y1, bandwidth = bw.nrd(x2),x.points=x2) lines(c2) fit2 = fit1 fit2[order(x2)] = c2$y ## to check, try points(x2,fit2,pch=2) y2 = y1 - fit2 ## residuals after 2nd fit plot(x3, y2,xlab="x3",ylab="residuals after 2 kernel smoothings") ## Now suppose we think these residuals look like a linear function of x3, ## so we decide to fit a line for this term. c3 = lsfit(x3, y2) abline(c3) c3$coef[2] ## compare with 7 plot(x3,c3$resid,ylab="residuals after 2 kernels and a line") par(mfrow=c(1,2)) hist(y2,nclass=20,xlab="residuals after 2 kernels",main="", xlim=range(c(y2,c3$resid))) hist(c3$resid, nclass=20, xlab="residuals after 2 kernels and a line",main="", xlim=range(c(y2,c3$resid))) par(mfrow=c(1,1)) ## or, we could keep going and fit the function of x3 by kernel regression. plot(x3, y2,xlab="x3",ylab="residuals after 2 kernel smoothings") c4 = ksmooth(x3, y2, bandwidth = bw.nrd(x3),x.points=x3) lines(c4) fit3 = fit1 fit3[order(x3)] = c4$y y3 = y2 - fit3 plot(x3, y3,ylab="residuals after 3 kernel smoothings") par(mfrow=c(1,2)) x5 = hist(c3$resid, nclass=20, xlab="residuals after 2 kernels and a line",main="", xlim=range(c(y3,c3$resid))) hist(y3, breaks=20, xlab="residuals after 3 kernels",main="", xlim=range(c(y3,c3$resid))) ## It's interesting to look at, and interpret, the estimates of f1, f2, ## and f3, too. par(mfrow=c(1,3)) plot(c1,type="l",xlab="x1",ylab="estimate of f1(x1)") plot(c2,type="l",xlab="x2",ylab="estimate of f2(x2)") plot(c4,type="l",xlab="x3",ylab="estimate of f3(x3)") par(mfrow=c(1,1)) What might be a good idea for your project is to take some regression style data, fit a generalized additive model by kernel regression (and/or perhaps fitting terms that look linear or quadratic by linear or quadratic functions), and using C to get confidence intervals for your fitted curves. Alternatively, if you have a natural "factor" style response variable and would like to do classification, using several explanatory variables, using the methods in Irizarry's book, that would also be good.