1. dnorm and change of variables. 2. Matrices in C. 3. Python and mySQL references. 4. Bias in the sample SD. 5. Bias in sample variance. 6. Calling R from C. 7. GAM. 1. Brief review of dnorm and change of variables. Why does dnorm(x, sd = s) = dnorm(x/s, sd = 1)/s? dnorm(1.2, 0,5.7,0) = normal density with mean 0 and sd 5.7, at 1.2 = dnorm(1.2/5.7, 0, 1, 0) / 5.7 from change of variables formula in calculus. Let f be the std normal density. Integral from -inf to inf of f(x/b) dx = ? Let y = x/b. dy = dx/b. dx = b dy. So Integral from -inf to inf of f(x/b) dx = b * Integral of f(x)dx = b. So, divide by b to get a density g(x) = f(x/b). 2. Matrices in C. You pretty much have to pass matrices in R to C as vectors. Matrices in R are actually stored as long vectors, with their dimensions as 2 attributes. However, you can pass the matrices just as you would vectors, with one caveat. The matrices are read column by column, not row by row! a = matrix(1:6,ncol=3,byrow=T) a a[1] a[2] a[3] a[1:6] In mymatrix.c, #include #include void matmult (double *a, int *n1, int *m1, double *b, int *n2, int *m2, double *y) { int i,j,k; double x; double a1[*n1][*m1]; double b1[*n2][*m2]; for(j = 0; j < *m1; j++) { for(i = 0; i < *n1; i++) { a1[i][j] = a[j * *n1 + i]; } } for(j = 0; j < *m2; j++) { for(i = 0; i < *n2; i++) { b1[i][j] = b[j * *n2 + i]; } } for(i = 0; i < *n1; i++) { for(j = 0; j < *m2; j++) { x = 0.0; for(k = 0; k < *m1; k++) { x += a1[i][k] * b1[k][j]; } y[i* *m2 + j] = x; // y[i][j] = x; } } } In R, system("R CMD SHLIB mymatrix.c") dyn.load("mymatrix.so") a = matrix(1:6,ncol=3,byrow=T) ## or equivalently a = c(1,4,2,5,3,6) b = matrix(1:12,ncol=4,byrow=T) y = rep(0,8) x = .C("matmult",as.double(a),as.integer(2),as.integer(3),as.double(b),as.integer(3), as.integer(4),as.double(y)) y = matrix(x[[7]], ncol=4, byrow=T) y a %*% b 3. Python and mySQL references. For Python, a nice introduction is http://wiki.python.org/moin/BeginnersGuide . There are many nearly equivalent online references. For MySQL, a good reference is King, T., Reese, G., Yarger, R., and Williams, H.E. (2002). Managing and Using MySQL: Open Source SQL Databases for Managing Information and Web Sites, 2nd ed. O'Reilly and Associates, Inc., Sebastopol, CA. Or Sams Teach Yourself SQL in 10 Minutes, http://www.amazon.com/dp/0672325675, and websites that you might like are http://w3schools.com/php/php_mysql_intro.asp http://www.tizag.com/mysqlTutorial . 4. Bias in the sample sd. Given n iid observations from some population with mean µ and variance sigma^2, it is well known (and not hard to show) that the sample variance s^2 = á (Xi - Xbar)^2 / (n-1) is an unbiased estimate of sigma^2. But is s an unbiased estimate of sigma? What is its bias? Suppose the population is uniform (0,1). If we were writing code to answer this question in R, we might do the following. ## R code to calculate bias in the sample sd of n uniforms on (0,1), ## where n goes from 2 to nmax=100, and for each such n, ## we will generate n uniforms, ## calculate the sample sd, and repeat m times, where m = 1000. ## The true variance of U(0,1) = E(X^2) - [E(X)]^2 = 1/3 - 1/4 = 1/12, ## so sigma = 1/sqrt(12). nmax = 100 n = 2:nmax m = 10000 bias2 = rep(0,99) ## to be general 99 should be changed to nmax-1 for(i in 1:99){ s = rep(0,m) for(j in 1:m){ x = runif(n[i]) s[j] = sd(x) } bias2[i] = mean(s) - 1/sqrt(12) cat(i," ") } plot(n,bias2,xlab="n",ylab="bias", ylim=c(-.003,.001), main="bias in SD of n uniforms", type="l") abline(h=0,lty=2) ## There's a lot of scatter and noise, because m is so small. It's mostly noise. In sd2.c, #include #include #include double sd2(double *x, int j) { int i; double a1, xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; a1 = 0.0; for(i = 0; i < j; i++) { a1 += pow(x[i] - xbar,2); } a1 = pow(a1/(j-1), 0.5); return(a1); } double mean2(double *x, int j) { int i; double xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; return(xbar); } void bias3 (int *nmax, int *m, double *b){ /* The output will be b, the bias in the sd for each n, from n=2 up to nmax. */ int i,j,k; double x[*nmax]; double s[*m]; GetRNGstate(); for(i = 2; i < *nmax+1; i++){ for(j = 0; j < *m; j++){ for(k = 0; k < i; k++){ x[k] = runif(0,1); } s[j] = sd2(x, i); } b[i-1] = mean2(s, *m) - pow(12, -.5); if(i/1000 == floor(i/1000)) Rprintf("%i ",i); } PutRNGstate(); } In R, system("R CMD SHLIB sd2.c") dyn.load("sd2.so") nmax = 100 m = 100000 a = .C("bias3",as.integer(nmax), as.integer(m),b = double(nmax)) bias1 = a$b plot(c(2:nmax),bias1[2:nmax],xlab="n",ylab="bias",ylim=c(-.003,.001), main="bias in SD of n uniforms", type="l") abline(h=0,lty=2) 5. Bias in sample variance? The sample variance s^2 = · (Xi - Xbar)^2 / (n-1). s^2 is an unbiased estimate of sigma^2, regardless of n and the distribution of Xi. Last time we investigated the bias in s as an estimate of sigma, supposing the random variables Xi are uniform (0,1). What would it look like if we repeated this, but looked at s^2 instead of s? In sd3.c, #include #include double s2(double *x, int j) { int i; double a1, xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; a1 = 0.0; for(i = 0; i < j; i++) { a1 += pow(x[i] - xbar,2); } a1 = a1/(j-1); // I changed this from a1 = pow(a1/(j-1), 0.5). return(a1); } double mean2(double *x, int j) { int i; double xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; return(xbar); } void bias4 (int *nmax, int *m, double *b){ /* The output will be b, the bias in s2 for each n, from n=2 up to nmax. */ int i,j,k; double x[*nmax]; double s[*m]; GetRNGstate(); for(i = 2; i < *nmax+1; i++){ for(j = 0; j < *m; j++){ for(k = 0; k < i; k++){ x[k] = runif(0,1); } s[j] = s2(x, i); // I changed this from sd2 to s2. } b[i-1] = mean2(s, *m) - 1.0/12.0; // I changed pow(12,-.5) to 1.0/12.0. if(i/1000 == floor(i/1000)) Rprintf("%i ",i); } PutRNGstate(); } In R, system("R CMD SHLIB sd3.c") dyn.load("sd3.so") nmax = 100 m = 100000 a = .C("bias4",as.integer(nmax), as.integer(m),b = double(nmax)) bias1 = a$b plot(c(2:nmax),bias1[2:nmax],xlab="n",ylab="bias", ylim=c(-.003,.001), main="bias in sample var of n uniforms", type="l") abline(h=0,lty=2) 6. Calling R from C. Notes on this come from Phil Spector's lecture notes at UC Berkeley. The composite version of Simpson's rule is a rough way to approximate an integral of some function, f. It goes back to English mathematician Thomas Simpson (1710Ð1761), who approximated the integral from a to b of f(x) dx by [(b-a)/6] [f(a) + 4f((a+b)/2) + f(b)]. This is the approximation you'd get by just dividing the interval (a,b) into 6 pieces, and using the left endpoint for the first piece, the right endpoint for the last piece, and the middle, (a+b)/2, for the other 4 pieces. For the composite Simpson's rule, you divide the interval not into 6 pieces but n pieces, where n is even, and get the integral from a to b of f(x) dx ~ h/3 [f(a) + 4f(a+h) + 2f(a+2h) + 4f(a+3h) + 2f(a+4h) + ... + 4f(b-h) + f(b)], where h = (b-a)/n. Note that the coefficients 4 and 2 alternate. The error in Simpson's rule is bounded in absolute value by h^4 (b-a) max[f^4(x); x in (a,b)]/180. Suppose we want to write this function f in R, but run the composite Simpson's rule in C, and call the C function from R. The key component is a C function called call_R. call_R takes as one of its arguments a pointer to a function, which is passed into the C environment as a list. The number and type of the arguments to f are also passed to call_R. As usual, when calling C from R, the results of the C function in the end are passed back to R not as a returned value, but through the argument list. It is typically easiest to divide the C program into three parts. 1) simp is the algorithm to implement the composite Simpson's rule, but it doesn't deal with issues pertaining to the interface between C and R. 2) sfunc is just a shell that uses call_R to have R evaluate the function f. 3) dosimp is just the interface, which is called by R and which calls simp. In simp.c, static char *sfunction; double simp(double(*func)(),double start,double stop,long n) { double mult,x,t,t1,inc; long i; inc = (stop - start) / (double)n; x = start; t = func(x); mult = 4.0; for(i=1; i