1. dnorm and change of variables. 2. Matrices in C. 3. Python and mySQL references. 4. Bias in the sample SD. 5. Bias in sample variance. 6. GAM. 1. Brief review of dnorm and change of variables. Why does dnorm(x, sd = s) = dnorm(x/s, sd = 1)/s? dnorm(1.2, 0,5.7,0) = normal density with mean 0 and sd 5.7, at 1.2 = dnorm(1.2/5.7, 0, 1, 0) / 5.7 from change of variables formula in calculus. Let f be the std normal density. Integral from -inf to inf of f(x/b) dx = ? Let y = x/b. dy = dx/b. dx = b dy. So Integral from -inf to inf of f(x/b) dx = b * Integral of f(x)dx = b. So, divide by b to get a density g(x) = f(x/b). 2. Matrices in C. You pretty much have to pass matrices in R to C as vectors. Matrices in R are actually stored as long vectors, with their dimensions as 2 attributes. However, you can pass the matrices just as you would vectors, with one caveat. The matrices are read column by column, not row by row! a = matrix(1:6,ncol=3,byrow=T) a a[1] a[2] a[3] a[1:6] In mymatrix.c, #include #include void matmult (double *a, int *n1, int *m1, double *b, int *n2, int *m2, double *y) { int i,j,k; double x; double a1[*n1][*m1]; double b1[*n2][*m2]; for(j = 0; j < *m1; j++) { for(i = 0; i < *n1; i++) { a1[i][j] = a[j * *n1 + i]; } } for(j = 0; j < *m2; j++) { for(i = 0; i < *n2; i++) { b1[i][j] = b[j * *n2 + i]; } } for(i = 0; i < *n1; i++) { for(j = 0; j < *m2; j++) { x = 0.0; for(k = 0; k < *m1; k++) { x += a1[i][k] * b1[k][j]; } y[i* *m2 + j] = x; // y[i][j] = x; } } } In R, system("R CMD SHLIB mymatrix.c") dyn.load("mymatrix.so") a = matrix(1:6,ncol=3,byrow=T) ## or equivalently a = c(1,4,2,5,3,6) b = matrix(1:12,ncol=4,byrow=T) y = rep(0,8) x = .C("matmult",as.double(a),as.integer(2),as.integer(3),as.double(b),as.integer(3), as.integer(4),as.double(y)) y = matrix(x[[7]], ncol=4, byrow=T) y a %*% b 3. Python and mySQL references. For Python, another nice introduction is http://wiki.python.org/moin/BeginnersGuide . There are many nearly equivalent online references. For MySQL, a good reference is King, T., Reese, G., Yarger, R., and Williams, H.E. (2002). Managing and Using MySQL: Open Source SQL Databases for Managing Information and Web Sites, 2nd ed. O'Reilly and Associates, Inc., Sebastopol, CA. Or Sams Teach Yourself SQL in 10 Minutes, http://www.amazon.com/dp/0672325675, and websites that you might like are http://w3schools.com/php/php_mysql_intro.asp http://www.tizag.com/mysqlTutorial . 4. Bias in the sample sd. Given n iid observations from some population with mean µ and variance sigma^2, it is well known (and not hard to show) that the sample variance s^2 = · (Xi - Xbar)^2 / (n-1) is an unbiased estimate of sigma^2. But is s an unbiased estimate of sigma? What is its bias? Suppose the population is uniform (0,1). If we were writing code to answer this question in R, we might do the following. ## R code to calculate bias in the sample sd of n uniforms on (0,1), ## where n goes from 2 to nmax=100, and for each such n, ## we will generate n uniforms, ## calculate the sample sd, and repeat m times, where m = 10000. ## The true variance of U(0,1) = E(X^2) - [E(X)]^2 = 1/3 - 1/4 = 1/12, ## so sigma = 1/sqrt(12). nmax = 100 n = 2:nmax m = 10000 bias2 = rep(0,99) ## to be general 99 should be changed to nmax-1 for(i in 1:99){ s = rep(0,m) for(j in 1:m){ x = runif(n[i]) s[j] = sd(x) } bias2[i] = mean(s) - 1/sqrt(12) cat(i," ") } plot(n,bias2,xlab="n",ylab="bias", ylim=c(-.003,.001), main="bias in SD of n uniforms", type="l") abline(h=0,lty=2) ## There's a lot of scatter and noise, because m is so small. It's mostly noise. In sd2.c, #include #include #include double sd2(double *x, int j) { int i; double a1, xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; a1 = 0.0; for(i = 0; i < j; i++) { a1 += pow(x[i] - xbar,2); } a1 = pow(a1/(j-1), 0.5); return(a1); } double mean2(double *x, int j) { int i; double xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; return(xbar); } void bias3 (int *nmax, int *m, double *b){ /* The output will be b, the bias in the sd for each n, from n=2 up to nmax. */ int i,j,k; double x[*nmax]; double s[*m]; GetRNGstate(); for(i = 2; i < *nmax+1; i++){ for(j = 0; j < *m; j++){ for(k = 0; k < i; k++){ x[k] = runif(0,1); } s[j] = sd2(x, i); } b[i-1] = mean2(s, *m) - pow(12, -.5); if(i/1000 == floor(i/1000)) Rprintf("%i ",i); } PutRNGstate(); } In R, system("R CMD SHLIB sd2.c") dyn.load("sd2.so") nmax = 100 m = 100000 a = .C("bias3",as.integer(nmax), as.integer(m),b = double(nmax)) bias1 = a$b plot(c(2:nmax),bias1[2:nmax],xlab="n",ylab="bias",ylim=c(-.003,.001), main="bias in SD of n uniforms", type="l") abline(h=0,lty=2) 5. Bias in sample variance? The sample variance s^2 = · (Xi - Xbar)^2 / (n-1). s^2 is an unbiased estimate of sigma^2, regardless of n and the distribution of Xi. Last time we investigated the bias in s as an estimate of sigma, supposing the random variables Xi are uniform (0,1). What would it look like if we repeated this, but looked at s^2 instead of s? In sd3.c, #include #include double s2(double *x, int j) { int i; double a1, xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; a1 = 0.0; for(i = 0; i < j; i++) { a1 += pow(x[i] - xbar,2); } a1 = a1/(j-1); // I changed this from a1 = pow(a1/(j-1), 0.5). return(a1); } double mean2(double *x, int j) { int i; double xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; return(xbar); } void bias4 (int *nmax, int *m, double *b){ /* The output will be b, the bias in s2 for each n, from n=2 up to nmax. */ int i,j,k; double x[*nmax]; double s[*m]; GetRNGstate(); for(i = 2; i < *nmax+1; i++){ for(j = 0; j < *m; j++){ for(k = 0; k < i; k++){ x[k] = runif(0,1); } s[j] = s2(x, i); // I changed this from sd2 to s2. } b[i-1] = mean2(s, *m) - 1.0/12.0; // I changed pow(12,-.5) to 1.0/12.0. if(i/1000 == floor(i/1000)) Rprintf("%i ",i); } PutRNGstate(); } In R, system("R CMD SHLIB sd3.c") dyn.load("sd3.so") nmax = 100 m = 100000 a = .C("bias4",as.integer(nmax), as.integer(m),b = double(nmax)) bias1 = a$b plot(c(2:nmax),bias1[2:nmax],xlab="n",ylab="bias", ylim=c(-.003,.001), main="bias in sample var of n uniforms", type="l") abline(h=0,lty=2) 6. Generalized additive models with kernel regression. Suppose you have a response Y, and 3 explanatory variables, X1, X2, and X3. A generalized additive model is of the form Y = f1(X1) + f2(X2) + f3(X3) + eps, where eps = noise (iid, mean 0, variance sigma^2). The idea is that the functions f1, f2, and f3 don't need to be linear or some particular parameteric form. You can estimate these functions and then see if they look linear, quadratic, exponential, or whatever. R has a function gam() in the mgcv package, which fits these functions using penalized regression splines. However, one could also fit these functions by kernel regression, using your C kernel regression function, or more simply using ksmooth(), which does basically the same thing. Consider the following example. We will generate some fake data, and then fit Y = f1(x1) using kernel regression. Then we will take the residuals from that, where residual[i] = Y[i] - f1(x1[i]), and fit the residuals as a function of x2 by kernel regression, and keep going. You might want to do something like this for your projects. x1 = runif(100)*10 x2 = runif(100)*20 x3 = runif(100)*30 y = x1^3 - 10*x1^2 + 5*x1 - .1*x2^3 + 2*x2^2 + 4*x2 + 12 + 7 * x3 + 14*rnorm(100) plot(x1,y) c1 = ksmooth(x1,y, bandwidth = bw.nrd(x1),x.points=x1) lines(c1) fit1 = rep(0,100) fit1[order(x1)] = c1$y y1 = y - fit1 ## these are the residuals from this first fit ## note that x1 is not ordered, so if you do lines(x1,y1), it won't look good ## because lines joins the first point to the second point to the third, etc. ## but you can do points(x1,fit1,pch=2) to check that things are going ok. plot(x2,y1,ylab="residuals after first kernel smoothing") c2 = ksmooth(x2,y1, bandwidth = bw.nrd(x2),x.points=x2) lines(c2) fit2 = fit1 fit2[order(x2)] = c2$y ## to check, try points(x2,fit2,pch=2) y2 = y1 - fit2 ## residuals after 2nd fit plot(x3, y2,xlab="x3",ylab="residuals after 2 kernel smoothings") ## Now suppose we think these residuals look like a linear function of x3, ## so we decide to fit a line for this term. c3 = lsfit(x3, y2) abline(c3) c3$coef[2] ## compare with 7 plot(x3,c3$resid,ylab="residuals after 2 kernels and a line") par(mfrow=c(1,2)) hist(y2,nclass=20,xlab="residuals after 2 kernels",main="", xlim=range(c(y2,c3$resid))) hist(c3$resid, nclass=20, xlab="residuals after 2 kernels and a line",main="", xlim=range(c(y2,c3$resid))) par(mfrow=c(1,1)) ## or, we could keep going and fit the function of x3 by kernel regression. plot(x3, y2,xlab="x3",ylab="residuals after 2 kernel smoothings") c4 = ksmooth(x3, y2, bandwidth = bw.nrd(x3),x.points=x3) lines(c4) fit3 = fit1 fit3[order(x3)] = c4$y y3 = y2 - fit3 plot(x3, y3,ylab="residuals after 3 kernel smoothings") par(mfrow=c(1,2)) x5 = hist(c3$resid, nclass=20, xlab="residuals after 2 kernels and a line",main="", xlim=range(c(y3,c3$resid))) hist(y3, breaks=20, xlab="residuals after 3 kernels",main="", xlim=range(c(y3,c3$resid))) ## It's interesting to look at, and interpret, the estimates of f1, f2, ## and f3, too. par(mfrow=c(1,3)) plot(c1,type="l",xlab="x1",ylab="estimate of f1(x1)") plot(c2,type="l",xlab="x2",ylab="estimate of f2(x2)") plot(c4,type="l",xlab="x3",ylab="estimate of f3(x3)") par(mfrow=c(1,1)) What might be a good idea for your project is to take some regression style data, fit a generalized additive model by kernel regression (and/or perhaps fitting terms that look linear or quadratic by linear or quadratic functions), and using C to get confidence intervals for your fitted curves. Alternatively, if you have a natural "factor" style response variable and would like to do classification, using several explanatory variables, using the methods in Irizarry's book, that would also be good.