1. Reading in from a file. 2. Generalized additive models. 3. Bias in SD. 1. Reading in from a file. You first need a file pointer FILE *fp; FILE is a structure that holds information about the file. Use FILE * to pass the information around by reference. fp = fopen(filename, mode); filename is a string that holds the name of the file on disk (including a path like /cs/course if necessary). mode is a string representing how you want to open the file. Most often you'll open a file for reading ("r") or writing ("w"). In data.txt, 1 14 5 37 7 2 1 100 In readin1.c, #include #include #include void readssq (int *n){ /* Read in the first n terms in "data.txt" and compute the sum of squares. Write the output to "output.txt" */ FILE *myfile1, *myfile2; int i; float a1, a2; a2 = 0.0; myfile1 = fopen("data.txt", "r"); myfile2 = fopen("output.txt", "w"); for(i = 0; i < *n; i++){ fscanf(myfile1, "%f", &a1); a2 += pow(a1, 2); } fclose(myfile1); Rprintf("The sum of squares is %f \n", a2); fprintf(myfile2, "%f \n", a2); fclose(myfile2); } In R, system("R CMD SHLIB readin1.c") dyn.load("readin1.so") a = .C("readssq",as.integer(5)) The input file that we are opening for reading ("r") must already exist, but the output file we are opening for writing ("w") does not have to exist. If it doesn't, it will be created. If this output file does already exist, its previous contents will be thrown away and lost. Use "a" for append to write something to the end of a file without losing its contents. In fprintf, Integer values use %d, Floating point values, i.e. decimals, use %f, Single characters use %c, Character strings use %s. 2. Generalized additive models with kernel regression. Suppose you have a response Y, and 3 explanatory variables, X1, X2, and X3. A generalized additive model is of the form Y = f1(X1) + f2(X2) + f3(X3) + eps, where eps = noise (iid, mean 0, variance sigma^2). The idea is that the functions f1, f2, and f3 don't need to be linear or some particular parameteric form. You can estimate these functions and then see if they look linear, quadratic, exponential, or whatever. R has a function gam() in the mgcv package, which fits these functions using penalized regression splines. However, one could also fit these functions by kernel regression, using your C kernel regression function, or more simply using ksmooth(), which does basically the same thing. Consider the following example. We will generate some fake data, and then fit Y = f1(x1) using kernel regression. Then we will take the residuals from that, where residual[i] = Y[i] - f1(x1[i]), and fit the residuals as a function of x2 by kernel regression, and keep going. You might want to do something like this for your projects. x1 = runif(100)*10 x2 = runif(100)*20 x3 = runif(100)*30 y = x1^3 - 10*x1^2 + 5*x1 - .1*x2^3 + 2*x2^2 + 4*x2 + 12 + 7 * x3 + 14*rnorm(100) plot(x1,y) c1 = ksmooth(x1,y, bandwidth = bw.nrd(x1),x.points=x1) lines(c1) fit1 = rep(0,100) fit1[order(x1)] = c1$y y1 = y - fit1 ## these are the residuals from this first fit ## note that x1 is not ordered, so if you do lines(x1,y1), it won't look good ## because lines joins the first point to the second point to the third, etc. ## but you can do points(x1,fit1,pch=2) to check that things are going ok. plot(x2,y1,ylab="residuals after first kernel smoothing") c2 = ksmooth(x2,y1, bandwidth = bw.nrd(x2),x.points=x2) lines(c2) fit2 = fit1 fit2[order(x2)] = c2$y ## to check, try points(x2,fit2,pch=2) y2 = y1 - fit2 ## residuals after 2nd fit plot(x3, y2,xlab="x3",ylab="residuals after 2 kernel smoothings") ## Now suppose we think these residuals look like a linear function of x3, ## so we decide to fit a line for this term. c3 = lsfit(x3, y2) abline(c3) c3$coef[2] ## compare with 7 plot(x3,c3$resid,ylab="residuals after 2 kernels and a line") par(mfrow=c(1,2)) hist(y2,nclass=20,xlab="residuals after 2 kernels",main="", xlim=range(c(y2,c3$resid))) hist(c3$resid, nclass=20, xlab="residuals after 2 kernels and a line",main="", xlim=range(c(y2,c3$resid))) par(mfrow=c(1,1)) ## or, we could keep going and fit the function of x3 by kernel regression. plot(x3, y2,xlab="x3",ylab="residuals after 2 kernel smoothings") c4 = ksmooth(x3, y2, bandwidth = bw.nrd(x3),x.points=x3) lines(c4) fit3 = fit1 fit3[order(x3)] = c4$y y3 = y2 - fit3 plot(x3, y3,ylab="residuals after 3 kernel smoothings") par(mfrow=c(1,2)) x5 = hist(c3$resid, nclass=20, xlab="residuals after 2 kernels and a line",main="", xlim=range(c(y3,c3$resid))) hist(y3, breaks=20, xlab="residuals after 3 kernels",main="", xlim=range(c(y3,c3$resid))) ## It's interesting to look at, and interpret, the estimates of f1, f2, ## and f3, too. par(mfrow=c(1,3)) plot(c1,type="l",xlab="x1",ylab="estimate of f1(x1)") plot(c2,type="l",xlab="x2",ylab="estimate of f2(x2)") plot(c4,type="l",xlab="x3",ylab="estimate of f3(x3)") What might be a great idea for your project is to take some regression style data, fit a generalized additive model by kernel regression (and/or perhaps fitting terms that look linear or quadratic by linear or quadratic functions), and using C to get confidence intervals for your fitted curves. 3. Bias in SD. Given n iid observations from some population with mean µ and variance sigma^2, it is well known that the sample variance s^2 = · (Xi - Xbar)^2 / (n-1) is an unbiased estimate of sigma^2. But is s an unbiased estimate of sigma? What is its bias? Suppose the population is uniform (0,1). If we were writing code to answer this question in R, we might do the following. ## R code to calculate bias in the sample sd of n uniforms on (0,1), ## where n goes from 2 to nmax=100, and for each such n, ## we will generate n uniforms, ## calculate the sample sd, and repeat m times, where m = 1000. ## The true variance of U(0,1) = E(X^2) - [E(X)]^2 = 1/3 - 1/4 = 1/12, ## so sigma = 1/sqrt(12). nmax = 100 n = 2:nmax m = 1000 bias2 = rep(0,99) ## to be general 99 should be changed to nmax-1 for(i in 1:99){ s = rep(0,m) for(j in 1:m){ x = runif(n[i]) s[j] = sd(x) } bias2[i] = mean(s) - 1/sqrt(12) cat(i," ") } plot(n,bias2,xlab="n",ylab="bias", ylim=c(-.003,.001), main="bias in SD of n uniforms", type="l") abline(h=0,lty=2) ## There's a lot of scatter and noise, because m is so small. It's mostly noise. In sd2.c, #include #include #include double sd2(double *x, int j) { int i; double a1, xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; a1 = 0.0; for(i = 0; i < j; i++) { a1 += pow(x[i] - xbar,2); } a1 = pow(a1/(j-1), 0.5); return(a1); } double mean2(double *x, int j) { int i; double xbar; xbar = 0.0; for(i = 0; i < j; i++) { xbar += x[i]; } xbar = xbar / j; return(xbar); } void bias3 (int *nmax, int *m, double *b){ /* The output will be b, the bias in the sd for each n, from n=2 up to nmax. */ int i,j,k; double x[*nmax]; double s[*m]; GetRNGstate(); for(i = 2; i < *nmax+1; i++){ for(j = 0; j < *m; j++){ for(k = 0; k < i; k++){ x[k] = runif(0,1); } s[j] = sd2(x, i); } b[i-1] = mean2(s, *m) - pow(12, -.5); if(i/1000 == floor(i/1000)) Rprintf("%i ",i); } PutRNGstate(); } In R, system("R CMD SHLIB sd2.c") dyn.load("sd2.so") nmax = 100 m = 100000 a = .C("bias3",as.integer(nmax), as.integer(m),b = double(nmax)) bias1 = a$b plot(c(2:nmax),bias1[2:nmax],xlab="n",ylab="bias",ylim=c(-.003,.001), main="bias in SD of n uniforms", type="l") abline(h=0,lty=2) ## Coming up: ## Making R packages. ## C++ examples?