Stat 404 day 4¶

notes:¶

hw2 due thursday

hw3 posted on the website (due Thurs Feb 15) no explanations needed, just output and code

question in C, a series that converges to ln(2)
kernel density estimation in C. same earthquake dataset but a slightly different sample.

Project dates and hw3.
Sum of squared differences between observations in C.
Sum of squared differences from observations for each gridpoint, in C.
Kernel regression.
Final project and oral report tips.
Vectors and matrices in C.
Calling C functions from C.
Running C from terminal.
Reading in from a file in C.
Pointers in C.

1. Dates for final projects and hw3.¶

I will now randomly assign people to presentation times. If you want to change oral presentation dates and times with another person, feel free but let me know.

I will use sample(), which randomly permutes the elements of a vector.

a = c("a","b","c","d")
  sample(a)
x = c("Paul", "Kaleb", "Jingyi",
"Lilian", "Liyu", "Chang", 
"John", "Yi", "Yu", 
"Anna", "Matthew", "Lihui", 
"Christopher", "Henry", "Cole", 
"Kjell", "Muxin", "Janella", 
"Annie", "Asa", "Victor", 
"Ah Sung")

n = length(x)
y = sample(x)
for(i in 1:n){
if(i == 1)  cat("\n\n  Tue Mar 6\n")
if(i == 12) cat("\n\n  Tue Mar 13 \n")
cat(i,". ",y[i],"\n")
}

Attendance will be mandatory the last 2 classes of the quarter. Hw2 is due Thu by email to me, and hw3 is currently on the course homepage. The RgoogleMaps package is nice for adding maps to plots, by the way.

For compiling C in Windows, these links might be useful:

2. Summing the squared differences between observations in C.¶

In sumsq.c:

#include <R.h>
	#include <Rmath.h>

	void ss2 (double *x, int *n, double *y)
	/* x will be the vector of data of length n,
	   and y will be a vector of squared differences from obs i
	   to the other n-1 observations.
	*/
	{
		int i,j;
		double a;
		for(i = 0; i < *n; i++){
			a = 0.0;
			for(j=0; j < *n; j++){
				a += pow(x[i] - x[j], 2);
			}
			y[i] = a;
		}
	}

in R:

system("R CMD SHLIB sumsq.c")
	dyn.load("sumsq.so")
	sum3 = function(data2){
		n = length(data2)
		a = .C("ss2", as.double(data2), as.integer(n),
			y=double(n))
		a$y  ## or equivalently a[[3]]
	}

	b = c(1,3,4)
        sum3(b)

	n2 = c(100, 1000, 2000, 3000, 5000, 7000, 8000, 10000)
	t2 = rep(0,8)
	for(i in 1:8){
	  b = runif(n[i])
	  timea = Sys.time()
	  d = sum3(b)
	  timeb = Sys.time()
	  t2[i] = timeb-timea
	  cat(n2[i]," ")
	}
	par(mfrow=c(1,2))
	plot(n2,t2,ylab="time (sec)")
	
	## Now try the same thing in R, without C.
	
	sum4 = function(data2){
	  n = length(data2)
	  x = rep(0,n)
	  for(i in 1:n){
	    for(j in 1:n){
	      x[i] = x[i] + (data2[i] - data2[j])^2
	      }
	    }
	  x
	  }

	b = c(1,3,4)
        sum4(b)

	n = c(1:8)*100
	t3 = rep(0,8)
	for(i in 1:8){
	  b = runif(n[i])
	  timea = Sys.time()
	  d = sum4(b)
	  timeb = Sys.time()
	  t3[i] = timeb-timea
	cat(n[i]," ")
	}
	plot(n,t3,ylab="time (sec)")

t3[8] = 1.173. t2[8] = 0.110.

These are for n = 800 and n = 10,000 respectively.

10000^2/6.51e-04 = 154 billion operations per sec in C.

800^2/1.151 = 556 thousand of the same exact operations per sec in R.

154000000000 / 556000 = 280000 times faster in C.

cat(signif(n2^2/t2/(n^2/t3),4))

3. Sum of squared differences from observations for each gridpoint.¶

In sumsgrid.c:

#include <R.h>
	#include <Rmath.h>

	void ssg2 (double *x, int *n, double *g, int *m, double *y)
	/* x will be the vector of data of length n,
	   g will be the vector of m gridpoints,
	   and y will be a vector of length m, of squared differences 
	   from gridpoint i to the n observations.
	*/
	{
		int i;
		int j;
		double a;
		for(i = 0; i < *m; i++){
			a = 0.0;
			for(j=0; j < *n; j++){
				a += pow(g[i] - x[j], 2);
			}
			y[i] = a;
		}
	}

in R:

system("R CMD SHLIB sumsgrid.c")
	dyn.load("sumsgrid.so")
	sum4 = function(data2, grid2){
		n = length(data2)
		m = length(grid2)
		a = .C("ssg2", as.double(data2), as.integer(n),
			as.double(grid2), as.integer(m),
			y=double(m))
		a$y  ## or equivalently a[[5]]
	}

	b = c(1,3,4)
	gr = c(0,1,2,3,4,5)
       sum4(b,gr)

This function ssg2 will be similar to your kernel density code in hw3, except that you will also have a bandwidth as an additional input, and you will replace the line a += pow(g[i] - x[j], 2) with the kernel density estimate.

4. Kernel regression.¶

Given vectors X and Y of length n, suppose you want to look at the relationship between X and Y, without assuming a linear model or other model. You could just plot(x,y), but perhaps you want to look at a smooth summary of the relationship between X and Y.

Suppose $Y_i = f(X_i) + eps_i$, for i = 1 to n, where f is some smooth function, and where $eps_i$ are iid with mean 0.

The Nadaraya-Watson or kernel regression estimate of f is $\hat{f}(x) = \sum [Y_i k(x-X_i; b)] / \sum [k(x-X_i; b)]$, where k is some kernel function, b is some bandwidth, and the sums are from i = 1 to n.

Note that $k(x-X_i; b) = k{(x-X_i)/b)}/b$. [for instance, compare dnorm(7.0/33.3)/33.3 with dnorm(7.0, sd=33.3).]

A commonly used kernel is the Epanechnikov kernel, $k(u) = 3/4 (1-u^2)$, if $|u| \le 1$, and 0 otherwise.

This was proposed in Epanechnikov, V.A. (1969). "Non-parametric estimation of a multivariate probability density". Theory of Probability and its Applications 14: 153â€“158. and shown to have minimal variance under certain conditions.

in epanech.c:

#include <R.h>
#include <Rmath.h>

/* x and y will be the data vectors each of length n, 
b is the bandwidth,
g will be the vector of m gridpoints where we want to calculate the 
kernel regression estimates, est, which is thus also of length m. 
*/
void kernreg2 (double *x, double *y, int *n, double *b,
	       double *g, int *m, double *est)
{
    int i,j;
    double a1,a2,c;
    for(i = 0; i < *m; i++){
	a1 = 0.0;
	a2 = 0.0;
	for(j=0; j < *n; j++){
	    if(fabs((x[j] - g[i])/ *b) <= 1.0){
		c = 0.75 * (1.0-pow((x[j]-g[i])/ *b,2.0))/ *b;
		a1 += y[j] * c;	
		a2 += c;
	    }
	}
	if(a2 > 0.0) est[i] = a1/a2; 
	else est[i] = 0.0;
	Rprintf("I'm on gridpoint %f and index %d right now.\n", g[i], i);
    }
}

// Watch out for /* in your code. This is treated as a comment started.
// Use / *b instead of /*b.
// These double slashes can be used in C to comment an individual line, 
// like # in R.

In R:

system("R CMD SHLIB epanech.c")
	dyn.load("epanech.so")
	n = 1000
	x = runif(n)*4
	y = 0.1*x^5 - 2*x^3 + 10*x + rnorm(n)
	bw = bw.nrd(x)
	m = 100
	g2 = seq(0,4,length=m)
## This time, I'm going to call the C function directly, 
## without making a separate R function to call it.
	a <- .C("kernreg2", as.double(x), as.double(y), as.integer(n), 
	as.double(bw), as.double(g2), as.integer(m), est = double(m))
	plot(c(0,4),c(-3,17),type="n",xlab="x",ylab="y")
	points(x,y,pch=".",col="red")
	lines(g2,a$est,col="orange")
	lines(g2,.1*g2^5 - 2*g2^3 + 10*g2,col="blue")
	legend("topleft",lty=c(1,1),col=c("orange","blue"),c("estimate","truth"))

Here, bw ~ 0.3. Try it again with bw = .01 or bw = 2. The Nadaraya-Watson estimator tends to be biased at the edges, especially when the bandwidth is large.

Play around with a[[1]], a[[2]], etc. C functions called by R output back to R a list containing all the arguments passed to them, though here the last argument, a$est, is changed of course after running the function kernreg2.

Note that, when calling your C function, the order of the arguments matters, and their type matters, but their names don't matter. For instance, here in R I called the bandwidth bw, but in C I called it b. Same with g2 and g.

5. Final project and oral report tips.¶

For the oral projects.

Rule 1: Do not look at me when you are talking!¶

Rule 2: 10-12 minutes per oral report, plus questions at the end.¶

I will cut you off if you go over 12 min. You can have someone in the audience help you with the time.

Rule 3: Everyone must be respectful and quiet during other people's talks.¶

You can ask clarifying questions but keep deep questions until the end.

Rule 4: Send me a pdf version of your slides by 6pm¶

the night before your talk, to frederic@stat.ucla.edu. That way, I can set up the talks in order ahead of time and we won't have to waste time in class waiting for each person to connect their laptop to the projector. About 8-10 slides seems right, though it's fine with me if you want fewer or more.

Rule 5: Speak very slowly in the beginning. Give us a sense of your data.¶

Assume that the listener knows what the statistical methods you are using are, but knows nothing about the subject matter. Tell us what the methods say about your data. Emphasize the results more than the methods. Make sure to go slowly and clearly in the start so that the listener really understands what your data are.

Rule 6: Speculate and generalize but use careful language.¶

Say "It seems" or "appears" rather than "is" when it comes to speculative statements or models. For example, you might say "The residuals appear approximately normal" or "a linear model seems to fit well" but not "The residuals are normal" or "The data come from a linear model".

Rule 7: Start with an introduction¶

explaining what your data are, how you got them, and why they are interesting (roughly 3 minutes), then show your results as clearly as possible, with figures preferred (roughly 5 minutes), and then conclude (roughly 2 minutes). In your conclusion, mention the limitations of your analysis and speculate about what might make a future analysis better, if you had infinite time. This might include collecting more data, or getting data on more variables, as well as more sophisticated statistical methods.

For your written reports, rules 5-7 apply again. Replace "minutes" with "pages". Have just the text in the beginning, and then the figures at the end. Email your pdf document to me, at frederic@stat.ucla.edu .

6. Vectors and matrices in C.¶

You can just say double x[100] or double x[25][4], for example. In mypi2.c,

/* Calculate the first 100 terms in the approximation of pi. */
  #include <R.h>
  #include <Rmath.h>
  void pi100 (double *y){
    int i;
    double x[100];

    x[0] = 1.0;
    y[0] = sqrt(6.0);
    for(i = 1; i < 100; i++) {
      x[i] = x[i-1] + pow(i+1.0,-2.0);
      y[i] = sqrt(6.0 * x[i]);
    }
  }

In R:

system("R CMD SHLIB mypi2.c")
    dyn.load("mypi2.so")
    a = .C("pi100",y=double(100))
    plot(1:100,a$y,type="l",xlab="n",ylab="Approx. of pi")

In mypi3.c:

/* Calculate 1*2*3 + 4*5*6 + ... + 28*29*30. */
  #include <R.h>
  #include <Rmath.h>
  void weirdsum (double *y){
    int i,j;
    double x[10][3];
    *y = 0.0;
    for(i = 0; i < 10; i++) {
      for(j = 0; j < 3; j++){
        x[i][j] = 3*i + j + 1.0;
      }
      *y += x[i][0] * x[i][1] * x[i][2];
    }
  }

In R:

system("R CMD SHLIB mypi3.c")
    dyn.load("mypi3.so")
    y = 1.0
    a = .C("weirdsum",as.double(y))
    a

If you want x to be an n by m matrix, where n and m are inputs, you can just do, for instance, in mypi4.c:

/* Calculate 1*2*3*...*m + (m+1)(m+2)...(2m) + ... + 
[(n-1)m+1][(n-1)m+2]...(nm). */
  #include <R.h>
  #include <Rmath.h>
  void pi103 (double *y, int *n, int *m){
    int i,j;
    double x[*n][*m];
    double a;
    
    /* First define the matrix. */
    for(i = 0; i < *n; i++) {
	for(j = 0; j < *m; j++){
	    x[i][j] = *m*i + j + 1.0;
	}
    }
    
    /* Now multiply each row and add the product to y. */
    *y = 0.0;
    for(i = 0; i < *n; i++){
	a = 1.0;
	for(j = 0; j < *m; j++){
	    a *= x[i][j];
	}
	*y += a;
      }
    }

In R:

system("R CMD SHLIB mypi4.c")
    dyn.load("mypi4.so")
    y = 1.0
    a = .C("pi103",as.double(y),as.integer(10),as.integer(3))
    a[[1]]
    ## to check, 
    prod(1:3)+prod(4:6)+prod(7:9)+prod(10:12)+
    prod(13:15)+prod(16:18)+prod(19:21)+prod(22:24)+
prod(25:27)+prod(28:30)
    a = .C("pi103",result75=as.double(y),as.integer(8),as.integer(4))
    a[[1]]
    prod(1:4)+prod(5:8)+prod(9:12)+prod(13:16)+
    prod(17:20)+prod(21:24)+prod(25:28)+prod(29:32)

Note that you have to say "y = as.double(y)" rather than just as.double(y) if you want to be able to call the result with a$y instead of a[[1]].

To do this in R, you could do

n = 8
  m = 4
  x = matrix(1:(n*m),byrow=T,ncol=m)
  y = apply(x,1,prod)
  sum(y)

What if you want to define a vector of length n, or a matrix with a dimension n, and n is not an input or a known integer but instead something that must be calculated inside your function, or something that changes as you go through your function?

Then you need to do dynamic memory allocation. We will hopefully get to this later.

7. Calling C functions from C.¶

In mysd1.c:

/* Create a 10 x 4 matrix of integers and compute the sample SD of each row. */
  /* Row 1 will be (1, 4, 9, 16). Row 2 will be (25, 36, 49, 64), etc. */

  #include <R.h>
  #include <Rmath.h>
  
  double sd2(double *x, int n){
    int i;
    double xbar, b;
    xbar = 0.0;
    for(i = 0; i < n; i++){
      xbar += x[i];
    }
    xbar = xbar/n;
    b = 0.0; // b will be the sum of squared deviations
    for(i = 0; i < n; i++){
      b += pow(x[i] - xbar, 2);
    }
    return(pow(b/(n-1),0.5));
    }

  void rowsd (double *y){
    int i,j;
    double a;
    double x[10][4];
    for(i = 0; i < 10; i++) {
      for(j = 0; j < 4; j++){
        x[i][j] = pow(4*i + j + 1.0, 2);
      }
      y[i] = sd2(x[i], 4);
    }
  }

In R:

system("R CMD SHLIB mysd1.c")
dyn.load("mysd1.so")
a = .C("rowsd",y=double(10))
a$y

## to check it,
sd(c(1,4,9,16))
sd(c(25,36,49,64))

8. Running C from terminal.¶

In mypi5.c:

#include <stdio.h>
  #include <math.h>

  int main ( ){
    int i;
    float x,y,h;
    printf("\n How many years old are you? ");
    scanf("%d", &i);
    printf("\n How long is the first side of your right triangle? ");
    scanf("%f", &x);
    printf("\n How long is the second side of your right triangle? ");
    scanf("%f", &y);
    h = sqrtf(x*x + y*y);
    printf("\n You are %d years old and the hypotenuse is %f . \n", i, h);
    return(i);
  }

In terminal on mac (or powershell on PC),
  cd Desktop ,
  gcc mypi5.c -o mypi5.out  
  ./mypi5.out

9. Reading in from a file.¶

You first need a file pointer

FILE *fp;

FILE is a structure that holds information about the file. Use FILE * to pass the information around by reference.

fp = fopen(filename, mode);

filename is a string that holds the name of the file on disk (including a path like /cs/course if necessary). mode is a string representing how you want to open the file. Most often you'll open a file for reading ("r") or writing ("w").

In data.txt:

1 14 5 37 7 2 1 100

In readin1.c:

#include <stdio.h>
  #include <R.h>
  #include <Rmath.h>

  void readssq (int *n){
    /* Read in the first n terms in "data.txt"
    and compute the sum of squares.
    Write the output to "output.txt" */
    FILE *myfile1, *myfile2;
    int i;
    float a1, a2;
    
    a2 = 0.0;
    myfile1 = fopen("data.txt", "r");
    myfile2 = fopen("output.txt", "w");
    for(i = 0; i < *n; i++){
	fscanf(myfile1, "%f", &a1);
	a2 += pow(a1, 2);
    }
    fclose(myfile1);
    Rprintf("The sum of squares is %f \n", a2);
    fprintf(myfile2, "%f \n", a2);
    fclose(myfile2);
  }

In R:

system("R CMD SHLIB readin1.c")
dyn.load("readin1.so")
a = .C("readssq",as.integer(5))

The input file that we are opening for reading ("r") must already exist, but the output file we are opening for writing ("w") does not have to exist.

If it doesn't, it will be created.

If this output file does already exist, its previous contents will be thrown away and lost.

Use "a" for append to write something to the end of a file without losing its contents.

In fprintf:

Integer values use %d
Floating point values, i.e. decimals, use %f
%lf for double precision numbers
Single characters use %c
Character strings use %s

10. Pointers in C.¶

Kernigan and Ritchie p93 defines a pointer as "a group of cells (often two or four) that can hold an address."

From Kernigan and Ritchie,

The unary operator & yields the address of an object.

p = &c assigns the address of c to the variable p.

p points to c.

The unary operator * is the indirection or dereferencing operator.

When x is a pointer, *x is the object x points to.

int x = 1, y = 2, z[10];
int *ip; /* ip is a pointer to int. */

ip = &x; /* ip now points to x. */
y = *ip; /* y is now 1. */ 
*ip = 0; /* x is now 0. */
ip = &z[0]; /* ip now points to z[0]. */

Every pointer points to a specific data type. An exception is a pointer to void, which can be used to hold any type of pointer but can't be dereferenced itself. If ip points to the integer x, then *ip can occur in any context where x could. For example, *ip = *ip + 10; increments *ip by 10.

Be careful because *ip += 1; adds one to whatever ip points to, as does (*ip)++; but *ip++ adds 1 to ip instead of what it points to because unary operators like * and ++ work from right to left.

Since pointers are variables, they can be used without dereferencing. For example, if ip is a pointer to int, then iq = ip; copies the contents of ip into iq, making iq point to whatever ip pointed to.

Coming up.¶

More nonparametric regression.
Generalized additive models.
inputting and outputting matrices.
Making R packages.
Calling R from C.