Mon, April 19, 2010.
1. Overview of kernel regression.
2. Histograms.
3. Kernel density estimates.
4. Kernel regression or Nadaraya-Watson estimation.

1. Overview of kernel regression.
Suppose you are trying to estimate the relationship between Y and X.
Y = f(X) + eps, where E(eps) = 0, and cov(eps) = sigma^2 I.
One option is to assume that f(X) is linear and fit a line. You can then see how well it fits.
Another option would be to fit a curve, like a polynomial of degree 17, and see how well that fits. You could see if it looks linear.
A different idea would be to fit a general curve and see if it looks linear.
How can you fit a general curve?
One way is to take chunks of X at a time, and compute the average of Y for X in those chunks, or bins. This is similar to histogram estimation of a density.
A better option is to use all of the data and smooth it to form f(x), taking a weighted average of all the Y's, but weighting them by how far they are from x. This is kernel regression.

2. Histograms.
Suppose that x_1, ..., x_n are independent draws from density f(x).
You can estimate a density using a histogram. In this case, f^(x) = the proportion of observed quantities (x_i) in B(x), where B(x) is the bin containing x.
The resulting estimates are not smooth. 
More importantly, if the bins are small, then the estimates are highly variable. 
If the bins are large, then the estimates are biased.
x = c(rnorm(1000,mean=0,sd=1),rnorm(1000,mean=4,sd=.6))
plot(c(-3,7),c(0,.4),type="n",xlab="x",ylab="density")
hist(x,nclass=20,probability=T,add=T)
x1 = seq(-3,7,length=100)
y = dnorm(x1)/2 + dnorm(x1,mean=4,sd=.6)/2
lines(x1,y) 

3. Kernel density estimation.
smooth density estimates are often made using kernel density estimation.
f^(x) = 1/(n h) · (i = 1 to n) K[(x - x_i)/h] . See p371. 
K is usually a density. h is the bandwidth.
For instance, K(u) might be the standard normal density.
K does not need to be similar to the density f being estimated.
See p372. If h is larger, then more smoothing is done, so the bias is larger but the variance is smaller.
A formula for the variance of kernel density estimates is on p372.

4. Kernel regression and the Nadaraya-Watson estimator.
An obvious analogue is to estimate f(x) using
1/(n h) · (i = 1 to n) y_i K[(x - x_i)/h].
We are essentially taking a weighted average of the points, weighting them by how far they are from x. The problem is that the weights do not add up to 1.
m^_0(x) = 1/n · (i = 1 to n) w_i Y_i / · (i=1 to n) w_i, where w_i = K[(x_i-x)/h]. The integral from -° to ° of K(u)du = 1. h is the bandwidth. The larger h is, the more smoothing is done.The weights now are guaranteed to sum to one.
The optimal choice of h has MSE(x) = E(f(x)-f^(x))^2 = O(n^-.8). For the typical regression estimate, MSE = O(n^-1), so the kernel estimator is less efficient, but doesn't depend on the model.A good idea is to first fit a kernel regression estimate, and then see if it looks linear or quadratic or something, and then if it does, then fit that curve.