Wed, April 21, 2010. 1. More about kernel regression. 2. Choosing h. 3. B-spline regression. 4. m-estimation, LAD regression, and Huber's method. 5. LTS. 1. More about kernel regression. In R, you can do ksmooth, or in splancs, see kernel2d and kernel3d. 95% confidence intervals for kernel regression estimates f^(x_i) can be constructed using the formula f^(x_i) +/- 1.96 sqrt{sigma^2(x_j) ||K||_2^2/ (Y_i n h)}, where sigma^2(x) = · (i = 1 to n) W_hi(x) (Y_i - m^_h(x))^2 / n, W_hi(x) = K((x-xi)/h) / (hg^(x)), g^(x) = · (i = 1 to n) K(x-x_i)/n, m^_h(x) is the Nadaraya-Watson estimate, i.e. m^_h(x) = (· (i = 1 to n) K_h(x - x_i)Y_i) / (· i = 1 to n K_h(x-xi)) K_h(x) = K(x/h)/h, and ||K||_2^2 = º -° to ° K^2(u) du. This formula comes from Hardle (1991). 2. Choosing h. a) Minimizing mean integrated squared error. p373. b) Plug-in methods. Start with an initial estimate of f, e.g. by assuming f is normal, and use that initial guess to estimate R(f'') and R(f'''). Using those estimates, find the optimal h that minimizes mean integrated squared error. p374. c) Silverman's rule of thumb. Silverman (1986). bw.nrd0(x) = .9 * min(sd, IQR/1.34) n^(-1/5). d) Scott (1992)'s rule of thumb. bw.nrd(x) = 1.06 * min(sd, IQR/1.34) n^(-1/5). e) Cross-validation, removing one i at a time and predicting Y_i, or leave out a segment of x's if you have overlapping x's. 3. B-spline regression. The fitted curve is the sum of basis functions of X, where each basis is a degree 3 polynomial in X between 2 knots, each is continuous, each is continuous in its first 2 derivatives at the knots, and each integrates to 1. Nonparametric estimates always fit well. They sometimes overfit, especially when you don't have much data. 4. M-estimation, LAD regression, and Huber's method. Instead of minimizing the sum of squares residuals ·e_i^2, you could choose beta to minimize ·rho(e), where rho is some other function. If you choose rho(x) = x^2, then you're back to least squares. If rho(x) = |x|, then this is LAD (Least absolute deviation) regression, or L1 regression. With LAD, one big residual doesn't get squared and matter so much in determining beta.