Monday, April 5, 2010. 1. Maximum likelihood. 2. AIC and Corrected AIC. 3. BIC. 4. Comparing Information Criteria. 5. Inference after variable selection. 6. LASSO. 1. Maximum likelihood. The idea: we observe the responses (Y), but we don't know beta. We can pick beta that most agrees with the data, i.e. pick beta so that the density, or likelihood, of Y is maximized. This is not a logical justification for choosing beta, but it has been shown that choosing beta in this way often results in good estimates. Suppose Y ~ ind. N(X beta, sigma^2 I). Given beta and sigma, what is the likelihood (density) of Y_i? [See bottom of p228.] Since the Y_i are independent, the likelihood of Y is just their product. [See top of p229.] If a particular beta maximizes the likelihood, then it also maximizes log(likelihood). [See middle of p229.] So in linear regression, the estimates of beta that you get by maximizing the likelihood (maximum likelihood estimates, or MLEs) are the same as the usual least squares estimates, (X^T X)^{-1} X^T Y, but the MLE of sigma is a bit different. [See bottom of p229.] 2. AIC and Corrected AIC. You can compare models or choose a model by minimizing AIC. [See top of p230. Note K=p+2 because you are estimating beta and sigma.] K is an approximately appropriate penalty for fitting more parameters. Note how R calculates AIC. [See middle of p231.] If p/n is small, AIC works well. Otherwise, it can favor fitting too many parameters. K is not a big enough penalty. Instead, you should use Corrected AIC. [See bottom of p231.] 3. BIC. BIC also adjusts for the fact that K is not quite a big enough penalty. [See middle of p232.] 4. Comparing Information Criteria. Prediction errors using AIC or Corrected AIC become optimal as n ->°. [See bottom of p232.] With BIC, P(true model is selected) -> 1 as n->°. [See top of p233.] 5. Inference after variable selection. Note that you can obtain different models depending on your choice of variable selection method. For instance, if you go forward, you can easily add in a variable that is significant given nothing else, but is insignificant given the other variables. If you go backward, you will remove that variable. Inferences after variable selection are biased, because the typical inference generating p-values based on F-tests or t-tests assume that the model was fixed beforehand. [See top of p239.] What can be done? Test the model on data not used to pick the model. [See middle of p239.] 6. LASSO. Least Absolute Shrinkage and Selection Operator. Constrain the sum of the sizes (abs. values) of the parameter estimates to be less than some upper bound, s. [See top of p251.] Equivalently, minimize the sum of squares plus lambda times the sum of the sizes of the parameter estimates. [See middle of p251.] Since some of the parameter estimates will often be set to zero, LASSO deals with the problems of choosing a subset and choosing parameters at the same time. LASSO also deals with the problem that, if you have many explanatory variables, the parameter estimates will be highly variable. This is an example of shrinkage in regression estimation.