Homework


For all homework, please write your name and the homework number at the top of the first page.  Your homework MUST be stapled. You can buy a stapler for as little as $1.88, so there is no reason not to do so.


Previous HW Assignments

R-Tip of the week:

R tutorials:
Intro to basic commands
Intro to writing functions
Official R Introduction

HW8  Due Wednesday, March 16  Solutions


A. Again, let's consider the fat data.  (You might want to re-read the previous homework assignment.)  This time we're going to evaluate how we might answer the question "which model has the best accuracy in making prediction."

a) Last time you fit a model to predict percentage of fat.  If you remember that model, write it down here.  If not, re-fit to get a reasonable model.
We'll call this "Model 1"
b) Now we'll create "Model 2".  Fit a model using (i) weight, (ii) age, (iii) age^2, (iv) height and (v) abdomen minus wrist.  No need to take transforms.  Just fit the model.
NOTE:   R will react unexpectedly if you do lm(y ~ x + x^2)  to create a quadratic fit.  (This applies to the age and age^2 terms you're asked to fit.) The reason is that R treats the "^" character differently when it is part of a model than when it is typed on the command line.  To get around this, type lm(y ~ x + I(x^2))   and you will get the correct model.
c)  Evaluate how well Model 2 fits the data, using standard diagnostics.
d) Calculate AIC and BIC for both models.  According to AIC which model is best? Which model does BIC prefer?
e)  One method for comparing accuracy is called "validation".  You can read about it on page. 121.  The idea is to randomly divide the data into two groups:  one group is the "training" group and one is the "test" group.  The idea is that you use the training group to fit the model, make transformations, decide which variables stay and which leave.  You then use the test group to compare the predicted values with the observed values, and calculate the Mean squared error (or some other statistic -- could be r-squared, could be BIC.)  This has been described as a "put your money where your mouth is" approach.  You say your model is good at predicting, so here's some new data you haven't seen before (the test data), see how well you can predict it.


A variation of this is called cross-validation.  Here you randomly divide the data into k groups.  We'll assume k = 3 for here.  Groups 1 and 2 will be the training data.  You'll fit a model with the training data. Then Group 3 will produce a  mean-squared error (the residual sums of squares divided by the degrees of freedom.)   Save this result.  Now we redo the whole thing. Groups 1 and 3 are the training data, group 2 the test data.  Finally, we do this a third time, with Groups 2 and 3 as the training data.  This gives us three MSE statistics, and the average of these gives us an estimate of just how good our predictions are.

i) Divide the data into three random groups.  Assuming your data frame is called fat:
n <- nrow(fat)
rand <- sample(1:n)%%3 + 1
group1.index <- (1:n)[rand==1]
group2.index <- (1:n)[rand==2]
group3.index <- (1:n)[rand==3]
fat.1 <- fat[group1.index,]
fat.2 <- fat[group2.index,]
fat.3 <- fat[group3.index,]

ii)  Fit a model using weight, age, age^2, height, and abdomen minus wrist using just the first two groups.  (Don't worry about transforming the data.)  You'll need to combine the groups together:  training <- fat[c(group1.index, group2.index),]
iii) Find the predicted values using group 3.  Calculate the residual sum of squares (use the "residual" function).  Calcualte the mean square error (RSS/n) where n is the sample size of the training group.
iv)  Repeat steps (ii) and (iii) using Group3 as the test data and then with Group2 as the test data.  What are the values for the mean square error?  What are the value sfor the RSS (residual sum of squares)?
v) You can estimate the over-all mean square error by adding the residual sum of squares for each test set, and dividing by the total sample size.  What is this estimate?

In principle, you could then repeat the procedure for Model 1 and use this to help you decide which makes the better predictions.  (No need to do that here.)

In practice, using k=3 gives a crude (typically too large) measure of the error.  You can improve by taking k larger.  In fact, one method (requires that you write a program) uses k = n-1.  This is also called the "leave one out" method or the "jacknife".

Read Section 5.5.1 - 5.5.2  for an overview.

B)  Bootstrapping

When the residuals are not normal, then our estimates of confidence intervals are only approximations.  While these approximations improve as the sample size increases, how do we know when the sample size is good enough?

An alternative method is to use bootstrapping. Bootstrapping can be used to provide confidence intervals for the slope (and intercept, but we'll focus on the slope) that do not require an assumption of normal distribution.  In this exercise, we'll calculate a bootstrap confidence interval for the slope in a simple linear regression.

a) Download the movie data.  Fit a model using the log of Friday gross to predict the log of the ultimate gross. What is the model?
b) What is a 95% confidence interval for the slope in your model?  Does it include 0?
c) Are the residuals normally distributed?  Explain.

 Now we'll try this with a bootstrap.  The idea is that we make no assumptions about the distribution of the population, but we do assume that our sample is large enough to be a good reflection of the shape of the distributions of the population.  

d) The first step is to take "bootstrap sample 1" from the original data.  To do this, take a random sample WITH replacement from the original data:
    original.data <- data.frame(logfriday, loggross)
    n <- nrow(original.data)
    sample.index <- sample(1:n, n, replace=T)
    bs1 <- original.data[sample.index,]

"Document" this code -- explain what each line does.

e) Now fit the model using bs1 (don't do transformations or diagnostics.)  What is the slope?  save this slope in a variable named slope.1
f) Repeat steps (d) and (e).  Call this slope slope.2.  You now have two different estimates of the slope.
g) Repeat (d) and (e)998 times.  You'll now have 1000 estimates of the slope and can use this to understand the variability in the estimate.  I don't expect you to actually repeat this 1000 times of course.   Instead, write a function that will do this.  The input to the function should be the data frame (what we called original.data) and a number called B which will represent the number of repetitions. (In this example, B = 1000).  The output should be a vector of B estimates of the slope.  
h) One crude way of getting a pretty good 95% confidence interval is to take as your left end point the .025 quartile of the output from your function in (g). The right end point is the .975 quantile.    So let's say that you have a vector called boostraps that consists of 1000 estimates of the slope. Your estimate of the 95% confidence interval could be found using quantile(boostraps, c(.025, .975)).  What is this interval?  How does it compare to the interval in (b)?

Note: R does all of this for you in a nice command called "boot".  Bot is a little tricky to use, but if you want to try it, read 5.5.3. Pay careful attention to the footnotes, which have sample code for you to use.