Homework
For all homework, please write your name and the homework number at the
top
of the first page. Your homework MUST be stapled. You can buy a
stapler
for as little as $1.88,
so there is no reason not to do so.
Previous HW Assignments
R-Tip of the week:
R tutorials:
Intro to basic
commands
Intro
to writing functions
Official R
Introduction
HW8 Due Wednesday, March 16 Solutions
A. Again, let's consider the fat
data.
(You might want to re-read the previous homework assignment.)
This
time we're going to evaluate how we might answer the question "which
model
has the best accuracy in making prediction."
a) Last time you fit a model to predict percentage of fat. If you
remember
that model, write it down here. If not, re-fit to get a
reasonable
model.
We'll call this "Model 1"
b) Now we'll create "Model 2". Fit a model using (i) weight, (ii)
age,
(iii) age^2, (iv) height and (v) abdomen minus wrist. No need to
take
transforms. Just fit the model.
NOTE: R will react unexpectedly if you do lm(y ~ x + x^2)
to
create a quadratic fit. (This applies to the age and age^2 terms
you're
asked to fit.) The reason is that R treats the "^" character
differently
when it is part of a model than when it is typed on the command line.
To
get around this, type lm(y ~ x + I(x^2)) and you will
get the
correct model.
c) Evaluate how well Model 2 fits the data, using standard
diagnostics.
d) Calculate AIC and BIC for both models. According to AIC which
model
is best? Which model does BIC prefer?
e) One method for comparing accuracy is called "validation".
You
can read about it on page. 121. The idea is to randomly divide
the
data into two groups: one group is the "training" group and one
is
the "test" group. The idea is that you use the training group to
fit
the model, make transformations, decide which variables stay and which
leave.
You then use the test group to compare the predicted values with
the
observed values, and calculate the Mean squared error (or some other
statistic
-- could be r-squared, could be BIC.) This has been described as
a
"put your money where your mouth is" approach. You say your model
is
good at predicting, so here's some new data you haven't seen before
(the
test data), see how well you can predict it.
A variation of this is called cross-validation. Here you randomly
divide
the data into k groups. We'll assume k = 3 for here. Groups
1
and 2 will be the training data. You'll fit a model with the
training
data. Then Group 3 will produce a mean-squared error (the
residual
sums of squares divided by the degrees of freedom.) Save this
result.
Now we redo the whole thing. Groups 1 and 3 are the training
data,
group 2 the test data. Finally, we do this a third time, with
Groups
2 and 3 as the training data. This gives us three MSE statistics,
and
the average of these gives us an estimate of just how good our
predictions
are.
i) Divide the data into three random groups. Assuming your data
frame
is called fat:
n <- nrow(fat)
rand <- sample(1:n)%%3 + 1
group1.index <- (1:n)[rand==1]
group2.index <- (1:n)[rand==2]
group3.index <- (1:n)[rand==3]
fat.1 <- fat[group1.index,]
fat.2 <- fat[group2.index,]
fat.3 <- fat[group3.index,]
ii) Fit a model using weight, age, age^2, height, and abdomen
minus
wrist using just the first two groups. (Don't worry about
transforming
the data.) You'll need to combine the groups together:
training
<- fat[c(group1.index, group2.index),]
iii) Find the predicted values using group 3. Calculate the
residual
sum of squares (use the "residual" function). Calcualte the mean
square
error (RSS/n) where n is the sample size of the training group.
iv) Repeat steps (ii) and (iii) using Group3 as the test data and
then
with Group2 as the test data. What are the values for the mean
square
error? What are the value sfor the RSS (residual sum of squares)?
v) You can estimate the over-all mean square error by adding the
residual
sum of squares for each test set, and dividing by the total sample
size.
What is this estimate?
In principle, you could then repeat the procedure for Model 1 and use
this
to help you decide which makes the better predictions. (No need
to
do that here.)
In practice, using k=3 gives a crude (typically too large) measure of
the
error. You can improve by taking k larger. In fact, one
method
(requires that you write a program) uses k = n-1. This is also
called
the "leave one out" method or the "jacknife".
Read Section 5.5.1 - 5.5.2 for an overview.
B) Bootstrapping
When the residuals are not normal, then our estimates of confidence
intervals
are only approximations. While these approximations improve as
the
sample size increases, how do we know when the sample size is good
enough?
An alternative method is to use bootstrapping. Bootstrapping can be
used
to provide confidence intervals for the slope (and intercept, but we'll
focus
on the slope) that do not require an assumption of normal distribution.
In
this exercise, we'll calculate a bootstrap confidence interval for the
slope
in a simple linear regression.
a) Download the movie data. Fit a model
using
the log of Friday gross to predict the log of the ultimate gross. What
is
the model?
b) What is a 95% confidence interval for the slope in your model?
Does
it include 0?
c) Are the residuals normally distributed? Explain.
Now we'll try this with a bootstrap. The idea is that we
make
no assumptions about the distribution of the population, but we do
assume
that our sample is large enough to be a good reflection of the shape of
the
distributions of the population.
d) The first step is to take "bootstrap sample 1" from the original
data.
To do this, take a random sample WITH replacement from the
original
data:
original.data <- data.frame(logfriday, loggross)
n <- nrow(original.data)
sample.index <- sample(1:n, n, replace=T)
bs1 <- original.data[sample.index,]
"Document" this code -- explain what each line does.
e) Now fit the model using bs1 (don't do transformations or
diagnostics.)
What is the slope? save this slope in a variable named
slope.1
f) Repeat steps (d) and (e). Call this slope slope.2. You
now
have two different estimates of the slope.
g) Repeat (d) and (e)998 times. You'll now have 1000 estimates of
the
slope and can use this to understand the variability in the estimate.
I
don't expect you to actually repeat this 1000 times of course.
Instead,
write a function that will do this. The input to the function
should
be the data frame (what we called original.data) and a number called B
which
will represent the number of repetitions. (In this example, B = 1000).
The
output should be a vector of B estimates of the slope.
h) One crude way of getting a pretty good 95% confidence interval is to
take
as your left end point the .025 quartile of the output from your
function
in (g). The right end point is the .975 quantile. So let's
say
that you have a vector called boostraps that consists of 1000
estimates
of the slope. Your estimate of the 95% confidence interval could be
found
using quantile(boostraps, c(.025, .975)). What is this interval?
How
does it compare to the interval in (b)?
Note: R does all of this for you in a nice command called "boot".
Bot
is a little tricky to use, but if you want to try it, read 5.5.3. Pay
careful
attention to the footnotes, which have sample code for you to use.