Earlier Homeworks
R tutorials:
Intro to basic
commands
Intro
to writing functions
Official R
Introduction
HW 7 Due Friday, March 4 (note that this is in two weeks) Solutions
A.
i) Using the mussel data (mussels.short)
, fit a linear model using food level to predict the thickness of
the mussel beds. Comment on how well the data fit the model.
No need to do adjustments; just check the diagnostics and
comment. NOTE: to upload this data into R, you should type
whatevernameyouwant <- read.table("mussels.short", header=T)
ii) Use an added variable plot to determine which of these varibles:
temp, waves, human.use, will have a contribution to predicting
thickness of the mussel bed if we already know the food level.
B.
First, download the "fat"
dataset. This dataset consists of a series of R commands, and
so
to download it, all you need to do is save the file in your
working
directory, start up R, and type source("fat.html"). Next,
type
ls() and you'll see a data object named fat. Type names(fat)
and you'lll see the variables.
This homework exercise is taken from an article called "Fitting
Percentage of Body Fat to Simple Body Measurements", Roger W Johnosn,
Journal of Statistis Education v.4, n.1 (1996). In a
nutshell.....there is a fairly accurate way of measuring the percentage
of body fat a person has, and it involves immersing that person in a
large tank of water. This is obviously not practical for home use, or
even the doctor's office.. And so it is expedient to have another
method. This lab involves building a model that uses easy-to-measure
aspects of a person to predict their percentage of body fat.
From the article:
Abstract
Percentage of body fat, age, weight, height, and ten body circumference
measurements (e.g., abdomen) are recorded for 252 men. Body fat, one
measure of health, has been accurately estimated by an underwater
weighing technique. Fitting body fat to the other measurements using
multiple regression provides a convenient way of estimating body
fat for men using only a scale and a measuring
tape. This dataset can be used to show students the utility of multiple
regression
and to provide practice in model building.
1. Introduction
1 A variety of popular health books suggest that readers assess their
health, at least in part, by estimating their percentage of body
fat. Bailey (1994, pp. 179-186), for instance, presents tables of
estimates based on age,
gender, and various skinfold measurements obtained using a
caliper. Bailey (1991, p. 18) suggests that "15 percent fat for
men and 22 percent fat for women are maximums for good health."
Behnke and Wilmore (1974, pp. 66-67), Wilmore (1976, p. 247), Katch and
McArdle (1977, pp. 120-132), and Abdel-Malek, et al. (1985) are other
sources of predictive equations for
body fat. These predictive equations use skinfold measurements,
body
circumference measurements (e.g., abdominal circumference), and, in the
Abdel-Malek
article, simply height and weight. Gardner and Poehlman (1993,
1994)
supplement these body measurements with a measure of physical activity
to
predict body density from which, as we shall see below, body fat can be
estimated.
2 Such predictive equations for the determination of body fat can
be determined through multiple regression. A group of subjects is
gathered, and
various body measurements and an accurate estimate of the percentage of
body
fat are recorded for each. Then body fat can be fit to the other
measurements
using multiple regression, giving, we hope, a useful predictive
equation
for people similar to the subjects. The various measurements other than
body
fat recorded on the subjects are, implicitly, ones that are easy to
obtain
and serve as proxies for body fat, which is not so easily obtained.
3 In the dataset provided by Dr. A. Garth Fisher (personal
communication, October 5, 1994), age, weight, height, and 10 body
circumference measurements are recorded for 252 men. Each man's
percentage of body fat was accurately estimated by an underwater
weighing technique discussed below. A complete listing of the variables
in the dataset appears in the
Appendix.
2. Determination of the Percentage of Body Fat from Underwater
Weighing
4 The percentage of body fat for an individual can be estimated from
body density. As an approximation, assume that the body consists of two
components -- lean tissue and fat tissue. Letting
D = body density,
W = body weight,
A = proportion of lean tissue,
B = proportion of fat tissue (so A + B = 1),
a = density of lean tissue, and
b = density of fat tissue,
we have
D = weight/volume
= W/[lean tissue volume + fat
tissue volume]
= W/[A*W/a + B*W/b]
= 1/[(A/a) + (B/b)].
Solving for B we find
B = (1/D) * [ab/(a - b)] - [b/(a - b)].
5 Using the estimates a = 1.10 gm/cm^3 and b = 0.90 gm/cm^3 (see Katch
and McArdle 1977, p. 111, or Wilmore 1976, p. 123), we come up with
"Siri's equation" (Siri 1956):
Percentage of body fat (i.e., 100 * B) = 495/D - 450,
where D is in units of gm/cm^3. The dataset provided also gives a
second estimate of body fat due to Brozek, Grande, Anderson, and Keys
(1963, p. 137):
Percentage of body fat = 457/D - 414.2,
which is considered accurate for "individuals in whom the body weight
has been free from large, recent fluctuations." There does not seem to
be uniform agreement in the literature as to which of these two methods
is best.
6 Volume, and hence the body density D, can be accurately measured in a
variety of ways. The technique of underwater weighing "computes body
volume as the difference between body weight measured in air and weight
measured during water submersion. In other words, body volume is
equal to the loss of
weight in water with the appropriate temperature correction for the
water's density" (Katch and McArdle 1977, p. 113). Using this technique,
Body density = W/[(W - WW)/c.f. - LV],
where
W = weight in air (kg)
WW = weight in water (kg)
c.f. = water correction factor
(equal to 1 at 39.2 degrees F because one gram of
water occupies exactly one cm^3 at this temperature,
equal to .997 at 76-78 degrees F)
LV = residual lung volume (liters)
(Katch and McArdle 1977, p. 115). The dataset provided here contains
the weights of the subjects, but not the values of the three other
quantities. Other methods of determining body volume are given in
Behnke and Wilmore (1974,
p. 22 ff.).
Questions for problem B.
1. Examine the data and note any unusual cases. What should be
done about the unusual cases? (The "about" file mentions one case
in which is not hard to figure out how to correct one unusual case.)
2. Choose one of the two percentage of body fat estimates (Brozek or
Siri). Fit the percentage of body fat to some subset of the
provided variables, but do not use density (which is too hard to
measure). You need not describe every idea you try, but should justify
your final choice using appropriate diagnostic tools and the like.
3. September 14, 1995 articles in The New England Journal of Medicine
link high values of the adiposity index (weight/height^2), sometimes
called the body mass index, to increased risk of premature death. See
if this variable is useful in your model. Also try
weight^1.2/height^3.3 as suggested in Abdel-Malek, et al. (1985).
4. Comment on the predictive accuracy of your model. For
example, how far off is an individual likely to be?
5. Estimate the percentage of US men whose bodyfat is less than 15%
(which some experts say is the maximum for good health). What
assumptions must
you make?
HW 6 Due Friday, Feb 18 Solutions
for all but #1 (#1 coming soon -- needs to be scanned.)
1) Show that in the case of simple linear regression, the matrix
version of the formula for the parameter estimates (X'X)^-1 X'Y
produce teh "right" formulas.
2) Show that the formulas on p. 740 of the handout for X'X and its
inverse are correct.
3) Using the
couples data set, I found the least squares line used to predict
the husbands age using the wife's age. (Remember, this data
set consists
of a random sample of married couples in Great Britain.) The
ANOVA
table is below. Based on the ANOVA table calculate the r-squared
and
the residual standard error.
Source
|
DF
|
Sum Sq
|
Mean Sq
|
F-value
|
Pr(>F)
|
W.Age
|
1
|
3938.0
|
3938.0
|
326.04
|
1.798e-15
|
Residuals
|
24
|
289.9
|
|
|
|
You can verify your answers by loading the couples data (couples.txt in
the dataset directory), fitting H.age = a + b W.age, and looking at the
summary.
4) Download the couples data (http://www.stat.ucla.edu/~rgould/120w05/datasets/couples.txt).
Our goal is to predict the age that men in Great Britain marry
using all other variables. The response variable is
H.AgeAtMarriage.
a) Look at a scatterplot matrix of the variables. Which
predictors seem related to each other? Describe the predictors
relationships
to the age at which men marry.
b) According to your graph, is there evidence of an "historic" trend in
the age at which people marry? That is, did people used to marry
younger than they do now? (This dataset was collected in the mid
1990's and is a random sample of married couples -- not recently
married couples. "age" is their age at the time of the survey.)
c) Fit a linear model without making any transformations.
Interpret the coefficient for husband's age.
d) Why do H.Age and W.age have different signs?
e) Is there a variable that does not seem to predict (or explain)
the age at which men marry? Which is it? Fit a new model
without it.
f) State the new model and check that it meets the necessary
assumptions. Report.
HW 5, Due Friday, Feb 11 Solutions
A. Download
the apiscores data. A full explanation of this data set is
given
below in the Optional Problem B. Our goal is to
understand
what factors influence or affect high-school performance. (Note
that
this is probably not possibly since this is an observational study.
Still
we can gain some insight. A less ambigitious and more attainable
goal
would be to understand how schools with different compositions vary in
terms
of their API schools. For example, do all schools with a high
percentage of teachers with emergency credentials score the same? If
not, what accounts for this difference?) One approach to
understanding data like this
is to ask: what explains the variability in y? For example, it
would
be nice if all schools got the same score on the API -- and preferably
this
score would be a good one. But type hist(api) and you'll see that they
don't.
Why?
a) type (api.table). (I'm assuming you named the data api.table,
but if not, type in whatever you did name it.) Notice that you
get a
large, messy graph. (You might want to make it larger to make any
sense
of it.) Notice that if you type school[1:2] you'll see that this
variable
contains the names of the schools. But for some reason R just
converted
these names to numbers and plotted them. We're not interested in
treating
school-name as a variable, so first we need to create a new dataframe
that
does not have the school variable:
api.small <- api.table[, -1]
This "subtracts" the first column (which contains the school variable)
and assigns it to api.small.
Now do plot(api.small). The first row contains what are called
"marginal plots". These are the same plots you would get if you
typed
plot(pct.meals, api99) and then plot(not.high.g, api99), etc. The
first one, for example, tells us what the relationship between api
scores and the percentage of students at a school who receive free
meals looks like, ignoring all of the other factors.
Describe this first row. How do each of these variables relate to
the API scores? What relationships look unusual?
b) Later we'll see that relationships between our predictor variables
can be problematic. Do you see any evidence that the predictor
variables are related? Give an example of two related predictor
variables and describe that relationship in the context of the data.
(In other words, don't just say "x has a linear relationship with
y", but instead explain
what these means in terms of schools/parents/teachers/ etc.)
c) You'll notice that the relationships between the API scores and the
various predictors are not terribly linear. Nonetheless, we'll
plow ahead and fit a linear model. There are two ways to do this:
The easy way:
fullfit <- lm(api99 ~ . , data=api.small)
The "." on the right hand side means "all of the variables except api99
that are in the dataframe api.small"
The hard way:
attach(api.small)
fullfit <- lm(api99 ~ pct.meals + not.high.g +
high.grad+some.coll+coll.grad+grad.schl+avg.ed+pct.cred+pct.emerg)
One thing to be concerned about: if in problem A you had typed
"attach(api.table)", before attaching a new table (api.small) you
should first detach the old
one. So type "detach(api.table)" and then you can type
"attach(api.small)". The reason for this is that both tables have
the same variable names, and this will lead to confusion later.
Write down the equation of the fitted model. (Type
"summary(fullfit)")
d) Note that the column of p-values ("Pr(>|t|)") has only one value
less than 5%. Strictly speaking, this means that all of the
slopes are not different from 0! This can't be correct! The
problem is that the violation of the assumption of linearity (we
assumed all relationships were linear, and they're not) has severe
repercussions. (There are other
problems, too, but we'll get to those much later.) But for now,
assume
that everything is fine and the relationships are linear.
What
does the model tell us about the role that emergency credentials play
in
school performance? (Be as detailed as you can.)
e) The least significant preditor is some_coll. Refit the model
without it. How does this change the summary?
f) Check the diagnostic plots of this new model.
Describe.
(If the new model was named fitwithoutcoll.lm type
plot(fitwithotucoll.lm))
OPTIONAL B:. You may have heard about the accountability craze that has
hit education and has been increased by the implementation of the No
Child Left Behind policy. For some time, California has been assessing
the performance of its high schools using something called the Academic
Performance Index (API). This
data set contains the API for all 806 public high schools in
California in 1999. Well, all but one. My alma mater
(Whitter High School) was left out for reasons that will later be
apparent. In this problem we're going to explore the problem of
prediction.
One use of regression is to make predicitons; given a set of x
values, what do we think y will be? Regression tries to answer
this by finding the (linear) model that comes closest to the response
values. But
still, this leaves an uneasy feeling: how do we know we will do
so
well if we collect more data (from the same population)? In the
context
of public schools, we might want to use data from 1999 to make
predictions
about the API of schools in future years. Of course, we would
have
to assume that the system was relatively stable from year to year
(which
means that although individual schools might change, the factors that
affect
performance and the way those factors affect performance does not).
An
iffy assumption, but one commonly made, and good for a first step.
Suppose you and I had two different models and bothed bragged that ours
was best. Maybe my model used the percentage of students
receiving
free meals, and yours included the percentage of teachers who are
credentialed. Maybe both models have the same r-squared.
(Recall that r-squared measures the percent of variation
explained. High r-squared means
points are close to the line and so therefore your predictions should
come
close to the actual observations.) How do we know which will
make
the best predicitons? Well, we don't. But we'll find out
next
year when the new data come out. But what if we can't wait until
next year! (And, realistically, it takes more than a year!)
Cross-validation is one procedure that we'll explore.
First, a word about the data:
| Variable |
Description
|
school
|
name of school
|
api99
|
API score
|
pct.meals
|
% students receiving free meals
|
not.high.g
|
% who do not graduate
|
high.grade
|
% who graduate but no college
|
some.coll
|
% who go to college, no degree
|
coll.grad
|
% who get college degrees, no higher
|
grad.schl
|
% who go to grad school
|
avg.ed
|
average years of higher ed
|
pct.cred
|
% of teachers who are credentialed
|
pct.emerg
|
% of teachers who hole emergency credentials
|
I think that the data on education (not_high_g and some_coll, for
example) refer to the parents of the children. (Otherwise, there hasn't
been enough time to collect this data.)
a) Using your intuition, choose two variables that you think would make
good predictors. Call these X1 and X2. We'll work with X1
first. Select half of the data at random. The best way to
do this is to create a variable called index:
index <- 1:length(api99)
Next, take a random sample of size round(length(api99)/2): sampleindex
<- sample(index, round(length(api99)/2), replace=F)
Now we divide the data into two random halves. The first is called the
"test set" and the other the "training set".
x1.training <- X1[sampleindex]
api.training <- api99[sampleindex]
x1.test<- X1[!is.element(index, sampleindex)] # this selects
everything that is NOT in sampleindex (the ! means "not"))
api.test <- api99[!is.element(index, sampleindex)]
Now, use the training data to find the best linear model you can to
predict api scores using X1. When you've found it (using any
transformations you want), call it output.training
What's the r-squared? (Make a note of this; we'll use it later)
b) Now along comes a "new" data set from the same population:
the test data. How good a job does your model do predicting the
new data points? What you need to do is try it. Use the x's
from the
test set and see what the model (from training data) predicts for the y
values. This also takes several steps.
Probably the easiest way to do this is to find out what your intercept
and slope turn out to be. Call them a and b. Then do
trained.pred <- a+b*x1.test
Now how well do these predictions (using our training data) compare to
what we actually saw? One way is compute the correlation between
the predicted y values, trained.pred, and the actual values,
api.test. Compute
this correlation and square it. What is it? This value is
called
the cross-validation correlation.
c) Calculate the shrinkage on cross-validation: The r-squared
from your fit with the test data minus the cross-validation
correlation. A small number is good --- it means your model is
reliable and gets about
the same r-squared with both sets of data. Large is not good.
Hard
to say how large things have to get. But it is useful to compare
models.
d) Repeat the above steps using X2. Which model (the one with X1
or the one with X2) do you think is better for predicting api scores?
e) Use whichever model you think is best to predict the API score of my
alma mater. Here's what you need to know:
pct.meals 34
not.high.g 27
high.grad 27
some.coll 23
coll.grad 18
grad.schl 5
avg.ed 2.47
pct.cred 85
pct.emerg 15
What you've done is an example of "two-fold" cross-validation because
you broke the data into two equal halves. It is more common to
use
k-fold, where k is a number between 3 and n-1. So if k = 3, you
break
the data into thirds. Two-thirds are your training set, and 1/3
you
use to test, and you repeat everything three times, so that each third
gets
a chance to be the test set. Then you can average your shrinkage
statistics
to get a sense of what it should be.
Homework #4, Due Friday, Feb. 4 Solutions
A. The probability model behind regression says that to determine the
response of a "system", first set the predictor to a value x.
Then the response will be (mean + error) where mean = a + bx and
error =
(random number, normally distributed, mean 0, sd =something.) In
this exercise, you'll generate fake data that follows this model and
do some exploring.
i) Write a function that creates fake data according to a simple
regression model. The inputs to the model should be intercept,
slope, standard deviation, x. The predictor values, x, should be
a vector of values, and the response fo the function should be y:
a
vector of "observations" for the x. So you should be able to do
something like this:
x <- -10:10
y <- yourfunctionhere(x, 3, 5, 2)
and y will be otuput for the model y = 3 + 5x + e where e is
N(0,2).
Show me your function, and generate some data using the definition of x
(-10, -9, ....9, 10) shown above.
Plot x vs. y.
ii) OPTIONAL: What is the population standard deviation
of the y's? (Note: I'm not asking you to calculate it from the
data -- that would be the sample standard deviation. I'm asking you to
determine what the model says it should be.) Compare this to the
sample standard deviation of the y's. (This is a little harder than I
had intended. It requires that you know the population standard
deviation of the x's. Feel free to think about it. In case you're
wondering, the population sd of the x's is 5.91)
iii) The model says that the RSS should be approximately equal to
4*(n-2) . Show why. Then compare this to the estimate from
the LS line.
iv) Check the residual plots to check for linearity and
normality. Of course you know these should be good because you made
them that way. But the idea is to get some experience seeing what
"good" plots look like.
v) Which values of x have the largest Cook's distance for your
fitted model? (If "myoutput" is the output of a call to the lm
command, then plot(myoutput) will give you a series of four plots
(just hit return to see the next one). The last plot shows Cook
distances.)
vi) Repeat (ii)-(v) using xnew <- rep(x,10). Make sure to plot
(xnew, ynew).
B. What makes one piece of real estate property more or
less valuable than another? The land.txt
datafile (tab-delimited, text) records values on several variables that
could affect the cost of a several properties. This dataset
consists of 11 variables. The variable named "LandVal" gives the value
of the land alone – without consideration of the structure on it.
“TotalVal” is the value of the property as a whole. “Acreage” is the
size of the lot and the other ten variables describe key features of
the buildings on the parcels of land. “Height” is the number of floors,
“fl1Area” is the square footage of the first floor, “Rooms”,
“Bedrooms”, “FullBath”, “HalfBath”, “Fireplce”, and “Garage” are all
exactly what they seem to be.
i) Make a histogram of LandVal. What does this say about the
value of these parcels of land.
ii) Perform a regression to use Acreage to predict LandVal. Your
regression should include
a) an initial scatterplot
b) a final model
c) discussion of any needed transformations
d) discussion of the validity of any assumptions
e) An interpretation of what the model tells us
about this relationship and how well the model fits.
iii) Choose any other variable (say "height") that you think
might be related to the value of land. Plot the residuals from
(ii) against this variable. Describe what you see. Do you
think that this variable would be useful additional information for
predicting the value of land?
iv) The slope in (ii) relates the mean land value to the size of
the lot. What assumptions are needed in order to make inferences
on this slope? Do you think they are valid here? Find a
95% confidence interval for the slope and interpret.
v) Estimate the mean value of properties that sit on .80 acres.
Find a 95% confidence interval. Do you think the
assumptions required to make this inference are satisfied? Explain.
vi) Predict the value of a property that sits on 1.7 acres of
land. Find a 95% confidence interval. Again, discuss
whether
the data can support this inference. (That is, are the
assumptions
valid?)
C. Do opposites attract? A random
sample of married couples in Great Britain might help answer this
question (at least with respect to a few variables.)
i) Use a regression to predict husband's height given his wife's
height. Is there a relationship?
ii) Interpret the slope and find a 95% confidence interval.
Interpret the confidence interval, and discuss whether the
assumptions that are required to make this interval interpretable are
true. (Heights are measured in milimeters).
iii) Identify any influential points. What happens if these are
removed?
iv) According to the model, among all wives who are 1680 mm tall,
about 68% of the husbands are between ? and ? in height?
iv) Play matchmaker: predict (with a confidence interval) how
tall the husband of a woman who is 1650 mm tall should be.
v) A psychologist claims that short men feel the need to compensate by
marrying taller wives. As evidence, he looks only at the
"shorter" men in the sample: those less than the first quartile in
height (which is those less than 1692). He finds that on
average, these men are 1.28 standard deviations shorter than average.
However, their wives are only .47 standard deviations shorter
than average. This, he claims, is evidence that short men
compensate by looking for taller wives. (Not wives who are taller
than them, mind you, but wives who are taller with respect to other
wives.) Critique this reasoning.
SAVE FOR HW 5 (DUE FEB 11):
D. You may have heard about the accountability craze that has
hit education and has been increased by the implementation of the No
Child Left Behind policy. For some time, California has been assessing
the performance of its high schools using something called the Academic
Performance Index (API). This
data set contains the API for all 806 public high schools in
California in 1999. Well, all but one. My alma mater
(Whitter High School) was left out for reasons that will later be
apparent. In this problem we're going to explore the problem of
prediction.
One use of regression is to make predicitons; given a set of x
values, what do we think y will be? Regression tries to answer
this by finding the (linear) model that comes closest to the response
values. But still, this leaves an uneasy feeling: how do we
know we will do so well if we collect more data (from the same
population)? In the context of public schools, we might want to
use data from 1999 to make predictions about the API of schools in
future years. Of course, we would have to assume that the system
was relatively stable from year to year (which means that although
individual schools might change, the factors that affect performance
and the way those factors affect performance does not). An iffy
assumption, but one commonly made, and good for a first step.
Suppose you and I had two different models and bothed bragged
that ours was best. Maybe my model used the percentage of
students
receiving free meals, and yours included the percentage of teachers who
are credentialed. Maybe both models have the same r-squared.
(Recall
that r-squared measures the percent of variation explained. High
r-squared means points are close to the line and so therefore your
predictions should come close to the actual observations.) How
do we know which will make the best predicitons? Well, we don't.
But we'll find
out next year when the new data come out. But what if we can't
wait
until next year! (And, realistically, it takes more than a year!)
Cross-validation is one procedure that we'll explore.
First, a word about the data:
| Variable |
Description
|
school
|
name of school
|
api99
|
API score
|
pct.meals
|
% students receiving free meals
|
not.high.g
|
% who do not graduate
|
high.grade
|
% who graduate but no college
|
some.coll
|
% who go to college, no degree
|
coll.grad
|
% who get college degrees, no higher
|
grad.schl
|
% who go to grad school
|
avg.ed
|
average years of higher ed
|
pct.cred
|
% of teachers who are credentialed
|
pct.emerg
|
% of teachers who hole emergency credentials
|
I think that the data on education (not_high_g and some_coll, for
example) refer to the parents of the children. (Otherwise, there hasn't
been enough time to collect this data.)
a) Using your intuition, choose two variables that you think would make
good predictors. Call these X1 and X2. We'll work with X1
first. Select half of the data at random. The best way to
do this is to create a variable called index:
index <- 1:length(api99)
Next, take a random sample of size round(length(api99)/2): sampleindex
<- sample(index, round(length(api99)/2), replace=F)
Now we divide the data into two random halves. The first is called the
"test set" and the other the "training set".
x1.training <- X1[sampleindex]
api.training <- api99[sampleindex]
x1.test<- X1[!is.element(index, sampleindex)] # this selects
everything that is NOT in sampleindex (the ! means "not"))
api.test <- api99[!is.element(index, sampleindex)]
Now, use the training data to find the best linear model you can to
predict api scores using X1. When you've found it (using any
transformations you want), call it output.training
What's the r-squared? (Make a note of this; we'll use it later)
b) Now along comes a "new" data set from the same population:
the test data. How good a job does your model do
predicting
the new data points? What you need to do is try it. Use
the x's from the test set and see what the model (from training data)
predicts for the y values. This also takes several steps.
Probably the easiest way to do this is to find out what your intercept
and slope turn out to be. Call them a and b. Then do
trained.pred <- a+b*x1.test
Now how well do these predictions (using our training data) compare to
what we actually saw? One way is compute the correlation between
the predicted y values, trained.pred, and the actual values,
api.test. Compute this correlation and square it. What is
it? This value is called the cross-validation correlation.
c) Calculate the shrinkage on cross-validation: The r-squared
from your fit with the test data minus the cross-validation
correlation. A small number is good --- it means your model is
reliable and
gets about the same r-squared with both sets of data. Large is
not
good. Hard to say how large things have to get. But it is useful
to compare models.
d) Repeat the above steps using X2. Which model (the one with X1
or the one with X2) do you think is better for predicting api scores?
e) Use whichever model you think is best to predict the API score of my
alma mater. Here's what you need to know:
pct.meals 34
not.high.g 27
high.grad 27
some.coll 23
coll.grad 18
grad.schl 5
avg.ed 2.47
pct.cred 85
pct.emerg 15
What you've done is an example of "two-fold" cross-validation
because you broke the data into two equal halves. It is more
common
to use k-fold, where k is a number between 3 and n-1. So if k =
3, you break the data into thirds. Two-thirds are your training
set, and 1/3 you use to test, and you repeat everything three times, so
that each third gets a chance to be the test set. Then you can
average
your shrinkage statistics to get a sense of what it should be.
Homework #3, Due Friday, Jan. 28 Solutions
(let me know if you find a typo or mistakes)
A. (Review of hypothesis testing). It is a well-known fact (that
has been empirically verified) that a flipped coin lands heads
approximately 50% of the time. Did you know, however, that a spun
coin lands on its head less often than 50% of the time? The
reasoning is that if you spin a coin, the extra weight on the head side
pulls the coin so that it is slightly more likely to land with the
tails-side up.
a) Suppose you spin a coin 100 times. How many times would
it have to land Heads before you would believe my claim about spinning
coins? Why?
You
might want to review the binomial distribution. If you are
unfamiliar with it, you can look it up in any introductory statistics
book. It is somewhat difficult to typeset on a web page, but
essentially it goes like this:
Suppose X is a random variable that
represents the number of heads resulting from flipping (or spinning) a
coin. Suppose p represents the probability the coin will land
heads, and n is the number of flips. Now let x (that's lower-case
X) represent the observed number of heads. (A subtle
distinction: X (upper case) represents the unknown number of
heads before we do this experiment. x lower case is an
old-fashioned algebraic variable that is a place holder for whatever
value we happen to observe. Now
P(X = x) = (n!/(n-x)!x!
) * p^x * (1-p)^(n-x)
Three very useful facts:
a) E(X) =
np that is, we expect to get np heads.
b) Var(X) =
np(1-p) so the SD is the square-root of this and represents
the random variability. That is, we expect to get np heads, give
or take sqrt(np(1-p)) heads.
c) if n is
sufficiently large (a standard rule-of-thumb) is both np>10 and
n(1-p)>10), then the probability distribution of X is closely
approximated by a normal distribution with mean np and standard
deviation sqrt(np(1-p)). (This is essentially a particular case
of the Central Limit Theorem.) So suppose n = 100 and p =
.5. If you want
to find P(X > 64) (the probability of getting more than 64 heads in
100 flips of a fair coin), you can laboriously calculate P(X = 65) +
P(X
= 66) + ....P(X = 100) using the binomial distribution up
above.
Or you can get a pretty good approximation using a normal distribution.
Two useful tips:
a)Actually, with
R, it is not that laborious to use the binomial distribution.
Simply typing pbinom(x,n,p) will calculate P(X <=x ) using the
binomial distribution. Hence P(X > 64) = 1-pbinom(64,100,.5)
b) You need never
use a normal probability table again. pnorm(x,mu,sd)
will give you the P(X <= x) for a N(mu, sd) distribution.
Hence 1-pnorm(64,100*.5, sqrt(100*.5*.5)) will give approximately the
same answer as 1 - pbinom(64,100,.5) (with "approximately" being
the key word.)
Now back to the HW:
b) Suppose in (a) you had decided that you
would conclude that the probability of getting heads when spinning a
coin was NOT .5 if you spun it 100 times and observed x heads or
fewer. (x should be whatever value you used to answer (a)).
Let p = true probability of getting heads when spinning a coin.
Assume p = .5. Use pbinom to find the probability that you
will mistakenly conclude that p is less than .5. This
figure is called the
significance level and is famously represented by the lower-case greek
letter alpha. The significance level is the probability of making
the mistake of rejecting the null hypothesis (that p = .5) when in fact
the null hypothesis is true.
c) Now assume that in truth, p = .45. What's
the probability, using the value x you chose in part (a), that
you will (correctly) conclude that p is not .5 if you carry out the
experiment? This value is called the "power": the
probability
of correctly rejecting the null hypothesis when, in fact, it is false
and shoudl be rejected. You want the power to be as high as
possible.
d) Choose another value of x so that the power is now higher than
before. (x , remember, is the number which, if you do the
experiment and get x heads or fewer, you'll decide that p does not
equal .5.)
e) Using the x you chose in (d), calculate the significance level.
It is a sad fact of life that power and significance level are
adversaries. You want the power to be high, and the significance
level to be low. But if you improve one, you make the other
worse. The only way to improve both is to increase the sample
size.
B. Load the survey
data set into R (use the command
read.table("survey.txt", header=T, sep="\t")). This data consists
of responses to a survey by a Stats 11 class.
a) Plot weight against height (weight should be on the
y-axis.) There are two outliers. To find out where they are
in the data set, do the following:
i) type identify(height, weight)
ii) place your cursor over one of the
points and click. Do the same for the other point.
iii) Hit the escape key.
If all went well, two numbers should appear, and
these numbers are the locations of these two points in the data
set.
b) Create two new variables, height2 and weight2 that have these two
outlier points removed. (For example, if you wanted to remove the
6th point, type height2 <- height[-6]
c) Plot weight2 against height2 and comment on the relationship between
height and weight (which shoudl be obvious without the graph.) This
plot combines men and women. Can you see any evidence of two
distinct groups? (the identify command can be fun
here. Type identify(height2,weight2, labels =gender)
and click on a few points; then hit the escape key.
d) Find the least-squares regression line and interpret the results.
(See the bottom of p. 108 and top of 109).
e) Plot the residuals against height2. Do you see any
deviations from the model? (if
output.lm is the output of your least squares fit, then resid <-
residuals(output) will give you the residuals. Then do plot(resid,
height2).
f) What assumptions must you make if we are to be able to
interpret the p-values provided in the output?
g) Algebraically solve the least squares regression line
to find the equation using weight to predict height.
h) Now fit a regression line using weight to predict
height. How does it compare to your algebraic solution in
(g)? Why do you think this is?
i) Now fit a least squares regression line using height to predict
weight but using the original variables (that include the
outliers). Do the outliers have a big affect on the fitted model?
C1. In this problem we'll turn things on their side and use height to
predict gender.This problem will use the same survey data. as in B.
But first, we need to create a new variable. The gender
variable has two values, "m" and "f". We want to recode
this as 0 and 1. Do the following:
gender.code <- c(length=length(gender)) (creates
a null vector with same length as gender.
gender.code[gender=="m"] <- 0
gender.code[gender=="f"] <- 1
Now the men are coded with 0's and the women with 1's.
a) Make side-by-sdie boxplots of weight for men and
women. Compare.
b) Although there is considerable overlap in weight, still the means
are different. Suppose I randomly select a person from this class and
tell you their weight. Can you predict which gender they are most
likely to be? This is a bit unusual compared to other
problems we've seen: the response variable is binary and has
values 0/1 while the predictor is continuous. One way to think
about this is that you want to know the probability that a person is
female
given that you know their weight. So for each weight, we can
estimate
it by finding what percentage of people at that weight are women.
Plot gender.code against weight (weight on x axis) and fit a lowess
line.
What does the line tell you about the probability that a randomly
selected person is a woman if you know their weight?
c) Find the mean weight of men and of women.
d) Fit a regression line using weight to predict gender.
Interpret the coefficients and compare to your answer in (c).
C2. Now we reverse things.
a) Fit a regression using gender to predict weight.
Interpret the coefficients. (It might help to look at the plot
of gender (y axis) vs. height.)
b) Do a t-test to test whether the mean weight of men and the
mean weight of women are equal. Assume that the standard
deviations ARE EQUAL in these two populations. What assumptions
must you make to carry out this test?
c) Now compare the value of the t-statistic in part (b) with the
t-statistic that corresponds to the slope you got in part (a).
What does this tell you about the meaning of the t-test
associated with this slope? What assumptions must you make in
order for this
t-test to be valid?
D.
a) Fit a regression line to Galton's
data of father/son heights. (Once you've downloaded the file
to the right directory, simply type source("filename")
b) Use the regression line to find out how many standard
deviation above average are typical sons whose fathers are 1 standard
deviation above average?
c) How many standard deviations below average are typical sons whose
fathers are 1 standard deviation below average?
Galton called this phenomenon (that taller than average and shorter
than average fathers produce more "mediocre" sons) "regression to the
mean."
d) Suppose a certain large class takes two midterms.
What does regression to the mean say about the people who get A's on
the first midterm? (For the sake of conversation, suppose that to
get an A you need to be 2 standard deviations above average).
Assume
that r = .4 and the standard deviations of both exams are the same.
E. Read #3 on p. 132 . You can get the data very simply; just
type data(cars).
Type names(cars) to see the
variables. Type attach(cars) to use the variables by their first
name. The data consist of stopping distances for a car at various
speeds.
a) Make a plot to help us determine how much distances is needed to
stop for a given speed.
b) plot the least squares regression line and comment on
whether the model fits. Superimpose the lowess line and compare.
c) Make a plot of the residuals to confirm your answer to (b).
d) Now we'll try transforming the y-variable. Fit a curve
of distance^2 to speed. Repeat for sqrt(distance) and
log(distance). Which fits best and why?
F. Download the oldfaithful
data into R. This consists of information about eruptions of
the Old Faithful Geyser
in October 1980. The length variable is the number of minutes
that an eruption lasts and duration is the time in minutes to the next
eruption. Old Faithful has many tourists and this data (well,
data like these) are used to derive an equation that helps park rangers
tell
tourists when to expect the next eruption.
a) Make a histogram of duration. Describe the
distribution. What would you say is the typical waiting time for the
next eruption?
b) Make a graph to show the relation between the length of the
eruption and the time until the next eruption. Describe this
relationship in words. Does it make sense?
c) Find an equation that predicts the time to the next eruption based
on the length of the current eruption.
G. (optional). The EPA publishes fuel-efficiency data for a
variety of types of cars. http://www.epa.gov/fueleconomy/data.htm.
Download
the 2005 data here. It is a text file in which values are
separated by commas, and so you load it into R by typing
read.table("fuel.csv", header=T, sep=",").
You can view the variable names with the names() function.
You can see what the names mean by reading this file, although you'll
have to scroll down a bit.
In general, cars with bigger engines are less efficient.
One measure of the size of an engine is the displacement which
measures, in liters, the amount of fuel moved through the engine.
Come
up with the best equation you can to predict the highway mileage
(hwy)
based on displacement (displ). Evaluate your model.
Use the identify function to identify any unusual cars. (For
example, which car has the worst fuel efficiency? the best?)
Homework #2, Due Friday, January 21
Solution (pdf)
p. 103: 2
#A: (a) Using the data from my songlists: http://www.stat.ucla.edu/~rgould/120w05/datasets/songlist.txt,
do a statistical test to determine whether the mean song
lenghs is different for rock than for classical. State all
assumptions, and in particular be clear about what the population is
that this hypothesis applies to. Talk about how you decided to
handle any outliers. Does it matter how the outliers are handled
here? Check any assumptions you can. For example, if you assumed
the times came from
a normal distribution, can you check whether that seems reasonable?
(There's
no need to get fancy with this problem and do a non-parametric or
bootstrap
-- but you can if you want. The point is to just explore and
practice
with R a bit.)
(b) Compute a 95% confidence interval for the mean difference in
lenghts of songs between these two groups. Again, state any
assumptions you make, but not need to check them out.
#B: Upload
this dataset into a data frame:
http://www.stat.ucla.edu/~rgould/datasets/bloodpressure.dat.
(It's a tab-delimited file -- you'll need ot save it to your
harddrive.) It consists of the systolic and diastolic blood
pressure of 15 patients with moderate hypertension. Measurements
were taken immediately before and two hours after taking 25 mg of th e
drug captopril. The drug is expected to lower blood pressures.
Does it work? Discuss. Explain.
#C: Upload http://www.stat.ucla.edu/~rgould/120w05/datasets/anscombe.txt
into a data frame. It's artificial data, so the context isn't
important. (a) Without making any graphs, find the correlations
between the following variables: X and Y1, X and Y2, X and
Y3, X4
and Y4. (Example, cor(X, Y1) will do it.) Comment.
(b) Now make scatterplots of these pairs. (One quick way is
to do "plot(dataframe)" where dataframe is whatever you named
it. Note that this gives
you too many plots (for one thing, it not only gives Y1 vs. X, but also
X vs. Y1. But it also gives X vs. X4, for example.) Comment. The
moral of this activity is that you need to do plots before computing
correlations.
#D. Upload http://www.stat.ucla.edu/~rgould/120w05/datasets/wine.txt
into a data frame. (Again, a tab-delimited, text file,
so do
wineframe <- read.table("wine.txt", header=T, sep="\t") to read it
in.) This data contains heart disease mortality per 1000 people,
wine consumption in liters per person per year for a number of
different countries. Make a scatterplot of mortality against wine
consumption and comment on the shape of the relationship and what this
says about wine consumption and health. Do you think the
relationship is causal? Why or why not?
Homework #1, Due Friday, January 14 Solutions
p. 26: 1,2,3 (state your observations about the relation
between temperature and failures), 4-10.
p. 50: 1,2,5
p. 69: 2-5
#A: Here is a link to
information collected from a Stats 11 class. Comment on the
distribution of men's heights. Do they seem normally distributed?
(Note: you'll have to react to an outlier before answering the
question.)
HINT: to download the data: save the file on your hard-drive in
the directory that you have identified as your "working directory."
Then
whatevernameyoulike <- read.table("survey.txt", header=T, sep="\t")
will read it into R. This is the standard format for
tab-delimited text files in which a header supplies variable names.
The sep="\t" option tells R that fields are separated by tabs.
The header=T option tells R that the first line contains
variable names.
#B: This text file contains Michelson's
measurements of the velocity of light in air. In 1879,
A.A. Michelson made 100 determinations of the veloctity of light in
air using a modification of a method proposed by the French physicist
Foucault. The data are given here as reported by Stigler.
The
numbers are in km/sec, and have had 299,000 subtracted from them.
The
currently accepted "true" velocity of light in vacuum is 299,792.5
km/sec.
Stigler has applied corrections used by Michelson and reports
that
the "true" value appropriate for comparison to these measurements is
734.5.
Each trial may be a summary of several exerimental observations.
a) Perform a t-test to check whether the data support the "true" value
(734.5). State any assumptions.
b) Check assumptions.
c) Construct side-by-side boxplots of the velocities based on the trial
number. Describe. How might this affect your conclusion?