Earlier Homeworks

R tutorials:
Intro to basic commands
Intro to writing functions
Official R Introduction

HW 7 Due Friday, March 4 (note that this is in two weeks) Solutions

A.  
i) Using the mussel data (mussels.short) , fit a linear model using food level  to predict the thickness of the mussel beds.  Comment on how well the data fit the model.  No need to do adjustments; just check the diagnostics and comment. NOTE: to upload this data into R, you should type
whatevernameyouwant <- read.table("mussels.short", header=T)

ii) Use an added variable plot to determine which of these varibles: temp, waves, human.use, will have a contribution to predicting thickness of the mussel bed if we already know the food level.

B.
First, download the "fat" dataset.  This dataset consists of a series of R commands, and so to download it, all you need to do is save the file  in your working directory, start up R, and type source("fat.html").  Next, type ls()  and you'll see a data object named fat.  Type names(fat) and you'lll see the variables.

This homework exercise is taken from an article called "Fitting Percentage of Body Fat to Simple Body Measurements", Roger W Johnosn, Journal of Statistis Education v.4, n.1 (1996).   In a nutshell.....there is a fairly accurate way of measuring the percentage of body fat a person has, and it involves immersing that person in a large tank of water. This is obviously not practical for home use, or even the doctor's office.. And so it is expedient to have another method. This lab involves building a model that uses easy-to-measure aspects of a person to predict their percentage of body fat.

From the article:

Abstract


Percentage of body fat, age, weight, height, and ten body circumference measurements (e.g., abdomen) are recorded for 252 men. Body fat, one measure of health, has been accurately estimated by an underwater weighing technique. Fitting body fat to the other measurements using multiple regression  provides a convenient way of estimating body fat for men using only a scale and a measuring tape. This dataset can be used to show students the utility of multiple regression and to provide  practice in model building.

1. Introduction


1 A variety of popular health books suggest that readers assess their health, at least in part, by estimating their percentage of body fat.  Bailey (1994, pp. 179-186), for instance, presents tables of estimates based on age, gender, and various skinfold measurements obtained using a caliper.  Bailey (1991, p. 18) suggests that "15 percent fat for men and 22 percent fat for women are maximums for good health."  Behnke and Wilmore (1974, pp. 66-67), Wilmore (1976, p. 247), Katch and McArdle (1977, pp. 120-132), and Abdel-Malek, et al. (1985) are other sources of predictive equations for body fat. These predictive equations use  skinfold measurements, body circumference measurements (e.g., abdominal circumference), and, in the Abdel-Malek article, simply height and weight.  Gardner and Poehlman (1993, 1994) supplement these body measurements with a measure of physical activity to predict body density from which, as we shall see below, body fat can be estimated.

2 Such predictive equations for the determination of body fat  can be determined through multiple regression. A group of subjects is gathered, and various body measurements and an accurate estimate of the percentage of body fat are recorded for each. Then body fat can be fit to the other measurements using multiple regression, giving, we hope, a useful predictive equation for people similar to the subjects. The various measurements other than body fat recorded on the subjects are, implicitly, ones that are easy to obtain and serve as proxies for body fat, which is not so easily obtained.

3 In the dataset provided by Dr. A. Garth Fisher (personal communication, October 5, 1994), age, weight, height, and 10 body circumference measurements are recorded for 252 men. Each man's percentage of body fat was accurately estimated by an underwater weighing technique discussed below. A complete listing of the variables in the dataset appears in the Appendix.

2. Determination of the Percentage of Body Fat from Underwater Weighing


4 The percentage of body fat for an individual can be estimated from body density. As an approximation, assume that the body consists of two components -- lean tissue and fat tissue. Letting
     D = body density,
     W = body weight,
     A = proportion of lean tissue,
     B = proportion of fat tissue (so A + B = 1),
     a = density of lean tissue, and
     b = density of fat tissue,

 we have
     D = weight/volume
       = W/[lean tissue volume + fat tissue volume]
       = W/[A*W/a + B*W/b]
       = 1/[(A/a) + (B/b)]. 

Solving for B we find
     B = (1/D) * [ab/(a - b)] - [b/(a - b)].

5 Using the estimates a = 1.10 gm/cm^3 and b = 0.90 gm/cm^3 (see Katch and McArdle 1977, p. 111, or Wilmore 1976, p. 123), we come up with "Siri's equation" (Siri 1956):
Percentage of body fat (i.e., 100 * B) = 495/D - 450,

where D is in units of gm/cm^3. The dataset provided also gives a second estimate of body fat due to Brozek, Grande, Anderson, and Keys (1963, p. 137):
Percentage of body fat = 457/D - 414.2,

which is considered accurate for "individuals in whom the body weight has been free from large, recent fluctuations." There does not seem to be uniform agreement in the literature as to which of these two methods is best.

6 Volume, and hence the body density D, can be accurately measured in a variety of ways. The technique of underwater weighing "computes body volume as the difference between body weight measured in air and weight measured during water submersion.  In other words, body volume is equal to the loss of weight in water with the appropriate temperature correction for the water's density" (Katch and McArdle 1977, p. 113). Using this technique,
Body density = W/[(W - WW)/c.f. - LV],

     where

     W = weight in air (kg)
     WW = weight in water (kg)
     c.f. = water correction factor
            (equal to 1 at 39.2 degrees F because one gram of
            water occupies exactly one cm^3 at this temperature,
            equal to .997 at 76-78 degrees F)
     LV = residual lung volume (liters)

(Katch and McArdle 1977, p. 115). The dataset provided here contains the weights of the subjects, but not the values of the three other quantities. Other methods of determining body volume are given in Behnke and Wilmore (1974, p. 22 ff.).


Questions for problem B.

1. Examine the data and note any unusual cases.  What should be done about the unusual cases?  (The "about" file mentions one case in which is not hard to figure out how to correct one unusual case.)

2. Choose one of the two percentage of body fat estimates (Brozek or Siri).  Fit the percentage of body fat to some subset of the provided variables, but do not use density (which is too hard to measure). You need not describe every idea you try, but should justify your final choice using appropriate diagnostic tools and the like.

3. September 14, 1995 articles in The New England Journal of Medicine link high values of the adiposity index (weight/height^2), sometimes called the body mass index, to increased risk of premature death. See if this variable is useful in your model. Also try weight^1.2/height^3.3 as suggested in Abdel-Malek, et al. (1985).

4.  Comment on the predictive accuracy of your model.  For example, how far off is an individual likely to be?

5. Estimate the percentage of US men whose bodyfat is less than 15% (which some experts say is the maximum for good health).  What assumptions must you make?

HW 6 Due Friday, Feb 18   Solutions for all but #1 (#1 coming soon -- needs to be scanned.)

1) Show that in the case of simple linear regression, the matrix version of the formula for the parameter estimates (X'X)^-1 X'Y  produce teh "right" formulas.
2) Show that the formulas on p. 740 of the handout for X'X and its inverse are correct.
3) Using the couples data set, I found the least squares line used to predict the husbands age  using the wife's age.  (Remember, this data set consists of a random sample of married couples in Great Britain.)  The ANOVA table is below.  Based on the ANOVA table calculate the r-squared and the residual standard error.

Source
DF
Sum Sq
Mean Sq
F-value
Pr(>F)
W.Age
1
3938.0
3938.0
326.04
1.798e-15
Residuals
24
289.9




You can verify your answers by loading the couples data (couples.txt in the dataset directory), fitting H.age = a + b W.age, and looking at the summary.
4) Download the couples data (http://www.stat.ucla.edu/~rgould/120w05/datasets/couples.txt).   Our goal is to predict the age that men in Great Britain marry using all other variables.  The response variable is H.AgeAtMarriage.
a) Look at a scatterplot matrix of the variables.  Which predictors seem related to each other?  Describe the predictors relationships to the age at which men marry.
b) According to your graph, is there evidence of an "historic" trend in the age at which people marry?  That is, did people used to marry younger than they do now?  (This dataset was collected in the mid 1990's and is a random sample of married couples -- not recently married couples.  "age" is their age at the time of the survey.)
c)  Fit a linear model without making any transformations. Interpret the coefficient for husband's age.
d) Why do H.Age and W.age have different signs?
e)  Is there a variable that does not seem to predict (or explain) the age at which men marry?  Which is it?  Fit a new model without it.
f) State the new model and check that it meets the necessary assumptions.  Report.




HW 5, Due Friday, Feb 11  Solutions


A.  Download the apiscores data.  A full explanation of this data set is given below in the Optional Problem B.     Our goal is to understand what factors influence or affect high-school performance.  (Note that this is probably not possibly since this is an observational study. Still we can gain some insight.  A less ambigitious and more attainable goal would be to understand how schools with different compositions vary in terms of their API schools.  For example, do all schools with a high percentage of teachers with emergency credentials score the same? If not, what accounts for this difference?)  One approach to understanding data like this is to ask: what explains the variability in y?  For example, it would be nice if all schools got the same score on the API -- and preferably this score would be a good one. But type hist(api) and you'll see that they don't.  Why?

a) type (api.table).  (I'm assuming you named the data api.table, but if not, type in whatever you did name it.)  Notice that you get a large, messy graph.  (You might want to make it larger to make any sense of it.)  Notice that if you type school[1:2] you'll see that this variable contains the names of the schools.  But for some reason R just converted these names to numbers and plotted them.  We're not interested in treating school-name as a variable, so first we need to create a new dataframe that does not have the school variable:
api.small <- api.table[, -1]
This "subtracts" the first column (which contains the school variable) and assigns it to api.small.

Now do plot(api.small).  The first row contains what are called "marginal plots".  These are the same plots you would get if you typed
plot(pct.meals, api99) and then plot(not.high.g, api99), etc.  The first one, for example, tells us what the relationship between api scores and the percentage of students at a school who receive free meals looks like, ignoring all of the other factors.

Describe this first row.  How do each of these variables relate to the API scores?  What relationships look unusual?

b) Later we'll see that relationships between our predictor variables can be problematic.  Do you see any evidence that the predictor variables are related?  Give an example of two related predictor variables and describe that relationship in the context of the data.  (In other words, don't just say "x has a linear relationship with y", but instead explain what these means in terms of schools/parents/teachers/ etc.)

c) You'll notice that the relationships between the API scores and the various predictors are not terribly linear.  Nonetheless, we'll plow ahead and fit a linear model.  There are two ways to do this:

The easy way:
fullfit <- lm(api99 ~ .  , data=api.small)

The "." on the right hand side means "all of the variables except api99 that are in the dataframe api.small"

The hard way:
attach(api.small)
fullfit <- lm(api99 ~ pct.meals + not.high.g + high.grad+some.coll+coll.grad+grad.schl+avg.ed+pct.cred+pct.emerg)

One thing to be concerned about: if in problem A you had typed "attach(api.table)", before attaching a new table (api.small) you should first detach the old one.  So type "detach(api.table)"  and then you can type "attach(api.small)".  The reason for this is that both tables have the same variable names, and this will lead to confusion later.

Write down the equation of the fitted model.  (Type "summary(fullfit)")

d) Note that the column of p-values ("Pr(>|t|)") has only one value less than 5%.  Strictly speaking, this means that all of the slopes are not different from 0!  This can't be correct! The problem is that the violation of the assumption of linearity (we assumed all relationships were linear, and they're not) has severe repercussions.  (There are other problems, too, but we'll get to those much later.)  But for now, assume that everything is fine and the relationships are linear.   What does the model tell us about the role that emergency credentials play in school performance?  (Be as detailed as you can.)

e) The least significant preditor is some_coll.  Refit the model without it.  How does this change the summary?

f) Check the diagnostic plots of this new model.   Describe.  (If the new model was named fitwithoutcoll.lm  type plot(fitwithotucoll.lm))

OPTIONAL B:. You may have heard about the accountability craze that has hit education and has been increased by the implementation of the No Child Left Behind policy. For some time, California has been assessing the performance of its high schools using something called the Academic Performance Index (API).   This data set contains the API for all 806 public high schools in California in 1999.  Well, all but one.  My alma mater (Whitter High School) was left out for reasons that will later be apparent.  In this problem we're going to explore the problem of prediction.

One use of regression is to make predicitons;  given a set of x values, what do we think y will be?  Regression tries to answer this by finding the (linear) model that comes closest to the response values.  But still, this leaves an uneasy feeling:  how do we know we will do so well if we collect more data (from the same population)?  In the context of public schools, we might want to use data from 1999 to make predictions about the API of schools in future years.  Of course, we would have to assume that the system was relatively stable from year to year (which means that although individual schools might change, the factors that affect performance and the way those factors affect performance does not).  An iffy assumption, but one commonly made, and good for a first step.

Suppose you and I had two different models and bothed bragged that ours was best.  Maybe my model used the percentage of students receiving free meals, and yours included the percentage of teachers who are credentialed.  Maybe both models have the same r-squared.  (Recall that r-squared measures the percent of variation explained.  High r-squared means points are close to the line and so therefore your predictions should come close to the actual observations.)   How do we know which will make the best predicitons?  Well, we don't.  But we'll find out next year when the new data come out.   But what if we can't wait until next year!  (And, realistically, it takes more than a year!)

Cross-validation is one procedure that we'll explore.

First, a word about the data:
Variable Description

school
name of school
api99
API score
pct.meals
% students receiving free meals
not.high.g
% who do not graduate
high.grade
% who graduate but no college
some.coll
% who go to college, no degree
coll.grad
% who get college degrees, no higher
grad.schl
% who go to grad school
avg.ed
average years of higher ed
pct.cred
% of teachers who are credentialed
pct.emerg
% of teachers who hole emergency credentials

I think that the data on education (not_high_g and some_coll, for example) refer to the parents of the children. (Otherwise, there hasn't been enough time to collect this data.)

a) Using your intuition, choose two variables that you think would make good predictors.  Call these X1 and X2.  We'll work with X1 first.  Select half of the data at random.  The best way to do this is to create a variable called index:
    index <- 1:length(api99)
Next, take a random sample of size round(length(api99)/2):  sampleindex <- sample(index, round(length(api99)/2), replace=F)

Now we divide the data into two random halves. The first is called the "test set" and the other the "training set".
x1.training <- X1[sampleindex]
api.training <- api99[sampleindex]
x1.test<- X1[!is.element(index, sampleindex)]   # this selects everything that is NOT in sampleindex (the ! means "not"))
api.test <- api99[!is.element(index, sampleindex)]

Now, use the training data to find the best linear model you can to predict api scores using X1.  When you've found it (using any transformations you want), call it output.training

What's the r-squared?  (Make a note of this; we'll use it later)

b) Now along comes a "new" data set from the same population:   the test data.  How good a job does your model do predicting the new data points?  What you need to do is try it.  Use the x's from the test set and see what the model (from training data) predicts for the y values.  This also takes several steps.

Probably the easiest way to do this is to find out what your intercept and slope turn out to be. Call them a and b.  Then do
trained.pred <- a+b*x1.test

Now how well do these predictions (using our training data) compare to what we actually saw?  One way is compute the correlation between the predicted y values, trained.pred,  and the actual values, api.test.  Compute this correlation and square it.  What is it?  This value is called the cross-validation correlation.

c) Calculate the shrinkage on cross-validation:  The r-squared from your fit with the test data minus the cross-validation correlation.  A small number is good --- it means your model is reliable and gets about the same r-squared with both sets of data.  Large is not good. Hard to say how large things have to get.  But it is useful to compare models.

d) Repeat the above steps using X2.  Which model (the one with X1 or the one with X2) do you think is better for predicting api scores?

e) Use whichever model you think is best to predict the API score of my alma mater.  Here's what you need to know:
pct.meals    34
not.high.g    27
high.grad    27
some.coll    23
coll.grad    18
grad.schl    5
avg.ed    2.47
pct.cred    85
pct.emerg    15

What you've done is an example of "two-fold" cross-validation because you broke the data into two equal halves.  It is more common to use k-fold, where k is a number between 3 and n-1.  So if k = 3, you break the data into thirds.  Two-thirds are your training set, and 1/3 you use to test, and you repeat everything three times, so that each third gets a chance to be the test set.  Then you can average your shrinkage statistics to get a sense of what it should be.






Homework #4, Due Friday, Feb. 4 Solutions

A. The probability model behind regression says that to determine the response of a "system", first set the predictor to a value x.  Then the response will be (mean + error) where mean = a + bx and error = (random number, normally distributed, mean 0, sd =something.)  In this exercise, you'll generate fake data that follows this model and do some exploring.

i) Write a function that creates fake data according to a simple regression model.  The inputs to the model should be intercept, slope, standard deviation, x.   The predictor values, x, should be a vector of values, and the response fo the function should be y:  a vector of "observations" for the x.  So you should be able to do something like this:

x <- -10:10
y <- yourfunctionhere(x, 3, 5, 2)
and y will be otuput for the model y = 3 + 5x + e   where e is N(0,2).

Show me your function, and generate some data using the definition of x (-10, -9, ....9, 10) shown above.
Plot x vs. y.

ii) OPTIONAL: What is the population standard deviation  of the y's?  (Note: I'm not asking you to calculate it from the data -- that would be the sample standard deviation. I'm asking you to determine what the model says it should be.)   Compare this to the sample standard deviation of the y's. (This is a little harder than I had intended.  It requires that you know the population standard deviation of the x's. Feel free to think about it.  In case you're wondering, the population sd of the x's is 5.91)

iii)  The model says that the RSS should be approximately equal to 4*(n-2) .  Show why.  Then compare this to the estimate from the LS line.

iv)  Check the residual plots to check for linearity and normality. Of course you know these should be good because you made them that way.  But the idea is to get some experience seeing what "good" plots look like.

v)  Which values of x have the largest Cook's distance for your fitted model?  (If "myoutput" is the output of a call to the lm command, then plot(myoutput) will give you a series of four plots (just hit return to see the next one).  The last plot shows Cook distances.)

vi) Repeat (ii)-(v) using xnew <- rep(x,10).  Make sure to plot (xnew, ynew).

B. What makes one piece of real estate property  more or less valuable than another?  The land.txt datafile (tab-delimited, text) records values on several variables that could affect the cost of a several properties.  This dataset consists of 11 variables. The variable named "LandVal" gives the value of the land alone – without consideration of the structure on it. “TotalVal” is the value of the property as a whole. “Acreage” is the size of the lot and the other ten variables describe key features of the buildings on the parcels of land. “Height” is the number of floors, “fl1Area” is the square footage of the first floor, “Rooms”, “Bedrooms”, “FullBath”, “HalfBath”, “Fireplce”, and “Garage” are all exactly what they seem to be.

i) Make a histogram of LandVal.  What does this say about the value of these parcels of land.
ii) Perform a regression to use Acreage to predict LandVal.  Your regression should include
    a) an initial scatterplot
    b) a final model
    c) discussion of any needed transformations
    d) discussion of the validity of any assumptions
    e) An interpretation of what the model tells us about this relationship and how well the model fits.
iii)  Choose any other variable (say "height") that you think might be related to the value of land.  Plot the residuals from (ii) against this variable.  Describe what you see.   Do you think that this variable would be useful additional information for predicting the value of land?
iv)   The slope in (ii) relates the mean land value to the size of the lot.  What assumptions are needed in order to make inferences on this slope?  Do you think they are valid here?   Find a 95% confidence interval for the slope and interpret.
v) Estimate the mean value of properties that sit on .80 acres.  Find a 95% confidence interval.  Do you think the assumptions required to make this inference are satisfied? Explain.
vi) Predict the value of a property that sits on 1.7 acres of land.  Find a 95% confidence interval.  Again, discuss whether the data can  support this inference. (That is, are the assumptions valid?)

C.  Do opposites attract? A random sample of married couples in Great Britain might help answer this question (at least with respect to a few variables.)
i) Use a regression to  predict husband's height given his wife's height.  Is there a relationship?
ii) Interpret the slope and find a 95% confidence interval.  Interpret the confidence interval, and discuss whether the assumptions that are required to make this interval interpretable are true.   (Heights are measured in milimeters).
iii) Identify any influential points.  What happens if these are removed?
iv) According to the model, among all wives who are 1680 mm tall,  about 68% of the husbands are between ? and ? in height?
iv) Play matchmaker:  predict (with a confidence interval) how tall the husband of a woman who is 1650 mm tall should be.
v) A psychologist claims that short men feel the need to compensate by marrying taller wives.  As evidence, he looks only at the "shorter" men in the sample: those less than the first quartile in height (which is those less than 1692).   He finds that on average, these men are 1.28 standard deviations shorter than average.  However, their wives are only .47 standard deviations shorter than average.  This, he claims, is evidence that short men compensate by looking for taller wives.  (Not wives who are taller than them, mind you, but wives who are taller with respect to other wives.)   Critique this reasoning.

SAVE FOR HW 5 (DUE FEB 11):
D. You may have heard about the accountability craze that has hit education and has been increased by the implementation of the No Child Left Behind policy. For some time, California has been assessing the performance of its high schools using something called the Academic Performance Index (API).   This data set contains the API for all 806 public high schools in California in 1999.  Well, all but one.  My alma mater (Whitter High School) was left out for reasons that will later be apparent.  In this problem we're going to explore the problem of prediction.

One use of regression is to make predicitons;  given a set of x values, what do we think y will be?  Regression tries to answer this by finding the (linear) model that comes closest to the response values.  But still, this leaves an uneasy feeling:  how do we know we will do so well if we collect more data (from the same population)?  In the context of public schools, we might want to use data from 1999 to make predictions about the API of schools in future years.  Of course, we would have to assume that the system was relatively stable from year to year (which means that although individual schools might change, the factors that affect performance and the way those factors affect performance does not).  An iffy assumption, but one commonly made, and good for a first step.

Suppose you and I had two different models and bothed bragged that ours was best.  Maybe my model used the percentage of students receiving free meals, and yours included the percentage of teachers who are credentialed.  Maybe both models have the same r-squared.  (Recall that r-squared measures the percent of variation explained.  High r-squared means points are close to the line and so therefore your predictions should come close to the actual observations.)   How do we know which will make the best predicitons?  Well, we don't.  But we'll find out next year when the new data come out.   But what if we can't wait until next year!  (And, realistically, it takes more than a year!)

Cross-validation is one procedure that we'll explore.

First, a word about the data:
Variable Description

school
name of school
api99
API score
pct.meals
% students receiving free meals
not.high.g
% who do not graduate
high.grade
% who graduate but no college
some.coll
% who go to college, no degree
coll.grad
% who get college degrees, no higher
grad.schl
% who go to grad school
avg.ed
average years of higher ed
pct.cred
% of teachers who are credentialed
pct.emerg
% of teachers who hole emergency credentials

I think that the data on education (not_high_g and some_coll, for example) refer to the parents of the children. (Otherwise, there hasn't been enough time to collect this data.)

a) Using your intuition, choose two variables that you think would make good predictors.  Call these X1 and X2.  We'll work with X1 first.  Select half of the data at random.  The best way to do this is to create a variable called index:
    index <- 1:length(api99)
Next, take a random sample of size round(length(api99)/2):  sampleindex <- sample(index, round(length(api99)/2), replace=F)

Now we divide the data into two random halves. The first is called the "test set" and the other the "training set".
x1.training <- X1[sampleindex]
api.training <- api99[sampleindex]
x1.test<- X1[!is.element(index, sampleindex)]   # this selects everything that is NOT in sampleindex (the ! means "not"))
api.test <- api99[!is.element(index, sampleindex)]

Now, use the training data to find the best linear model you can to predict api scores using X1.  When you've found it (using any transformations you want), call it output.training

What's the r-squared?  (Make a note of this; we'll use it later)

b) Now along comes a "new" data set from the same population:   the test data.  How good a job does your model do predicting the new data points?  What you need to do is try it.  Use the x's from the test set and see what the model (from training data) predicts for the y values.  This also takes several steps.

Probably the easiest way to do this is to find out what your intercept and slope turn out to be. Call them a and b.  Then do
trained.pred <- a+b*x1.test

Now how well do these predictions (using our training data) compare to what we actually saw?  One way is compute the correlation between the predicted y values, trained.pred,  and the actual values, api.test.  Compute this correlation and square it.  What is it?  This value is called the cross-validation correlation.

c) Calculate the shrinkage on cross-validation:  The r-squared from your fit with the test data minus the cross-validation correlation.  A small number is good --- it means your model is reliable and gets about the same r-squared with both sets of data.  Large is not good. Hard to say how large things have to get.  But it is useful to compare models.

d) Repeat the above steps using X2.  Which model (the one with X1 or the one with X2) do you think is better for predicting api scores?

e) Use whichever model you think is best to predict the API score of my alma mater.  Here's what you need to know:
pct.meals    34
not.high.g    27
high.grad    27
some.coll    23
coll.grad    18
grad.schl    5
avg.ed    2.47
pct.cred    85
pct.emerg    15

What you've done is an example of "two-fold" cross-validation because you broke the data into two equal halves.  It is more common to use k-fold, where k is a number between 3 and n-1.  So if k = 3, you break the data into thirds.  Two-thirds are your training set, and 1/3 you use to test, and you repeat everything three times, so that each third gets a chance to be the test set.  Then you can average your shrinkage statistics to get a sense of what it should be.

Homework #3, Due Friday, Jan. 28  Solutions (let me know if you find a typo or mistakes)

A. (Review of hypothesis testing).  It is a well-known fact (that has been empirically verified) that a flipped coin lands heads approximately 50% of the time.  Did you know, however, that a spun coin lands on its head less often than 50% of the time?  The reasoning is that if you spin a coin, the extra weight on the head side pulls the coin so that it is slightly more likely to land with the tails-side up.

a)  Suppose you spin a coin 100 times.  How many times would it have to land Heads before you would believe my claim about spinning coins?  Why? 

You might want to review  the binomial distribution.  If you are unfamiliar with it, you can look it up in any introductory statistics book.  It is somewhat difficult to typeset on a web page, but essentially it goes like this:

Suppose X is a random variable that represents the number of heads resulting from flipping (or spinning) a coin.  Suppose p represents the probability the coin will land heads, and n is the number of flips.  Now let x (that's lower-case X) represent the observed number of heads.  (A subtle distinction:  X (upper case) represents the unknown number of heads before we do this experiment.  x lower case is an old-fashioned algebraic variable that is a place holder for whatever value we happen to observe.  Now
P(X = x) =  (n!/(n-x)!x! )   * p^x * (1-p)^(n-x)

Three very useful facts:
    a) E(X) = np   that is, we expect to get np heads.
    b) Var(X) = np(1-p)   so the SD is the square-root of this and represents the random variability.  That is, we expect to get np heads, give or take sqrt(np(1-p)) heads.
    c) if n is sufficiently large (a standard rule-of-thumb) is both np>10 and n(1-p)>10),  then the probability distribution of X is closely approximated by a normal distribution with mean np and standard deviation sqrt(np(1-p)).  (This is essentially a particular case of the Central Limit Theorem.)  So suppose n = 100 and p = .5.  If you want to find P(X > 64) (the probability of getting more than 64 heads in 100 flips of a fair coin), you can laboriously calculate P(X = 65) + P(X = 66) + ....P(X = 100)  using the binomial distribution up above.  Or you can get a pretty good approximation using a normal distribution.

Two useful tips:
    a)Actually, with R, it is not that laborious to use the binomial distribution.  Simply typing pbinom(x,n,p)  will calculate P(X <=x ) using the binomial distribution. Hence P(X > 64) = 1-pbinom(64,100,.5)
    b) You need never use a normal probability table again.  pnorm(x,mu,sd)  will give you the P(X <= x) for a N(mu, sd) distribution.  Hence 1-pnorm(64,100*.5, sqrt(100*.5*.5)) will give approximately the same answer as 1 - pbinom(64,100,.5)  (with "approximately" being the key word.)

Now back to the HW:
    b) Suppose  in (a) you had decided that you would conclude that the probability of getting heads when spinning a coin was NOT .5  if you spun it 100 times and observed x heads or fewer.  (x should be whatever value you used to answer (a)).  Let p = true probability of getting heads when spinning a coin.  Assume p = .5.   Use pbinom to find the probability that you will mistakenly conclude that p is less than .5.   This figure is called the significance level and is famously represented by the lower-case greek letter alpha.  The significance level is the probability of making the mistake of rejecting the null hypothesis (that p = .5) when in fact the null hypothesis is true.

c) Now assume that in truth, p = .45.   What's the probability, using the value x you chose in part (a), that  you will (correctly) conclude that p is not .5 if you carry out the experiment?   This value is called the "power":  the probability of correctly rejecting the null hypothesis when, in fact, it is false and shoudl be rejected.  You want the power to be as high as possible.

d)  Choose another value of x so that the power is now higher than before.  (x , remember, is the number which, if you do the experiment and get x heads or fewer, you'll decide that p does not equal .5.) 

e) Using the x you chose in (d), calculate the significance level.

It is a sad fact of life that power and significance level are adversaries.  You want the power to be high, and the significance level to be low.  But if you improve one, you make the other worse.  The only way to improve both is to increase the sample size.

B.  Load the survey data set into R   (use the command read.table("survey.txt", header=T, sep="\t")).  This data consists of responses to a survey by a Stats 11 class.
a)  Plot weight against height (weight should be on the y-axis.)  There are two outliers.  To find out where they are in the data set, do the following:
    i)  type identify(height, weight)
    ii) place your cursor over one of the points and click.  Do the same for the other point.
    iii) Hit the escape key.
    If all went well, two numbers should appear, and these numbers are the locations of these two points in the data set. 
b) Create two new variables, height2 and weight2 that have these two outlier points removed.  (For example, if you wanted to remove the 6th point, type height2 <- height[-6]
c) Plot weight2 against height2 and comment on the relationship between height and weight (which shoudl be obvious without the graph.) This plot combines men and women.  Can you see any evidence of two distinct groups?   (the identify command can be fun here.  Type identify(height2,weight2, labels =gender)  and click on a few points; then hit the escape key.
d) Find the least-squares regression line and interpret the results. (See the bottom of p. 108 and top of 109).
e)  Plot the residuals against height2.  Do you see any deviations from the model? (if output.lm is the output of your least squares fit, then resid <- residuals(output) will give you the residuals. Then do plot(resid, height2).
f)  What assumptions must you make if we are to be able to interpret the p-values provided in the output?
g) Algebraically solve the least squares regression line to find the equation using weight to predict height.
h)  Now fit a regression line using weight to predict height.   How does it compare to your algebraic solution in (g)?  Why do you think this is?
i) Now fit a least squares regression line using height to predict weight but using the original variables (that include the outliers).  Do the outliers have a big affect on the fitted model?


C1. In this problem we'll turn things on their side and use height to predict gender.This problem will use the same survey data. as in B.  But first, we need to create a new variable.  The gender variable has two values, "m" and "f".  We want to recode this as 0 and 1.  Do the following:
gender.code <- c(length=length(gender))  (creates a null vector with same length as gender.
gender.code[gender=="m"] <- 0
gender.code[gender=="f"] <- 1

Now the men are coded with 0's and the women with 1's.
a)  Make side-by-sdie boxplots of weight for men and women.  Compare.
b) Although there is considerable overlap in weight, still the means are different. Suppose I randomly select a person from this class and tell you their weight.  Can you predict which gender they are most likely to be?  This is a bit unusual compared to other problems we've seen:  the response variable is binary and has values 0/1 while the predictor is continuous.  One way to think about this is that you want to know the probability that a person is female given that you know their weight.  So for each weight, we can estimate it by finding what percentage of people at that weight are women.   Plot gender.code against weight (weight on x axis) and fit a lowess line.  What does the line tell you about the probability that a randomly selected person is a woman if you know their weight?
c) Find the mean weight of men and of women.  
d)  Fit a regression line using weight to predict gender.  Interpret the coefficients and compare to your answer in (c).

C2.  Now we reverse things.  
a)  Fit a regression using gender to predict weight.  Interpret the coefficients. (It might help to look at the plot of gender (y axis) vs. height.)
b)  Do a t-test to test whether the mean weight of men and the mean weight of women are equal.  Assume that the standard deviations ARE EQUAL in these two populations.  What assumptions must you make to carry out this test?
c) Now compare the value of the t-statistic in part (b) with the t-statistic that corresponds to the slope you got in part (a).  What does this tell you about the meaning of the t-test associated with this slope?  What assumptions must you make in order for this t-test to be valid?
D.
a) Fit a regression line to Galton's data of father/son heights.  (Once you've downloaded the file to the right directory, simply type source("filename")
b) Use the regression line to find out how many standard deviation above average are typical sons whose fathers are 1 standard deviation above average?
c) How many standard deviations below average are typical sons whose fathers are 1 standard deviation below average?
Galton called this phenomenon (that taller than average and shorter than average fathers produce more "mediocre" sons) "regression to the mean."
d) Suppose a certain large class takes two midterms.  What does regression to the mean say about the people who get A's on the first midterm?  (For the sake of conversation, suppose that to get an A you need to be 2 standard deviations above average).  Assume that r = .4 and the standard deviations of both exams are the same.

E.  Read #3 on p. 132 . You can get the data very simply; just type data(cars).  Type names(cars) to see the variables.  Type attach(cars) to use the variables by their first name.  The data consist of stopping distances for a car at various speeds.
a) Make a plot to help us determine how much distances is needed to stop for a given speed.
b) plot the least squares regression line and comment on whether the model fits.  Superimpose the lowess line and compare.
c) Make a plot of the residuals to confirm your answer to (b).
d)  Now we'll try transforming the y-variable.  Fit a curve of distance^2 to speed.  Repeat for sqrt(distance) and log(distance).  Which fits best and why?

F. Download the oldfaithful data into R.  This consists of information about eruptions of the Old Faithful Geyser in October 1980.  The length variable is the number of minutes that an eruption lasts and duration is the time in minutes to the next eruption.  Old Faithful has many tourists and this data (well, data like these) are used to derive an equation that helps park rangers tell tourists when to expect the next eruption.

a)  Make a histogram of duration.  Describe the distribution. What would you say is the typical waiting time for the next eruption?
b)  Make a graph to show the relation between the length of the eruption and the time  until the next eruption. Describe this relationship in words.  Does it make sense?
c) Find an equation that predicts the time to the next eruption based on the length of the current eruption.

G. (optional).  The EPA publishes fuel-efficiency data for a variety of types of cars.  http://www.epa.gov/fueleconomy/data.htm.   Download the 2005 data here.  It is a text file in which values are separated by commas, and so you load it into R by typing read.table("fuel.csv", header=T, sep=",").
You can view the variable names with the names() function.  You can see what the names mean by reading this file, although you'll have to scroll down a bit.

In general, cars with bigger engines are less efficient.  One measure of the size of an engine is the displacement which measures, in liters, the amount of fuel moved through the engine.  Come up with the best equation you can to predict the highway mileage (hwy)  based on displacement (displ).  Evaluate your model.   Use the identify function to identify any unusual cars.  (For example, which car has the worst fuel efficiency?  the best?)


Homework #2, Due Friday, January 21    Solution (pdf)
p. 103: 2
#A:  (a) Using the data from my songlists: http://www.stat.ucla.edu/~rgould/120w05/datasets/songlist.txt,  do a statistical test to determine whether the mean  song lenghs is different for rock than for classical.  State all assumptions, and in particular be clear about what the population is that this hypothesis applies to.  Talk about how you decided to handle any outliers.  Does it matter how the outliers are handled here?  Check any assumptions you can. For example, if you assumed the times came from a normal distribution, can you check whether that seems reasonable?  (There's no need to get fancy with this problem and do a non-parametric or bootstrap -- but you can if you want.  The point is to just explore and practice with R a bit.)
(b) Compute a 95% confidence interval for the mean difference in lenghts of songs between these two groups.  Again, state any assumptions you make, but not need to check them out.

#B: Upload this dataset into a data frame: http://www.stat.ucla.edu/~rgould/datasets/bloodpressure.dat.  (It's a tab-delimited file -- you'll need ot save it to your harddrive.)  It consists of the systolic and diastolic blood pressure of 15 patients with moderate hypertension.  Measurements were taken immediately before and two hours after taking 25 mg of th e drug captopril.  The drug is expected to lower blood pressures. Does it work? Discuss. Explain.

#C:  Upload  http://www.stat.ucla.edu/~rgould/120w05/datasets/anscombe.txt into a data frame.   It's artificial data, so the context isn't important. (a)  Without making any graphs, find the correlations between the following variables:  X and Y1,   X and Y2, X and Y3, X4 and Y4.  (Example, cor(X, Y1)  will do it.)  Comment.  (b) Now make scatterplots of these pairs.  (One quick way is to do "plot(dataframe)" where dataframe is whatever you named it. Note that this gives you too many plots (for one thing, it not only gives Y1 vs. X, but also X vs. Y1. But it also gives X vs. X4, for example.)  Comment. The moral of this activity is that you need to do plots before computing correlations.

#D.  Upload http://www.stat.ucla.edu/~rgould/120w05/datasets/wine.txt  into a data frame. (Again, a tab-delimited, text file, so do
wineframe <- read.table("wine.txt", header=T, sep="\t") to read it in.)   This data contains heart disease mortality per 1000 people, wine consumption in liters per person per year for a number of different countries.  Make a scatterplot of mortality against wine consumption and comment on the shape of the relationship and what this says about wine consumption and health.   Do you think the relationship is causal?  Why or why not?  

Homework #1, Due Friday, January 14  Solutions
p. 26: 1,2,3 (state your observations about the relation between temperature and failures), 4-10.
p. 50: 1,2,5
p. 69: 2-5
#A:  Here is a link to information collected from a Stats 11 class.  Comment on the distribution of men's heights.  Do they seem normally distributed? (Note: you'll have to react to an outlier before answering the question.)

HINT: to download the data:  save the file on your hard-drive in the directory that you have identified as your "working directory."  Then
whatevernameyoulike <- read.table("survey.txt", header=T, sep="\t")
will read it into R.  This is the standard format for tab-delimited text files in which a header supplies variable names.  The sep="\t" option tells R that fields are separated by tabs.  The header=T  option tells R that the first line contains variable names.

#B: This text file contains Michelson's measurements of the velocity of light in air.  In 1879, A.A. Michelson made 100 determinations of the veloctity of light in air using a modification of a method proposed by the French physicist Foucault.  The data are given here as reported by Stigler.  The numbers are in km/sec, and have had 299,000 subtracted from them.  The currently accepted "true" velocity of light in vacuum is 299,792.5 km/sec.  Stigler has applied corrections used by Michelson and reports that the "true" value appropriate for comparison to these measurements is 734.5.  Each trial may be a summary of several exerimental observations.
a) Perform a t-test to check whether the data support the "true" value (734.5).  State any assumptions.
b) Check assumptions.
c) Construct side-by-side boxplots of the velocities based on the trial number.  Describe.  How might this affect your conclusion?