Homework, Stats 11  Fall 2004


All HW is due at 11:30am on Fridays.  All questions are in Wild and Seber unless stated otherwise.  The *'d problems are graded 0-5.  You will received one additional point if all other problems have serious attemps.

Resources

NOTE 1: More often than not, the data referred to in the exercises is available in digital form on-line.  Check here before doing lots of typing.

NOTE 2: Looking for a nice, free, stats package?  Try Statcrunch.  You need to have a live internet connection, and will be asked to provide your email address, and you'll also need to take a little (very little) bit of time figuring things out. But you can do all homework with this.

Solutions to "Extra" problems. Some of you have asked for solutions to the "extra" problems that have been assigned. Here are solutions to those since HW 4 that were not already included in the on-line solutions.

Assignments

HW 9 due Friday, December 10
Solutions to Extra Problems
p. 452: 1,2, 7ab,8
Extra #1* : In the mid-1980s, a team of epidemiologists did a study of children whose parents worked at a battery factory.  The purpose of the study was to determine if the parents -- whose daily work exposed them to lead dust -- were contaminating their children with lead.  The theory was that lead-dust gets onto the parents' skin and clothes and then is brought home where it gets into the air and is inhaled by the children. Lead poisoning in children can affect growth and also intelligence.  It has less effect on adults.  Once lead gets into the bloodstream, the body cannot cleaning it out, and so it gradually accumulates over time.  For all of these reasons, health officials are very concerned about childhood exposure to lead.  (A common source used to be household paint. Although laws prevent using lead based paint, some older structures with older paint can still contain lead.  It can be dangerous for children to eat the chips from this paint.  Recently, in Los Angeles, there has been an increase in lead poisoning due to certain candies imported from Mexico that have a high lead content.)

The epidimiologists measured the level of lead in the blood of 32 children whose parents worked at a battery factory in Oklahoma. These children will be called the "Exposed" group.  They also took lead levels from another 32 children whose parents' work did NOT expose them to lead.  These children are the "Control" group.   Lead is measured in units of micro-grams of lead per mililiter of blood (mg/ml)

Two important facts about lead poisoning: most experts believe that lead levels higher than 40 are dangerous and levels over 60 require immediate hospitalization.

You can download and view the data at http://www.stat.ucla.edu/~rgould/11f04/lead.txt

a)
Describe and compare the distributions of the lead levels of the exposed and control groups.  Based on these graphs, what do you think about the theory that workplace exposure to lead can result in high lead levels in their children?
b) Find the means and standard deviations of the lead levels of the exposed and control groups.
c) The researchers chose the control group in a very particular way: for each child in the exposed group, they found a child whose parents did not work around lead and who lived in the same neighborhood and who was the same age as the exposed child.  Explain why age of child and neighborhood are potential confounding variables if we're trying to determine whether the parents' workplace leads to lead contamination.
d) Suppose we wished to do a hypothesis test to test the researchers' theory.  What would be a better test and why:  A two-sample t-test or a paired t-test?

Extra #2*:  These questions refer to the same study of child lead poisoning.
a) Compute the "difference scores":  the difference between each exposed child and his or her "match".  (The data in the dataset are organized by pairs. Each Exposed child's score is reported beside his or her Control match. ) Describe the distribution of these scores.
b) Perform a hypothesis test to test whether parents' exposure to lead results in higher lead levels in children.  Be sure to give the significance level you're using, state the hypotheses, state your test statistic and give the observed value, compute the p-value, and give your decision.  
c) Find the 95% CI for the mean difference in lead levels.  Interpret what this confidence interval tells us about the affect of parents' workplace exposure to lead on their children.
d) An assumption of the t-test is that the exposed children were a random sample from the population of all similarly exposed children.  Probably this assumption is not true here.  We don't know for sure (and the original research article doesn't specify), but it is unlikely the children came from a random sample of the children at this one factory, much less a random sample of all children throughout the country who are similarly exposed.  If a t-test is inappropriate, the sign-test might be.  The reasoning is as follows:  According to the null hypothesis, if we select an exposed child and then find a control match, the control should be just as likely to have a lead level higher than the exposed child as he or she is to have a lower level.  So there's a 50% chance each control child should have a higher lead level.  Perform a sign test using a 5% significance level.
e) Is this an observational study or a controlled experiment? Why?  Does this fact affect your ability to state conclusively whether the parents' exposure caused the childrens' elevated lead levels?

Extra #3. A vending machine is meant to dispense precisely 12 ounces of (very bad) coffee into a styrofoam cup.  The cup holds no more than 12.2 ounces.  If it puts in too much, the owner of the vending machine loses money.  If it puts in too little, the customer complains.  To test whether or not it is dispensing 12 ounces of coffee, the owner tests it 10 times and records the number of ounces dispensed each time.
a) State the null and alternative hypothesis for a hypothesis test to test the claim that the machine is not dispensing the right amount of coffee.
b) It turns out that a 95% confidence interval for the mean amount of coffee dispensed is (11.9 oz, 12.1 oz.).  Suppose the owner now does a t-test.  True or false and explain: the p-value for the t-test will be less than .05.


HW 8 due Friday, December 3 (note that this is a two-week assignment)
Solutions


Note: all "extra" problems are required.

• Extra #1:   It is fairly well established that heights of adult women in the U.S. are normally distributed with a standard deviation of 3 inches.  A random sample of 10 women are selected and there heights are recorded as follows:

    63.6, 65.2, 62.2, 71.1, 65.8, 65.1, 71.0, 64.8, 63.3, 64.9

The five number summary is
(62.20, 63.90, 65.00, 65.65, 71.10)  and the average of the sample is 65.70.

a)Is the standard deviation of the population known or unknown in this problem?
b) Calculate the standard deviation of the sample.  How does this compare to the population standard deviation?
c) Is the mean of the population known or unknown?
d) Find a 95% confidence interval using a z-statistic as a multiplier.  (Why use the z-statistic?)
e) Find a 99% confidence interval.
f) Pretend that you were not told that the population standard deviation is 3".  Use your estimate in (b) to re-calculate the 95% confidence interval.  This time, rather than a z-statistic, use the appropriate t-statistic as a multipler.
Click here for answers to the first problem.

• W&S, p. 355:  1, 3,4a

• Extra#2*:  A random sample of 9 overweight men tested an experimental diet. (The actual test also included a control group, but in this problem we consider only the treatment group.)  Their change in weight over a 2 week period is given below.  A negative value means they lost weight.   Would you conclude the diet is successful?  Use a 95% confidence interval to answer.
-2.9, -5.7, 1.3, 2.0, 0.0, 1.6, -9.1, 2.1, -4.2

• Extra #3:  A university is interested in knowing more about the demographics of its student body.  In particular, it wants to know the mean income of the parents' of its students.  To determine this, it sends out a survey to a random sample of 500 students. (The university has about 25000 students.)  The average income reported by the sample is $55000 and the standard deviation of the sample is $25000.
a) The distribution of incomes in the populations is almost certainly non-normal (and the mean and standard deviation of the sample support this notion).  Why, then, is it appropriate to use a z-statistic as a multiplier to compute an approximate 95% confidence interval?
b) Compute an approximate 95% confidence interval.
c) Suppose that the university knew as a fact that the standard deviation of incomes in the population was $27,500.  The university's administration wanted the survey to produce a 95% confidence interval that was just $2000 "wide".  (So the right end-point of the interval minus the left end-point is 2000.)   How many students would they have to sample to achieve this?
d) Suppose the university did the study again and got this interval:  ($53000, $55000).  For each of the following interpretations below, state whether the interpretation is "right" or "wrong" and explain your answer:
    i) 95% of the students' parents' have incomes between $53000 and $55000.
    ii) There is a 95% chance that the true mean is between $53000 and $55000.
    iii) If we were to repeat this study infinitely many times, 95% of our means would fall betwen $53000 and $55000.
    iv) If we were to repeat this study infinitely many times, 95% of the time we would get an interval that contained the true mean.

• Extra #4*: If the mean speed on a road is faster than the posted speed limit, people can contest speeding tickets (at least in some cities).  Below are recorded speeds on a street in which the speed limit is 30 mhp. We will examine whether there is evidence that the mean speed of all cars is faster than the posted speed limit.  The observed speeds were 29, 34, 34, 28, 30, 29, 38, 31, 29, 34, 32, 31, 27, 37, 29, 26, 24, 34, 36, 31, 34, 36, 21

The five number summary is (21.00, 29.00, 31.00, 34.00, 38.00). The average of the observed speed was 31.04 and the standard deviation was 4.24.

a) Make an appropriate plot of the data and describe the distribution of speeds.
b) State the null and alternative hypotheses to test.
c) Compute the appropriate test statistic.
d) Find the p-value for the observed test statistic.
e) Would you conclude tht the true mean speed is faster than 30 miles per hour?

• Extra #5: In 1998, as an advertising campaign, the Nabsico Company announced a "1000 Chips Challenge", claiming that every 18-ounce bag of their Chips Ahoy cookies contained at least 1000 chocolate chips.  Dedicated Statistics students at the Air Force Academy purchased some randomly selected bags of cookies, and counted the chocolate chips.  Some of their datda are given below:
1219, 1214, 1087, 1200, 1419,1121,1325,1345,1244,1258,1356,1121,1191,1270,1295,1135.
a) Find a 95% confidence interval for the mean number of chocolate chips in all Chips Ahoy bags. Wht does this say about Nabisco's claim?
b) Perform a hypothesis test and state your conclusion about Nabisco's claim?
c) Hopefully you reached the same conclusion for parts (a) and (b).  Was this coincidence, or will it always happen?

• Extra #6: Consumer Reports tested 14 brands of vanilla yogurt and found the following numbers of calories per serving:
160, 200, 220, 230, 120, 180, 140, 130, 170, 190, 80, 120, 100, 170.

A diet guide claims that you will get 120 calories from a serving of vanilla yogurt.  What does this evidence indicate?  Do a hypothesis test and also compute a confidence interval.  Choose the significance level you think is appropriate.

HW 7 due Friday, Nov 19 (same day as midterm II)
W&S p.315:  1,12*,15
Extra* (but required):  in class we took a random sample of 7 serial numbers from the population {1,2,....N}, where N was an unknown number.  Each serial number came from a "captured tank".
a) Make a sketch of the pdf of the population (obviously in terms of N)
b) Let Xi represent the serial number on the ith tank we capture.  Find an expression for the expected value of Xi. Find an expression for the standard deviation of Xi.
c) Suppose we calculate Y = (X1 + ... + X7)/7   Find the mean of Y in terms of the mean of X.  Find the SD of Y in terms of the SD of X.
d) A popular choice for  an estimator for N was Xbar + 3* SD(X).  What's the bias of this estimator?  How does the bias change if we take a larger sample size?
e)  Another choice was Xbar + 3*SD(X)/sqrt(n) .  What's the bias of this estimator?  How does the bias change if we take a larger sample size?
Extra (and required) for review:  According to an urban legend, the Washington Redskins, a DC-based football team, are able to predict presidential elections.  The theory goes like this:  if the Redskins lose their last homegame before the election, the incumbant party loses.  If they win the last homegame, the incumbant party wins.  This method has been right 15 out of the last 16 times.   The only time it failed was this election.  On October 31, the Redskins lost to the GreenBay Packers and therefore the incumbant should have lost, according to this predictor.  However, he won.  Assume that the Redskins have no predictive ability whatsoever.  (In which case the string of successes is a coincidence.)  What's the probability of getting 15 or more correct predictions in 16 elections?

Solutions to HW 7

HW 6 Due Friday, Nov 12  Solutions
H&G  2.12, 2.14, 2.16 (see note at end).
W&S p. 270: 6e, 7,11,12,13,14*,16
W&S p. 315: 2,3,7*
Note: for #16, the problem is asking you to find the mean and SD of these "new" random variables (that are, in fact, transformations -- simply changes in units -- of the previous RVs.)  You do *not* need to show why the distribution is Normal.  However, it is not necessarily clear why the probability distribution of a random variable stays the same when you add and multiply by constants.  If you take a mathematical statistics course, this is the sort of thing you will worry about.  For now, just assume that the new distribution is still normal.

HW 5 Due Friday, Nov 5
  Solution NOTE: Some of you asked about an example of finding the covariance.  Click on the link to 2.11.
H&G, 2.10 , 2.11,2.20
W&S p. 226: 17a-e, 20,22
W&S: p. 268: 1,2 (note, for part a, don't use the data -- assume the data follow the normal distribution and use the normal curve), 4,5,6,8
Extra #1:  Suppose we toss a fair-coin 100 times.  Let X represent the number of heads.
    a) Use the normal approximation to find the probability that you will get more than 60 heads.
    b) Use the normal approximation to find the probability you'll get between 45 and 55 heads.  Between 40 and 60. Between 35 and 65.
    c) Compare your answers in (b) with the exact probabilities obtained from the binomial distribution.
*Extra #2 Suppose that 48% of the population will vote for Bush.  (By the time you do this, you'll know what the truth is. Maybe.) A random sample of 1000 people is taken with replacement.  Let X be the number of people in the sample who will vote for Bush. (Note: "with replacement" means that when a person is selected, they are put back in the pool and can, theoretically, be selected again.)
    a) What distribution does X have?  Why?  Why did I specify that the sample of people is to be taken with replacement.
    b) Will the normal approximation provide a suitable approximation of the exact probabilities here?  Why?
    c) Find the probability that more than 50% of the sample will vote for Bush.
    d) Find the probability that less than 45% of the sample will vote for Bush.

*Extra #3
  In our in-class census, the results of a quesiton asked about whether people support Bush came out as follows:

Bush?
No
Undecided
Yes
Female
25
17
10
Male
28
16
14

a) Suppose we select a student at random from this group.  Let X = -1 if the student does not support Bush, 0 if undecided, and 1 if they do support.  Let Y = 1 if the student is female, 0 if male.   Write a table of f(x,y), the joint distribution of X and Y.
b) Write  tables for f(x) and f(y),  the marginal distributions of X and Y.  (That's two tables; one for X, one for Y.)
c) Find P(X>0, Y=1)
d) Find E(X).
e) Are X and Y independent?

HW4 Due Friday, Oct 29  Solutions
Hill & Griffiths (H&G), Exercises 2.2 (p. 37), 2.3, 2.5*, 2.9a
W&S: p. 212:  2 (choose any four -- note that the answers are in the back)
W&S: p.226 13,14*, 16
Extra #1:  Suppose I want to go into the casino business with a simple gambling game..  In this game, you roll a fair, six-sided die.   You will win 1 dollar for each pip on the die if there are an odd number of pips.  You will lose one dollar for each pip if there is an even number of pips.   I'm going to charge some amount of money to play the game.  How much would I need to charge in order to make the game fair?  (A fair game is one in which neither side has an advantage -- so the expected value should be 0.)


HW3 Due Friday, Oct 22 (same day as Midterm)  Solutions
p. 151 1,2,3
p. 157, 2,3
p. 166: 2,3*
p. 174, 2,3
p. 188, 8,9,10*,12
p. 192: 18,(20 is optional), 27

HW 2 Due Friday Oct. 15  Solutions
Click here for the pdf file (needs Adobe Acrobat Reader) that contains the HW.  If you have trouble viewing this file, save it to your harddrive (on a PC, right-click and choose "Save File to Harddrive" or some such phrase) and open it with Acrobat.  Sorry for the pdf file, but it was the easiest way to include some of the graphs I wanted to include.
 
The census data used in the exercises is available here.  It's a fairly long file with 5000 entries and consists of a random sample from the 1980, 1990 and 2000 census of residents of Los Angeles-Long Beach and Anaheim.  It includes information on family income, race, sex, age, and commuting time. It is a text file, tab-delimited.  From within Stata, you can download the data (assuming you're on-line) by typing insheet using "http://web.stat.ucla.edu/~rgould/11f04/census.txt" (Assuming, of course, that you are on-line.)

HW 1 Due Friday Oct. 8  For Solutions click here.

* #1 A dermatologist wants to test which of two products is best for removing a type of blemish that appears on the face. He instructs the receptionist that she is to assign each patient diagnosed with this condition with either Treatment A or Treatment B, and there should be roughly the same number of people in each group.  The dermatologist assigns each patient a severity score, and each patient is asked to return 4 weeks later.  At the followup visit, the dermatologist (who does not know which treatment the patient has received) assigns a new severity score.  There were 43 patients enrolled in the study.  The results were that, on average, the patients in Treatment A improved more than those receiving Treatment B.  
    a) Was this is a controlled experiment or an observational study, and why?
    b) As you might have guessed, with 43 patients enrolled each treatment group did not have the same number of patients.  Is this a problem?  Why or why not?
    c) Note there was no placebo.  What information would be obtained by including a third group of patients who reeived a placebo?  Why should or should not this be done?  (Assume a placebo is a completely inactive product.)
    d) Suppose the dermatologist has not yet collected this data, but has proposed this to you as a design for this study. He has asked whether anything should be changed. What advice would you give him?

#2 Physical exercise is considered to increase the risk of spontaneous abortion. Furthermore, women who have had a spontaeneous abortion are more likely to ahve another.  One observational study finds that women who exercise regularly have fewer spontaneous abortions than other women.   If you were a doctor, if you were to advise all of your pregnant patients to exercise regularly, would you expect to see a drop in the rate of spontaneous abortions?  If yes, explain why.  If no, suggest a confounding variable.

#3  There's considerable debate about the effects of the "Bush Tax Cuts" on the U.S. economy. At issue is the question of whether or not the tax cuts led to an improved economy.  Describe the data you would like to answer this question and explain what type of study you would have to use.  (Warning: it could be impossible to design the perfect study.)

p. 27: 1,2,
p. 33: 6,7,* 8