Homework, Stats 11 Fall 2004
All HW is due at 11:30am on Fridays. All questions are in Wild
and Seber unless stated otherwise. The *'d problems are graded 0-5.
You will received one additional point if all other problems have
serious attemps.
Resources
NOTE 1: More often than not, the data referred to in the exercises
is available in digital form on-line. Check
here before doing lots of typing.
NOTE 2: Looking for a nice, free, stats package? Try Statcrunch. You need to have
a live internet connection, and will be asked to provide your email address,
and you'll also need to take a little (very little) bit of time figuring
things out. But you can do all homework with this.
Solutions to "Extra" problems. Some
of you have asked for solutions to the "extra" problems that have been assigned.
Here are solutions to those since HW 4 that were not already included in
the on-line solutions.
Assignments
HW 9 due Friday, December 10
Solutions to Extra Problems
p. 452: 1,2, 7ab,8
Extra #1* : In the mid-1980s, a team of epidemiologists did a study of
children whose parents worked at a battery factory. The purpose of
the study was to determine if the parents -- whose daily work exposed them
to lead dust -- were contaminating their children with lead. The theory
was that lead-dust gets onto the parents' skin and clothes and then is brought
home where it gets into the air and is inhaled by the children. Lead poisoning
in children can affect growth and also intelligence. It has less effect
on adults. Once lead gets into the bloodstream, the body cannot cleaning
it out, and so it gradually accumulates over time. For all of these
reasons, health officials are very concerned about childhood exposure to
lead. (A common source used to be household paint. Although laws prevent
using lead based paint, some older structures with older paint can still
contain lead. It can be dangerous for children to eat the chips from
this paint. Recently, in Los Angeles, there has been an increase in
lead poisoning due to certain candies imported from Mexico that have a high
lead content.)
The epidimiologists measured the level of lead in the blood of 32 children
whose parents worked at a battery factory in Oklahoma. These children will
be called the "Exposed" group. They also took lead levels from another
32 children whose parents' work did NOT expose them to lead. These
children are the "Control" group. Lead is measured in units of micro-grams
of lead per mililiter of blood (mg/ml)
Two important facts about lead poisoning: most experts believe that lead
levels higher than 40 are dangerous and levels over 60 require immediate
hospitalization.
You can download and view the data at http://www.stat.ucla.edu/~rgould/11f04/lead.txt
a) Describe and compare the distributions of the lead levels of the
exposed and control groups. Based on these graphs, what do you think
about the theory that workplace exposure to lead can result in high lead levels
in their children?
b) Find the means and standard deviations of the lead levels of the exposed
and control groups.
c) The researchers chose the control group in a very particular way: for
each child in the exposed group, they found a child whose parents did not
work around lead and who lived in the same neighborhood and who was the
same age as the exposed child. Explain why age of child and neighborhood
are potential confounding variables if we're trying to determine whether
the parents' workplace leads to lead contamination.
d) Suppose we wished to do a hypothesis test to test the researchers' theory.
What would be a better test and why: A two-sample t-test or
a paired t-test?
Extra #2*: These questions refer to the same study of child lead
poisoning.
a) Compute the "difference scores": the difference between each exposed
child and his or her "match". (The data in the dataset are organized
by pairs. Each Exposed child's score is reported beside his or her Control
match. ) Describe the distribution of these scores.
b) Perform a hypothesis test to test whether parents' exposure to lead
results in higher lead levels in children. Be sure to give the significance
level you're using, state the hypotheses, state your test statistic and
give the observed value, compute the p-value, and give your decision.
c) Find the 95% CI for the mean difference in lead levels. Interpret
what this confidence interval tells us about the affect of parents' workplace
exposure to lead on their children.
d) An assumption of the t-test is that the exposed children were a random
sample from the population of all similarly exposed children. Probably
this assumption is not true here. We don't know for sure (and the
original research article doesn't specify), but it is unlikely the children
came from a random sample of the children at this one factory, much less
a random sample of all children throughout the country who are similarly
exposed. If a t-test is inappropriate, the sign-test might be. The
reasoning is as follows: According to the null hypothesis, if we select
an exposed child and then find a control match, the control should be just
as likely to have a lead level higher than the exposed child as he or she
is to have a lower level. So there's a 50% chance each control child
should have a higher lead level. Perform a sign test using a 5% significance
level.
e) Is this an observational study or a controlled experiment? Why? Does
this fact affect your ability to state conclusively whether the parents'
exposure caused the childrens' elevated lead levels?
Extra #3. A vending machine is meant to dispense precisely 12 ounces of
(very bad) coffee into a styrofoam cup. The cup holds no more than 12.2
ounces. If it puts in too much, the owner of the vending machine loses
money. If it puts in too little, the customer complains. To test
whether or not it is dispensing 12 ounces of coffee, the owner tests it 10
times and records the number of ounces dispensed each time.
a) State the null and alternative hypothesis for a hypothesis test to test
the claim that the machine is not dispensing the right amount of coffee.
b) It turns out that a 95% confidence interval for the mean amount of coffee
dispensed is (11.9 oz, 12.1 oz.). Suppose the owner now does a t-test.
True or false and explain: the p-value for the t-test will be less
than .05.
HW 8 due Friday, December 3 (note that this is a two-week assignment)
Solutions
Note: all "extra" problems are required.
• Extra #1: It is fairly well established that heights of adult
women in the U.S. are normally distributed with a standard deviation of
3 inches. A random sample of 10 women are selected and there heights
are recorded as follows:
63.6, 65.2, 62.2, 71.1, 65.8, 65.1, 71.0, 64.8,
63.3, 64.9
The five number summary is
(62.20, 63.90, 65.00, 65.65, 71.10) and the average of the sample
is 65.70.
a)Is the standard deviation of the population known or unknown in this
problem?
b) Calculate the standard deviation of the sample. How does this
compare to the population standard deviation?
c) Is the mean of the population known or unknown?
d) Find a 95% confidence interval using a z-statistic as a multiplier.
(Why use the z-statistic?)
e) Find a 99% confidence interval.
f) Pretend that you were not told that the population standard deviation
is 3". Use your estimate in (b) to re-calculate the 95% confidence
interval. This time, rather than a z-statistic, use the appropriate
t-statistic as a multipler.
Click here for answers to the first problem.
• W&S, p. 355: 1, 3,4a
• Extra#2*: A random sample of 9 overweight men tested an experimental
diet. (The actual test also included a control group, but in this problem
we consider only the treatment group.) Their change in weight over
a 2 week period is given below. A negative value means they lost
weight. Would you conclude the diet is successful? Use a
95% confidence interval to answer.
-2.9, -5.7, 1.3, 2.0, 0.0, 1.6, -9.1, 2.1, -4.2
• Extra #3: A university is interested in knowing more about
the demographics of its student body. In particular, it wants to
know the mean income of the parents' of its students. To determine
this, it sends out a survey to a random sample of 500 students. (The university
has about 25000 students.) The average income reported by the sample
is $55000 and the standard deviation of the sample is $25000.
a) The distribution of incomes in the populations is almost certainly
non-normal (and the mean and standard deviation of the sample support this
notion). Why, then, is it appropriate to use a z-statistic as a multiplier
to compute an approximate 95% confidence interval?
b) Compute an approximate 95% confidence interval.
c) Suppose that the university knew as a fact that the standard deviation
of incomes in the population was $27,500. The university's administration
wanted the survey to produce a 95% confidence interval that was just $2000
"wide". (So the right end-point of the interval minus the left end-point
is 2000.) How many students would they have to sample to achieve
this?
d) Suppose the university did the study again and got this interval:
($53000, $55000). For each of the following interpretations
below, state whether the interpretation is "right" or "wrong" and explain
your answer:
i) 95% of the students' parents' have incomes between
$53000 and $55000.
ii) There is a 95% chance that the true mean is
between $53000 and $55000.
iii) If we were to repeat this study infinitely
many times, 95% of our means would fall betwen $53000 and $55000.
iv) If we were to repeat this study infinitely many
times, 95% of the time we would get an interval that contained the true
mean.
• Extra #4*: If the mean speed on a road is faster than the posted
speed limit, people can contest speeding tickets (at least in some cities).
Below are recorded speeds on a street in which the speed limit is
30 mhp. We will examine whether there is evidence that the mean speed of
all cars is faster than the posted speed limit. The observed speeds
were 29, 34, 34, 28, 30, 29, 38, 31, 29, 34, 32, 31, 27, 37, 29, 26, 24,
34, 36, 31, 34, 36, 21
The five number summary is (21.00, 29.00, 31.00, 34.00, 38.00). The
average of the observed speed was 31.04 and the standard deviation was 4.24.
a) Make an appropriate plot of the data and describe the distribution
of speeds.
b) State the null and alternative hypotheses to test.
c) Compute the appropriate test statistic.
d) Find the p-value for the observed test statistic.
e) Would you conclude tht the true mean speed is faster than 30 miles
per hour?
• Extra #5: In 1998, as an advertising campaign, the Nabsico Company
announced a "1000 Chips Challenge", claiming that every 18-ounce bag of
their Chips Ahoy cookies contained at least 1000 chocolate chips. Dedicated
Statistics students at the Air Force Academy purchased some randomly selected
bags of cookies, and counted the chocolate chips. Some of their datda
are given below:
1219, 1214, 1087, 1200, 1419,1121,1325,1345,1244,1258,1356,1121,1191,1270,1295,1135.
a) Find a 95% confidence interval for the mean number of chocolate
chips in all Chips Ahoy bags. Wht does this say about Nabisco's claim?
b) Perform a hypothesis test and state your conclusion about Nabisco's
claim?
c) Hopefully you reached the same conclusion for parts (a) and (b).
Was this coincidence, or will it always happen?
• Extra #6: Consumer Reports tested 14 brands of vanilla yogurt and
found the following numbers of calories per serving:
160, 200, 220, 230, 120, 180, 140, 130, 170, 190, 80, 120, 100, 170.
A diet guide claims that you will get 120 calories from a serving of
vanilla yogurt. What does this evidence indicate? Do a hypothesis
test and also compute a confidence interval. Choose the significance
level you think is appropriate.
HW 7 due Friday, Nov 19 (same day as midterm II)
W&S p.315: 1,12*,15
Extra* (but required): in class we took a random sample of 7
serial numbers from the population {1,2,....N}, where N was an unknown number.
Each serial number came from a "captured tank".
a) Make a sketch of the pdf of the population (obviously in terms
of N)
b) Let Xi represent the serial number on the ith tank we capture.
Find an expression for the expected value of Xi. Find an expression for the
standard deviation of Xi.
c) Suppose we calculate Y = (X1 + ... + X7)/7 Find the
mean of Y in terms of the mean of X. Find the SD of Y in terms of
the SD of X.
d) A popular choice for an estimator for N was Xbar + 3* SD(X).
What's the bias of this estimator? How does the bias change if we take
a larger sample size?
e) Another choice was Xbar + 3*SD(X)/sqrt(n) . What's
the bias of this estimator? How does the bias change if we take
a larger sample size?
Extra (and required) for review: According to an urban legend,
the Washington Redskins, a DC-based football team, are able to predict
presidential elections. The theory goes like this: if the Redskins
lose their last homegame before the election, the incumbant party loses.
If they win the last homegame, the incumbant party wins. This method
has been right 15 out of the last 16 times. The only time it
failed was this election. On October 31, the Redskins lost to the GreenBay
Packers and therefore the incumbant should have lost, according to this
predictor. However, he won. Assume that the Redskins have no
predictive ability whatsoever. (In which case the string of successes
is a coincidence.) What's the probability of getting 15 or more correct
predictions in 16 elections?
Solutions to HW 7
HW 6 Due Friday, Nov 12 Solutions
H&G 2.12, 2.14, 2.16 (see note at end).
W&S p. 270: 6e, 7,11,12,13,14*,16
W&S p. 315: 2,3,7*
Note: for #16, the problem is asking you to find the mean and SD of
these "new" random variables (that are, in fact, transformations -- simply
changes in units -- of the previous RVs.) You do *not* need to show
why the distribution is Normal. However, it is not necessarily clear
why the probability distribution of a random variable stays the same when
you add and multiply by constants. If you take a mathematical statistics
course, this is the sort of thing you will worry about. For now,
just assume that the new distribution is still normal.
HW 5 Due Friday, Nov 5 Solution
NOTE: Some of you asked about an example of finding the covariance. Click
on the link to 2.11.
H&G, 2.10 , 2.11,2.20
W&S p. 226: 17a-e, 20,22
W&S: p. 268: 1,2 (note, for part a, don't use the data -- assume
the data follow the normal distribution and use the normal curve), 4,5,6,8
Extra #1: Suppose we toss a fair-coin 100 times. Let X
represent the number of heads.
a) Use the normal approximation to find the probability
that you will get more than 60 heads.
b) Use the normal approximation to find the probability
you'll get between 45 and 55 heads. Between 40 and 60. Between 35
and 65.
c) Compare your answers in (b) with the exact probabilities
obtained from the binomial distribution.
*Extra #2 Suppose that 48% of the population will vote for Bush. (By
the time you do this, you'll know what the truth is. Maybe.) A random
sample of 1000 people is taken with replacement. Let X be the number
of people in the sample who will vote for Bush. (Note: "with replacement"
means that when a person is selected, they are put back in the pool and
can, theoretically, be selected again.)
a) What distribution does X have? Why? Why
did I specify that the sample of people is to be taken with replacement.
b) Will the normal approximation provide a suitable
approximation of the exact probabilities here? Why?
c) Find the probability that more than 50% of the
sample will vote for Bush.
d) Find the probability that less than 45% of the
sample will vote for Bush.
*Extra #3
In our in-class census, the results of a quesiton asked about
whether people support Bush came out as follows:
Bush?
|
No
|
Undecided
|
Yes
|
Female
|
25
|
17
|
10
|
Male
|
28
|
16
|
14
|
a) Suppose we select a student at random from this group. Let
X = -1 if the student does not support Bush, 0 if undecided, and 1 if
they do support. Let Y = 1 if the student is female, 0 if male.
Write a table of f(x,y), the joint distribution of X and Y.
b) Write tables for f(x) and f(y), the marginal distributions
of X and Y. (That's two tables; one for X, one for Y.)
c) Find P(X>0, Y=1)
d) Find E(X).
e) Are X and Y independent?
HW4 Due Friday, Oct 29 Solutions
Hill & Griffiths (H&G), Exercises 2.2 (p. 37), 2.3, 2.5*,
2.9a
W&S: p. 212: 2 (choose any four -- note that the answers
are in the back)
W&S: p.226 13,14*, 16
Extra #1: Suppose I want to go into the casino business with
a simple gambling game.. In this game, you roll a fair, six-sided
die. You will win 1 dollar for each pip on the die if there are an
odd number of pips. You will lose one dollar for each pip if there is
an even number of pips. I'm going to charge some amount of money to
play the game. How much would I need to charge in order to make the
game fair? (A fair game is one in which neither side has an advantage
-- so the expected value should be 0.)
HW3 Due Friday, Oct 22 (same day as Midterm) Solutions
p. 151 1,2,3
p. 157, 2,3
p. 166: 2,3*
p. 174, 2,3
p. 188, 8,9,10*,12
p. 192: 18,(20 is optional), 27
HW 2 Due Friday Oct. 15 Solutions
Click here
for the pdf file (needs Adobe Acrobat Reader) that contains the HW.
If you have trouble viewing this file, save it to your harddrive
(on a PC, right-click and choose "Save File to Harddrive" or some such
phrase) and open it with Acrobat. Sorry for the pdf file, but it
was the easiest way to include some of the graphs I wanted to include.
The census data used in the exercises is
available here. It's a fairly long file with 5000 entries and consists
of a random sample from the 1980, 1990 and 2000 census of residents of Los
Angeles-Long Beach and Anaheim. It includes information on family
income, race, sex, age, and commuting time. It is a text file, tab-delimited.
From within Stata, you can download the data (assuming you're on-line)
by typing insheet using "http://web.stat.ucla.edu/~rgould/11f04/census.txt"
(Assuming, of course, that you are on-line.)
HW 1 Due Friday Oct. 8 For Solutions
click here.
* #1 A dermatologist wants to test which of two products is best for
removing a type of blemish that appears on the face. He instructs the receptionist
that she is to assign each patient diagnosed with this condition with
either Treatment A or Treatment B, and there should be roughly the same
number of people in each group. The dermatologist assigns each patient
a severity score, and each patient is asked to return 4 weeks later. At
the followup visit, the dermatologist (who does not know which treatment
the patient has received) assigns a new severity score. There were
43 patients enrolled in the study. The results were that, on average,
the patients in Treatment A improved more than those receiving Treatment
B.
a) Was this is a controlled experiment or an observational
study, and why?
b) As you might have guessed, with 43 patients
enrolled each treatment group did not have the same number of patients.
Is this a problem? Why or why not?
c) Note there was no placebo. What information
would be obtained by including a third group of patients who reeived a placebo?
Why should or should not this be done? (Assume a placebo is
a completely inactive product.)
d) Suppose the dermatologist has not yet collected
this data, but has proposed this to you as a design for this study. He
has asked whether anything should be changed. What advice would you give
him?
#2 Physical exercise is considered to increase the risk of spontaneous
abortion. Furthermore, women who have had a spontaeneous abortion are more
likely to ahve another. One observational study finds that women who
exercise regularly have fewer spontaneous abortions than other women.
If you were a doctor, if you were to advise all of your pregnant patients
to exercise regularly, would you expect to see a drop in the rate of spontaneous
abortions? If yes, explain why. If no, suggest a confounding
variable.
#3 There's considerable debate about the effects of the "Bush
Tax Cuts" on the U.S. economy. At issue is the question of whether or not
the tax cuts led to an improved economy. Describe the data you would
like to answer this question and explain what type of study you would have
to use. (Warning: it could be impossible to design the perfect study.)
p. 27: 1,2,
p. 33: 6,7,* 8