Solutions to *'d and all Extra problems
Extra #1* : In the mid-1980s, a team of epidemiologists did a study
of children whose parents worked at a battery factory. The purpose
of the study was to determine if the parents -- whose daily work exposed
them to lead dust -- were contaminating their children with lead. The
theory was that lead-dust gets onto the parents' skin and clothes and then
is brought home where it gets into the air and is inhaled by the children.
Lead poisoning in children can affect growth and also intelligence. It
has less effect on adults. Once lead gets into the bloodstream, the
body cannot cleaning it out, and so it gradually accumulates over time. For
all of these reasons, health officials are very concerned about childhood
exposure to lead. (A common source used to be household paint. Although
laws prevent using lead based paint, some older structures with older paint
can still contain lead. It can be dangerous for children to eat the
chips from this paint. Recently, in Los Angeles, there has been an
increase in lead poisoning due to certain candies imported from Mexico that
have a high lead content.)
The epidimiologists measured the level of lead in the blood of 32 children
whose parents worked at a battery factory in Oklahoma. These children will
be called the "Exposed" group. They also took lead levels from another
32 children whose parents' work did NOT expose them to lead. These
children are the "Control" group. Lead is measured in units of micro-grams
of lead per mililiter of blood (mg/ml)
Two important facts about lead poisoning: most experts believe that lead
levels higher than 40 are dangerous and levels over 60 require immediate
hospitalization.
You can download and view the data at http://www.stat.ucla.edu/~rgould/11f04/lead.txt
a) Describe and compare the distributions of the lead levels of the
exposed and control groups. Based on these graphs, what do you think
about the theory that workplace exposure to lead can result in high lead
levels in their children?
When comparing distributions, compare the center, spread, and shape. Here
the exposed children have a higher center score, have much greater spread,
and the shape could be described as right-skewed or, at the very least, has
two outliers. The control group, on the other hand, has a fairly symmetric
distribution of lead levels. Note that some of the children in the
exposed group have dangerously high levels of lead. These all suggest
that the exposed children's lead level is different from the control group
in a meaningful way.
b) Find the means and standard deviations of the lead levels of the
exposed and control groups.
Exposed: average is 31.8 and sd is 14.4 micrograms per deciliter
Control: average is 15.9 and sd is 4.5
Again we see the exposed group has a higher mean (twice as high) and greater
spread. This is consistent with the theory that the exposed children
are being contaminated from lead through an outside source.
c) The researchers chose the control group in a very particular way:
for each child in the exposed group, they found a child whose parents did
not work around lead and who lived in the same neighborhood and who was the
same age as the exposed child. Explain why age of child and neighborhood
are potential confounding variables if we're trying to determine whether
the parents' workplace leads to lead contamination.
Lead accumulates in the bloodstream, and so we expect older children to
have higher lead levels. If the factory worker children turn out to
be older than the control group, then this difference in age might explain
the difference in lead levels. Also, lead varies with environment.
For example, older houses are more likely to have lead-based paint. If
the factory workers lived in older housing than the control group, this might
explain why the lead levels were different. Therefore, by making sure
the ages and neighborhoods are similar, the researchers can control for age
and environment.
d) Suppose we wished to do a hypothesis test to test the researchers'
theory. What would be a better test and why: A two-sample t-test
or a paired t-test?
The paired t-test is a better test because it takes into account the fact
that the two groups are DEPENDENT. This violates one of the assumptions
of the two-sample t-test, and so it woudl not be appropriate. (There
are numerous ways of saying why it is the wrong thing to use the two-sample
test, but this is a fairly direct way of saying why.)
Extra #2*: These questions refer to the same study of child lead
poisoning.
a) Compute the "difference scores": the difference between each
exposed child and his or her "match". (The data in the dataset are
organized by pairs. Each Exposed child's score is reported beside his or
her Control match. ) Describe the distribution of these scores.
The typical difference is somewhere around 20, with a range from -4 to 60.
The bulk of the data are over 0, which is important because it suggests
that in most pairs, the exposed children have higher lead levels than the
control children. There is a potential outlier at 60, which means in
one pair, the exposed child's lead level was 60 micrograms/dliter higher than
the control child.
b) Perform a hypothesis test to test whether parents' exposure to lead
results in higher lead levels in children. Be sure to give the significance
level you're using, state the hypotheses, state your test statistic and give
the observed value, compute the p-value, and give your decision.
Significance level: 5%
H0: mean difference is 0
Ha: mean difference > 0
T = (15.9 - 0)/(15.9/sqrt(33) = 5.65 (the average of the difference
scores is 15.9, which also turns out to be the standard deviation.)
P-value = P(T > 5.65 ) = 0.000015 using a computer. Or, using
Appendix A6 in the book, we use the row with n-1 = 32 degrees of freedom,
we see that since 5.65 > 4.198 (the largest value in the table), the p-value
must be less than .0001.
Our decision is to reject the null hypothesis, since the p-value is less
than the significance level. We conclude that the mean level of lead
in the exposed children is higher than the mean level in the control children.
c) Find the 95% CI for the mean difference in lead levels. Interpret
what this confidence interval tells us about the affect of parents' workplace
exposure to lead on their children.
Note that the question asks for a two-sided confidence interval. (We
didn't cover one-sided, and don't need to know.) Again, using the row
with 32 degrees of freedom, we see that the appropriate multiplier for a
95% CI is 2.037. Hence
15.9 +/- (2.037) * (15.9/sqrt(33)
15.9 +/- 5.6
(10.3, 21.5)
This confidence interval means we can be confident that the true mean difference
is positive, and hence the mean lead level of the exposed children is higher
than for the control children.
d) An assumption of the t-test is that the exposed children were a
random sample from the population of all similarly exposed children. Probably
this assumption is not true here. We don't know for sure (and the original
research article doesn't specify), but it is unlikely the children came
from a random sample of the children at this one factory, much less a random
sample of all children throughout the country who are similarly exposed.
If a t-test is inappropriate, the sign-test might be. The reasoning
is as follows: According to the null hypothesis, if we select an exposed
child and then find a control match, the control should be just as likely
to have a lead level higher than the exposed child as he or she is to have
a lower level. So there's a 50% chance each control child should have
a higher lead level. Perform a sign test using a 5% significance level.
Let p represent the probability that, in a given pair, the exposed child
will have a higher lead level than the control child.
H0: p = .5
Ha: p > .5
Also okay to say, "let m be the median difference, then H0: m = 0 and Ha:
m >0
Also, if you did your difference as "Control - Exposed" instead of "Exposed
- Control", then your ">" should be a "<".
For our test statistic we'll use X = (# of pairs out of 33 in which the
exposed child has a higher lead level than the control child.)
If the null hypothesis is true, X is a binomial random variable with n =
33 and p = .5. NOTE: you might have dropped the 0 score, in which
case n = 32. I think its bad practice to drop data but since the book
suggests doing this, it's okay to do it here.
Our observed value of the test statistic is x = 28.
p-value = P(X >= 28) which we can figure out using the binomial pdf =
.00003
Becauase the p-value is less than .05, we reject the null hypothesis and
conclude that the exposed children are more likely to have higher lead levels
than the control children.
e) Is this an observational study or a controlled experiment? Why?
Does this fact affect your ability to state conclusively whether the
parents' exposure caused the childrens' elevated lead levels?
Observational: the researchers did not determine which child would
be in the exposed group and which in the control. (While they did hand-pick
the control group, they did not get to assign some children to exposed and
some to control. This assignment was determined by where the parents worked,
not by the researchers.) This means that there could be unaccounted
for confounding variables. They controlled for age and neighborhood, but perhaps
there are others that might explain the difference. So we can't be certain
that the parents' working conditions lead to the elevated levels.
(NOTE: in fact, the researchers collected more evidence that leads
one to conclude that the parents' working conditions were to blame, but that's
beyond this problem.)
Extra #3. A vending machine is meant to dispense precisely 12 ounces
of (very bad) coffee into a styrofoam cup. The cup holds no more than
12.2 ounces. If it puts in too much, the owner of the vending machine
loses money. If it puts in too little, the customer complains. To
test whether or not it is dispensing 12 ounces of coffee, the owner tests
it 10 times and records the number of ounces dispensed each time.
a) State the null and alternative hypothesis for a hypothesis test to
test the claim that the machine is not dispensing the right amount of coffee.
H0: mean = 12
Ha: mean <> 12
where "mean" means the mean amount of coffee dispensed by the machine.
b) It turns out that a 95% confidence interval for the mean amount
of coffee dispensed is (11.9 oz, 12.1 oz.). Suppose the owner now
does a t-test. True or false and explain: the p-value for the t-test
will be less than .05.
The null hypothesis mean, 12, is contained in the 95% confidence interval.
This means that we canNOT reject the null hypothesis with a (100% -
95%) = 5% significance level. If we did NOT reject, it must be the case
that the p-value > .05. So the statement is False.