Homework 9 Solutions
Caveat Emptor
Due Friday, May 31
1) A poll by the LA Times asked 1167 residents of LA if "Do you think
Los Angeles is primarily a segregated city?". 350 of those polled identified
as White, and 50% said that they thought the city was segregated. 262
of the respondants identified as Black, and 47% of these people said that
they thought the city was segregated. Do an appropriate hypothesis
test to see whether blacks and whites differ in the percentage of those who
believe LA is segregated.
For a fairly complete analysis of the poll, including all
questions asked,
read this pdf file
. Note that the sampling scheme was more complicated
than a simple random sample (which is why Blacks make up about 30% of the
sample but about 11% of the LA population). But assume, for the sake
of this problem, that sampling was done with replacement.
If sample sizes are sufficiently large, then this statistic is approximately
N(0,1):
Z = (p-hat 1 - p-hat2)/SD(phat1 - phat2) where phat1 = X/n and
phat2 = Y/m. X = number in sample of size n who say "segregated", Y
= number in sample of size m who say it. Hence phat1 = .50 and phat2 = .47.
SD(phat1 - phat2) = sqrt(var(phat1) + var(phat2)) = sqrt( p1*(1 - p1)/n +
p2 *(1-p2)/m) but because we don't know p1 and p2, we replace them
by phat1 and phat2. So SD = sqrt(.5*.5/350 + .47*.53/262) = .0408.
Hence our observed value of Z is .03/.0408= .7353
H0: p1 - p2 = 0
Ha: p1 - p2 <> 0
p-value = P(Z > .7353) + P(Z < -.7353) = 1 - pnorm(.7353) + pnorm(-.7353)
= .4622
Therefore, do NOT reject. There is no evidence to suggest that the
proportion of blacks and white in the POPULATION who would answer "segregated"
is different.
2) Consider, again, the
body temperature
data. Perform a hypothesis test to test whether the
median body temperature is 98.6 degrees Farenheit. Use the bootstrap
technique. Follow these steps:
a) state the null and alternative hypothesis
H0: median = 98.6
Ha: median <> 98.6
b) As a test statistic, we'll use the sample median. What is the observed
value of the sample median?
> median(bodytemp)
[1] 98.321
c) Adjust the data so that the sample median is now 98.6. (Hint:
add or subtract the appropriate amount from every observation). Now
take a sample, with replacement, of 130 and calculate the median. Repeat
this 1000 times, and you'll have a "bootstrap sample" of medians for which
we know that the true value is 98.6.
> adjbodytemp <- bodytemp-98.321+98.6
> median(adjbodytemp)
[1] 98.6
> btstrap <- c()
> for (i in 1:1000){
+ bsmedian <- median(sample(adjbodytemp,replace=T,length(adjbodytemp)))
+ btstrap <- c(btstrap, bsmedian)}
>
d) Based on your bootstrap sample, what is your estimate of the probability
that a sample median will be as extreme OR MORE than the observed value of
your sample median (in step b)?
p-value = P(sample median < 98.321) + P(sample median > 98.879) (because
98.879 is as far above 98.6 as 98.321 is below it.)
> sum(btstrap < 98.321)
[1] 8
This is the "empirical probability" of getting a median less than 98.321
if the true median is 98.6.
> 98.6 - 98.321
[1] 0.279
> 98.6+.279
[1] 98.879
> sum(bstrap > 98.879)
Error in sum(bstrap > 98.879) : Object "bstrap" not found
> sum(btstrap > 98.879)
[1] 4
> 12/1000
[1] 0.012
e) Based on your answer to (d), would you reject the null hypothesis?
Yes. P-valuie < .05.
3) Suppose you are shown a suspicious looking coin, and asked to
determine whether it is "fair". Because your time is limited, you decide
that you will flip the coin 10 times. If it lands on heads 0, 1, 9,
or 10 times, you will declare it "unfair". Otherwise, you will conclude
that there is no evidence to call it unfair.
a) What is the probability that you will
declare a fair coin unfair? (i.e. What's the significance level?)
Let X = number of heads in 10 tosses.
P(declare coin unfair when coin is fair) = P(X = 0,1,9, or 10 when p = .5)
=
> sum(dbinom(c(0,1,9,10),10,.5))
[1] 0.02148438
b) Suppose that the reality is that the coin is
unfair. In fact, assume that for this coin, p = .55. For this
coin, what is the probability that your procedure will correctly identify
it as unfair? This is the power at p = .55.
P(X = 0,1,9,10 when p = .55) =
> sum(dbinom(c(0,1,9,10),10,.55))
[1] 0.02775935
c) Redo part (b), but now assume p = .60. Note
that the power increases as p gets further from the null hypothesis value
of .50.
> sum(dbinom(c(0,1,9,10),10,.6))
[1] 0.04803512
>
Note that the power is still very small (4.8%)
d) Suppose that p = .55, but now you change your procedure.
You flip the coin 100 times, and you declare the coin unfair if it lands
heads 10% or fewer, or 90% or more. Now what's the probability
that you will correctly declare the coin unfair? Note that the power
increases as the sample size increases.
(Hint: if X is a binomial random variable with parameters n and p,
then the R command dbinom(x,n,p) gives you P(X = x).)
P(X <= 10) + P(X >=90) =
> sum(dbinom(0:10,100,.55)) + sum(dbinom(90:100,100,.55))
[1] 2.914801e-14
Oops. My mistake. This Decision Rule is not comparable to the
one used in part c.
(The power should be bigger. The reason its not is because the significance
level is no longer .02148. You can calculate it by substitute .50 where
.55 appears above.)
e) Back to the old procedure of flipping the coin
10 times. Now we will declare the coin unfair if it lands heads 0,1,2,8,9,
or 10. What is the significance level now?
> sum(dbinom(c(0,1,2,8,9,10),10,.5))
[1] 0.109375
f) What is the power for this new procedure if we assume
that the truth is that p = .55. How has it changed from (b)? What
can you say about the relation between power and significance level?
> sum(dbinom(c(0,1,2,8,9,10),10,.55))
[1] 0.1269515
The significance level is higher (and therefore worse), but the power is
also higher (and therefore better.) Increasing significance level increases
the power.
4) A special managed care programwas implemented at 9 randomly chosen hospitals
in an HMO system. At 7 of those hospitals, the average cost per patient
decreased over the previous year. Is this evidence that the managed
care program saves money?
a) Let X represent the number of hospitals out of the 9 that spend less
money per patient. Before the study was conducted, this was a
random variable. According to the null hypothesis, what value should
we assign to p = the probability that a hospital will spend less money per
patient.
H0: p = .5
b) What's the alternative hypothesis?
Ha: p > .5
c) Let X be your test statistic. What's the sampling distribution
of X, according to the null hypothesis?
Binomial, with n = 9, p = .5
d) The observed value of X is 7. What's the p-value?
P-value = P(X >= 7) =
> sum(dbinom(7:9,9,.5))
[1] 0.08984375
e) If you use a significance level of 5%, do you reject the null hypothesis?
No.
5) A study published in 1982 in the Journal of Epidimiology (Morton, et.
al.) examined the children of workers at a battery factory. These workers
were exposed to lead while at work, and there was concern that lead dust could
be brought home and infect the workers' children. Once ingested, lead
gets into the bloodstream and the body cannot not remove it. Therefore,
over time, the level of lead accumulates. This is particularly dangerous
in children, because excessive lead levels cause developmental problems.
The file
lead.dat contains the blood lead levels
for 33 children of workers at this factory. (Lead levels are measuerd in
units of micrograms of lead per deciliter of blood.) They are labeled
"Exposed". The other column, labeled "Control" consists of the blood
lead levels for 33 "matched" children. These were children of the same
age, living in the same neighborhood, but whose parents did not work around
lead. So for example, the 4th child in the Exposed column is the same
age and lives in the same neighborhood as the 4th child in the Control column.
a) Without looking at the data, what shape do you think the distribution
of Exposed children's lead levels will look like? Explain why.
You're welcome to whatever thoughts you want. You might think that
the scores will be more-or-less normally distributed. Afterall, a child
is just as likely to be some amount over the average as above the average.
And in fact, this is exactly what the Control children's lead distribution
looks like. The Exposed children, however, show a very skewed distribution.
This is some evidence that something is going on here, since the shape
of the two distributions differs so radically.
b) Look at the distributions of the Exposed and Control lead scores. What
differences do you see in the shapes? Do you see evidence that the Exposed
children differ from the Control children?
See above.
c) Most toxicologists believe that lead levels over 50 micrograms per deciliter
require medical treatment, and levels over 60 require immediate hospitilization.
How does this information affect your comparison of the two groups?
2 children in the Exposed group should be in the hospital. No one in
the Control group is nearly this sick. This says that the levels in
the exposed gruop are dangerously high.
d) Create a new variable called "diff" that is equal to the Exposed levels
minus the Control levels. Describe the distribution of this variable.
> diff <- Exposed - Control
> hist(diff)
>
the scores are somewhat skewed right,. and the bulk of the scores are positive,
suggesting that most Exposed children had higher lead levels than their Control
counter-parts. Sometimes considerably higher.
e) What does the variable diff tell you about the difference between Exposed
and Control children?
In most cases, Exposed children have higher lead levels than children of
same age living in same neighborhood.
f) Perform the appropriate hypothesis test to test whether the mean of diff
is 0. State the null and alternative hypotheses, and your conclusion.
Also state any assumptions that you made regarding the population and
the sample.
The data are paired, and so the test we use is the paired t-test:
H0: mean difference = 0
Ha: mean difference > 0
T = Avg(diff) - 0 divided by SD(diff)/sqrt(n)
> mean(diff)/(sd(diff)/sqrt(length(diff)))
[1] 5.782966
p-value= P(T > .1752) where T has a t-distribution with n - 1 = 32 degrees
of freedom:
1 - pt(5.783, 32)
[1] 1.018113e-06
The p-value is considerably less than .05, suggesting that this is a remarkable
outcome if, indeed, the mean difference is 0. We therefore reject the
null hypothesis and conclude that the mean is higher than can be explained
by chance variation.
g) Why did the experimenters choose matches from the same neighborhood?
Of the same age?
They were controlling for two confounding factors. Lead might be in
the environment (in the dirt in the yard), and this would lead children to
have higher than normal lead levels. Also, coincidentally, these homes
might be closer to the factory, which means that children of the factor workers
might have high lead levels not because of what their parents do at work,
but because of where they live. Similarly, lead stays in the blood
stream and therefore accumulates over time. So children of the same
age should have about the same amount of lead in their blood. If the
factory worker's children were all of a similar age (maybe because of when
hiring was done or when the factory was opened), we might expect their lead
levels to be similar.