Homework 9 Solutions

Caveat Emptor

Due Friday, May 31


1) A poll by the LA Times asked 1167 residents of LA if "Do you think Los Angeles is primarily a segregated city?".  350 of those polled identified as White, and 50% said that they thought the city was segregated.  262 of the respondants identified as Black, and 47% of these people said that they thought the city was segregated.   Do an appropriate hypothesis test to see whether blacks and whites differ in the percentage of those who believe LA is segregated.

For a fairly complete analysis of the poll, including all questions asked, read this pdf file .   Note that the sampling scheme was more complicated than a simple random sample (which is why Blacks make up about 30% of the sample but about 11% of the LA population).  But assume, for the sake of this problem, that sampling was done with replacement.


If sample sizes are sufficiently large, then this statistic is approximately N(0,1):
Z = (p-hat 1 - p-hat2)/SD(phat1 - phat2)   where phat1 = X/n  and phat2 = Y/m.  X = number in sample of size n who say "segregated", Y = number in sample of size m who say it. Hence phat1 = .50 and phat2 = .47.

SD(phat1 - phat2) = sqrt(var(phat1) + var(phat2)) = sqrt( p1*(1 - p1)/n  + p2 *(1-p2)/m)  but because we don't know p1 and p2, we replace them by phat1 and phat2.  So SD = sqrt(.5*.5/350 + .47*.53/262) = .0408.

Hence our observed value of Z is .03/.0408= .7353

H0: p1  - p2 = 0
Ha: p1 - p2 <> 0

p-value = P(Z > .7353) + P(Z < -.7353) = 1 - pnorm(.7353) + pnorm(-.7353) = .4622  
Therefore, do NOT reject.  There is no evidence to suggest that the proportion of blacks and white in the POPULATION who would answer "segregated" is different.

2) Consider, again, the body temperature data.  Perform a hypothesis test to test whether the median body temperature is 98.6 degrees Farenheit.  Use the bootstrap technique.  Follow these steps:
a) state the null and alternative hypothesis
H0: median = 98.6
Ha: median <> 98.6

b) As a test statistic, we'll use the sample median.  What is the observed value of the sample median?

> median(bodytemp)
[1] 98.321

c) Adjust the data so that the sample median is now 98.6.  (Hint: add or subtract the appropriate amount from every observation).  Now take a sample, with replacement, of 130 and calculate the median.  Repeat this 1000 times, and you'll have a "bootstrap sample" of medians for which we know that the true value is 98.6.
> adjbodytemp <- bodytemp-98.321+98.6
> median(adjbodytemp)
[1] 98.6

> btstrap <- c()
> for (i in 1:1000){
+ bsmedian <- median(sample(adjbodytemp,replace=T,length(adjbodytemp)))
+ btstrap <- c(btstrap, bsmedian)}
>

d) Based on your bootstrap sample, what is your estimate of the probability that a sample median will be as extreme OR MORE than the observed value of your sample median (in step b)?
p-value = P(sample median < 98.321) + P(sample median > 98.879)  (because 98.879 is as far above 98.6 as 98.321 is below it.)

> sum(btstrap < 98.321)
[1] 8
This is the "empirical probability" of getting a median less than 98.321 if the true median is 98.6.

> 98.6 - 98.321
[1] 0.279
> 98.6+.279
[1] 98.879
> sum(bstrap > 98.879)
Error in sum(bstrap > 98.879) : Object "bstrap" not found
> sum(btstrap > 98.879)
[1] 4
> 12/1000
[1] 0.012

e) Based on your answer to (d), would you reject the null hypothesis?
Yes.  P-valuie < .05.

3) Suppose you are shown a suspicious looking coin, and asked to determine whether it is "fair".  Because your time is limited, you decide that you will flip the coin 10 times.  If it lands on heads 0, 1, 9, or 10 times, you will declare it "unfair".  Otherwise, you will conclude that there is no evidence to call it unfair.
    a) What is the probability that you will declare a fair coin unfair?  (i.e. What's the significance level?)
Let X = number of heads in 10 tosses.
P(declare coin unfair when coin is fair) = P(X = 0,1,9, or 10 when p = .5) =

> sum(dbinom(c(0,1,9,10),10,.5))
[1] 0.02148438

    b) Suppose that the reality is that the coin is unfair.  In fact, assume that for this coin, p = .55.  For this coin, what is the probability that your procedure will correctly identify it as unfair?  This is the power at p = .55.

P(X = 0,1,9,10 when p = .55) =
> sum(dbinom(c(0,1,9,10),10,.55))
[1] 0.02775935


    c) Redo part (b), but now assume p = .60.  Note that the power increases as p gets further from the null hypothesis value of .50.

> sum(dbinom(c(0,1,9,10),10,.6))
[1] 0.04803512
>

Note that the power is still very small (4.8%)

    d) Suppose that p = .55, but now you change your procedure.  You flip the coin 100 times, and you declare the coin unfair if it lands heads 10% or fewer, or 90% or more.  Now  what's the probability that you will correctly declare the coin unfair?  Note that the power increases as the sample size increases.
(Hint:  if X is a binomial random variable with parameters n and p, then the R command dbinom(x,n,p) gives you P(X = x).)
P(X <= 10) + P(X >=90) =
> sum(dbinom(0:10,100,.55)) + sum(dbinom(90:100,100,.55))
[1] 2.914801e-14

Oops.  My mistake.  This Decision Rule is not comparable to the one used in part c.
(The power should be bigger.  The reason its not is because the significance level is no longer .02148.  You can calculate it by substitute .50 where .55 appears above.)

     e) Back to the old procedure of flipping the coin 10 times.  Now we will declare the coin unfair if it lands heads 0,1,2,8,9, or 10.  What is the significance level now?

> sum(dbinom(c(0,1,2,8,9,10),10,.5))
[1] 0.109375

    f) What is the power for this new procedure if we assume that the truth is that p = .55.  How has it changed from (b)?  What can you say about the relation between power and significance level?
> sum(dbinom(c(0,1,2,8,9,10),10,.55))
[1] 0.1269515

The significance level is higher (and therefore worse), but the power is also higher (and therefore better.)  Increasing significance level increases the power.

4) A special managed care programwas implemented at 9 randomly chosen hospitals in an HMO system.  At 7 of those hospitals, the average cost per patient decreased over the previous year.  Is this evidence that the managed care program saves money?
a) Let X represent the number of hospitals out of the 9 that spend less money per patient.   Before the study was conducted, this was a random variable.  According to the null hypothesis, what value should we assign to p = the probability that a hospital will spend less money per patient.
H0: p = .5

b) What's the alternative hypothesis?

Ha: p > .5

c) Let X be your test statistic.  What's the sampling distribution of X, according to the null hypothesis?
Binomial, with n = 9, p = .5

d) The observed value of X is 7.  What's the p-value?
P-value = P(X >= 7) =
> sum(dbinom(7:9,9,.5))
[1] 0.08984375
e) If you use a significance level of 5%, do you reject the null hypothesis?
No.

5) A study published in 1982 in the Journal of Epidimiology (Morton, et. al.) examined the children of workers at a battery factory.  These workers were exposed to lead while at work, and there was concern that lead dust could be brought home and infect the workers' children.  Once ingested, lead gets into the bloodstream and the body cannot not remove it.  Therefore, over time, the level of lead accumulates.  This is particularly dangerous in children, because excessive lead levels cause developmental problems.

The file lead.dat contains the blood lead levels for 33 children of workers at this factory. (Lead levels are measuerd in units of micrograms of lead per deciliter of blood.)  They are labeled "Exposed".  The other column, labeled "Control" consists of the blood lead levels for 33 "matched" children.  These were children of the same age, living in the same neighborhood, but whose parents did not work around lead.  So for example, the 4th child in the Exposed column is the same age and lives in the same neighborhood as the 4th child in the Control column.

a) Without looking at the data, what shape do you think the distribution of Exposed children's lead levels will look like?  Explain why.
You're welcome to whatever thoughts you want.  You might think that the scores will be more-or-less normally distributed.  Afterall, a child is just as likely to be some amount over the average as above the average.  And in fact, this is exactly what the Control children's lead distribution looks like.  The Exposed children, however, show a very skewed distribution.   This is some evidence that something is going on here, since the shape of the two distributions differs so radically.

b) Look at the distributions of the Exposed and Control lead scores.  What differences do you see in the shapes?  Do you see evidence that the Exposed children differ from the Control children?
See above.

c) Most toxicologists believe that lead levels over 50 micrograms per deciliter require medical treatment, and levels over 60 require immediate hospitilization.  How does this information affect your comparison of the two groups?
2 children in the Exposed group should be in the hospital.  No one in the Control group is nearly this sick.  This says that the levels in the exposed gruop are dangerously high.
 
d) Create a new variable called "diff" that is equal to the Exposed levels minus the Control levels.  Describe the distribution of this variable.
> diff <- Exposed - Control
> hist(diff)
>
the scores are somewhat skewed right,. and the bulk of the scores are positive, suggesting that most Exposed children had higher lead levels than their Control counter-parts.  Sometimes considerably higher.


e) What does the variable diff tell you about the difference between Exposed and Control children?
In most cases, Exposed children have higher lead levels than children of same age living in same neighborhood.

f) Perform the appropriate hypothesis test to test whether the mean of diff is 0.  State the null and alternative hypotheses, and your conclusion.  Also state any assumptions that you made regarding the population and the sample.
The data are paired, and so the test we use is the paired t-test:
H0: mean difference = 0
Ha: mean difference > 0

T = Avg(diff) - 0  divided by SD(diff)/sqrt(n)
 > mean(diff)/(sd(diff)/sqrt(length(diff)))
[1] 5.782966

p-value= P(T > .1752) where T has a t-distribution with n - 1 = 32 degrees of freedom:
1 - pt(5.783, 32)
[1] 1.018113e-06

The p-value is considerably less than .05, suggesting that this is a remarkable outcome if, indeed, the mean difference is 0.  We therefore reject the null hypothesis and conclude that the mean is higher than can be explained by chance variation.
g) Why did the experimenters choose matches from the same neighborhood?  Of the same age?
They were controlling for two confounding factors.  Lead might be in the environment (in the dirt in the yard), and this would lead children to have higher than normal lead levels.  Also, coincidentally, these homes might be closer to the factory, which means that children of the factor workers might have high lead levels not because of what their parents do at work, but because of where they live.  Similarly, lead stays in the blood stream and therefore accumulates over time.  So children of the same age should have about the same amount of lead in their blood.  If the factory worker's children were all of a similar age (maybe because of when hiring was done or when the factory was opened), we might expect their lead levels to be similar.