comparing two groups

Outline

1. CI for proportion
a) exact (later)
b) approximate
2. CI for difference of two indpt. groups
3. CI for difference of two proportions
4. CI for difference of paired data. (time permitting)

CI for Proportion

Setting: A large population of objects which, when measured, produce a 1 or 0. We take a random sample, with replacement, of n objects, and count X - the number of 1’s.

Model: X is binomial. E(X) = np where p is the percentage of 1’s in the population. Var(X) = np(1-p).

Goal: estimate p.

Estimator: phat = X/n

Note: E(phat) = p, Var(phat) = p(1-p)/n

What’s sampling distribution of phat?

If n is sufficiently large (np > 10, n(1-p) > 10), then phat is approximately N(p, SD(phat))

The problem with this is that we don’t know p and therefore can’t knwo phat. If we did, then the CI would be
phat +- z(a/2)sqrt(p(1-p)/n)

Solution is to add an approximation to the approximation. Replace p everwhere with phat.

Example:
A survey of 3000 people finds that 62% prefer white bread to wheat. Estimate a 95% CI for the proportion of people in the population who prefer white bread to wheat.

x = 1860. phat = .62
alpha = .05, z(a/2) = 1.96 (qnorm(.975)).
.62 +- 1.96*sqrt(.62*.38/3000)
.62 +- .017
So CI is roughly 60% to 64%.

Example:
A politician wants to know if he’ll win tomorrow’s election. A survey of 1000 people says that 510 will vote for him. Assume he needs more than 50% to win. What’s a 95% CI?

phat = .51
.51 +- 1.96 sqrt(.51 * .49/1000)
.51 +-.031
(47% to 54%).

We can also calculate an exact CI, particularly with the use of a computer. leave that until Friday.

Note: this is not what real surveys do. They sample with replacement, and often have more complex sampling schemes. But it’s a fairly good approximation.

II. Comparing two groups

Setting: We have two very large groups of objects. The two groups might have different mean values. We take a random sample from each group and measure our samples. We wish you compare the samples to see if we learn anything about differences in the groups.

Example: Do SAT I scores differ among Math and Engineering students? Do cholesterol levels differ between those getting placebo and those getting treatment? etc.

Model: X1, ...., Xn are N(mu1, sd1). Y1, ...., Ym are N(mu2, sd2)
observations indpt.
sd1 <> sd2 (See book.)

Goal: Obtain a CI for mu1 - mu2.

Estimator: Xbar - Ybar.

Note that
a) E(Xbar - Ybar) = mu1 - mu2
b) Var(Xbar - Ybar) = sd1-squared/n + sd2squared/m

We base this on a test statistic

T = (Xbar - Ybar)/sd(Xbar - Ybar)
Problem: we don’t know sd1 or sd2, so we estimate them with s1 and s2.

This makes T a t-random variable. But what are the degrees of freedom?

turns out this is hard question to answer. An approximation is found on p. 264. It’s messy to do by hand but, of course, easy enough on computer.

Often, people make a simplifying assumption: assume sd1 = sd2. In that case, you can derive an exact confidence interval, without approximations. But how do you know sd1 = sd2?

Common practice now is to assume they do NOT, and use this bizarre approximation.

R makes this easier.
x <- rnorm(10,5,1)
> y <- rnorm(10,5.5,1.5)
> t.test(x,y,alternative="two.sided", conf.level=.95)

Welch Two Sample t-test

data: x and y
t = -1.1115, df = 15.586, p-value = 0.2832
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.6838414 0.5271344
sample estimates:
mean of x mean of y
5.295828 5.874181

We could also do
t.test(x,y,alternative=”two.sided”, var.equal=T)

III. Comparing two proportions
Again, there are exact methods, but here we’ll do an approximation.

Setting: two large groups, random samples of size n and m from each, and for each we record teh number of “successes”.

We therefore have two binomial RVs: X, E(X) = np1 and Y, EY = mp2
p1hat = X/n p2hat = Y/m

p1hat - p2hat is approx. N(p1 - p2, sqrt(p1(1-p1)/n + p2(1-p2)/m))
and if we replace the p1 and p2 with their estimates, the approximation just becomes a little more approximate, and
p1hat - p2hat +- z(a/2) sqrt(s) is the (1-a)% CI.

Example: A poll of 100 Democrats show that 47% support Bush’s budget, while a poll of 110 Republicans show that 55% support it. Find an approximate 95% CI for the difference. Do you think there really is a difference in the population?

SD = sqrt(((.47*.53)/100 )+ (.55*.45)/110)) = 0.0685

-.08 +- .13426
-0.21 to .05
So level of support might be same in both groups.

IV. Paired Data
Consider this study. 15 patients have their blood pressure measured, are given a drug, and then have it measured again. We want to estimate the mean change. Why do we need this?

There’s great variability in the data. Blood pressure is measured inaccurately, and in addition there is a wide variety of responses in different people to drugs.

So we have two populations: one consists of all possible blood pressure readings on the 15 people before the drug. We have 15 random variables measuring this: X1, ..., Xn.
and the other represents all possibel blood pressure readings after the drug. And we see Y1....Yn.

Suppose E(X) = mu1 And E(Y) = mu2. We would hope, if the drug works, that mu2 < m1, so m1-m2< 0.

How can we find a confidence interval to estimate mu1 - mu2?

There’s one difference between this and setting (2). Here the two groups are NOT independent. Because the same people provide both measurements. So if person 3 is high on blood pressure before, then this gives you some insight into what to expect after.

Data like this are called “paired” because each subject/object produces a pair of observations. To get the most out of this analysis, we need to treat them as pairs.

The solution is to examine not the two variables, but their differences:
Di = Xi - Yi
If X, Y are N, then
D is N(m1 - m2, sd)

So we’re back to our original one-group problem: find a 95% CI for
D1, D2, ...., Dn.