Statistics 50 Lecture 16

Statistics 50
Lecture 16

Inferences for Counts and Proportions

A. Recall: binary variables

Usual setup
We might think of samples as n independent trials. Each trial results in either "success" or "failure", and the chance of a success each time is p (the parameter) and our outcomes/results are called p-hat (the statistic, the sample proportion)
In Chapter 7, we make inferences based on the sampling distribution of the proportion, p.
Recall (from Chapter 4)
The sample proportion p-hat is an unbiased estimator of the population proportion p.
And the standard deviation of the proportion is sqrt( p (1-p) / n).
And for large samples, the distribution of p-hat will approximately be normal.

B. One population proportion

Method
Suppose a simple random sample of size n is drawn from a large population in which there are a proportion p of successes. Let p-hat be the observed proportion of successes in the sample.
Then, if the true p is unknown, the approximate standard error is sqrt( phat x (1-phat) / n), and this can be used for confidence intervals.
On the other hand, if the true p is given (such as by a null hypothesis), use the true p to compute the standard deviation.
Example - Confidence Interval
Suppose a simple random sample of 100 students were drawn from UCLA. In this sample, 37 were women. Find a 95% confidence interval for the percent of women at UCLA.
The estimated percentage of women at UCLA is 37%; the estimated standard error is sqrt(.37 x .63 / 100) x 100% = 4.83%. For a 95% confidence interval, z = 1.96. Thus, the 95% interval is 37% +/- (1.96)(4.83)% = (27.5%, 46.5%).
Note that the data are from a SRS. The population (UCLA students is at least 10 times as large as the sample). And n*p and n*(1-p) are larger than 10.
Example - Hypothesis test
A researcher wants to test the hypothesis that 50% of UCLA students are women. What is the resulting p-value?
Here, the null hypothesis is that 50% of UCLA students are women (and that the difference between 37% and 50% is due to chance). If the null were true, the expected percentage would be 50%, and the standard error would be sqrt(.50 x .50 / 100) x 100% = 5%. Then, z = (37-50)/5 = -2.6, and p is 1/2 of 1 percent.
Choosing the sample size
Formula : n = ((z*/m)^2) * p*(1-p*)
p* is a guessed value for the sample proportion. m is the margin of error expressed as a proportion (not a percentage).
Note, smaller margins of error require bigger samples.
Summary
a. Need SRS
b. Population >> sample
c. p-hat will be almost normally distributed
d. confidence interval p-hat +/- z*SE
e. significance test: use Z and SD.

C. Two population proportions

Recall
In two sample problems, you are comparing two independent populations or two responses based on two independent samples.
Notation (see top of page 500)
Method
Suppose a simple random sample of size n₁ is drawn from a population having a proportion p₁ of successes. Let p₁-hat be the proportion of successes in sample 1. Suppose an independent simple random sample of size n₂ is drawn from a population having a proportion p₂ of successes. Let p₂-hat be the proportion of successes in sample 2.
Then, if the true p's are unknown, the approximate standard error for the difference between p₁-hat and p₂-hat can be estimated by sqrt( p₁hat x (1-p₁hat) / n₁ + p₂hat x (1-p₂hat) / n₂). This is useful for finding confidence intervals.
Example -- Two Sample Confidence interval
In a simple random sample of 100 Democrats, 56% favored increased taxes, while in an independent simple random sample of 150 Republicans, only 41.3% favored new taxes. Find a 95% confidence interval for the difference. Does it look likely the support rate is the same?
For the confidence interval, the observed difference is 14.7%, and the estimated standard error is sqrt( 0.56 x 0.44 / 100 + 0.41 x 0.59 / 150 ) x 100% = 6.4%.
Thus a 95% confidence interval for the difference is 14.7% +/- (1.96)(6.4), or from (2.1% to 27.3%). It looks like there is a difference.
Summary
a. Need two independent SRS
b. p₁-hat minus p₂-hat will be almost normally distributed when the samples are large.
c. The variances sum, the standard deviations of each sample do not sum.
d. confidence interval estimate +/- z*SE
Example -- Two Sample Test of Significance
For a test of significance, you are simply looking at the difference in the population proportions. The null is that there is really no difference and the alternative suggests that there is a difference.
Standardize, using a z statistic for a difference in two proportions. The standard error is somewhat different here, you need to pool the samples to get the single estimate for the population parameter p. The pooled sample proportion is:
```
   p-hat = count of successes in both samples combined
           ___________________________________________
           count of observations in both samples combined
```
The z-statistic then will be the difference in the two sample proportions divided by a standard error constructed from this estimate of the pooled sample proportion.

Return to the Fall 1996 Statistics 50 Home Page

Last Update: 25 November 1996 by VXL