Ó 2004, S. D. Cochran. All rights reserved.

CONFIDENCE INTERVALS

  1. There is another important use that we can make of viewing observations in terms of
  1. Their use as estimates for population parameters or truth

  2. Their centers and spreads and our sense of uncertainty about the value

  1. Most of the time when we collect observations or data, we do so because we want use the data to infer something about the target population. That is, we want to do more than describe our sample (sometimes called descriptive statistics); we want to make inferences about truth (called inferential statistics).

Example: In the last lecture when we conducted a survey of 100 UCLA students to find out their opinions about a planned tuition increase, the truth is we didn't really care about what these 100 people thought. We weren't going to go back to the Regents and say, "100 people at UCLA are against a tuition increase." No. We wanted to poll these 100 people as representatives of UC students in general. We wanted to use their 100 answers to say something about what over 100,000 UC students feel.

  1. The law of large numbers allows us to predict that with reasonably large samples, the estimated mean of what we observe will approximate the mean of the population parameter.

  2. However, we also know that any statistic we generate, even if there is no bias, has some amount of chance error.

  3. With confidence intervals, we use the mean of our observed distribution and the estimated SE to generate a range of values that we estimate will contain the population mean. Even here we can be wrong simply due to chance, and so we decide how right we need to be:

  1. A 95% CI means that 95 out of 100 times we will be correct in creating an interval that includes the population parameter within the interval.

  2. A 99% CI means we will correct 99% of the time in creating an interval that includes the population parameter.

  3. The interval we choose depends upon the precision needed. A 99% CI will be wider (include more possible values) than a 95% CI, reflecting our desire to be more certain that the range we report will be more likely to include the parameter of interest and have less of a chance that we have made a mistake.

  1. A CI uses 4 concepts. The population mean, µ, from the population distribution, our statistics, the mean (that we observe in our sample) and the SE (that we estimate from our observed SD--but the SE is a parameter in the nonobserved sampling distribution of the means), and z-scores.

Example: Remember our random sample of 100 students from this campus that we were going to draw to estimate whether they were for or against a tuition increase? We knew from a previous survey of 150 students at Berkeley that 80% were against a tuition hike. And let's say magically we know the true population parameter for all UC students is 85% and that UCLA and Berkeley do not differ. The Berkeley observation of 80% in their sample is an estimate of the population parameter Let's say we now go out and collect the data. That is, we now have real observations and we find that 82% of UCLA students we survey are against a tuition increase. We use the same equations we've already learned to calculate. What will our SE for the percentage we find in our sample be?

First we make the box:

                                        82…1's                     18…0's

The SD is square root (.82 X .18) = 0.38 The SE of the box = square root (n or 100) X 0.38 = 3.8

Next, we calculate the % that 3.8 is of 100 = 3.8%

So, the SE for the percentage is 3.8%.

We could also do this with the formal equation:

We can use our statistics, mean and S.E., and now z-scores to say something about how often we think our mean and an interval around it will cover the cover the population mean.

If we want a 95% CI, it is

82% ± 1.96 (3.8%) or we can be 95% certain that we have created an interval between 74.5% and 89.4% that contains the population mean.

For the Berkeley sample, the S.E. = square root ((.2 X .8)/150) = 3.3%, and the 95% CI is:

80% ± 1.96 (3.3%) or 95% of the time we would be correct that if we drew a sample of 150 students from the population of UC students and observed that 80% were against a tuition increase, the interval between 73.5% and 86.5% would cover the population mean

To draw it:

…………..70………………..75……………….80……………….85………………90

µ

UCLA________________|_______________________________________|__

UCB_____________|___________________________________|__________

Notice that in this case, as we would expect in 95 out of 100 cases, the intervals do cover the population parameter. Notice also that they are centered around the mean of the individual sample, not µ.

Be careful. We are not saying the 95% of the time the population parameter is between the lower and upper bounds. The population parameter is where it is. That's fixed. The chance lies in our behavior: 95% of the time we create an interval that includes it.

  1. The final point that is made in the chapter is said with very little detail, because of its complexity, far beyond this class. And the purpose I think is so that when you see SE error estimates reported in the newspaper or elsewhere you will have some warning that the SE is wider than you might think.
  1. The formulas that we have learned are for calculating an SE when sampling has been done by SRS.

  2. However, most surveys that you'll see reported use cluster sampling. Cluster sampling requires different formulas for calculating the SE.

  1. The reason is because elements in a cluster are more alike than elements drawn at random. Therefore, each element provides less information than a single element under SRS, effectively reducing the sample size. A smaller sample size means a larger SE.

  2. Example: Let's say you surveyed homes in Thailand to find out if the home had indoor plumbing. To save time you surveyed every one of 60 inhabitants in a small village and every one of 60 inhabitants in another village. Well, if the first village has plumbing capabilities probably everyone or near everyone has indoor plumbing. The inhabitants exist in a cluster (a village) and though 60 people are polled you have effectively learned less than if you polled 60 random people from throughout Thailand (who probably live in 60 different locations). Many of your subjects are redundant and provide no information. This is another way of saying they provide no information that reduces our uncertainty about the population parameter. So even though we have 60 subjects, we need to calculate an S.E. that reflects the truth that we only really have one village--our sample size smaller than we think. So our S.E. will be bigger.