Ó 2004, S. D. Cochran. All rights reserved.

ESTIMATING CHANCE ERROR IN SAMPLE SURVEYS

  1. We have already learned methods of estimating chance error
  1. In a sample of observations, standard deviations give estimates of variation among the values, and when we pair our estimates of center and spread via z-scores with the normal curve and the probability density distribution for the population, we can develop estimates of how far to the left and right of the mean we have to go to contain a certain probable percentage of the sample. The sample of observations is a known, observed distribution.

  2. We also have seen that we can calculate estimates of spread in hypothetical distributions of discrete random variables--here we found that accuracy of the estimate from our anticipated sampling distribution is a function of the square root of the sample size. The bigger the sample, theoretically, the more accurate our estimate and the closer we get to the true value. This is a distribution of chance error. We don’t actually observe it but we create it from our knowledge of the probability of the event occurring.

  1. Remember, also, that the point of doing a survey is to get a statistic which we will use to estimate a population parameter.

statistic value = parameter value + chance error + bias

  1. We use sampling procedures and measurement procedures to minimize bias thus attempting to remove it from this equation

  2. We can minimize chance error by changing our sample size

  1. Dealing with chance error is a balancing act between cost and precision
  1. Each element (subject) that provides a data point costs money, time, and effort.
  1. In an ideal world, this is irrelevant--right--we want to know the truth
  2. In the real world, this acts a limiting factor in survey research       
  1. Precision is enhanced by an increase in sample size, up to a point--only about 2500 people sampled by SRS are needed to estimate a population parameter with 1% chance error no matter how large the population. Why? The largest standard deviation possible in a two outcome box occurs when there is a 50-50 chance of drawing one of the outcomes. The SD for that box is .5. The SE for the percentage of that box as we about to learn is:

We figure any estimate that is accurate to within 1% is good enough.

  1. To justify cost, we have to specify how precise do we need to be

  2. In some situations, only a gross estimate is needed.

  3. In other situations, we need to know the estimate with great accuracy

  4. Example: If we wanted to conduct a survey to find out how much tuition could the Regents charge before more than half of you transfer elsewhere, and let's say the true number is $5011.73, then whether we come up with an estimate of $5,010 or $5050 would probably be just as good at predicting your behavior as being more precise. In other words, there is some limit to how precise we need to be.  On the other hand, if we want to estimate the winner of a close election where the true difference in preference between the two candidates was less than a percentage point, then being less accurate than that is too imprecise.

  5. These decisions require judgment calls

  1. Surveys start with estimating the desired precision (or amount of chance error) and, from this, estimate the needed sample size for the study.
  1. Calculating the SE for a percentage

    To calculate the standard error of a percentage, we need to go through two steps:

  1. Compute the SE for the box

  2. Calculate the percentage = (SE for the box)/(sample size)

Example: Let's say we wanted to know if UCLA students are against a tuition increase. We could poll each and every student but that would be very expensive and time consuming. Instead we decide to draw a random sample of 100 students from campus and estimate whether they are for or against a tuition increase. Later we intend to use that estimate to make a statement about what UCLA students in general believe. We know from a previous survey at Berkeley that 80% of students are against tuition increase. What do we think our SE for the percentage we find in our sample will be, if UCLA students are as likely Berkeley students?

First we make the anticipated box based on our estimate of population parameter from the Berkeley data:

80 1's  and 20 0's

The SD of the box is the square root (.2 X .8) = 0.4

The SE of the box = square root (n or 100) X 0.4 = 4

Next, we calculate what % 4 is of 100 = 4/100 * 100% = 4%

So, the SE for the percentage is 4%.

We could also do this with the formal equation:

Notice that and the squareroot of (p*q) is the SD of the box

  1. What this says is that if we survey 100 students, we believe that, if 80% of students in truth are against it, we have a 68% chance of finding in our particular sample 76% to 84% against the tuition increase. 95% of the time we will find a vote against the tuition increase ranging between 72% and 88%

  2. But let's rethink the question. What if the Regents said, "We are going through with fee increase unless 75% or more of students are against it."

Are you confident enough in the design? You have around a 16% chance of finding 75% or less against the tuition increase even though the true value is 80%, as best we can estimate it from the Berkeley data. Is that a risk you are willing to take? If not, increase your sample size to the level of risk that you can tolerate. You can never be absolutely certain, but you can be more certain than you are.

  1. Correcting for the size of the sample
  1. Simple random sampling is sampling without replacement. For example, if four random numbers are drawn to select 4 subjects from a sample of twenty--we really don't select four numbers at random--we select 4 without replacement.

  2. If the number of draws is small in relation to the population then this has very little effect in altering the selection probability. What is the difference between choosing 1 in 10,000 vs. 1 in 9,999?

  3. But technically, there is a difference and we can correct for it.

 

The correction factor is:

If you sample with replacement, by definition there is one draw from the population, so the correction factor exactly equals 1. With each additional draw, or sampling without replacement, there is a small penalty. But if the population size is large, the penalty is trivial.