Statistics 10
Lecture 14


The Accuracy of Percentages

  1. Inference
    A. Overview

    In Chapters 17-20, the parameters are known, and we estimate the chance of various outcomes. In Chapter 21 we begin STATISTICAL INFERENCE, here the parameters are unknown, and we draw conclusions from sample outcomes to make guesses about the parameters.

    Statistical inference is related to probability as follows: we make assumptions about the parameters, and then test to see if those assumptions could have led to the sample outcome we observed. We then use statements of confidence to express the strength of our conclusions.

    In this chapter we will examine CONFIDENCE INTERVALS (21.2) for estimating the value of the population parameter. The confidence intervals are based on the sampling distribution of statistics from Chapter 20.1.

    B. Remark

    Remember, parameters, although unknown, are fixed (they do not change). It was the OUTCOME (statistic from a sample) that was random. Randomness, that is either your data comes from a random sample or from a randomized experiment, is an important prerequisite.

  2. Information from Samples (21.1)
    A sample is drawn and statistics can be calculated from it. For example, the LA Times Poll prior to the election.

    883 people said they would probably vote. 468 or 53% said they would vote for Gray Davis.

    Up until now, we have information on the "box", but in reality, we almost never do. Such as here in this election example.

    So what do statisticians do? The substitute sample fractions for parameter fractions.

    In other words, the composition of the "box" is unknown, so we use our "best guess" -- a sample which is drawn from the "box".

    From the sample, we can calculate a SD ( the square root of .53 x .47 which is about .5) and then the SE. For the LA times poll the standard error is square root(883) x .5 or about 15 voters or 1.7% (15/883) of the sample. They rounded it to about 2 percentage points.

    On October 25th then, the LA Times was fairly certain that the percentage of votes Davis would receive would between 49% and 57%. In other words, their sample statistic of 53% was telling them that it was highly likely (95% confident) that the true percentage would be captured in the range 49% to 57%.

  3. Confidence Interval Basics (21.2)
    A. Definition

    A CONFIDENCE INTERVAL is a range of values (i.e. values derived from sample information) which we think covers the true parameter.

    The article suggests that the "margin of error" is about 4% for the likely voters which suggests a range around the sample statistic of 49% to 57%. This interval covers Davis' true share of the vote. This is about plus or minus 2 S.E and is a standard way of expressing results from opinion polls. What they are saying is that they were 95% confident that the interval 49% to 57% covers the true percentage of the vote for Davis.

    Actually, Davis got about 58% of the vote which is in the 3 S.E. range (47% to 59%). Standard Errors larger than 2 do happen, but rarely.

    The figures I have been giving to you are confidence intervals for the population percentage and they are calculated from sample percentages and sample standard deviations. Up until now, we've been in a situation where we know exactly what the "box" looks like, now we don't, but we have samples.

    B. Properties

    (1) In about 68% of all samples, the sample percentage will be within one standard error of the population percentage.

    From the Times poll, we would say that we were 68% confident that Davis' percentage of the vote is in the interval 51% to 55%

    (2) In about 95% of all samples, the sample percentage will be within two standard errors of the population percentage.

    From the Times poll, we would say that we were 95% confident that Davis' percentage of the vote is in the interval 49% to 57%

    (3) In about 99% of all samples, the sample percentage will be within three standard errors of the population percentage.

    From the Times poll, we would say that we were 99% confident that Davis' percentage of the vote is in the interval 47% to 59%

    (4) You can never been 100% confident. There is always the chance that you could have a very bad sample and know nothing about the true population parameter.

    Also remember a property of the normal curve, it never crosses or touches the line, so even at 10 S.E. there is a non-zero chance, but it is very very small.

  4. Method
    Constructing a confidence interval for a population parameter involves five steps:

    (1) Find the sample statistic. This is our ESTIMATE of the population parameter. From the example above, Davis got 53% of the vote according to the survey.

    (2) Compute the standard error for the sample statistic; for simple random samples involving percentages, the standard error is square root(sample size) multiplied by the square root(fraction of thing of interest x fraction the thing not of interest) and divide that by the sample size and multiply it by 100.

    For example, from the Times Poll:

        SQRT(883) x SQRT( .53 x .47)      
        ---------------------------- x 100 = 1.7%
                  883
    
    (3) Find the level of confidence you are interested in from a normal table using the area percentages. For an approximate 95% confidence interval, z=2.00.

    (4) Multiply (2) and (3).

    (2 * 1.7%)

    (5) Add and subtract (4) from (1). This is your "margin of error" that is, how accurate you believe your statistic is based on the variability of the estimate.

    53% + or -3.4%

  5. Remarks

    a. A typical confidence interval has the form "estimated value, plus or minus Z times the SE of the sample". In other words, an estimate plus and minus some margin of error.

    b. If the original population is normally distributed with a known standard deviation, or if the sample size is "large", then the distribution of the sample percentage is normal, and the appropriate test statistic is thus z from the normal table. (If the original distribution is normal with an unknown standard deviation, the test statistic is different.)

    c. Your margin of error will depend on the choice of a confidence level. A lower confidence will give you a smaller margin of error. A higher confidence will give you a larger margin of error.

    d. If your standard deviation is small, it is easier to get a more precise fix on the parameter. Your margin of error is smaller for populations with smaller standard errors.

    e. If your n increases in size, it will reduce your margin of error. If your n gets smaller, it will increase your margin of error.

  6. Summary (21.3)
    1. The CORRECT interpretation for a confidence interval is as follows: "We did a procedure of drawing a sample, computing a percentage, standard error, etc. This procedure will give us a correct interval X% of the time and an incorrect interval 100-X% of the time. We hope this is one of the correct times. Thus, for about X% of all samples, the interval "sample percentage + or - Z standard errors covers the true population percentage.

    2. It is WRONG to talk about the chance a particular confidence interval contains the parameter. For example, you can't say "there is a X% chance that the parameter is in the confidence interval" because these confidence intervals vary with samples and the parameter never varies.

    Any single confidence interval either covers the true parameter or it does not.

    3. Another way you might think about this. When you KNOW the TRUE POPULATION PARAMETER, you can make a statement like: there is a 95% CHANCE that the SAMPLE STATISTIC will be in the range of the parameter plus or minus two standard errors.

    Example: if you know the parameter is 40% and the SE is 2.5%, then there is a 95% chance that the sample percentage will be in the range of 40% plus or minus 5%.

    But when you DO NOT KNOW THE TRUE POPULATION PARAMETER, you are forced to make statements like this: I am 95% confident that the POPULATION PARAMETER is in the range of the statistic plus or minus two standard errors.

    Example: if you don't know the parameter and the sample statistic is 40% and the SE is 2.5%, then you are 95% confident that the parameter is covered by the range of 40% plus or minus 5%.

  7. Warnings

    If the sample is known to be biased, the confidence interval can be calculated, but it is meaningless.

  8. Homework Chapter 21 (DUE 11/13/98)

button Return to the Fall 1998 Statistics 10/50 Home Page

Last Update: 4 November 1998 by VXL