Stat 10: Making the leap from Chapters 16.4, 17.1-17.3, 20.1 - 20.4 to Chapter 21.

Suppose we're supreme beings and we know exactly what percentage of Californians over age 18 will buy their next car via the Internet. Let's just say that percentage is 10%.

This could be represented by a "box" of tickets representing the whole population of Californians, 18 and over, let's say there are 20 million of them, and 10% (or 2 million) have a "1" representing "plans to buy via the internet" and the remainder, 90% (or 18 million) "do not plan to buy via the internet" could be represented with a "0"

  1. A question for you regarding the above "box": if you picked one person at random from the "box", what's the CHANCE of picking someone who plans to buy a car from the internet? What is the CHANCE of picking someone who does not plan to buy a car from the internet?
  2. Another question for you. If you picked 100 people at random, knowing nothing else, what is your best guess about number of persons who plan to buy out of this 100? And what would the percentage of buyers be?

First, some quick review about chances again (see Chapter 13's summary):

  1. The chance of something gives the percentage of time it is expected to happen when the basic process is done over and over again independently and under the same conditions.
  2. Chances are between 0% and 100%
  3. The chance of something equals 100% minus the opposite thing.
  4. Chances are like the proportion of time you expect to see a given outcome.
  5. Remember, "selected at random" or "drawn at random" implies that everyone (in this case Californians age 18 and over) had the same chance to be picked.

In Chapter 16.4, 17.1 - 17.3, 20.1 - 20.4 this is one of the main ideas. Questions of interest -- in this case, what % of people plan to buy a car via the internet (instead of more traditional ways) or what % of people plan to vote for Gore or what % people think the world is going to end, etc. etc. can be simplified into a "box model". The type of box model illustrated here has two outcomes (binary, "one-zero").

The type of questions (that you would see on a midterm) that grow out of this kind of

situation run along the lines of:

"What is the CHANCE (or what percentage of time) of drawing or selecting or surveying 100 people and discovering that 20% (or 20 people) said they were planning to buy a car via the internet?"

 

The ideas floating around are:

  1. sample outcomes (or sample statistics) can vary from sample to sample…in other words, it is unreasonable to expect that every time you select 100 from "the box" you are going to get 10 people who say they will buy a car from the internet. Sometimes you will get 9, 11, 20 etc.
  2. understanding the basic behavior of samples can help us calculate the exact CHANCE (see the bolded question above) of getting 20% when we were expecting 10%.
  3. The ability to do this comes from Chapter 17.1 - 17.3 and the discussion in 20.1 -- the most difficult concept to understand is this: sample outcomes (statistics, in this case, a percentage) have a distribution too, this distribution is normal (remember Chapter 5), it's center (or mean) is the parameter p which is the population percentage (or proportion). This distribution also has a spread (called the Standard Error) and for a percentage it is the formula on page 360:

    SE for percentage = (SE for number / size of sample ) * 100

    Remember the formula for a SE for a number (p. 291):

    SE for a number = (square root of the number of draws) x (standard deviation of the box)

    What is the standard deviation for this "box"? (1 - 0) * Ö .1 * .9 = .30

    So the SE for the number of "1"s drawn would be 10 * .30 = 3.

    The SE for the percentage of "1"s draw would be (3 / 100) * 100 = 3%

  4. To actually make a statement of the CHANCE (remember from above Chances are like the proportion of time you expect to see a given outcome) you need to covert your given outcome -- that is for example, 20% into a Z score and read the percentage (or change or probability) from table A 105.
  5. Specifically (see 17.3 and 20.3)

Z = 20% - 10% / 3% = 3.33 (round down to a Z of 3.30) = 99.903%

Interpretation -- 99.903% of the sample outcomes (sample statistics) should fall between

-3.30 and +3.30. Only .0485% of all samples outcomes are expected to be as large as 20% or larger. Or you could say THERE WAS ONLY A .0485% CHANCE OF GETTING AN OUTCOME AS LARGE AS 20% IF WE WERE EXPECTING 10%

This was an example using percentages. Sometimes we are interested in SUMS (see 16.4, 17.1 - 17.3)

 

Chapter 21 is a departure from Chapters 17.1 - 17.3 and 20.1 - 20.4, but it more like "real life" in terms of statistics.

Let's continue the car purchase idea…

An enterprising auto dealer (who is not a supreme being) wants to know what percentage of Californians over the age of 18 plan to buy their next car via the Internet. If it turns out there are at least 10% willing to buy, he'll set up a web site. But if it's less than 10%, he doesn't think it's worth his trouble. Suppose the auto dealer selected (at random) and surveyed 100 Californians age 18 and over.

Suppose the auto dealer's sample of 100 yielded the following results:

8% said they would buy their next car via the Internet

92% said they would not buy their next car via the Internet

Never having taken statistics and not being a supreme being, the auto dealer might interpret these results in the following manner "it looks like 8% of Californians plan to buy their next car via the Internet. That's too low for me, I won't have a web site created"

  1. After this course, you should do a little better than the auto dealer in terms of interpreting statistical information….even if you are not a supreme being.
  2. In the auto buying example above -- this is like having a "box" of tickets representing the whole population of Californians, 18 and over, let's say there are 20 million of them, BUT NOW he doesn't know the underlying chances (probabilities).
  3. In practice, what is done is this THE SAMPLE INFORMATION, that is the 8% and the 92% are used as "substitutes" to for the unknown parameters. In other words, in Chapter 21, the sample result of 8% is now treated as if it were the population parameter (but it is not -- it's just an estimate) and we calculate the standard deviation and the standard error from the sample information yielded in the 100 people selected.

So…the "new" population percentage is 8%,

the new standard deviation for this "box" is

(1 - 0) * Ö .08 * .92 = .2713

So the SE for the number of "1"s drawn would be 10 * .2713 = 2.713.

The SE for the percentage of "1"s draw would be (2.713 / 100) * 100 = 2.713%

We might calculate Z scores at this point, but what is done in Chapter 21 (specifically 21.2) is an introduction to the CONFIDENCE INTERVAL. You can think of it as a "margin of error". Here, a range of values derived from sample information is given and within this range, we think the true parameter is "covered". We use a combination of the sample estimate of 8% and the Standard error for that percentage (2.713%) to make statements of confidence about where we think the true parameter is.

We know from Chapters 17 and 20 than in 68% of all samples, the sample percentage should be within one standard error of the population percentage. So in this Chapter 21 example:

8% plus or minus 2.713% is a 68% confidence interval. We would say that we are 68% confident that the interval 5.2871% to 10.713% covers the "truth", that is, the unknown population parameter p.

8% plus or minus (2 * 2.713%) is approximately a 95% confidence interval. We would say that we are 95% confident that the interval 2.574% to 13.426% covers the "truth", that is, the unknown population parameter p.

8% plus or minus (3 * 2.713) is approximately a 99.7% confidence interval. We would say that we are 99.7% confident that the interval -0.1390% to 16.139% covers the "truth", that is, the unknown population parameter p.

We can never be 100% confident. There is always the chance, remote as it may be, that you did everything correctly when drawing a sample and still got a bad sample.

Notes

If the original population is normally distributed with a known standard deviation, or if the sample size is "large", then the distribution of the sample statistic (in this example a percentage -- but it could be a sum or an average) is normal, and the appropriate test statistic is thus z from the normal table. (If the original distribution is normal with an unknown standard deviation, the test statistic is different.)

Your margin of error will depend on the choice of a confidence level. A lower confidence will give you a smaller margin of error. A higher confidence will give you a larger margin of error.
If your standard deviation is small, it is easier to get a more precise fix on the parameter. Your margin of error is smaller for populations with smaller standard errors.
If your n (sample size) increases in size, it will reduce your margin of error. If your n (sample size) gets smaller, it will increase your margin of error.

The parameter is FIXED, UNCHANGING but your confidence intervals will vary from sample to sample because your statistic varies.