Statistics 10

1. Definitions again

Review the definitions of POPULATION, SAMPLE, PARAMETER and STATISTIC.

STATISTICAL INFERENCE, here the parameters are unknown, and we draw conclusions from sample outcomes (those are statistics) to make statements about the population parameters.

From your textbook: statistical inference produces answers to specific questions, along with a statement (prof's note: using probability) of how confident we can be that the answer is correct. The conclusions of statistical inference are usually intended to apply beyond the individuals (prof's note: the particular sample) actually studied.

When random samples are drawn from a population of interest to represent the whole population, they are generally unbiased and representative. The statistics calculated from the samples will obey the laws of probability discussed in Chapter 4.

The key to understanding why this is so is a difficult concept: THE SAMPLING DISTRIBUTION. The sampling distribution is a theoretical/conceptual/ideal probability distribution of a statistic, such as the ones seen in Chapter 4.4. A theoretical probability distribution is what the outcomes (i.e. statistics) of some random process would look like if you could repeatedly sample or repeat the experiment. Your Lab 3 has you simulate thousands upon thousands of "coin tosses" or thousands of "market days" so you can see the resulting probability distribution.

Note that a sampling distribution is the probability distribution of a statistic. In Chapter 1.3, you saw a probability distribution for normal variables, you could think of 1.3 as describing samples of size one and how that one "fit" with respect to the rest of the population.

In Chapter 5, we are now consider how sample size, represent by n, affects the sampling distribution.

2. Sampling Distributions for Counts and Proportions (5.1)

· Basic Definitions

a. BINOMIAL DISTRIBUTION -- the simplest situation (p. 376)

i. you have some fixed number n of things to observe.

ii. They are independent -- that is, the probabilities do not change from one thing to the next.

iii. Each observation falls into one of only two categories. These are thought of as "success/failure" "yes/no" "zero/one"

iv. The probability of a "success" is p and is the same, in theory, for each thing observed, (therefore the probability of failure is 1-p)

b. A fair coin behaves in this manner, 50% chance of showing "heads" on a toss, 50% chance of showing "tails". We write that the proportion of heads = .50 or p=.50. Here p is a parameter representing the probability of seeing "heads".

c. The POPULATION in this example is the count of heads in some infinite number of tosses of a coin 10 times. Here is a table which describes the sampling distribution (see handout). These probabilities are derived from knowing how a single coin behaves. (I don't expect you to calculate something as complicated as this, but if you want to know, it's in the optional section on page 387)

d. Suppose I drew (or threw) 10 SAMPLES (a part of the population) at random (using Stata). (see handout)

e. In the first 10 tosses, I got 6 heads (and therefore 4 tails). What were we EXPECTING? 5 heads. Remember "expected values?" For samples taken from a binomial, the expected value is n*p and the variance is np(1-p)

f. I got 6 heads by the luck of the draw. If I had drawn, say, hundreds of samples instead of drawing one single sample, I'd have a sampling distribution. And for this particular situation, the sampling distribution would look like a normal curve centered around 5.

g. THE COIN TOSS SITUATION COULD BE EXTENDED TO ANY QUESTION THAT INVOLVES THE IDEA OF CATEGORIZING OUTCOMES INTO "SUCCESS/FAIL" or "YES/NO" for example, the question might be "Will you vote for George Bush?" or "Will you buy from our company again?"

Notes

The underlying population is not normal, but the sampling distribution is normal

The sample size (n) stays the same from sample to sample. If you used a different sample size, you would have a different sample distribution.

For these kinds of yes/no, success/failure, positive/negative populations, we are interested in two parameters n (sample size) and p (probability or proportion of "successes" in the population).

The definition of "success" is arbitrary.

3. The mean and standard deviation of a binomial distribution: counts and proportions

Another example, suppose you know that for the stock market as a whole (all 7,500 or so publicly traded issues on the NYSE, AMEX, and NASD) the proportion with a positive return year-to-date is .72 or 72% of the market has experienced a positive return since Jan 1, 2001.

Let's suppose we want to buy securities in such a way that our portfolio has a similar return as the market's but we don't need to own as many stocks, so we say, select 100 at random from the 7,500.

Stop: which one is the population, parameter, sample, statistic?

We could treat the count of stocks with positive returns in our single portfolio of 100 stocks as a random variable.

The mean (expected value) would be m _X = n*p = 100 * .72 = 72

The variance is s² _x= np(1-p) = (100)(.72)(.28) = 20.16 and the standard deviation (more useful) is approximately 4.49

This is saying that if we sample from a binomial population using samples of size 100, we expect a COUNT of 72 positive plus or minus 4.49 or nearly 5 stocks.

For sample proportions (page 382), that is, for the percentage or proportion of stocks with positive returns is:

The mean would be m _p = p = .72 or 72%

The standard deviation is s _p = Ö(p(1-p))/n = .0448 (approximately 4.5%)

Note: If we increase the size of the sample (assuming it is representative), for proportions, our standard deviation gets smaller. Think about what this property means in terms of accuracy when comparing sample statistics to population parameters.

4. The Normal Curve Again: p. 383 The Normal Approximation for Counts and Proportions

Here is a statement:

"The percentage of stocks in the sample which had a positive change in price over the last year will be around 72% give or take 4.5%"

We can convert these to standard units (Z scores) as in Chapter 1.

One standard deviation of the sampling distribution in this example is 4.5%, +1 standard deviation would be 72 +4.5 or 76.5%, -1 standard deviation will be 67.5%.

The chance that between 67.5% and 76.5% of any given sample of 100 stocks from the major exchanges will have a positive change in price since the new year is about 68%.

We can move from using the normal curve to figuring percentages to using the normal curve to make statements about chances. This is because if we had an infinite number of samples of any given size from a binomial population, their distribution approximates a normal. This allows us then to make a statement about a single sample or a set of samples or give us a sense of ranges.

The chance that between 58.5% and 81.0% of any given sample of 100 stocks will have a positive change is about 95%

And 99.7%?

What is being said is the following, that under certain conditions, when n*p is greater than or equal to 10 and when n*(1-p) is greater than or equal to 10, the sampling distribution for a binomial approximates a normal.

This will allow us to ultimately use the normal to make probability statements about sample statistics.

Note that the sampling distributions are not perfectly normal, there is a correction for this (on page 386) but you will not be tested on the continuity correction. It will be enough for you to assume that if the conditions n*p and n*(1-p) are greater than or equal to 10, the sampling distribution will be normal.