1.
Definitions again
Review
the definitions of POPULATION, SAMPLE, PARAMETER and STATISTIC.
STATISTICAL
INFERENCE, here the parameters are
unknown, and we draw conclusions from sample outcomes (those are statistics) to
make statements about the population parameters.
From
your textbook: statistical inference produces answers to specific questions,
along with a statement (prof's note: using probability) of how confident we can be that the answer is
correct. The conclusions of
statistical inference are usually intended to apply beyond the individuals (prof's
note: the particular sample) actually
studied.
When random samples are drawn from a population of interest to
represent the whole population, they are generally unbiased and representative.
The statistics calculated from the samples will obey the laws of probability
discussed in Chapter 4.
The key to understanding why this is so is a difficult concept:
THE SAMPLING DISTRIBUTION. The
sampling distribution is a theoretical/conceptual/ideal probability distribution
of a statistic, such as the ones seen in Chapter 4.4. A theoretical probability distribution is what the outcomes
(i.e. statistics) of some random process would look like if you could
repeatedly sample or repeat the experiment. Your Lab 3 has you simulate thousands upon thousands of
"coin tosses" or
thousands of "market days" so you can see the resulting probability
distribution.
Note that a sampling distribution is the probability distribution
of a statistic. In Chapter 1.3,
you saw a probability distribution for normal variables, you could think of 1.3
as describing samples of size one and how that one "fit" with respect
to the rest of the population.
In Chapter 5, we are now consider how sample size, represent by n,
affects the sampling distribution.
·
Basic Definitions
a. BINOMIAL DISTRIBUTION -- the simplest
situation (p. 376)
i. you have some fixed number n of things to observe.
ii. They are independent -- that is,
the probabilities do not change from one thing to the next.
iii. Each observation falls into one of only
two categories. These are thought
of as "success/failure" "yes/no" "zero/one"
iv. The probability of a "success"
is p and is the same, in theory, for each thing observed, (therefore the
probability of failure is 1-p)
b. A fair coin behaves in this manner, 50%
chance of showing "heads" on a toss, 50% chance of showing
"tails". We write that
the proportion of heads = .50 or p=.50.
Here p is a parameter representing the probability of seeing
"heads".
c. The POPULATION in this example is the count of heads
in some infinite number of tosses of a coin 10 times. Here is a table which describes the sampling distribution
(see handout). These probabilities
are derived from knowing how a single coin behaves. (I don't expect you to calculate something as complicated as
this, but if you want to know, it's in the optional section on page 387)
d. Suppose I drew (or threw) 10 SAMPLES (a part of the
population) at random (using Stata). (see handout)
e. In the first 10 tosses, I got 6 heads (and therefore 4
tails). What were we
EXPECTING? 5 heads.
Remember "expected values?"
For samples taken from a binomial, the expected value is n*p and the
variance is np(1-p)
f. I got 6 heads by the luck of the draw. If I had drawn,
say, hundreds of samples instead of drawing one single sample, I'd have a
sampling distribution. And for this particular situation, the sampling
distribution would look like a normal curve centered around 5.
g. THE COIN TOSS SITUATION COULD BE EXTENDED
TO ANY QUESTION THAT INVOLVES THE IDEA OF CATEGORIZING OUTCOMES INTO
"SUCCESS/FAIL" or "YES/NO" for example, the question might
be "Will you vote for George Bush?" or "Will you buy from our
company again?"
The
underlying population is not normal, but the sampling distribution is normal
The
sample size (n) stays the same from sample to sample. If you used a different
sample size, you would have a different sample distribution.
For
these kinds of yes/no, success/failure, positive/negative populations, we are
interested in two parameters n (sample size) and p (probability or proportion
of "successes" in the population).
The definition of "success" is arbitrary.
Another
example, suppose you know that for the stock market as a whole (all 7,500 or so
publicly traded issues on the NYSE, AMEX, and NASD) the proportion with a
positive return year-to-date is .72 or 72% of the market has experienced a
positive return since Jan 1, 2001.
Let's
suppose we want to buy securities in such a way that our portfolio has a
similar return as the market's but we don't need to own as many stocks, so we
say, select 100 at random from the 7,500.
Stop: which one is the population, parameter,
sample, statistic?
We
could treat the count of stocks with positive returns in our single portfolio
of 100 stocks as a random variable.
The
mean (expected value) would be m X = n*p =
100 * .72 = 72
The
variance is s2 x= np(1-p) = (100)(.72)(.28) = 20.16 and the standard
deviation (more useful) is approximately 4.49
This
is saying that if we sample from a binomial population using samples of size
100, we expect a COUNT of 72 positive plus or minus 4.49 or nearly 5 stocks.
For
sample proportions (page 382), that is, for the percentage or proportion of
stocks with positive returns is:
The
mean would be m p = p = .72 or 72%
The
standard deviation is s
p = Ö(p(1-p))/n = .0448 (approximately 4.5%)
Note: If we increase the size of the sample (assuming it is representative), for proportions, our standard deviation gets smaller. Think about what this property means in terms of accuracy when comparing sample statistics to population parameters.
4. The Normal Curve Again: p. 383 The Normal Approximation
for Counts and Proportions
Here is a
statement:
"The
percentage of stocks in the sample which had a positive change in price over
the last year will be around 72% give or take 4.5%"
We can convert
these to standard units (Z scores) as in Chapter 1.
One standard
deviation of the sampling distribution in this example is 4.5%, +1 standard deviation would be 72
+4.5 or 76.5%, -1 standard deviation will be 67.5%.
The chance that
between 67.5% and 76.5% of any given sample of 100 stocks from the major
exchanges will have a positive change in price since the new year is about 68%.
We can move from
using the normal curve to figuring percentages to using the normal curve to
make statements about chances. This is because if we had an infinite number of
samples of any given size from a binomial population, their distribution
approximates a normal. This allows
us then to make a statement about a single sample or a set of samples or give
us a sense of ranges.
The chance that
between 58.5% and 81.0% of any given sample of 100 stocks will have a positive
change is about 95%
And 99.7%?
What is being
said is the following, that under certain conditions, when n*p is greater than
or equal to 10 and when n*(1-p) is greater than or equal to 10, the sampling
distribution for a binomial approximates a normal.
This will allow
us to ultimately use the normal to make probability statements about sample
statistics.
Note that the
sampling distributions are not perfectly normal, there is a correction for this
(on page 386) but you will not be tested on the continuity correction. It will be enough for you to assume
that if the conditions n*p and n*(1-p) are greater than or equal to 10, the
sampling distribution will be normal.