1. Sample Design and Statistical Inference
Goal:
to Make generalizations from collected data and to draw conclusions about summarized
information (e.g. means, percentages) on variables
2. Sampling
and Inference
a.
The POPULATION is the entire set of people (or animals, things) we
wish to study.
b.
A SAMPLE is a part of the population. Why sample? Most times it is
not feasible to query the entire population (e.g. too costly, would take too
long) so researchers select (sample) a subgroup for questioning or analysis.
c.
A numerical fact about a sample is a STATISTIC. It is some number
that is used to describe a sample.
d.
A numerical fact about a population is a PARAMETER. Statistic is
to sample what PARAMETER is to the population. If 48% is the resulting statistic
from a sample, the people who conducted the study hope that it is a close
approximation of the true population PARAMETER (i.e. they hope the parameter is
also 48%).
3. Problems: less than perfect samples, less than perfect subjects
Bias
: If a sample is "representative", then a statistic can be a good
estimate of the parameter; but if the sample includes or excludes certain
people systematically, the sample is BIASED. This CNN "sample" is a bad sample (at
least they are willing to issue a disclaimer at the bottom).
Types
of bias
( i) Voluntary-response bias --- certain people decide to
talk to you.
( ii) Selection bias --- you systematically exclude certain subjects, undercoverage
(iii) Non-response bias --- people don't
bother to answer you
( iv) Response bias --- people answer,
but they lie to you
( v) Wording of question --- phrasing
may not be neutral (e.g. a loaded question).
In class example, questions on legal
abortion. There are different levels of support (in percentages) depending on how the question is worded. From a
1992 series of Time/CNN polls:
Question |
In Favor |
Opposed |
Unsure |
"When
the mother's life is at stake:" |
84 |
11 |
5 |
"When
the mother's health is in danger:" |
82 |
12 |
6 |
"In
cases of rape or incest:" |
79 |
16 |
5 |
"If
the fetus will be born seriously deformed:" |
70 |
22 |
8 |
"For
any reason during the first trimester:" |
47 |
44 |
9 |
"For
any reason while the fetus cannot survive outside the womb:" |
67 |
23 |
10 |
"At
no time during the pregnancy:" |
25 |
65 |
10 |
4.
Design Issues
Sampling
designs -- a design is a method used to choose a sample
a.
Simple random sample (SRS): every person in the population has an
equal chance of getting into the sample with each draw.
b. Not every sampling scheme is simple
random sampling; other sampling schemes include MULTISTAGE SAMPLING. (You don't need to know this for any test)
Probability methods work well because
they are impartial and they give each member of the population a known
(calculable) chance of being selected.
5.
Statistical Inference and how to think
about the remainder of the course
STATISTICAL
INFERENCE, here the parameters are unknown, and we draw conclusions from sample
outcomes (i.e. statistics) to make guesses about the value of the parameters.
Why
are we interested in sampling and bias and sampling methods? Because for
inference to work, we need that element of random chance to help us make
generalizations from samples to the larger population. In other words, with a relatively small sample
we would like to make statements (inferences) about the much larger population
it came from, but in order for us to make accurate statements, we cannot allow
bias.
Statistical
inference is related to and relies on probability as follows: we make
assumptions about the parameters, and then test to see if those assumptions
could have led to the sample outcomes (i.e. statistics) we observed. We then
make statements of chance or confidence to express the strength of our
conclusions about the value of the parameter.
There
are two important concepts that must be understood:
SAMPLING
VARIABILITY -- this simply means that if additional samples are drawn the
statistics generated from each of the samples are very different (for example, in
the Wall Street Journal handout you might treat all of the different surveys
discussed as additional samples). Note
that statistics from larger samples should have less variability than small
samples (i.e. consider the effect of outliers on a small sample versus a large
sample) but variability is not closely related to the size of the population.
The
collection of all possible samples is a very important theoretical idea and the
statistics associated with this collection have SAMPLING DISTRIBUTIONS. Figure
3.6 (page 270) in your book is important for understanding this concept and its
properties:
a.
It
looks NORMAL (and for our purposes, is normal)
b.
It's
CENTER is equal to the value of the PARAMETER
c.
It's
SPREAD or variability, is very predictable and for now, let's say it gets
smaller as your samples get larger in size.
To
sum it up, if you want to make inferences from samples to the population as a
whole, you want a situation where you have low bias and low variability (see page
274), Your single sample is one sample
from a collection of all possible samples.
Your hope is that your sample is "right on" the target and
that its statistics are very close if not equal to the parameters for the
larger population from which it was drawn.