Statistics 10

1. Sample Design and Statistical Inference

Goal: to Make generalizations from collected data and to draw conclusions about summarized information (e.g. means, percentages) on variables

2. Sampling and Inference

a. The POPULATION is the entire set of people (or animals, things) we wish to study.

b. A SAMPLE is a part of the population. Why sample? Most times it is not feasible to query the entire population (e.g. too costly, would take too long) so researchers select (sample) a subgroup for questioning or analysis.

c. A numerical fact about a sample is a STATISTIC. It is some number that is used to describe a sample.

d. A numerical fact about a population is a PARAMETER. Statistic is to sample what PARAMETER is to the population. If 48% is the resulting statistic from a sample, the people who conducted the study hope that it is a close approximation of the true population PARAMETER (i.e. they hope the parameter is also 48%).

3. Problems: less than perfect samples, less than perfect subjects

Bias : If a sample is "representative", then a statistic can be a good estimate of the parameter; but if the sample includes or excludes certain people systematically, the sample is BIASED. This CNN "sample" is a bad sample (at least they are willing to issue a disclaimer at the bottom).

Types of bias

( i) Voluntary-response bias --- certain people decide to talk to you.

( ii) Selection bias --- you systematically exclude certain subjects, undercoverage

(iii) Non-response bias --- people don't bother to answer you

( iv) Response bias --- people answer, but they lie to you

( v) Wording of question --- phrasing may not be neutral (e.g. a loaded question).

In class example, questions on legal abortion. There are different levels of support (in percentages) depending on how the question is worded. From a 1992 series of Time/CNN polls:

Question	In Favor	Opposed	Unsure
"When the mother's life is at stake:"	84	11	5
"When the mother's health is in danger:"	82	12	6
"In cases of rape or incest:"	79	16	5
"If the fetus will be born seriously deformed:"	70	22	8
"For any reason during the first trimester:"	47	44	9
"For any reason while the fetus cannot survive outside the womb:"	67	23	10
"At no time during the pregnancy:"	25	65	10

4. Design Issues

Sampling designs -- a design is a method used to choose a sample

a. Simple random sample (SRS): every person in the population has an equal chance of getting into the sample with each draw.

b. Not every sampling scheme is simple random sampling; other sampling schemes include MULTISTAGE SAMPLING. (You don't need to know this for any test)

Probability methods work well because they are impartial and they give each member of the population a known (calculable) chance of being selected.

5. Statistical Inference and how to think about the remainder of the course

STATISTICAL INFERENCE, here the parameters are unknown, and we draw conclusions from sample outcomes (i.e. statistics) to make guesses about the value of the parameters.

Why are we interested in sampling and bias and sampling methods? Because for inference to work, we need that element of random chance to help us make generalizations from samples to the larger population. In other words, with a relatively small sample we would like to make statements (inferences) about the much larger population it came from, but in order for us to make accurate statements, we cannot allow bias.

Statistical inference is related to and relies on probability as follows: we make assumptions about the parameters, and then test to see if those assumptions could have led to the sample outcomes (i.e. statistics) we observed. We then make statements of chance or confidence to express the strength of our conclusions about the value of the parameter.

There are two important concepts that must be understood:

SAMPLING VARIABILITY -- this simply means that if additional samples are drawn the statistics generated from each of the samples are very different (for example, in the Wall Street Journal handout you might treat all of the different surveys discussed as additional samples). Note that statistics from larger samples should have less variability than small samples (i.e. consider the effect of outliers on a small sample versus a large sample) but variability is not closely related to the size of the population.

The collection of all possible samples is a very important theoretical idea and the statistics associated with this collection have SAMPLING DISTRIBUTIONS. Figure 3.6 (page 270) in your book is important for understanding this concept and its properties:

a. It looks NORMAL (and for our purposes, is normal)

b. It's CENTER is equal to the value of the PARAMETER

c. It's SPREAD or variability, is very predictable and for now, let's say it gets smaller as your samples get larger in size.

To sum it up, if you want to make inferences from samples to the population as a whole, you want a situation where you have low bias and low variability (see page 274), Your single sample is one sample from a collection of all possible samples. Your hope is that your sample is "right on" the target and that its statistics are very close if not equal to the parameters for the larger population from which it was drawn.