Sampling

Ó 2004, S. D. Cochran. All rights reserved.

SAMPLING

Populations

A population is the universe of elements that we are interested in making an inference about

A population distribution can be characterized by its center (µ: read mu) of the distribution or population mean and its average spread (s )

Both µ and s are called parameters

Parameters are numerical facts about a population. They are real, true, and without error or bias contributing to the value.

parameter = true value (in population)

Example: Let's say we are interested in what college students think of snow. Does it keep them from getting to class on time? What is our population? We could define it as all college students in the nation; we could define it as all college students who attend universities where it snows. What would be the effects of these definitions on what we think the parameters are?

Samples

It is generally impractical and unnecessary to poll a population to find out the population parameters.

Instead we estimate population parameters from statistics we generate in samples

Samples are subsets of populations

A sample distribution can characterized by its center ( ) and its average spread (S.D.)

Both the mean and the S.D. are called statistics

Statistics are numerical facts about a sample. They are a mixture (in quantities we cannot be precisely sure of) of true score, chance error, and bias

statistic = true value + chance error + bias

or, by substitution

statistic = parameter + chance error + bias

We use statistics to estimate parameters. Obviously, if both error and bias are very small, then this approach can be very effective.

Selecting our sample

Techniques for selecting samples affect both the chance error and the bias portion of the equation.

The parameter of the population is real, whether or not we can actually know what it is.

Example: The percentage of college students in the country who believe that snow impairs their ability to get to class on time is a real, though precisely unknowable percentage.

In the last 2 lectures we’ve been talking about estimating chance error. Chance error has been defined as random differences in measurement. Every time we measure something, the result will vary slightly, moving the value obtained to one side or the other some amount. In the next lecture, we will learn how to take this into account in our estimates. But this is yet a third distribution, the sampling distribution--it is not the population--it is not the values, generally, we observe in our sample. It is a hypothetical distribution of chance that we create using the estimated probability that an event will occur.

The key to sampling techniques is to try to select a sample where bias is minimized (as close to zero as possible)

Simple random sampling

Each element in the population has an equal probability of being selected.

Example: Imagine a population of 20 individuals, half men, half women. We want to estimate the prevalence of a disease called mocus. If we were to select a sample of four people to study using simple random sampling, then the selection probability is 4/20, or the chances of being selected are .20.

We could do this by listing all elements in the population, and then randomly (without bias) select the elements to include in the sample

Researchers use many techniques to approximate random selection, including use of random number tables, flips of coins, programming computers to start at some random spot in a list and select every 10th subject

In our population we could list all 20 elements and select every 5th subject. But what would happen if we had some bias in how we listed our 20 subjects?

Example: Here, I programmed the computer to randomly select four numbers between 1 and 20 to create 3 independent samples:

Population

Sample 1

Sample 2

Sample 3

Element

Gender

Has Mocus?

Sampled?

Has mocus?

Sampled?

Has mocus?

Sampled?

Has mocus?

1

F

2

M

3

F

4

M

YES

5

M

YES

6

M

YES

7

M

YES

8

M

9

M

YES

10

F

11

F

YES

YES

YES

12

M

YES

13

M

YES

YES

YES

14

F

15

F

YES

YES

16

F

17

F

YES

18

F

YES

YES

19

M

20

F

YES

YES

Random # Selected in Sample 1

Random # Selected in Sample 2

Random # in Sample 3

These are the numbers selected by the computer. So, elements with these numbers were selected

18

20

5

7

18

20

13

11

6

15

4

15



In the population the prevalence of mocus is 5 per 20 people = .25

In sample 1: 1 per 4 = .25 In sample 3: 0 per 4 = .00

In sample 2: 1 per 4 = .25

This example demonstrates that simple random sampling from a population can reproduce a population parameter fairly well, but it is not perfect.

Sampling can be very complicated with sampling conducted in stages and clusters. What is key is knowing the selection probability for each element. As long as this can be estimated, weights can be used later to ensure that the sample as composed reflects the composition of the population. In multistage sampling, there are two important concepts: clusters and stages

Clusters refer to the fact that elements exist within units. If elements are people they exist within social units. So, people live in households, households are located are blocks, blocks are in neighborhoods, neighborhoods are in counties, and counties are in states. We don't generally sample people directly, but reach them through their social units.

Stages refers to the step of unit selection.

Multi-stage designs might select first a random sample of states. The second stage might be a random sample of counties within the selected states. The third stage might be a random sample of phone numbers within selected counties. The final stage would be the selection of a person within a household that is reachable by the phone numbers randomly chosen to be called.

Example: Imagine a sorority in which 31 women live, two or three to each room. We want to select a sample of 5 women to ask their opinion of the sorority's food service. With simple random sampling the probability of selection is 5/31 = .16. But we are going to select these women in a multistage design. There are 12 rooms in all. In a multistage sampling design we might select the 5 women:

First Stage

Second Stage

Task: Select Room

Task: Select Respondent

How: Pick 5 of 12 rooms randomly

How: Pick 1 woman from each room randomly

Remember the multiplication rule where there is independence. We can calculate the probably of selection for each woman:

P(A and B) = P(A)P(B) = (5/12)(1/number of women in room)

So a woman in a 2 person room has a probability of selection of (5/12)(1/2) = .2. A woman in a 3 person room is less likely to be selected, (5/12)(1/3) = .14.

Every woman has a chance to be selected, so there is no selection bias, but her probability of selection varies depending upon the size of the room cluster she lives in. Later when we analyze the data we collect, we would compensate for the fact that women in 2-person rooms reflect the opinions of 2 people, while those from 3-person rooms reflect the opinions of 3 people by using weights, a topic we won’t cover in this class.

Bias could be introduced, of course, in our selection of women within each room if we aren't careful. How might that occur?

The introduction of bias

Selection bias

This occurs when some elements in the population are more likely to end up in the sample than others, for reasons that are unmeasured and/or unplanned. In this case, the probability of being selected into the sample is not estimated correctly. Selection bias refers to a problem in our invitations to the elements in the population to be in our sample.

Quota sampling where interviewers have the option of choosing or not choosing a subject

Most sampling is quota sampling at its most basic--we set our desired sample size or quota and recruit subjects until we fill it.

In true quota sampling, the cells (male, White, not married, between the ages of 18 and 25, living in a suburban setting, and working full time) that we are trying to fill are highly articulated.

The basic principle used to fill cells is ease of recruitment. If the factor we are interested in is confounded with ease of recruitment, selection bias will occur

Example: Many marketing surveys in shopping malls let the interviewer select individuals who meet certain characteristics. Or offers of tickets to screenings at the studios.

Example: Imagine in our population of 20 elements, if we selected 4 who were closest to us

Other forms of sampling also have selection bias--including recruitment for subjects in psychology experiments. The bias here is that researchers generally define their population as people in general, but not all people are college students. So most have no chance of being selected into a study

Example: What if we tried to examine the effects of snow on class attendance by only surveying UCLA students?

Response bias

Response bias is the general term for the fact that people cannot or will not always report facts accurately

Distortions occur because of how questions are worded, the order in which they are administered, social desirability, interviewer characteristics, etc.

Memory recall

Non-response bias

This is bias that occurs because some subjects are harder to reach than others. While selection bias refers to differences in selection probability (with often times an effectively zero probability for many elements), nonresponse bias is introduced by the failure to sample an element that has been selected. What if the door selected at the sorority had no one home at the time? Selection bias is a failure to send an invitation; nonresponse bias is a failure of response to an invitation.

Examples: Young males who are not home when the phone rings. Busy people who do not agree to be interviewed

There are also many other types of bias--anything that alters our estimates away from the population parameter of interest (apart from chance) is bias.

First Stage	Second Stage
Task: Select Room	Task: Select Respondent
How: Pick 5 of 12 rooms randomly	How: Pick 1 woman from each room randomly