Statistics 50
Lecture 1


OVERVIEW, COLLECTING DATA, AND SAMPLING ISSUES

A. Overview

Statistics is the science of collecting, presenting, and interpreting data to answer questions. Thus, there are four primary issues:

  1. Determining the question and what data are of interest.
  2. Collecting the data.
  3. Summarizing the data graphically and numerically.
  4. Making generalizations from the data and drawing conclusions.

B. Determining the Question

  1. The first issue in statistics is always, "What question do we want to answer?" If there is no question, there can be no answer.
  2. Data do not speak for themselves; they must be gathered purposefully.
  3. Not all questions can be answered!
    a. "Data as they are" questions can be answered:

    If the U.S. Presidential elections were held today, what percent of Americans would vote for Bill Clinton as president?

    b. "What-if questions under replicable circumstances" can be answered:

    Among all American school children age 6 to 12, would giving Vitamin C prevent colds? (You can imagine testing this out on more and more children.)

    c. Data from nonreplicable events, in general, can NOT be answered!

    "How many American Indians would be alive today if the American Revolution had failed?"

C. Collecting Data

  1. GUESSING ("data-free analysis"): the researcher makes up numbers.
  2. ANECDOTES ("stories"): the researcher collects stories.
  3. SAMPLING and OBSERVATIONAL STUDIES:
    a. In SAMPLING, the researcher looks at a part of the group, and makes inferences about the whole group.

    b. Typically, sampling is part of an OBSERVATIONAL STUDY, in which the researcher collects the data as they currently are.

  4. CONTROLLED EXPERIMENTS: the researcher ASSIGNS a treatment and observes the outcomes.

D. Examples

  1. Guessing -- "The population of native americans in 1670 was over 100 million"
    Ask yourself is this a reasonable guess? It is a bad guess? Ask what sources of data are being used. Is the researcher doing his or her job well?
  2. Anecdotes -- Journalists LOVE anecdotes.
    Los Angeles Times article on family stress, divorce and foreclosures in the Palmdale area of Los Angeles County. The author interviewed 5 families and they shared their commuting stories.

    If the author is trying to make the case that commuting time causes family stress, anecdotes, though amusing, may not be representative of the larger population of commuting Los Angelenos, or for that matter persons who commute from Palmdale.

  3. Observational Study -- Suppose you want to know who will will the US Presidential Election in November? Survey houses and the news media want to know how voters are likely to cast their ballot. So they choose a random sample and ask.
    Suppose you want to know whether drinking wine can lower your chance of Coronary Heart Disease (CHD). You could survey people, ask them if they drink wine and then measure their cholesterol. There are problems with answering questions in this way instead...
  4. Controlled Experiment -- Better than an observational study in that a researcher can begin to eliminate confounding and pin down cause and effect. Here researchers impose a treatment on randomized subjects.

E. Sampling

  1. Basic Definitions
    a. The POPULATION is the entire set of people (or animals, things) we wish to study. Examples: All Americans. All parking meters in New York. All blue whales in the sea.

    b. A SAMPLE is a part of the population. See the Gallup handout on the Dole vs. Clinton voter poll taken over Labor Day. Why sample? Most times it is not feasible to query the entire population (e.g. too costly, would take too long) so researchers select (sample) a subgroup for questioning or analysis.

    c. A numerical fact about a sample is a STATISTIC. It is some number which is used to describe a sample. Example, from the Gallup handout -- 623 Americans surveyed in the telephone poll, 49% said they would vote for Clinton. The 49% is a statistic which describes the sample.

    d. A numerical fact about a population is a PARAMETER. Statistic is to sample what PARAMETER is to the population. If 49% is what the telephone survey revealed, the people who conducted the survey hope that it is a close approximation of the true population PARAMETER.

  2. Sampling designs
    a. Simple random sample (SRS): every person in the population has an equal chance of getting into the sample.

    b. Not every sampling scheme is simple random sampling; other sampling schemes include STRATIFIED RANDOM SAMPLING and MULTISTAGE CLUSTER SAMPLING.

    There is a good example of stratified random sampling on p.187, example 3.5

  3. Bias
    a. Idea

    If a sample is "representative", then a statistic can be a good estimate of the parameter; but if the sample includes or excludes certain people systematically, the sample is BIASED.

    b. Types of bias

    ( i) Self-selection bias --- certain people decide to talk to you

    Read question 3.16 in your text (p. 195). Certain people responded to Shere Hite's survey.

    ( ii) Selection bias --- you include or exclude certain people

    Question 3.16 again, Shere Hite sent out 100,000 surveys to women who belonged to women's groups. Ask yourself, is her sample of women representative?

    (iii) Nonresponse bias --- people don't bother to answer you

    Still 3.16, only 4.5% of the surveys were returned. It would appear that the respondents were "unusual" in some manner since the vast majority chose to not give their opinions.

    ( iv) Response bias --- people answer, but they lie to you

    In class example, in some surveys of sexual behavior in the United States, men tend to overstate the number of sexual partners they have had since age 18. While women tend to understate the total number of sexual partners. Mathematically, the totals suggest that members of one or the other group -- or both -- are lying.

    ( v) Wording of question --- people answer based on phrasing

    In class example, questions on legal abortion... interviewers receive different responses depending on how the question is worded.


button Return to the Fall 1996 Statistics 50 Home Page

Last Update: 26 September 1996 by VXL