lect1111

Statistics M11/Economics 40
Lecture 7

FINISHING UP POPULATIONS AND SAMPLES

A. Design Issues

Statisticians are well aware of the problem of bias. Only in the last 50 years have survey organizations used probability methods (methods that rely on chance) to draw their samples. Funny, we think of random chance as a nuisance -- unpredictability -- but it is this randomness that gives sampling the power it has.

Sampling designs

a. Simple random sample (SRS): every person in the population has an equal chance of getting into the sample with each draw. In practice this is drawing at random without replacement (because it would not make sense to select the same person or measure the same animal/thing twice).

b. Not every sampling scheme is simple random sampling; other sampling schemes include MULTISTAGE SAMPLING.

The idea here is that a large population (e.g. the US) is broken down into increasingly smaller areas at and each stage a single unit is drawn randomly until the unit of interest (e.g. households) is reached.

Note: these methods can be applied to things other than households. Examples might be estimating the corn harvest, sampling firms on hiring expectations, etc.

Probability methods work well because they are impartial. This is the key. Bias is a systematic distortion -- deliberate or unintended -- a mis-representation of the issue in front of us. Think of the researcher from the previous lecture -- her question regards the attitudes of ALL U.S. women on love and marriage -- but did she really get a good sample of "all"?

MODELING AND FINDING EXPECTED VALUES (4.1)

A. Expected Value or the mean of a Random Variable (4.1)

Definition. RANDOM VARIABLE . Your book doesn't clearly define this, but it is an important concept. It basically says it is a "number or amount of a chance quantity". Maybe a better definition is "a random variable is a variable whose value is a numerical outcome of a random phenomenon. A given outcome for a give trial is not predictable, but these outcomes have a regular distribution in very many repetitions."
Examples: (from your book) The number of cereal boxes you must buy before you "collect all 6". The number of games it takes on average to win the world series. The number of girls in a 3 child family. (but also) Will the Dow Jones Industrial Average close up or close down tomorrow? What is the starting salary for a graduate of UCLA? How many widgets will my company sell next year?
Definition. The MEAN OR EXPECTED VALUE of a random variable is the most typical value. For example. You toss a coin ten times. How many heads do you expect? Suppose you got 4 heads and 6 tails? Is that unreasonable? Suppose you got 6 heads and 4 tails? Suppose you got 9 heads and 1 tail? You get a sense that there is a distribution -- 5 is common, 4 and 6 are quite possible but not as common as 5, 1 and 9 are rare (but it could happen!)
This "typical value" or "mean" or "average" is a "long-run" concept. That is, for any single trial/experiment, anything is possible. But over many trials, we expect a typical value.

B. The approach of this text: Experimental, Data Collecting

Your book relies heavily on simulation experiments to solve problems or answer these questions. It's a novel approach because many introductory statistics texts give you formulas to calculate expected values. The result of relying on simulations is a "gut" or "intuitive" feel for statistics -- which is not bad at all. (I'll give you formulas later for those of you who would rather have them).
You can use the random number tables to answer some of the homework questions in the book. Or you can use Excel or some other spreadsheet program that can generate "discrete distributions". See handouts.

C. The foundation: THE BOX MODEL

They call this "step 1" (we'll get to the steps later). Basically what they are trying to get you to do is conceptualize any question in the form of a probability model. A box model is the simplest. There are also "die" models (e.g. a six-sided die) but they could be rewritten as a "box" model. As your book points out they are versatile.
Box models can have uniformly distributed outcomes in them (i.e. equally likely outcomes) they can also be non-uniform though. More important, sampling from the box must be done with replacement and the "draws" from the box must be independent (we'll talk more about this in the next lecture) for now, independence just means "unrelated".
Examples: from problem #4 in section 4.1 in your text, a student takes a true false quiz by guessing. Assume his chance of getting a question correct is 50%. How many answers can he expect to get by guessing on a 10 question test. The "right model" is a coin toss (heads, tails) or a box (with tickets marked 0 or 1). Suppose he had a 70% chance of guessing correctly…how would this change the model?

D. More traditionally, a formula (optional)

The mean or expected value of a random variable is denoted m _x is the long run average value for random variable X. For a discrete random variable X taking on values x₁, x₂, x₃, ... ,x_n the mean of random variable X is

m _x = [x_ip_i].

In English: the mean is found by summing the products of each outcome and its associated probability/chance.

Associate the mean of a random variable with the idea of a "most likely outcome"

A basic example: a coin toss -- it has 2 outcomes. Head or Tails. Suppose we're interested in the count of heads (so a head=1 and a tail=0) in some number of tosses. We expect 50% for each outcome (i.e. half heads, half tails). The mean is .50 or 1/2 or 50%.

In a situation of 10 tosses (draws), you wind up with a mean (or expected value) of 5 (10 times 1/2). Or think of it as 5 heads is the most likely outcome.

A table might illustrate this idea more clearly:

Outcome	0 (tail)	1 (head)
Probability	.50	.50
Outcome*probability	0	.5