Ó 2000, S. D. Cochran. All rights reserved.

REVIEW FOR MIDTERM 2

A. Topics we have covered in this section

1. The law of averages, the central limit theorem

a. The law of averages states that chance error observed over repeated samplings with replacement from a population is likely to be large in absolute terms but small compared to the number of samplings. So if we toss a coin 4 times, the deviation from what we expect (half heads) will be small in absolute value but is quite likely to be large in deviation from what we expect (we would not be surprised if we observed 3 heads in four tosses, 75% as opposed to 50% of tosses). However, if we toss the same coin 1000 times although the deviation (in numbers of heads) from 500 heads will be larger than the 4 toss situation, it is more likely to be closer to 50% that when there were fewer tosses.

1. One implication of this is that although chance happens pushing scores of what we observe to one side or the other of the true population parameter, with increasing sample size this will be a smaller and smaller percentage of a factor in our observed percentage.

2. So one way to decrease the influence of chance error in the true value + chance error + bias equation is to increase our sample size.

3. The average size of this chance error is referred to as a standard error

b. The Central Limit theorem states that if random samples of a fixed N are drawn from any population (regardless of the form of the population distribution), as N becomes large, the distribution of sample means approaches normality with the overall mean approaching µ, and the standard error is sigma/(squareroot of N)

1. The implication of this is that if we repeatedly draw samples from a distribution, the means of these samples will have a normal distribution centered on the population mean.

2. Because our one sample mean is an element in this distribution, using it to estimate the population mean is smarter than choosing any number off the top of our heads.

3. It also means that no matter what our underlying population distribution form or our sample distribution form, we can use our estimate of the SE of this sampling distribution of the means and the normal distribution to estimate the size of deviation away from the population mean that you would expect to see due to chance alone.

2. Expected values of a discrete random variable

a. We introduced the notion of a discrete random variable, noting that we can assign a number to each outcome and deal with that number in a numeric fashion, such as multiplying, dividing, adding, and subtracting.

b. We learned that we can calculate the expected or average value of an outcome both in terms of one draw from the population (or box) and in terms of repeated draws with replacement.

3. How to create boxes with counts and draws

a. We learned how to specify the box or population in one situation assigning numbers to all possible outcomes so that we could calculate a sum over repeated draws from the box

b. In a second situation, we recoded the possible values in the box into 0’s and 1’s to keep count of how many times over the repeated draws the outcome matched our criterion.

4. Differences between expected value of one draw vs. net gain or sum of the draws

a. The expected value of one draw is simply the average of the random number values in the box

b. The expected net gain or sum of the draws is the running tally over the number of samplings, which on average is simply the average of the box times the number of draws.

5. Standard errors vs. standard deviations

a. SE’s exist in hypothetical distributions created from our knowledge of the underlying probability of different outcomes or estimated from what we observe in samples but it is still spread due to chance error.

b. So I can calculate the SE of 4 tosses of a fair die, without ever actually tossing the die

1. The expected value of the average of four tosses is 3.5 per toss. The expected value of the sum of four tosses is 4*average of the box or 14.

2. The SD of the box or population is:

3. The SE of the sum is squareroot of (4)*SD of the box = 3.42

4. So in 4 tosses I would expect to see a sum of 14 give or take about 3.4, 68 percent of the time, if the die is fair. Notice here that I am invoking a statement that involves the normal distribution (the 68% of the time comment). I am relating what I expect to observe in terms of chance back to the normal probability distribution because I know that chance is distributed like a bell-shaped curve. That's what the law of large numbers tells me.

5. This deviation from 14 is consistent with chance error inflating or deflating the actual value I might observe if I were to toss a die 4 times.

c. In contrast, standard deviations in samples are simply descriptions of average spread in what we observe not what is possible or predicted. So if I had a weighted die where only the number 6 came up and I actually tossed it four times, I would actually observe 4 6’s. The mean is 6; the sum of the 4 tosses is 24. The S.D. of those 4 elements is 0--there is no variation. The SE if I calculated it would be zero. My expectation of the population that the die came from is one where 6 will always come up--a good guess. My estimate of the SE suggests that there is little opportunity for chance error. It’s a sure thing.

d. I could also in reality toss a fair die 4 times and let's say, just as an example, I get 1, 4, 5, 2. The sum is 12 (not the 14 we expected precisely) and average value per toss is 3 (not the 3.5 we expected). The SD of the sample containing 1, 4, 5, and 2 is 1.58. Notice this is not the SD of the box. It is the squareroot of (((1-3)2+(4-3) 2+(5-3) 2+(2-3) 2)/4). I could use these two statistics we observe to estimate the size of the SE which would be the squareroot of the number of draws * the SD = 2*1.58 = 3.16, close but not the SE we predicted from just knowing the box alone (where it was 3.42). The central limit theorem allows us to do this.

6. Binomial variables (two outcomes)

a. These are situations where the population has only two values (yes/no for example)

b. Here we learned to calculate the SE for the count. The formula is the same as SE for the sum. They are both the squareroot of the draws*SD of the box.

c. If the box is 1 and 0 then the SD is simply the squareroot of (probability of 1*probability of 0). For a die, if we recoded it as counting the number of 1 & 2’s that appeared, the SD of the box would be squareroot of (2/6*4/6)

d. If the box is not 1 and 0 then we have to mulitply by the difference between the two values. For example if we had a population where there were 2 10’s in the box and 3 5’s, then the SD would be (10 - 5)*squareroot of (2/5*3/5).

e. The SD is the spread we can observe in the box. The SE for the count is the average deviation from expected count over the number of draws from the box due to chance.

7. Using the normal curve to estimate chance

a. Because chance is distributed normally, if we have a large number of samplings with replacement from our population (or to put it another way, a large number of draws from the box) then we can use our knowledge of the Table in back of the book to estimate the probability or chances that we would observe a particular outcome in a hypothetical distribution.

b. For example, if I toss a coin 1000 times, I expect to see 500 heads plus or minus a SE of squareroot of (1000)*squareroot (.5*.5) = 15.8. So the probability that I would observe 532 heads or more is consistent with 2 SE or using the Table, less than a 2.5% chance. Notice this is the SE for the count.

c. I could ask a different question. I could think of the last example in terms of percentages. I expect to see 50% plus or minus a SE for the percentage of 15.8/1000*100% = 1.6%. I observe 532 heads or 53.2% heads. This is about two SE more than I expect, less than a 2.5% chance.

d. You should notice here that although you have done different calculations you are dealing with the same deviation--in one instance it is in terms of the sum or count, in the other in terms of the percentage. But they are all talking about the same thing just looked at a different way.

8. Parameters vs. statistics

a. Parameters exist in distributions that are never observed. They are true values, absolute without chance error or bias.

b. Statistics are what we observe

c. Standard errors live in a grey zone--when they are estimated from our statistics they are not just true values but include all the slop that happens in real data. When they are estimated based on parameters alone, then they can be thought of as exact and true. In the book the author notes that sometimes we calculate an SE using a known box where we can generate the SD exactly. Sometimes we work from a sample where we don’t know the exact distributions of elements in the population but we use our sample statistics to estimate that (the concept of what we are estimating is identical to the SD of the box). For example, when we calculate the SE of the roll of a die, we know the elements in the population exactly. The SE here is a parameter. In contrast, in class we did not know exactly how many UC students were against a tuition hike, but in one instance we used 80% which was derived from a sample of Berkeley students to estimate that our population box contained 80 1’s and 20 0’s. Here the SE is an estimate (or statistic) of the parameter.

9. Types of bias (selection, nonresponse, response bias)

a. We talked extensively of how bias can enter a sample through many routes

b. Part of developing sophistication as a consumer of statistics is to think about the possible sources of bias in the statistics

c. Controlling and minimizing bias is done via good research design

d. Estimating the size of bias is generally done by judgment, not by mathematics (unlike estimating the size of chance error).

10. Sample distributions, sampling distributions of the mean, population distributions

a. Sample distributions are the elements we observe

b. Populations distributions are the elements that exist in the population as defined (the population distribution of a die is 1,2,3,4,5,6.)

c. The sampling distribution of the means refers to a distribution that is created by repeatedly drawing samples from a population, calculating the means, and using the means as elements in a new distribution.

1. We learned that the expected value of the mean of this new distribution is the mean of the population

2. The average spread in this distribution is referred to as an SE.

a. In most Stat books it is called the Standard Error of the Means or SEM. In our book, it is called the SE of the average of the draws.

b. It is simply the squareroot of (number of draws)*SD of the box divided by the number of draws.

c. Notice the similarity of this equation to the SE for the percentage. The SE for average is an estimate of the expected deviation of a mean in the samples drawn from the expected value of the mean.

d. The SE for average is always smaller than the SE for the sum because it is the SE for the sum divided by the number of samplings.

e. The SE says how far on average the sample average is from the population average. In contrast the SD of the sample, says how far on average an element within the sample is away from the average of the sample.

f. In the book, the author uses SE for the average to answer questions like this:

1. The average educational level in a sample of 400 25 year olds is 14.3 years, SD = 3.4. What is your best estimate of the average educational level of all 25 year olds in the population. You are using your sample mean to estimate a population mean and then attaching an uncertainty to that. The SE for the average can be estimated as squareroot of (400)*3.4/400 = 0.17. So you say that you think the average years of education in the population for 25 year olds is 14.3 plus or minus about 0.17. This means that the mean is likely to be between 14.1 and 14.5, and if you were to say this you would be right about 68% of the time.

2. This is not same as saying what is the educational range that 68% of all 25 year olds have? Clearly is it not 14.3 years plus or minus 0.17 years. It is more likely to be 12.3 years plus or minus 3.4 years approximately.

11. Simple random sampling vs. multistage sampling

a. We talked about several concepts in complex sampling including stages, and clusters

b. We specified that SRS is defined as each element in the population has exactly the same chance of being selected into the sample.

12. Estimating the accuracy of statistics from survey samples

a. This is a combination of considering bias

b. And the effects of sample size (which influences the size of chance error)

13. Confidence intervals

a. We learned how to use SE’s to calculate confidence intervals

b. Confidence intervals are a technique where we specify a range around our sample statistic that we believe will include the corresponding population parameter. If we go out about 2 SE’s (1.96 to be precise), then we will have created an interval where 95 out of 100 times we will be correct.

c. CI’s say nothing about the probability that the population parameter is in the interval or not. That is absolute truth--it either is or it isn’t inside the interval. Our uncertainty and conversely our confidence are tied to our statistics and our estimation of chance. With a 95% CI, we will be correct 95 out of 100 times in creating an interval that includes the population parameter. But in this 1 try, we don’t know if it is that 96th time when we are wrong or not. About that we have no certainty, we just know that over repeated times of doing this we’ll be wrong on average 5 out of 100 times.

14. Sampling with or without replacement

a. All of the techniques we’ve been talking about over the last several chapters involve sampling with replacement.

b. When the sample is large then the fact our sampling often times is not done with replacement (for example, if we interview 100 students on campus, once we select a student to interview, he or she is not replaced into the population, so we are sampling without replacement) does not really matter much

c. When the sample is small, then the techniques we learned are biased. That means they don’t estimate SE well. They are not accurate.

d. We can adjust for this by multiplying the SE we have calculated by a correction factor. The factor is the squareroot of ((population size-number sampled)/(population size -1))

15. Correlation and joint distributions

a. Finally, we learned about joint distributions of two variables.

b. The correlation coefficient summarizes both the strength of association between two variables (the extent to which each element in the sample is in the same percentile on both variables) and the direction of the association (the extent to which each element is high in the percentile ranking on both variables--a positive correlation--or high on one and low on the other--a negative association).

c. If a correlation shows a modest, positive association (such as an r of .30) then what that means is that elements in the sample (subjects) who score high on one variable tend to score high on the other, but there is still a lot of variability. For example, if the association between depression and stress is .30 then people who are more depressed also tend to be more stressed, but it is not a perfect prediction by a long shot. If the correlation were .80 (a very strong correlation), then we would be quite certain that depressed people report more stress and stressed people report more depression.

d. In contrast, if a correlation is negative and fairly strong (like -0.60) then as people score high on one measure they will score lower on the other. For example if pizza preference ratings (I LIKE IT) show a negative correlation of .60 with nonfat cottage cheese ratings (I LIKE IT) then what this means is that people who say they like pizza tend not to say that they like nonfat cottage cheese.

B. Calculation skills needed

1. You should be able to calculate the expected value of a discrete random variable

2. To calculate expected values of sums and proportions

3. To calculate standard errors of sums and proportions

4. To create confidence intervals for proportions and means

5. To use means and standard errors, the normal table, and z-scores to answer questions

6. To calculate correlations

7. I will put the formula table and the Normal Table on the exam for you

C. Rules for the exam

1. You may bring your hand calculators and pencils, pens, student id, nothing else

2. If you forget your calculator

a. We will not supply any

b. You cannot share a calculator with anyone

3. Show all of your work.

a. Points will be assigned for each component of an answer.

b. Getting the correct final answer will not result in full credit. You must show your work.