Midterm 1 Solutions

Summary of class conversation

In case you missed it, we discussed some of the difficulties of the midterm in class today.

1) The SD question (Question #4) came out of the blue.
2) Many questions were worded unfamiliarly
3) The simulation question: it was hard to match the response to the different questions.
4) The test emphasized definitions/concepts more than calculations and "doing".

I think these are all valid observations. One thing I want you to be able to do is solve problems in a novel setting or circumstance. For that reason, I always try to put a problem on the exam that is solveable given what you've learned, but unfamiliar. Also, questions are worded differently because, well, becaues I word them, and not the authors of the book. Still, I try to use the same terminology. But I'll try to give more quizzes so you'll see more questions asked differently.

Number (4) above is an interesting observation. To help you study, here's how I think of this class. The most important thing you should know is how to Interpret results. This means I place an emphasis on your being able to explain graphs and statistics and analyses (we'll see more of these later) in the context of the problem, and to be able to explain consequences. Second most important are "Concepts/ Definitions". You should know what words mean and understand what concepts they represent and how those concepts are used. Question 4 on the MT is a good example of testing about the concept of the standard deviation. The least important are calculations. You are expected to be able to get the correct numbers when called on, but you will get most of the credit on the problem for interpreting and explaining then you will for the number. For that reason, I try to de-emphasize purely number-oriented questions. (This is why putting a "normal curve" question on the exam was less important to me than putting on a question about interpreting 2-way tables (#1c) or a simulation -- which has both calculation and interpretation.)

So I'm going to continue to put "new" problems on the exam, and I can't really change the way I write questions. Our problem, then, is to see what we (me and the TAs) can do to help you prepare for this. Below are some general suggestions.

How to improve your score next time:

1. Read the book. I'm trying hard to use the same language in the tests as in the book, and also the same structure. So for example, when so many people wonder how to answer the separate questions on the simulation, I suspect that part of the problem (at least for some people) is they haven't read that section of the book.
2. Attend lecture. The lecture emphasizes some points, and de-emphasizes others, and you can use this to focus your studies. For example, given that we spent 2 days on simulations, you should expect them to play an important role in the exam.
3. Pay attention to "vocabulary" words. Statistics, like many disciplines, has lots of "jargon". Learning the subject means learning what its words mean. The "Key Concepts" section of the book emphasizes important terms and concepts, and you should know these cold. For example, questions 1b&c on the midterm were straight-forward vocabulary question. We also were testing for vocabulary/terminology in the simulation problem. Do you know what a response variable is? If so, then you should be able to answer that question correctly.
4. Read the exam all the way through before answering questions. Some people had to write a LOT more for the simulation problem because they didn't read the whole question (on both pages) before beginning.

The most commonly missed questions on the exam were 1c and the simulation. Question 4 was next, and Question 2 seemed to be answered pretty well. Question 1c tests your ability to use two-way tables. If you missed this, or got it right but don't understand it, you should study this some more. This means ASK QUESTIONS. Both TAs have reported that their office hours are not well attended, and I always have time in my office hours.

The simulation was missed for a variety of reasons, but the most common was people gave the right answer to the wrong question. Too many people thought that the question was asking you to estimate the chances of getting bit by a malarial mosquito. But you weren't asked to estimate this, you were told that it was 10%. Your job was to estimate the number of bites until the first malarial mosquito. Another common mistake was made in the "State your conclusion" section. Some people suggested that, since the outcome was variable, the chances of being bit by a malarial mosquito was changing. In fact, the chance is always 10%, but the point of doing simulations is to see that even if the chance stays the same, the actual outcomes vary. This is what Statistics is all about. Variable outcomes. Another common mistake in the analysis section was to make-up your own, sometimes rather bizarre, analysis. The entire first week of the class was spent on what to do to summarize data. Now you've got data: 4 observations. Summarize them. Don't invent something new -- stick to the tried and true.

Solutions

1. The following table consists of a random sample of 2,002 Los Angeles residents who filled out their census forms in the 2000 census. The variables are self-reported race and highest level of educational attainment. The K-8th category, for example, includes people who's highest level of attainment was some grade between kindgergarten and 8th grade (inclusive). Although the racial categories were self-reported, I have combined several categories together, and so these are not the names used in the census. The sample is not representative of all LA, but of only a few census tracts in the Long Beach area.

	White	African Amer.	Asian/Pac Isld.	Two or more	Other	Total
None or preschool	54	5	16	11	54	140
K-8th grade	207	38	51	29	59	384
9-12	315	60	57	40	165	637
at least some college	501	79	143	44	74	841
Total	1077	182	267	124	352	2002

a) What type of variables are Educational Attainment and Race?

They are categorical variables.

b) Find the marginal distribution of educational attainment.

For full credit you need to include the values and their frequencies:

None   140/2002= .07
K-8    384/2002=.19
9-12    637/2002= .32
College    841/2002=.42

c)c(5) A sociologists wishes to compare the educational attainment of Whites and Asians.   Which should he compute: row percentages or column percentages or cell percentages? Why?

Column percentages. The races have different numbers of members, and if we're comparing educational attainment we don't want the fact that there are simply more whites than asians to affect our conclusions.

2. Again drawn from the 2000 census for the LA area. Shown are the reported personal incomes (dollars), excluding incomes of 0 or less.
a) (5) Briefly compare the distributions of men and women in one or two sentences. The vertical axis shows relative frequency. The vertical line in the histogram indicates the median value for each group.

(I can't get the picture to copy into this document, so you'll have to examine your tests.)

Your answer should include references to center, spread, and shape of the distributions.

The men have a slightly higher median income. Although both distributions are right-skewed, the men have more outliers and, it appears, a greater range of incomes than do the women.

b) YELLOW: (5) True or false and explain: a majority of people have above average incomes.
False: The right-skewed distribution means that the average income is greater than the median. This means that less than half of the people are above the average. You get 2 points for getting the "false", 3 points for the explanation.

b) BLUE: True or false and explain: a majority of people have less than average incomes.
True: the right-skewed distribution means that the median is less than the average. So more than half of the people are below the average. You get 2 points for "true" and 3 for the explanation.

c)(5) A student claims that we should expect about 68% of all people in this sample to have incomes within one standard deviation of the mean. Do you agree? Explain why or why not.

No. This would be the case if the distribution were approximately symmetric. But this is very skewed.

3. Malaria is a serious illness transmitted by mosquitos that can cause death. Suppose in a certain part of the world that 10% of all mosquitos carry malaria, and that if a malarial mosquito bites you, you will get malaria. About how many mosquitos would have to bite you before you get malaria? Design a simulation:

a)(2)   Identify the component to be repeated.

The component to be repeated are the mosquito bites.

b) (2) Using the random numbers provided below, explain how you would model the outcome.

Let a 0 represent malaria, and 1-9 represent no malaria. Although any similar coding will work just as well.

c) (2) Explain how you will simulate the trial.

Select a random number to represent a mosquito bite. Continue until the first 0. This constitutes 1 trial.

d) (2) State the response variable.
The number of random numbers selected until (and including) the first 0.

YELLOW
e) (1) Using the random numbers provided, run 4 trials. Start by using the first random number provided. Write the current value of the response variable below each random digit.

36254 44136 69138 65665 0***first trial == 21 bites ***6335 490***2nd trial = 7 bites***97 81683 81153 30***3rd trial equals 14 bites ***667 63846

25339 45818 98380***4th trial = 23 bites*** 61014 88448 26114 20167 69682 84572 64490

f) (5) Analyze the response variable.

Our outcomes were 7,14, 21, 23. The average is 16.25, the median is 17.5

g) (5) State your conclusions.
It takes about 16 to 17 bites, on average.

BLUE
e) (1) Using the random numbers provided, run 4 trials. Start by using the first random number provided. Write the current value of the response variable below each random digit.

0 *****first trial = 1 bite ***3674 31651 72812 66486 16663 63846 69287 34278 31735 5980***2nd trial 49 bites ***6

84771 0***3rd trial = 7 bites **4691 67681 74357 21924 22359 76580***4th trial = 29 bites*** 63157 68254 85323

f) (5) Analyze the response variable.

Outcomes were 1,7,29,49
The average is 21.5, the median is 18.

g) (5) State your conclusions.

It takes about 18 bites, although there's quite a bit of variability.

YELLOW (BLUE is same question, but different order)
4. Choose 4 digits from among 0,1,2,3,4,5,6,7,8,9, repeats allowed so that
a) (2) the standard deviation is as large as possible
0,0, 9, 9

b) (2) the standard deviation is as small as possible

2,2,2,2 or any set of four repeats

c) (2) Is your answer to (a) unique? Yes or no; no explanation needed.
Yes.
(Any other combination will result in a smaller SD. This is the only one that gives the largest.)

d) (2) Is your answer to (b) unique? Yes or no; no explanation needed.

No.
(Any set of 4 repetitions will provide the same SD of 0 -- and this is the smallest possible SD.)