E-mails Prior to Midterm 2

(These are real questions asked by other students just like you

--only the names have been changed to protect the innocent

--look here for answers to common questions.)

Student's Question Professor's Response
I know it's a little late, but if you have time can you please explain to me how the SE can be equal to both sqrt(N) *S.D. and sqrt(pq)/sqrt(n) *100%. Isn't pq the SD also? Are N and n the same thing? I'm really confused. the first is the SE of the sum of the count, the second is the SE of the percentage (the first is the deviation due to chance in terms of amount and the second is if we think about deviation in terms of percentage). The standard deviation of a two outcome situation is equal to the p*q, as you see. N and n are the same thing. But there are several SE. Conceptually the same in terms of deviation due to chance, but calculated differently depending upon what we are interested.
Professor Cochran,

I was watching a program on Dateline last night and they conducted a poll on people's opinion concerning the issue of DNA cloning. The top of the screen showed a +/- factor of 4.5. Is this value the standard error, or is it the standard deviation? what exactly is the difference between the two?!?

I'm really frustrated with my progress in class. When I'm in lecture I feel like I have a full understanding of what is going on. But when I try to apply the theories in real life or when I'm using them to solve questions on our homework assignments, all my knowledge seems to go out the window.

All the theories seem so abstract to me. I know that sounds strange because statistics deals with tangible things like data, percentages, and numbers.

But i can't seem to make the connection between theory and application.

Watching the program last night made me realize just how lost I am in this class. I can't even tell the difference between the SE and the SD!!!!!!

Please help.

a drowning student desperately trying to stay afloat,

Ms. Student,

I'm glad you asked. Actually you are closer to knowing the answer than you think. You know there is a difference between SD and SE (and you may not have even known that SD and SE existed a couple of months ago). SD is an average of the variation in our observations. SE is an estimate of the average variation in the mean of our observations. The answer to your question is that it is SE--but here is the reasoning that you can use. When the information they are trying to convey to you involves the mean (and the percent of people who feel a certain way can be thought of as an expected value or mean--like I toss a coin 50 times and expect to see 25 heads or 50% heads-its the mean of my expected distribution) and the concern is with how accurate that estimate is, it will be bracketed with an SE. If instead they are trying to communicate to you the average range of values (like 68% of Americans have 14 years of education + or - 3.4 years) then it will be an SD--or an estimate of the expected average variation in the population. With polls, as a rule of thumb, it is always an SE, or should be. Sometimes though the analyst, like you, is unsure and will forward to their boss the wrong number. The boss, which might be you one day, needs to know which is which or risk severe (:)) embarassment. So another way to know for sure that it is SE and not SD is to quickly calculate a rough estimate of the SD (which is the SD for the box--the squareroot of the percent of yeses*percent of no's)--if the number is much smaller then it is an SE for the percentage (because the SE is the SD divided by the squareroot of the sample size). But if you do the math with the Dateline show, your calculations won't give you the 4.5. The reason is that polls are generally cluster samples and the real estimate of the variance is a little more complicated than that. The rule of thumb is that the 4.5 should be a litte bigger than what you calculate, but still much much smaller than the SD.

Dr. C.

Hello Professor Cochran....

I have a humble request...and a terrible story to accompany it..It all began in lecture today, when you spoke of your red sports car and the almighty bruin pedestrian who did not receive a ticket from the long awaiting policeman...that was when it dawned on me that I had forgotten to display my parking permit....knowing the efficiency of the UCLA parking enforement..I bolted out of class (I apologize for the disruption) and dashed toward lot 11...fortunately I did not find a ticket on my car...(unusual...considering I have been blessed with more than my share of parking tickets)...unfortunately...in my panick I had forgotten to turn in my homework.....that brings me to my MOST humble request...is there ANY POSSIBLE WAY that I could turn in my homework????

Please consider and thank you for your time...

 

The student responds:

Thanks anyhow Professor Cochran...

I'm glad you enjoyed my honestly terrible story....but I guess in it all...it was a good trade...at least I'm not in deeper debt =) thanks for responding....I will definitely see you in class.....keep chuckling....PS....if you could..please stay away from cars, cops and tickets on days that homework is due! =) They make a great story...

Dear Student,

Great story. But, sorry, no. Like I said, when the homework is due it gets put on a hopper and is rapidly processed, so there isn't any latitude. But 1> you get to drop one homework, so that can be the one, and 2> look at it this way--you traded 6 pts for not getting a ticket...isn't that worth something to you?

I'm still chuckling. Thanks for the early morning laugh. See ya in class.

Hello, I have a question concerning homework assignment#4. Question 6 of chapter 17:... it is not part of the question asked but i was wondering how you find out the "standard error" for the number of 1's (or of any number)? Hence, how would you determine question 2(b) of chapter18?

Thank you..

Dear Student,

Ah, there you are. Your brain is off to the next possibility. Yes. There is a way to think about an SE in relation to the number of 1's. The first SE we learn is an SE for the sum of the draws from the box. But what if, instead of tracking the sum, we wanted to track the count--that is we have a box that contains 1's and everything else (2's and 3's) and we want to know over repeated draws how many 1's can be we expect to see, and --here's the SE for the count--on average how much variation from that expected count of 1's can we expect to see. The math is only slightly altered but the idea behind is the same. You've started with a box that looks like this: | 1 | 1 | 2 | 3 | Now you restructure to look like this: 2 | | | 2 | not 1 | We create a simple random number: 2 | 1 | 2 | 0 | The zero stands for not drawing a 1.

It's clear from this that you would expect, on any one draw to have a 50-50 chance of drawing a 1, so the mean for the box is .5 (1 + 0)/2 = .5 The SD of the box is (1 - 0) * squareroot of (prob of drawing 1 * probability of not drawing 1) = 1* squareroot of (.5 * .5) = .5 And now the SE for the Count (or the number of times we draw a 1) = squareroot of (the number of draws) * SD of the box (or .5)So if we draw 100 times from this box we would expect to see E(number of 1's in 100 draws) = np = 100 * .5 = 50 ones drawn With an average deviation from that of SE for the count = sqrt(100)* .5 = 10 * .5 = 5 So about two thirds of the time (68%) we would expect to see 50 plus or minus 5 1's drawn

To do 2b in Chap 18--you have to mix what you know about the normal distribution with this setup above 1> First restructure the box to be 3 vs. not 3 2> Calculate your mean and SD of the box 3> Estimate your expected count and se for the count in 400 draws 4> Find what the SE for 90 draws of 3 would be 5> Convert that to area in the normal curve

Let me know if you get stuck and need more help.

Dr. C.

Professor Cochran,

I was having some difficulty with the homework tonight for Stats 10, so I was wondering if you could explain how to do number #2 on chapter 18. I know you went over it in lecture today but I am still having trouble figuring it out.

Thanks a lot.

Ms. ,

So, you can see that (a) is a situation where you are concerned about the sum and (b) is a situation where you are concerned about the count. Let's set it up and then see if you can take it the rest of the way.

(a) He's asking what is the 'chance' of a sum of 1500 or more. This is a reference to area, right? in the the normal curve the area under the curve is the same as 'chance'. That's why its called a probability density curve. So you've got to find the boundary between 1500 and the most it can ever be. That boundary will be a z score that corresponds to 1500 (the other boundary is the end of the curve, which is undefined). To get a z we need three things. A mean, a SE, and the score which in this instance is 1500. The mean is the expected sum after 400 draws, which is just the expected sum after 1 draw multiplied by 400. The mean of one draw is the average of the box.

1 chance each of 1, 3, 5, 7

so it's (1 + 3 + 5 + 7 )/4 =3D

And then we multiply that by 400.

The SE for the sum is the squareroot of the draws* SD of the box. Well, the SD of the box--that you oughta be able to do And getting the SE is then just a slide downhill. So you take these two components and put it into your equation. This gives you a Z, and you use the normal table to find the area you are interested in.

(b) This is the same except now you have to classify and count. Your box changed to 1 Chance of 3 and 3 Chances of not 3 Your expected count after 400 draws is =3D 400 * (1 in 4 chances or .25) The SD of the box is (1 - 0) * squareroot of (1/4 * 3/4) And the SE is the squareroot of 400 * SD of the box

Notice I'm assuming that you coded drawing a 3 as 1 and not drawing a 3 as 0

From here the rest should be relatively easy.

Let me know if you're still having trouble.

Dr. C.

Professor Cochran,

I have a question as to how to figure out the average for a box model with large #'s. For example if you have a box that looks something like this 30,000 (1) 12,000 (0). How do you find the average so that then you could calculate the expected value of # of draws x box average. I understand how to do the steps, but I do not understand the math on getting an average for sample surveys.

A Student

Ms. Student,

The size of the numbers don't matter (they just make things harder to calculate so we don't tend to use them in examples) The average of the box above would be (30000*1) + (12000*0) divided by the number of elements in the box or 42000. Or another way to calculate is 30000/42000. Either way it's .71. The math looks about right conceptually, right?, cause (lopping off 3 zeros) we've got about 30 1's and 12 0's and that should average to around 3/4 of the value of the 1 (because 30 is about 3/4 of 42). If we wanted to know what a random sample (with replacement) of 100 would sum to out of this box it's 100*average or 100*.71 = about 71 1's being drawn out of the box (precisely 71.42). There are two numbers to think about--the numbers in the box (which is the size of your population, as in 'in a population of 50,000 high school seniors'). and then the other number is the size of the sample (or number of draws from the box) as in "a simple random sample of 300 high school seniors'.

Putting it all together (if I can make it look right in an email) the diagram goes like this:

A reseacher polled 300 hs seniors drawn randomly from the 50,000 who g...# of draws from a box containing She found that 60% supported the rights of students to choose their own..the average of the sample, used to create our expectation of the structure of the box

So the box, we predict should contain 30,000 1's and 20,000 0's (we can make this prediction because of the law of large numbers which states that with repeated samples from a box the average we observe will move closer and closer to the average of the box--and up past around 100 draws we will be so close we can expect them to be the same if everything is done fairly (no bias, no changing probabilities in the box, no hanky panky in our sampling, etc))

But there is still some chance that we are not exactly on the mark of what the real box looks like. We've just created an expected box. So we can take this a step further and estimate a range of values for what the real box might be.

We can do this by invoking the central limit theorem. That says that if we take repeated samples of the same size from a box, the means of all those samples will be distributed normally with a mean equal to the mean of the box and a standard error (se) that we can link to the normal distribution.

Well, we only took 1 sample of 300 students. But that is an element in the sampling distribution of the means (the repeated samplings from the box I just described). Because it is an element of this distribution, it's as good as any for the average of the distribution. (Just like if you have money in your pocket, and I pull one coin or bill out of your pocket, I can use that to estimate the average of your pocket without knowing anything else--if it's a quarter I'll estimate 25 cents--if it's a $5,000 bill (do they exist???) I would estimate $5,000--it's better than guessing off the top of my head). So we set .60 as the estimated average of the sampling distribution of the means (which by the theorem would also be the average of the box).

Then we have to calculate the se of the sampling distribution. We do this in three steps (according to the book):

1> estimate the sd of the box: (1 - 0) * square root [(30000/50000)*(20000/50000)] = .49

2> estimate what the se of the count would be out of this box with 300 draws: square root (300) * sd of box = 17.32*.49= 8.49

3> estimate the se for the percent: se for count/number of draws * 100% = 8.49/300 * 100% = 2.8%

Now we have an estimate of the average of the sampling distribution of the means (.6) and an estimate of the average spread of this distribution (se= .028). And this distribution is normally distributed. So if we go out 2 SE in either direction we will include 95% of all possible values in this distribution.

So we can say with 95% confidence that the percent of high school seniors (in the box) who support the rights of students... is 60% plus or minus 5.6%.

Now, we never saw the box. We only drew 300 times out of it. We used the mean we observed and our estimate of how much we think that result might vary due to chance to place a bet that the box really has somewhere between 54.4% and 65.6% in favor of the rights of students. We can be wrong in our bet--we WILL be wrong in a perfect world 5% of the time. The actual, real % in the box is either in that interval or it isn't. The population in the box does not change--what changes is the values we draw from it (just like if I drew a coin or bill from you pocket and made a guess--the amount in your pocket doesn't change--what does change is the outcome of my draw). This is pretty powerful stuff. We went from observing values in 300 people and by two theorems linked that to making a prediction about a population (box) we can never see. It may not seem like a big thing, because intuitively all of us already do that (we meet 3 people on a floor in the dorm and rather quickly assume that everyone on that dorm floor is the same sort of person). But here is the mathematical basis for why we can do that. And also a caution to us to temper our sweeping generalizations about things around us--sometimes what we see is not exactly what the whole population is like, and there is spread or diversity within populations, and within our samples of populations. But you can see, the size of the numbers don't change the formulas. And the key is to figure out what is an element of the sample, what is an element of the population, and how to construct the box. Good luck in your studying.

Dr. C.

Hi Professor Cochran. I was reviewing for the midterm and I came across some questions. If you have time, would you mind answering them?

1) in the example on pg 296, how did they get the SD to be $1? I worked out the SD to be .001 and SE to be .1 Isn't the SD the square root of 20/10000 x 18/10000.

2) on pf 294 #5, why do you multiply by the square root of 4 for 100 draws for the SE?

3) on pg 299 #3b, i don't understand where the .35 in the back of the book came from. I understand how to get EX and SE, but from that how do you determine the chance is 36%?

4) on pg. 303, do you have any suggestion in where to start for number 5? I know I have to find EX, and SE but how do I determine how many counts?

Thanks for the help.

Ms. Student,

Well..some of your learning from one situation is slopping over into another.

1> I can see above that you are treating the 20 tickets as counts not sums. But even if they were counts, then elements for the solving the problem would look like N or number of draws = 10000 (not the denominator you show above), and the SD would be the biggest-smallest)*squareroot 20/38 (20 out of 38 tickets or outcomes)*18/38 or (1- -1)*approximately .5 or 1.

If you did it as sums the SD would be squareroot of ((20*(1-approx. 0)squared+18(-1 - approx. 0)squared)/38) or 1 also. It comes out the same only because we are working with 1's and 0's here.

But you can also conceptually see that your answer above has to be incorrect just by the feel of it. Freedman tries to convey that on the next page by saying, hey, if the mean is about zero in a distribution and every element varies from that by 1, then the average variation from the mean is 1 (right?--38 outcomes all about 1 deviation from the mean is an average of 1 deviaton or 38*1/38), so the SD must be around 1, because that's what an SD is--it's an average of the variations.

2> ah, it's tricky. In 25 draws you see a sum of 50 with an SE of 10. Now the question is what happens in 100 draws or 4 times as many draws? Well it's not 4 times as much that would be too big. Instead the SE or estimate of the deviation increases at the rate of the squareroot of the number or draws (see the squareroot law). So, first, get a grip in your mind that the SD of the box is the same whether you draw 25 times or 100 times, right? Now, imagine this: 100 draws vs. 25 draws is like squareroot of 100*SD/squareroot of 25*SD. The SD is the same so it drops out and this comparison becomes 10/5 or 2, which is the squareroot of 4. The shortcut is to realize that a sample that is 4 times bigger is going to have a SE that it the squareroot of 4 times bigger.

3> Ah, tricky again. Ok, you have the expected value and you have the SE and you want to know what are the chances that you walk away from the table with $0 or more (right? if you will a penny that winning, so the limit on winning, the great divide is $0). well the number comes from the z formula --which is used to figure the chances z = (0 or winning - -$8 or your expected outcome)/$24 or your expected average variation from -$8 in playing 100 times = .35 approximately That's a z value and you use it figure the area or chances in the right hand tail or 36%

4> Well, this is a prelude to the central limit theorem. Each one of the groups is a fixed size (100 tosses) drawn randomly from the box and what is recorded is the mean. So you would expect by the theorem that on average the means should cluster in a normal curve centered on the actually probability of a coin toss outcome or 50%. The variation about this fifty percent or chance error should be the SD of the box or a single coin flip or squareroot of 1/2*1/2 or .5. The SE for the count is going to the SD of the Box times the squareroot of the number of times you repeated this process. So he had 10,000 flips in groups of 100 or 100 numbers of times, so your SE should be around .5*10 = 5. So, going out 5 in each direction from the center of 50 heads is like plus or minus 1 SE or 68%

Yer welcome

Dr. C.

I had a question regarding an example that was given in class (Lecture #11). It was regarding wanting to know if UCLA students were against a tuition increase. Polling everyone would be too costly and too time-consuming, so we decided to take a random sample of 100 students. We knew from a previous study done at Berkeley that 80% of students were against the increase. So, our box is 80 1's 20 0's

I understand how to calculate the S.D. (since it is a binomial distribution) = square root (0.2 X 0.8) = 0.4

I also understand how to get the S.E. = square root (n = 100) X S.D. = 100 X 0.4 = 4

and know how to get S.E. as a percentage = S.E. divided by sample size = 4%

My question is:

The Regents say that they are going through with the fee increase unless 75% or more of the students are against it....so to find the % chance error that the regents are going to be incorrect in raising the tuition (when they shouldn't be, because the true percentage of those against the tuition increase is 80%) we need to find what percent of students represents the 75% or less being against the tuition increase, correct? In order to do that, we can covert it to a z-score with x - mean/S.D. The z-score comes out -1.25. So, from -1.25 to 1.25, the area under the normal curve is 78.87%. This leaves two tails being roughly 21%. Which means that there is an 11% chance of finidng 75% or less against the tuition increase, and that the Regents will raise tuition, when there should not be an increase. Is this correct?

I would appreciate it if you could get back to me on this as soon as possible. Thank you for your help.

that's sort of a way to say it...let me reword it a bit...The regents say, "unless your survey shows that 75% of students are against it we are raising tuition!" That means anything above 75% is okay. We want to know,what are the chances that our survey will show 75% or less against it, even though the truth is that 80% are against it.

But yeah, you bet, you're right. Now the issue is, are we willing to take about a 10% chance of failing. Maybe not...then we've got to increase our sample size.

Dear Professor Cochran,

I know that homework #6 is not to be turned in, but I am having a little trouble.

1> on #8 from Chapter 8. From the scatter plot, how can you tell the S.D. at age 18 (y-axis)? The points range from 66-76, and the mean looks to be approximately 71, so can you just use those numbers to plug into the formula for S.D. = square root (sum of (x - mean)/ n) to get 2.5 inches?

2> Also, to find out the correlation coefficient, do you just "eyeball" it?

3> Finally, how do you know that the S.D. line is the solid line and not the dashed line? Thanks for your time.

Sorry to be bothering you with this...

 

1> you don't have to calculate anything...think in terms of density or area. look to see what interval includes about 68% of the points. your use of the formula above to do this will lead you astray most times.

2> in this instance, yes. He wants you to develop a feel for what a .8 correlation looks like

3> well first clue is that it crosses the bottom axis, but that may be obscure to you at this point. The second is that the solid line bisects the scatter of points. The dashed line is a regression line (we get to that next week after the midterm) and it takes into account how many elements are located at each point shown. Right now, you don't know if a point refers to one boy or several at that particular value of height at age 4 and 18. The SD line simply moves by the slope of the standard deviations so it perfectly bisects the cluster.

Happy studying.

Dr. C.

Professor,

I have a question dealing with distributions according to the normal curve. Let's say you have a confidence interval that was something like 5 +- 10. Let's say that in the context of the question, it is impossible to have -5 as a data point. Would you still say that the confidence interval is valid, because if you look at it on the normal curve, it is ok to go into the negative range? However, in the problem context, it would be impossible to have negative 5. I'm having problems interpreting this type of situation. Thanks.

Ms. Student,

Well, yes, this is not a happy situation you ask about. But let's think about it. Here you have a center of your distribution at 5 but a huge confidence interval around it. This suggests that you have a very small sample size (that's in your denominator) or a large SD in your box (like winning or losing 15 vs 0 or something). If it is a loss, then a negative value shouldn't trouble you (right? can you see that?). But if this is a percent situation, then it's probably something related to greater uncertainty in the deviations due to chance (caused by a small sample size and a lopsided box) So if you are wildly uncertain in your estimate, the estimate is not real precise and the fact that it slops over 0 (into the realm of impossible may just be due to your inexactness). So, that would be my interpretation.

In the real world of research when that happens, scientists set 0 as the bottom limit but mostly we try to design our studies so that that doesn't happen. I'm looking right now at a table from a paper I'm writing where 4.3% + or - 2.0% for a 95% CI of 0.3%-8.3% comes real close to creating the problem that troubles you. But I know that the SE is so big relative to the center estimate because I have a very small size (about 96 women vs about 5800 in the whole sample) that I wouldn't have been surprised to have transgressed into the minus territory.

What we are learning in class is the rules by which to think about these things. Your brain has found a 'what if' limit--that's good. Practically though, it only happens when the sample is very small, too small for the rules to work well.

Hope you had a good holiday.

Dr. C.

Prof. Cochran,

I have a question about exercise Set B in Chapter 20 #2c. The question asks to find how many Democrats are between 39% and 41% of the registered voters. I know that 40% are Democrats with a SE of about 1.5%. So then wouldn't they be 1SE away and have a chance of 68%? Please help. The correct answer is 48%?

Ms. Student,

40% + 1.5% = 41.5% right? 41% -40% = 1% right? ok...the SE is 1.5% but the question asks what happens if we go out 1% (not 1 SE). so if we go to the left of the mean (which is 40%) 1 percent, that is about 2/3 of a SE (2/3's of 1.5). The question is asking what is the chance that the sample will show a mean that is within 2/3 of an SE of the population mean.

Well, looking at the normal table about 48% of the time when we draw a simple random sample of 1000 from a population where 40% have a certain characteristic we will observe a value within 2/3's of an se (between 39% and 41%).

The trick here where you are getting confused, I think, is that there are several uses of the word 'percent'. There is the percent in the sample, the se of the percent, the percent to the left and right you are concerned with, and then the percent in the normal distribution. Walk through each of them and be sure that you differentiate the concepts (how do you do this?--ask yourself questions about each one of them and answer those questions, preferably aloud as if you were explaining it to another person). Of course if you have a roommate...make sure they've gone out, or they'll think you're nuts!

Good luck with your studying,

Dr. C.

Hi Prof. Cochran,

In the course reader you wrote that we need to know how to calculate standard errors of the means. Is this the SE for the average or something else?

Thanks for your help,

It's the SE for the average (the average is the mean--so standard error of the means is the SD of the sampling distribution of the means, the theoretical distribution we would observe if we randomly sampled with replacement a large number of equal size samples from the box and made a new distribution out of the means of each sample--it's the estimate of variation in the mean we expect to see due to chance)

good luck

Dr. C.