Student question:

> Hello Dr. Cochran,

> I would like to know the difference between the correction factor and

the SD+? When do we use the Standard error for the percentage?

Professor response:

I presume you are referring to the correction factor from earlier in the course which was used in calculating a SE...this corrects for sampling without replacement from a small population

the SD+ is a correction to the SD of a sample when that sample is small

(<100). That is it is an unbiased estimate of the size of the average

deviation from the mean in a sample. Using the SE for %: well...in a lot of situations...that's a pretty big question. Whenever we are concerned with the average amount of chance error in our estimate

of the mean (a percentage can be thought of as a mean--50% is the mean for number of heads in a coin flip for example). So we would use the SE for the percentage in confidence intervals, in test statistics

 

Student question:

> Dr. Cochran:

> I am still having trouble with the regression method, both as explained

> in the book and in class (using the formula for the slope of the

> regression line). I think I am confused by the relationship of the

> estimate for the difference between the average of the score and its new

> value, and the Z score. These seem to be the same thing (?), but I still

> get confused by the method to use when working out a predicted value for

> y from a given value of x. I have gone through the examples in the book

> and noted from lecture. Could you briefly outline the method for me,

> or refer to a place where this is located?

>

> Thank you,

 

Professor response:

Let's start with where I think you 'know' something. You know there is a linear equation, and you know how to solve for unknowns if you know all other pieces of information. So if I give you the correlation and the SD of both y and x from that you can calculate the slope. and if I give you the mean of x and y, you can calculate the intercept. And if i give you a value of x, you can calculate (now having 3 + 2 +1 pieces of information) the predicted value of y. That's my assumption about what you know. If I'm wrong then what I say next might confuse you--so read through that again and be sure you know that stuff (it's high school math with a little new stat thrown in but that's all).

Now in lecture, a way to do all this differently was presented. This method made use of percentiles and ranks. Here, we said we can look at a value of x and think of it in terms of its distance from the mean of x. So a score is a certain number of SD's away from the mean. and we 'know' that given a value of x, the associated value of y is r*SD of x = SD of y units from the mean of y(p. 151 in the text). So given a value of x we can calculate the estimated y value without ever developing the linear equation as we had to do above--the advantage? I could using only knowledge of x's deviation (never knowing the value of x or the mean of x) calculate an expected value of y. We can also use knowledge of a person's ranking in a distribution and convert that into SD or use SD's to convert it into percentiles (here we know nothing about the score or the mean of either x or y). Look at the web--lecture 16. You're right. These are all the same thing in many ways because it is simply one equation. But the first method requires all 6 pieces of information to generate an answer. These other methods can be used when you have less than 6 pieces of information. In the world of being coddled by textbooks and classes, you've always had all 6 pieces of information. These latter methods simply show you that you now have ways to jury rig answers with less--this is more likely to happen in the real world--and the skill to develop now is to think about what method do I use when I know all 6? when I only know these 3? when I only know 1? and so on.

Dr. C.

 

Student question:

> we have a few questions:

>

> 1. what is a regression line?

> 2. what is the difference between a regression line and a SD line?

> 3. when do you use them?

> 4. could you please explain the the problem on page 7 of lecture 17 in the

> "class notes" handbook?

>

> we are very confused.

> thank you!

The TA responds:

Hi,

These are pretty broad questions to answer over email, and I'm

not sure where your exact questions about these concepts are,

but we'll try...

For 1-3

(note-- SD(y) is standard deviation of y, and similarly for SD(x))

The idea is that we try to fit a line to a scattergram, or

we want to try to summarize points using a line.

There are different ways of doing this. One is to use

the SD line, which has slope (SD(y)/SD(x)). Another

option is to use the regression line which has

slope r*(SD(y)/SD(x)). Both lines go through the point

(Xbar, Ybar). It turns out that the regression line

does a better job of summarizing the scattergram.

There are lot of aspects to this question... I don't

know where the confusion is, so I'm not sure what

part to talk about. have you looked at

the notes? She explains some of these concepts in more detail around

lecture 16. As far as when to use them, it might be helpful

to look at homework problems and problems done in lecture

to see some examples.

For 4, I think you are talking about the question

"What % of 24.5 yr olds have cholesterol level of 100 or

less?"

Here's another interpretation of the question--

First, think of the scattergram with cholesterol going

up vertical axis and age going along horizontal axis.

Now, think of vertical strip of points corresponding to

people of age 24.5. We want to know what percent of these

people (or points) have cholesterol level of 100 or less.

We use a normal approximation.

For the mean, (Xbar), we use the estimated mean cholesterol

level for 24.5 year olds. This is computed in the notes on

the top part of the page. For the SE, we use the rms, which is

computed on that page as well. We get the z-score as usual--

(X-Xbar)/SE. From there, use the normal table to get the

percent that lies below the z-score (since we are looking

for a level of 100 OR LESS). For more on this, see Chapter

11, section 5 of your book.

I hope this helps. If you have more specific questions, I'll

check my mail again at 6. but I'm leaving soon after that.

Thanks and hope it goes well tomorrow!

 

Student question:

Dr. Cochran,

I have a few questions for you.

1. When do you make a chi-square test and when do you make a z-test? I don't know how to figure out which one to use just by looking at the word problem.

2. On page 485 of the book, the authors mention something about a left

handed tail. I don't understand that whole section. Can you please explain it to me?

3. In chapter 29 the authors say that large samples can be bad and can skew the data. Is that actually what they are saying, or have I interpreted it wrong?

Please answer these questions if you have time.

Thanks,

Professor response:

1. chi-square analyzes counts (each person or subject thrown in one and only one cell) and z and t analyze differences in means of the sample or samples (each person contributes a score, but you are only analyzing the deviation of the mean of the group from some standard)

2. here the issue was that you 'expect' chance variation to happen. the left tail of a chi-square (very, very rarely used--I've never seen it before in 20 years of work, but as he points out you _can_ do that) evaluates departures toward an absence of chance error. the data are too good to be true. You can think of it as flipping a coin 10 times and getting 5 heads and repeating this process 100 more times and always getting the same result--it is too unlikely to be true

3. the issue with large samples is that N is in the denominator so the math appears to be very very precise--but, well one of the ways to think about it is there is a certain amount of bias (our design tries to make it zero but it won't be that exactly) and when we can get chance to be very very small (remember early in the course you learned that if you repeat a chance process many times the deviation you observe as a sum is large but the percent away from the expected is very, very small--well it's that later that ends up in our test statistics denominator) then we can find significance with even trivial differences. It's like using too powerful an instrument to do something. For example, I'm analyzing some data now where my sample is 9400 women. When I divide them into two groups the difference between 33.2% and 33.6% is 'significant'--that is unlikely to be due to chance...but my hunch is that bias could easily acount for a percent or two. If my sample instead was 200, then i'd have to see 25% vs

35% or so to find significance. that would make me feel more comfortable that the difference were real and also meaningful

good luck tomorrow

Dr. C.

 

Student question:

> hi, professor cochran,

> i do not understand significance tests. why do you make a null

> hypothesis? how do you determine whether or not the null hypothesis can

> be rejected on the basis of 'p'? what *is* 'p' in relation to the

> normal curve? i just don't understand.

Professor response:

Why the null...first, do you know what the null is? in general it is a

hypothesis of no difference, no deviation except for chance (i'm assuming that you have down the idea that chance adds or subtracts some amount to each observation, always, without fail, and that each observation is made up of it's true value + chance + bias). So the null states that whatever deviation we observe from what we expect is due to chance alone (not a real difference in true value, not bias). If and only if the null is correct, then a z-test or a t-test has a known distribution. And the area under the curve is known. this area has a known percent of the distribution--the percentile is P. So if the null is correct, then the z we obtain from a z test can be compared to the table in the back of the book to find out what percent of the distribution is cut off. So for example, if you get a z of a about 2 (1.96), then that is associated with an area of 95.45 (look in your book table), meaning that 100 -95.45 or approximately 5% is outside that area in the two tails. If we have a two tailed alternative hypothesis (for example there is a difference and we don't specify whether it is bigger or smaller), then we choose a P = .05, or a z value of about 2 as a cut off for deciding to reject or not our null hypothesis. if we reject our null (that what we observe is what we expect), then we must accept the logical alternative (what we observe is not what we expect). the logical alternative or the alternative

hypothesis is our research hypothesis in statistical form. So today, we thought that giving people a drug would make their esp ability improve so that it was different from those who did not get the drug.

We have to test the null and not the alternative because only the null is testable. A z-test for example divides the difference between what we observe and what we expect by a weight (the SE) for how much average

deviation we expect on average. Under the null the numerator is 0 except for chance deviation and since we divide by an estimate of our

chance deviation the z should be close to 0 + or - some small number

(generally < 2 or so). Under the alternative the numerator is not zero

(think about why this is so...) and the numerator contains both true

difference and difference due to chance. We just don't know what this

number should be. So we can't evaluate the problem.

Hang in there. Keep reading the book; keep coming to class. This stuff is hard to learn, but it is the heart of the statistical technique.

Dr. C.

Student question:

Dr. Cochran

on question 4 of the practice final questions, as far as I can tell, I am doing everything corectly. My answer does not make sense to me, and I have looked at many other examples to see what I am doing wrong. I can't see what I am doing wrong. I got that the z-score equals 29.74. Maybe that is correct, but it seems high. Should we be getting z-score like that?

Professor response:

sure. why not. the difference is a mixture of chance error and true

difference (hs vs college). If chance error in estimating the mean is

small as it would be with such a large sample, but the true difference is large then z can be huge. Like I said in class, these questions were cast offs and so were not edited by the ta's and myself--normally we would remove a result like that (by changing the numbers around) because the large z would throw students on a test. But there is no reason why you can't obtain a value like that (imagine timing the 100 yard dash in a group of college sprinters vs 4th graders--you'd get a real whopping big z value).

Dr. C.

 

Student question:

> I am having trouble understanding percentile ranks for joint

> distributions ( in the regression chapters) I get confused on where the

> final percent comes from. The examples in the book leave out a couple

> of steps because I think they assume that the reader understands where

> the numbers are coming from. Could you please give me an example that

> includes all the steps.

>

>

> thank you,

>

> a student

>

>

Professor response:

<

Ms. ,

That's a very broad question; much of a long answer from me would probably

not address what is confusing you. Can you give me a problem in the book

where you are unclear?

The basic idea is that regression (like correlation) reflects the extent

to which scores share similar percentiles. So if I am 1 SD up on height

and height and weight perfectly correlate (r =1.0) and we know by

definition the slope predicting weight from height is

r*SD_weight/SD_height, then I will be 1 SD up on weight. The key here is

to remember that 'given my SD on the predictor variable, if I multiply

that by the correlation, it gives me what I expect my SD to be on the new

variable of interest'. Now SD's you can easily translate into percentiles

and vice versa. So the steps would be:

1> Take the percentile for the variable you are using to predict the

percentile on the other variable and convert it to an SD using the normal

table

2> Multiply the SD by the correlation to get the SD you expect for the

new variable

3> Convert the new SD back to a percentile using the normal table

If the problem is that you don't know how to convert from SD to

percentile, then the place you got lost was in Chap 5. Go back and work

through that again, and then face the regression material.

Let me know how your doing...

Dr. C.