Regression/Hypothesis testing

Ó 2004, S. D. Cochran. All rights reserved.

REGRESSION CONTINUED

Remember the regression equation for predicting y from x is: y = bx + a (a is also indicated as "e" at times)

b, or the slope, is simply (r_xy * S.D._y)/S.D._x

a, or the intercept, is simply the value of y when x is 0:

[Why?: the point, a, where the line crosses the Y axis for X being 0 is the distance from the mean of Y predicted for the X value of 0:

Remember: D y = b * D x

a = D y + mean of y

so:

]

Example: Let's say we knew that the average UCLA student experiences a moderate level of anxiety on a 100 point scale, = 36.8, S.D. = 12.2. Also, that students average a course load of about 13 or so units, = 13.4, S.D. = 3.7. And finally, that the correlation between units taken and anxiety levels is a stunning r = .4.

You might ask as you plan your schedule for next quarter, how much anxiety can I expect to experience if I take 20 units? Treat units as x and anxiety as y. Then

The slope of the line predicting anxiety from units taken is (.4 * 12.2)/3.7 = (4.88)/3.7 = 1.32

The intercept is 36.8 - 1.32*13.4 = 36.8 - 17.67 = 19.13

So the predicted anxiety score when taking 20 units is:

y (or anxiety) = 1.32 * (20 units) + 19.13 = 45.53

The method of least squares

The r.m.s. error for the regression line of y on x is:

The regression equation is the equation for the line that produces the least r.m.s. error or standard error of the estimate

If x and y are perfectly related, that is all points lie on the regression line, the standard error of estimate is zero (the square root of 1 - 1² = 0), there is no deviation from the line.

If x and y are not associated at all, the standard error of the estimate is the S.D. of y (the square root of 1 - 0² = 1) and slope is 0. So the regression line is simply a line parallel to the x axis that intercepts y at the mean of y.

Interpretation

Regression is appropriate when the relationship between two variables is linear

Although we commonly think of x as causing y, this is dependent upon the research design and logic

GIGO--garbage in-garbage out--you can always create regression lines predicting one variable from another. The math is the same whether or not the analysis is appropriate

Example: Calculate a regression line predicting height of the surf at Venice beach from the number of floors in the math building.

HYPOTHESIS TESTING

So far we have learned how to take raw data, combine it, and create statistics that allow us to describe the data in a brief summary form.

We have used statistics to describe our samples. These are called descriptive statistics.

We have used our statistics to say something about the population that our samples were drawn from--this is inferential statistics.

Now we are going to learn another way in which statistics can be use inferentially--hypothesis testing

At the beginning of this course, we said that an important aspect of doing research is to specify our research question

The first step in conducting research is to translate our inclinations, hunches, suspicions, beliefs into a precise question.

Example: Is this drug effective?, Does lowering the interest rate cause inflation?

The second step is to look closely at the question we have asked and assure ourselves that we know what an answer to the question would look like

Example: Is this drug effective? Do we know exactly what drug we are referring to, how big a dose, given to whom? Can we define what we mean by effective? Do we mean effective for everyone? Is it a cure? What about side effects?

Now, we are going to add one more layer to this--the third step is to translate our question into a hypothesis that we can test by using statistical methods.

Example: Is this drug effective? Does it reduce symptoms? Do people report higher average pain before they take the drug than after they have taken it for a while?

Statistically, what we are saying is, perhaps, that the mean pain at time 1 is greater than the mean pain at time 2. But how much greater does it have to be?

Remember every observation is potentially made up of three components: true or expected score + bias + chance error. Things vary from being exactly the same every time we measure them for one of three possible reasons:

The true score could in fact be different from what we expect

There is bias

Random variation or chance

Generally, we are interested in only whether or not the true score is different. We design our studies to minimize bias as much as possible. But no matter what we do there is always random variation

This means that whenever we evaluate a change or difference between two things, we have, even with a perfect design eliminating bias, two possible causes. This is like try a solve a problem with two unknowns. If I tell you x + y = 5, you cannot tell me what x is or what y is.

There are two strategies to solving this dilemma

Set one of the unknowns to a value, such as 0 by use of logic

Get two estimates for one of the unknowns from two different sources and divide one by the other. On average this should equal 1.

Combine these two strategies

Statistical tests use these approaches to try to evaluate how much of the difference between two things can be attributed to a difference in the true score.

Now for the mind twist

To evaluate a research question, we translate the question into logical alternatives

One is a mathematical statement that says there is no difference. Or essentially, all the difference that we observe is due to chance alone.

This is called the null hypothesis. Null meaning nothing. And the hypothesis is that nothing is there in our data, no differences from what we expect except chance variation or chance error.

Example: Does this drug reduce pain?

The null hypothesis is that any change in mean levels of pain from time 1 to time 2 is simply random (explained by chance error) and the true score does not vary from time 1 to time 2.

Or mathematically the truth is: ₁ = ₂, in the population

Because the hypothesis does not refer to what we observe in our sample, but rather what is true in the population, the null hypothesis is typically written:

H₀: m ₁ = [some value such as 0, or any number we expect the true score to be]

There are two other possible alternatives.

That pain is in fact reduced at time 2

Or mathematically: ₁ < ₂in the population

That pain is in fact increased at time 2

Or mathematically: ₁ > ₂in the population

Each one of these is referred to as a tail (for reasons we'll find out later).

If we only predict that time 2 pain will be less that time 1 pain, then our alternative hypothesis (which is our research hypothesis) is considered one-tailed

With one-tailed hypotheses, the other tail is simply added to the original null hypothesis, for the following statement: ₁ ³ ₂

If either possibility is consistent with our research hypothesis, then our statistical hypothesis that restates the research hypothesis is two-tailed or: ₁ ¹ ₂

Again, our hypothesis refers to what is true in the population and so is formally written:

H₁: m ₁ ¹ [the same value as we specified above for our null hypothesis]

Notice that if we combine the two hypotheses we have logically included all possibilities (they are mutually exclusive and exhaustive)

So if one is absolutely correct, the other must be false

If one is highly unlikely to be true, the other just might possibly be true

If one is perhaps correct, we have not really reduced our uncertainty at all about the other.

Because of the problems of too many unknowns, we end up only being able to evaluate the possible truth about the null hypothesis. We're not interested in the null hypothesis. But because it is related by logic to the alternative hypothesis which is a statistical restatement of our research hypothesis, if we can conclude something definitive about the null hypothesis, then we can make a judgment about the possibility of the alternative being true.