Ó 2004, S. D. Cochran. All rights reserved.
REGRESSION
Regression
So far with joint distributions we've been talking about linear relationships between two variables--this might suggest to us that if we could find the line that best summarizes the data we might be able to reduce the amount of information we need to communicate our observations to others.
Remember, at the beginning of this course we started with taking a sample distribution of observed scores and reduced that to an estimate of the center, or mean, and S.D. Those two pieces of information made it possible to communicate to others a picture of our observations without having to provide all the individual data points
Now we might try to use the equation for a line to do the same thing; to allow us summarize the relationship between two variables and to communicate that succinctly to others--if we had a line that summarized information all we would need is the slope, the X value, and the intercept to determine a value of Y.
Remember from high school math: the formula for a straight line is:
y = mx + b
where m is the slope and b is the intercept. The slope tells us how much y changes for a unit change in x; the intercept tells us where the line crosses the y axis or to put it another way what y is when x is 0.
That’s a straight line where all joint values of X and Y are on the line--but in a joint distribution we do not have a straight line exactly--we have a scattergram of observations that are (x,y) points and we try to fit a line that best describes what these points look like.
Fitting the line: we have three options to think about
Use the SD line
The SD line is a line with slope SD_{Y}/SD_{X }
It is insensitive to relationship between x and y--it bisects the center of the joint distribution--for it to reflect the relationship between x and y it would have to be weighted for the number of (x,y) points at each slice
Example of problem: does not necessarily cross the y axis at a place that makes sense when x is 0
Plot the averages at each joint occurrence of x and y
This does take into account the distributions of x at any point of y, and the distributions of y at any point of x, so it does weight for the relationship between x and y.
But--it may not be a straight line exactly, and therefore does not simplify many details into two or three things to communicate
Plot a straight line that tries to position itself as close to the averages of the joint occurrences of x and y as possible
In effect, this line is the two dimensional average of the scattergram
Along the line for y on x is the smoothed version of the average value for y corresponding to a particular value of x
Unlike the SD line, the slope of this line is (r * SD_{y})/SD_{X}--that is it takes into account the relationship between x and y
But as we have learned before with distributions of single variables, there will be spread around the line. In regression, a residual is the difference between the actual data and our prediction. Residuals are defined as:
residual = actual value - predicted value
In deciding precisely where to fit this line, our decision will be based on minimizing the sum of the squared residuals, thus the name for the procedure "least squares regression"
Regression method for individual scores
We can use means and SD's of x and y, and the correlation, r, to calculate estimated values of either x or y.
Example: Given the following
average depression score = 14, SD =2
average stress score = 30, SD = 5
r = .2
Then if someone scores 17 on the depression inventory our best guess for their stress score would be:
17 = 1.5 S.D. above the depression mean
Estimate of stress S.D. = (.2)(1.5) = 0.3
Estimate of stress score = 30 + (.3)(5) = 31.5
And if they scored 20 on the stress inventory, our best guess for their depression score would be:
20 = -2.0 S.D. below the stress mean
Estimate of depression S.D. = (.2)(-2.0) = -.4
Estimate of depression score = 14 + (-.4)(2) = 1
We can also use percentile ranks and our knowledge of the correlation, without ever knowing the mean and SD of the sample for the two variables. This means we can deal with hypothetical samples
Example: Let's say the correlation between current interest rates and inflation is .4. If interest rates are currently 2 SD’s lower than their average over the past one hundred years, what is our estimate of the inflation level?
For interest rates, -2.0 S.D. is the 5th percentile approximately.
To get inflation's percentile: .4(-2.0) = -0.8 or 100 - ((57.63/2) + 50) = 21.2 % percentile rank
Some points about regression
Regression fallacy--test-retest
Observed values are a combination of true score and chance error.
Chance is bidirectional--sometimes pushing a score one way and sometimes another
If you measure something and the score has a large negative chance error, chances are that the second time you measure it, the chance error will be closer to the mean
This implies that in test-retest situations, individuals who are outliers in the first testing will simply by chance tend to score closer to the mean on second testing.
The book refers to it as the regression effect; elsewhere it is called regression to the mean
The equation only describes the relationship between x and y within the group that was observed.
Remember the regression line is the two-dimensional average of the scatterplot
The scatterplot comes from a sample
When you have a different sample, the line may not represent the average of the new sample
Example: Suppose we collected data from college students in Maine (where it snows a lot). Suppose we calculate a line predicting minutes late to class from day during the quarter. Predictably this line shows that as winter turns to spring, students are less likely to be late (because weather does not impair their ability to get to class). If we then use that line to predict the behavior of UCLA students we will be completely inaccurate. In fact it may be that as weather gets better in L.A., students are later and later to class as they enjoy the sunshine more.
There are actually two possible regression lines
One line, regressing y on x, uses x to predict values of y
The other line, regressing x on y, uses y to predict values of x
These lines are related but their slopes and intercepts differ
Slopes
For predicting y, the slope = (r * SD_{Y})/SD_{X }
For predicting x, the slope = (r * SD_{X})/SD_{Y }
Intercepts
For predicting y, x=0, the point at which the line crosses the y axis
For predicting x, y=0, the point at which the line crosses the x axis
A regression line does not imply causality
Example: Miasma theory