Regression

Ó 2004, S. D. Cochran. All rights reserved.

REGRESSION

Regression
1. So far with joint distributions we've been talking about linear relationships between two variables--this might suggest to us that if we could find the line that best summarizes the data we might be able to reduce the amount of information we need to communicate our observations to others.
  1. Remember, at the beginning of this course we started with taking a sample distribution of observed scores and reduced that to an estimate of the center, or mean, and S.D. Those two pieces of information made it possible to communicate to others a picture of our observations without having to provide all the individual data points
  2. Now we might try to use the equation for a line to do the same thing; to allow us summarize the relationship between two variables and to communicate that succinctly to others--if we had a line that summarized information all we would need is the slope, the X value, and the intercept to determine a value of Y.
2. Remember from high school math: the formula for a straight line is:
  
  y = mx + b
  
  where m is the slope and b is the intercept. The slope tells us how much y changes for a unit change in x; the intercept tells us where the line crosses the y axis or to put it another way what y is when x is 0.
  
  That’s a straight line where all joint values of X and Y are on the line--but in a joint distribution we do not have a straight line exactly--we have a scattergram of observations that are (x,y) points and we try to fit a line that best describes what these points look like.
3. Fitting the line: we have three options to think about
  1. Use the SD line
    1. The SD line is a line with slope SD_Y/SD_X
    2. It is insensitive to relationship between x and y--it bisects the center of the joint distribution--for it to reflect the relationship between x and y it would have to be weighted for the number of (x,y) points at each slice
    3. Example of problem: does not necessarily cross the y axis at a place that makes sense when x is 0
  2. Plot the averages at each joint occurrence of x and y
    1. This does take into account the distributions of x at any point of y, and the distributions of y at any point of x, so it does weight for the relationship between x and y.
    2. But--it may not be a straight line exactly, and therefore does not simplify many details into two or three things to communicate
  3. Plot a straight line that tries to position itself as close to the averages of the joint occurrences of x and y as possible
    1. In effect, this line is the two dimensional average of the scattergram
    2. Along the line for y on x is the smoothed version of the average value for y corresponding to a particular value of x
    3. Unlike the SD line, the slope of this line is (r * SD_y)/SD_X--that is it takes into account the relationship between x and y
    4. But as we have learned before with distributions of single variables, there will be spread around the line. In regression, a residual is the difference between the actual data and our prediction. Residuals are defined as:
      
      residual = actual value - predicted value
    5. In deciding precisely where to fit this line, our decision will be based on minimizing the sum of the squared residuals, thus the name for the procedure "least squares regression"
Regression method for individual scores
1. We can use means and SD's of x and y, and the correlation, r, to calculate estimated values of either x or y.

Example: Given the following

average depression score = 14, SD =2

average stress score = 30, SD = 5

r = .2

Then if someone scores 17 on the depression inventory our best guess for their stress score would be:

17 = 1.5 S.D. above the depression mean

Estimate of stress S.D. = (.2)(1.5) = 0.3

Estimate of stress score = 30 + (.3)(5) = 31.5

And if they scored 20 on the stress inventory, our best guess for their depression score would be:

20 = -2.0 S.D. below the stress mean

Estimate of depression S.D. = (.2)(-2.0) = -.4

Estimate of depression score = 14 + (-.4)(2) = 1

We can also use percentile ranks and our knowledge of the correlation, without ever knowing the mean and SD of the sample for the two variables. This means we can deal with hypothetical samples

Example: Let's say the correlation between current interest rates and inflation is .4. If interest rates are currently 2 SD’s lower than their average over the past one hundred years, what is our estimate of the inflation level?

For interest rates, -2.0 S.D. is the 5th percentile approximately.

To get inflation's percentile: .4(-2.0) = -0.8 or 100 - ((57.63/2) + 50) = 21.2 % percentile rank

Some points about regression
1. Regression fallacy--test-retest
  1. Observed values are a combination of true score and chance error.
  2. Chance is bidirectional--sometimes pushing a score one way and sometimes another
  3. If you measure something and the score has a large negative chance error, chances are that the second time you measure it, the chance error will be closer to the mean
  4. This implies that in test-retest situations, individuals who are outliers in the first testing will simply by chance tend to score closer to the mean on second testing.
  5. The book refers to it as the regression effect; elsewhere it is called regression to the mean
2. The equation only describes the relationship between x and y within the group that was observed.
  1. Remember the regression line is the two-dimensional average of the scatterplot
  2. The scatterplot comes from a sample
  3. When you have a different sample, the line may not represent the average of the new sample

Example: Suppose we collected data from college students in Maine (where it snows a lot). Suppose we calculate a line predicting minutes late to class from day during the quarter. Predictably this line shows that as winter turns to spring, students are less likely to be late (because weather does not impair their ability to get to class). If we then use that line to predict the behavior of UCLA students we will be completely inaccurate. In fact it may be that as weather gets better in L.A., students are later and later to class as they enjoy the sunshine more.

There are actually two possible regression lines
1. One line, regressing y on x, uses x to predict values of y
2. The other line, regressing x on y, uses y to predict values of x
3. These lines are related but their slopes and intercepts differ
  1. Slopes
    1. For predicting y, the slope = (r * SD_Y)/SD_X
    2. For predicting x, the slope = (r * SD_X)/SD_Y
  2. Intercepts
    1. For predicting y, x=0, the point at which the line crosses the y axis
    2. For predicting x, y=0, the point at which the line crosses the x axis
A regression line does not imply causality
1. Example: Miasma theory