Statistics 50
Lecture 6


A. Recap of Regression

    1.  Idea:  given a set of data, find the line that best summarizes the 
        relationship between X and Y.

        The method of least squares is used because it is simple.
        A line can be found from just knowing the means and
        standard deviations of the two variables and their
        correlation. 


    2.  Formula:  the coefficients for the least squares line are

		     (sum of xy) - (1/n)(sum x)(sum y)
		b = ----------------------------------- 		    [**]
                     (sum of x^2) - (1/n)(sum x)^2

                a = average of y's - (slope) x (average of x's)


    3.  Example.  For the data below, find the regression coefficients:
	  	 	 (1,10), (3,6), (5,5), (7,1), (9,4).

  	Answer:	 predicted y = 9.45 - 0.85 X

        (you can do this two ways, either with the formula
         above or by knowing that b = r*(Sy/Sx)) 
   
        Note: What the correlation-slope formula suggests is
        that a change of one standard deviation in X corresponds
        to a change of r standard deviations in Y.  So when
        r=1 or -1, a change in X equals a change in Y.  As you
        move away from +1 or -1, Y moves less in response to X
        and at zero, it doesn't matter at all.

        Remember b (the slope) is the amount y is *predicted* to
        change when x changes by one unit.  When X=0, Y=9.45,
        when X=1, Y changes (decreases) by -.85 and its predicted
        value is 8.60.  

B. Correlation and Regression

  1. Interpretation of regression in the context of correlation: a one SD change in X is ASSOCIATED WITH an r*SD change in Y
  2. Math fact:
    the slope of the regression line is b = r * (SD of Y)/(SD of X)

    (Recall: the slope b measures the average observed change in Y when there is a unit change in X.)

  3. Also beware of "lurking" variables...variables that can have an important effect on the outcome but has been excluded from the study.

  4. Beware of extrapolation: predicting outside of the data Example: Suppose data was collected on the educations and incomes of men age 25-34 and you try to predict the income of someone with 50 years of education based on a regression equation calculated from this data; is this reasonable?

    No, for a number of reasons, mostly that men age 25-34 can't have 50 years of education. Even if you had 75 year olds in the dataset, 50 years is nonsensical.

    (Use the scatter diagram to make sure you stay within the data.)

  5. Beware of outliers: isolated points that distort the fit (Remark: check the scatter diagram for outliers.) Outliers in the X dimension are frequently INFLUENTIAL as well.

  6. Beware of attributing association to observational data! *** Example: on average, people with college degrees (16 years of education) earn an average of $24,600. True or false: If we made high school dropouts go to school for 16 years, they would earn an average of $24,600.

C. Studying Residuals

1. BUT WAIT! You might say, hey! the original data has (1,10) where X=1 and Y=10, how come when I substitute 1 for X in this formula, I get 8.60?

The difference (10-8.60)= 1.40 is the residual. It's the ERROR.

The predicted response is rarely ever the same as the observed response.

2. Studying residuals is a diagnostic tool. In a word, it is a good thing to do when you have computer software which can make this easy. It can help you determine if there are problems with using a least squares line to describe the relationship between two variables.

D. R-square

This is the square of the correlation r. It is defined as the fraction of the variation in values of Y that is explained by the regression of Y on X. In a sense, you could think of it as Y has some variability and if you are examining it in relation to X, if Y and X change together, some of the variation in Y is accounted for by X. All r-square does is give you an idea of how much of the variation in Y is due to X.

So if you have a correlation of .5, this means you have an r-square of .25 or about 1/4 of the variation in Y can be accounted for by knowing the value of X.


button Return to the Fall 1996 Statistics 50 Home Page

Last Update: 14 October 1996 by VXL