Statistics 50 Lecture 5

Statistics 50
Lecture 5

CORRELATION AGAIN

A. The correlation coefficient

Idea: the CORRELATION COEFFICIENT, denoted r, measures how close the data are to a straight line. It gives you an idea of the direction and strength of the relationship between two quantitative variables.

Formula

r = (1/(n-1)) sum{((x-xbar)/SD-x)((y-ybar)/SD-y)}	[Book formula]
= ((sum xy) - (1/n)(sum x)(sum y))/((n-1) SDx SDy) [Calculator]
(know how to use calculator formulas)

An Internet Correlation Calculator

B. Properties of r

Basic facts
a. -1 <= r <= 1 A positive r means the variables are positively associated and a negative r means the variables are negatively associated.
b. If r is close to 1 or -1, the data are close to a line
c. If r is close to 0, the data are not close to a line
d. Pictures! (see page 115)
Interpreting correlation as a measure of linearity
a. The correlation r measures how close the data are to a line A scatterplot can give a general idea of strength and direction, but a correlation gives you a better idea.
b. r does NOT work for categorical variables and non-linear relationships, it can be misleading.
Advanced mathematical facts
a. The correlation between X and Y is the same as the correlation between Y and X. In other words it makes no distinction between explanatory and response variables.
b. Invariant under addition: if some constant "a" is added to ALL of the X or Y values, the correlation is unchanged.
c. Invariant under multiplication: if all of the X or the Y values are multiplied by some positive constant "b", the correlation is unchanged.
d. Nonrobustness: the correlation can change very dramatically if only ONE of the data points is changed.

C. Examples

Given the five points {(2,7), (3,3), (5,1), (8,4), (13,2)}, find r.
Answer: r = -0.47.
A minor change can change r by a lot {(2,7), (3,3), (5,2), (8,4), (13,1)}, find r.
Answer: r = -0.67.

The dataset on the left has a correlation of r = 0.415. Find the correlation for the dataset on the right.

	 X	 Y		 X	 Y
	---	---		---	---
	 1	 2		 4	-12
	 1	 3		 6	-12
	 2	 6		12	-11
	 3	 5		10	-10
	 5	 9		18	 -8
	 7	 8		16	 -6
	11	 8		16	 -2
	13	 4		 8	  0
	13	 7		14	  0

Since the new list is just a transformation of the old list (i.e.,
the "new" X = 2Y, and the "new" Y = X-13), the correlation is
the same as in the previous list:  r=0.415.

If you only modify one of the lists (either X or Y) by
adding or multiplying by a constant, it will not change
the correlation.

D. Regression (intuitive)

1. Idea: given a set of data, we might try to find the line that best summarizes the relationship between X and Y. This line will tell us how much Y changes with a change in X. Note that regression requires us to have explanatory and response variables.
2. Math fact: straight lines can be represented in the form
y = bx + a
where b is the slope and a is the intercept. The slope tells how much y tends to be different when x changes by one unit; the intercept tells what we expect to get for y when x=0. The method of drawing a line through a scattering of points tries to make a line that is as close as possible to the points in the VERTICAL direction. These differences between the line and the points are called residuals. Note that they refer to the Y variable.
3. Definition: a RESIDUAL is the difference between the actual data and our prediction; the residual is defined as
residual = actual value - predicted value
4. Method: minimize the sum of squared residuals! (Hence the name "least squares regression")

E. Regression (mathematical)

    1.  Formula:  the coefficients for the least squares line (using X
        to predict Y) are

		     (sum of xy) - (1/n)(sum x)(sum y)
		b = ----------------------------------- 		    [**]
                     (sum of x^2) - (1/n)(sum x)^2

                a = average of y's - (slope) x (average of x's)

		* Note:  you have to solve for the slope first!


    2.  Example.  Find the regression coefficients for the following data:
	          (1,4), (2,6), (4,8), (3,10), (4,10), (5,14), (6,12), (7,16)

		          sum x = 32, mean x = 4;

		          sum y = 80, mean y = 10;

		          sum x^2 = 156

			  sum xy = 372


	          Then, slope = ((372 - 1/8 (32)(80)) / (156 - 1/8 (32^2))) 
  			      = 1.8571

		  and intercept = 10 - 1.8571 * 4 = 2.5714 


    3.  Example.  For 637 California men age 25--29 in 1988, the regression
            equation for predicting income from number of years of education 
            was:

		predicted income = ($1,400 per year) x education + $2,200


            a.  Predict the income of someone with 10 years of education.

			predicted income = 1,400 x 10 + 2,200 = $16,200.


            b.  Predict the income of someone with 0 years of education.

			predicted income = $2,200.

Do a regression on the internet. See if you get the same results.

F. Recap of Regression

    1.  Idea:  given a set of data, find the line that best summarizes the 
        relationship between X and Y.


    2.  Formula:  the coefficients for the least squares line are

		     (sum of xy) - (1/n)(sum x)(sum y)
		b = ----------------------------------- 		    [**]
                     (sum of x^2) - (1/n)(sum x)^2

                a = average of y's - (slope) x (average of x's)


    3.  Example.  For the data below, find the regression coefficients:
	  	 	 (1,10), (3,6), (5,5), (7,1), (9,4).

  	Answer:	 predicted y = 9.45 - 0.85 X

        Remember b (the slope) is the amount y is predicted to
        change when x changes by one unit.  When X=0, Y=9.45,
        when X=1, Y changes (decreases) by -.85 and its predicted
        value is 8.60.  

        BUT WAIT!  You might say, hey! the original data has
        (1,10) where X=1 and Y=10, how come when I substitute
        1 for X in this formula, I get 8.60?

        The difference (10-8.60)= 1.40 is the residual.  It's
        the ERROR.

G. Interpretation

    1.  Example.  For 637 California men age 25--29 in 1988, the regression 
        equation for predicting income from number of years of education was

  	   predicted income = ($1,400 per year) x education + $2,200

        Predict the income of someone with 10 years of education;
        is this reasonable?  ($16,200; yes)


    2.  Beware of extrapolation:  predicting outside of the data

        Predict the income of someone with 50 years of education;
	is this reasonable?  ($72,200; no, sample is men age 25--29
        none of these people can possibly have 50 years of
        education)

        (Use the scatter diagram to make sure you stay within the data.)


    3.  Beware of outliers:  isolated points that distort the fit
        (Remark:  check the scatter diagram for outliers.)


    4.  *** Beware of attributing association to observational data! ***

       Example:  on average, people with college degrees (16 years of
       education) earn an average of $24,600.  True or false:  If we 
       made high school dropouts go to school for 16 years, they would
       earn an average of $24,600.

H. Correlation and Regression

Interpretation: a one SD change in X is ASSOCIATED WITH an r*SD change in Y (note: less than 1 SD for Y!)
Math fact:
the slope of the regression line is b = r * (SD of Y)/(SD of X)
(Recall: the slope m measures the average observed change in Y when there is a unit change in X.)
Again, warning: r and the regression line measure ASSOCIATION, not causation!
Also beware of "lurking" variables...

Return to the Fall 1996 Statistics 50 Home Page

Last Update: 9 October 1996 by VXL