r = (1/(n-1)) sum{((x-xbar)/SD-x)((y-ybar)/SD-y)} [Book formula] = ((sum xy) - (1/n)(sum x)(sum y))/((n-1) SDx SDy) [Calculator] (know how to use calculator formulas)
a. -1 <= r <= 1 A positive r means the variables are positively associated and a negative r means the variables are negatively associated.b. If r is close to 1 or -1, the data are close to a line
c. If r is close to 0, the data are not close to a line
d. Pictures! (see page 115)
a. The correlation r measures how close the data are to a line A scatterplot can give a general idea of strength and direction, but a correlation gives you a better idea.b. r does NOT work for categorical variables and non-linear relationships, it can be misleading.
a. The correlation between X and Y is the same as the correlation between Y and X. In other words it makes no distinction between explanatory and response variables.b. Invariant under addition: if some constant "a" is added to ALL of the X or Y values, the correlation is unchanged.
c. Invariant under multiplication: if all of the X or the Y values are multiplied by some positive constant "b", the correlation is unchanged.
d. Nonrobustness: the correlation can change very dramatically if only ONE of the data points is changed.
Answer: r = -0.47.
Answer: r = -0.67.
X Y X Y --- --- --- --- 1 2 4 -12 1 3 6 -12 2 6 12 -11 3 5 10 -10 5 9 18 -8 7 8 16 -6 11 8 16 -2 13 4 8 0 13 7 14 0 Since the new list is just a transformation of the old list (i.e., the "new" X = 2Y, and the "new" Y = X-13), the correlation is the same as in the previous list: r=0.415. If you only modify one of the lists (either X or Y) by adding or multiplying by a constant, it will not change the correlation.
1. Idea: given a set of data, we might try to find the line that best summarizes the relationship between X and Y. This line will tell us how much Y changes with a change in X. Note that regression requires us to have explanatory and response variables.2. Math fact: straight lines can be represented in the form
y = bx + a
where b is the slope and a is the intercept. The slope tells how much y tends to be different when x changes by one unit; the intercept tells what we expect to get for y when x=0. The method of drawing a line through a scattering of points tries to make a line that is as close as possible to the points in the VERTICAL direction. These differences between the line and the points are called residuals. Note that they refer to the Y variable.
3. Definition: a RESIDUAL is the difference between the actual data and our prediction; the residual is defined as
residual = actual value - predicted value
4. Method: minimize the sum of squared residuals! (Hence the name "least squares regression")
1. Formula: the coefficients for the least squares line (using X to predict Y) are (sum of xy) - (1/n)(sum x)(sum y) b = ----------------------------------- [**] (sum of x^2) - (1/n)(sum x)^2 a = average of y's - (slope) x (average of x's) * Note: you have to solve for the slope first! 2. Example. Find the regression coefficients for the following data: (1,4), (2,6), (4,8), (3,10), (4,10), (5,14), (6,12), (7,16) sum x = 32, mean x = 4; sum y = 80, mean y = 10; sum x^2 = 156 sum xy = 372 Then, slope = ((372 - 1/8 (32)(80)) / (156 - 1/8 (32^2))) = 1.8571 and intercept = 10 - 1.8571 * 4 = 2.5714 3. Example. For 637 California men age 25--29 in 1988, the regression equation for predicting income from number of years of education was: predicted income = ($1,400 per year) x education + $2,200 a. Predict the income of someone with 10 years of education. predicted income = 1,400 x 10 + 2,200 = $16,200. b. Predict the income of someone with 0 years of education. predicted income = $2,200.Do a regression on the internet. See if you get the same results.
1. Idea: given a set of data, find the line that best summarizes the relationship between X and Y. 2. Formula: the coefficients for the least squares line are (sum of xy) - (1/n)(sum x)(sum y) b = ----------------------------------- [**] (sum of x^2) - (1/n)(sum x)^2 a = average of y's - (slope) x (average of x's) 3. Example. For the data below, find the regression coefficients: (1,10), (3,6), (5,5), (7,1), (9,4). Answer: predicted y = 9.45 - 0.85 X Remember b (the slope) is the amount y is predicted to change when x changes by one unit. When X=0, Y=9.45, when X=1, Y changes (decreases) by -.85 and its predicted value is 8.60. BUT WAIT! You might say, hey! the original data has (1,10) where X=1 and Y=10, how come when I substitute 1 for X in this formula, I get 8.60? The difference (10-8.60)= 1.40 is the residual. It's the ERROR.
1. Example. For 637 California men age 25--29 in 1988, the regression equation for predicting income from number of years of education was predicted income = ($1,400 per year) x education + $2,200 Predict the income of someone with 10 years of education; is this reasonable? ($16,200; yes) 2. Beware of extrapolation: predicting outside of the data Predict the income of someone with 50 years of education; is this reasonable? ($72,200; no, sample is men age 25--29 none of these people can possibly have 50 years of education) (Use the scatter diagram to make sure you stay within the data.) 3. Beware of outliers: isolated points that distort the fit (Remark: check the scatter diagram for outliers.) 4. *** Beware of attributing association to observational data! *** Example: on average, people with college degrees (16 years of education) earn an average of $24,600. True or false: If we made high school dropouts go to school for 16 years, they would earn an average of $24,600.
(Recall: the slope m measures the average observed change in Y when there is a unit change in X.)
Last Update: 9 October 1996 by VXL