1. Idea: given a set of data, find the line that best summarizes the relationship between X and Y. The method of least squares is used because it is simple. A line can be found from just knowing the means and standard deviations of the two variables and their correlation. 2. Formula: the coefficients for the least squares line are (sum of xy) - (1/n)(sum x)(sum y) b = ----------------------------------- [**] (sum of x^2) - (1/n)(sum x)^2 a = average of y's - (slope) x (average of x's) 3. Example. For the data below, find the regression coefficients: (1,10), (3,6), (5,5), (7,1), (9,4). Answer: predicted y = 9.45 - 0.85 X (you can do this two ways, either with the formula above or by knowing that b = r*(Sy/Sx)) Note: What the correlation-slope formula suggests is that a change of one standard deviation in X corresponds to a change of r standard deviations in Y. So when r=1 or -1, a change in X equals a change in Y. As you move away from +1 or -1, Y moves less in response to X and at zero, it doesn't matter at all. Remember b (the slope) is the amount y is *predicted* to change when x changes by one unit. When X=0, Y=9.45, when X=1, Y changes (decreases) by -.85 and its predicted value is 8.60.
(Recall: the slope b measures the average observed change in Y when there is a unit change in X.)
No, for a number of reasons, mostly that men age 25-34 can't have 50 years of education. Even if you had 75 year olds in the dataset, 50 years is nonsensical.
(Use the scatter diagram to make sure you stay within the data.)
1. BUT WAIT! You might say, hey! the original data has (1,10) where X=1 and Y=10, how come when I substitute 1 for X in this formula, I get 8.60?The difference (10-8.60)= 1.40 is the residual. It's the ERROR.
The predicted response is rarely ever the same as the observed response.
2. Studying residuals is a diagnostic tool. In a word, it is a good thing to do when you have computer software which can make this easy. It can help you determine if there are problems with using a least squares line to describe the relationship between two variables.
This is the square of the correlation r. It is defined as the fraction of the variation in values of Y that is explained by the regression of Y on X. In a sense, you could think of it as Y has some variability and if you are examining it in relation to X, if Y and X change together, some of the variation in Y is accounted for by X. All r-square does is give you an idea of how much of the variation in Y is due to X.So if you have a correlation of .5, this means you have an r-square of .25 or about 1/4 of the variation in Y can be accounted for by knowing the value of X.
Last Update: 14 October 1996 by VXL