Idea: given a set of data, regression can summarize the relationship between two variables X and Y. Here, there is a strong sense that one of the variables (Y) depends on the other (X).
A line which summarizes the relationship can be found from just knowing the means and standard deviations of the two variables and their correlation.
This line is called a "regression" line. It's called regression because the man who developed this statistical method was working with father's heights and the heights of their sons and noticed something he called "regression to the mean (or average)". Basically, he noticed that tall fathers often had sons who were average in height and short fathers often had sons who were average in height.
And that is what the regression line does -- it is going through the average of the Y-variable (the vertical axis) for a given value of the X-variable (the horizontal axis).
Handout.
The length and weight graph has a regression line drawn through it (for now, don't worry about how to draw the line or the equation, that's in Chapter 12). At each value of the length, the line is going through the average of weight for that particular value of length.
You could think of this line as the "best fitting" line between all of the points on the graph. Note that the line doesn't exactly pass through the points -- in fact, it misses most of them, but the average distances of the points from the line are minimized.
From the previous lecture: one of the properties of the correlation coefficient (r) is that it can be used to give us a rough idea of the relationship between two variables.
Values of r near +1 and -1 mean that the two variables are very closely associated. Values of r near 0 suggest that there is almost no relationship between the two.
If you look at the length vs. weight again, r is = .9460 which is suggests a very strong positive relationship between the two.
A positive value means the relationship is positive -- that is, as one variable increases so does the other. A negative value means that the relationship is negative -- as one variable increases, the other decreases.
In Chapter 10, there is another use for the correlation coefficient. For any two variables it can give you an idea of the rate of change between them.
That is, for every one standard deviation increase in the X variable, we can expect a
r * standard deviation of y increase in the y variable on average.
Back to the length and weight graph, the standard deviation for the length (the X variable) is 22.266 inches. Going back to the previous idea, for a 22.266 increase in length, we can expect about a 735.225 increase in weight (the Y variable). It's
(.9460*777.1936)=735.225 where 777.1936 is the SD for the weight.
r is a measure of association and it gives you a sense of the direction of the relationship and how much change is expected in the y variable given a change in the x variable.
Suppose you work for an automobile manufacturer and its cars are going to be lengthened by 2 inches next year. How much will the weight increase?
The length's standard deviation is 22.266 inches. An increase of 2 nches is .089 SD (that is 2/22.266). To predict weight when there is going to be a .089 SD change in length, multiply the correlation coefficient, .9460 by .089, which gives .08497 -- this is the relative fraction of an SD for weight -- and then convert this SD to pounds -- .08497 * 777.1936 = 66.04 pounds and in this case -- weight is expected to increase by 66.04 for a 2 inche increase in length.
In other words...should the manufacturer increase the length of the car by 2 inches, we'd expect the weight to increase by 66.04 pounds in response.
In combination with a basic understanding of regression, the correlation coefficient can give us a sense of the amount of change which occurs between two variables.
It can also be used to predict values of Y given a value of X.