The following data related the number of violent crimes (defined as murder, rape, robbery, and assault) at a sampling of college campuses in 1991 and 1994. The data listed in terms of the number of crimes per 1,000 students.
SCHOOL | 1991 | 1994 |
---|---|---|
Berkeley | 2.88 | 0.83 |
UCLA | 1.21 | 1.43 |
Davis | 0.32 | 0.58 |
Colorado State | 0.48 | 0.94 |
University of Florida | 0.83 | 0.94 |
University of Minnesota | 0.26 | 0.39 |
University of Iowa | 0.19 | 0.39 |
Penn State | 0.24 | 0.37 |
Northeastern | 0.41 | 1.54 |
Boston College | 0.54 | 3.15 |
1. Compute the sample correlation coefficient r for the data.
1991 Average is .7360 and the SD is .7736
1994 Average is 1.056 and the SD is .8013
The average of the products of each pair in 1991 and 1994 is
.81344
The correlation coefficient is:
.81344 - (.7360 x 1.056) ----------------------- = .0584 .7776 x .8013
2. Suppose the Berkeley pair is wrong and it should have been
2.88 and 3.83 instead of 2.88 and 0.83. Can a single error
change the correlation coefficient?
Absolutely. The new coefficient is .7655...a big change.
Note that the correlation coefficient is sensitive to a change in a single number.
3. Suppose I didn't want to work with decimal points so I
multiplied all of the numbers in the first column (1991)
by 100 and all of the numbers in the second column by
1000. Would the correlation coefficient change?
No. It would still be .7655. The average for 1991 and 1994 would change in predictable ways (100 times larger for the first and 1000 times larger for the second) but as long as the relationship between the two years remains unchanged, the correlation is unchanged.
4. Why might knowing (3) be useful? Think about yesterday's handout with the Microsoft vs. AT&T and Sony stocks. Suppose you are told that actually in that time period Microsoft stock was split 2-for-1. Would that change the correlations? No.
It might save you time to know the behavior of a correlation coefficient rather than recalculating all of the numbers.
Idea: given a set of data, regression can summarize the relationship between two variables X and Y. Here, there is a strong sense that one of the variables (Y) depends on the other (X).
A line which summarizes the relationship can be found from just knowing the means and standard deviations of the two variables and their correlation.
This line is called a "regression" line. It's called regression because the man who developed this statistical method was working with father's heights and the heights of their sons and noticed something he called "regression to the mean (or average)". Basically, he noticed that tall fathers often had sons who were average in height and short fathers often had sons who were average in height.
And that is what the regression line does -- it is going through the average of the Y-variable (the vertical axis) for a given value of the X-variable (the horizontal axis).
Handout.
The America Online and Amazon.com graph has a regression line drawn through it (for now, don't worry about how to draw the line, that's in Chapter 12). At each value of American Online's stock, the line is going through the average of Amazon.com for that particular value of America Online.
You could think of this line as the "best fitting" line between all of the points on the graph. Note that the line doesn't exactly pass through the points -- in fact, it misses most of them, but the average distances of the points from the line are minimized.
From the previous lecture: one of the properties of the correlation coefficient (r) is that it can be used to give us a rough idea of the relationship between two variables.
Values of r near +1 and -1 mean that the two variables are very closely associated. Values of r near 0 suggest that there is almost no relationship between the two.
If you look at the America Online vs. Amazon.com graph again, r is =.9047 which is quite close to +1. It suggests a strong positive relationship between the two stocks.
A postive value means the relationship is positive -- that is, as one variable increases so does the other. A negative value means that the relationship is negative -- as one variable increases, the other decreases.
In Chapter 10, there is another use for the correlation coefficient. For any two variables it can give you an idea of the rate of change between them.
That is, for every one standard deviation increase in the
X variable, we can expect a
r * standard deviation of y
increase in the y variable on average.
Back to AOL and Amazon, the standard deviation for AOL (the X variable) is $9.38. Going back to the previous idea, for a $9.38 increase in AOL's stock price, we can expect about a $16.90 increase in Amazon stock (the Y variable). It's (.9047*$18.6847) = $16.90.
r is a measure of association and it gives you a sense of the direction of the relationship and how much change is expected in the y variable given a change in the x variable.
Suppose you thought AOL was going to cost $50/share tomorrow. What would a share of Amazon cost given what we know about the relationship between the two stocks?
AOL's average is $55.6736 and it's standard deviation is $9.3851. A price of $50 is $5.6736 below average or .6045 SD below average (that is 5.6736/9.3851). To predict Amazon's price when AOL is .6045 SD below average, multiply the correlation coefficient, .9047 by .6045, which gives .5469 -- this is the relative fraction of an SD for Amazon -- and then convert this SD to dollar amounts for Amazon -- .5469 * $18.6847 = $10.2185 and in this case -- subtract. So Amazon's price is $104.7286-$10.2185=$94.51 dollars.
In other words...when America Online is trading at $50/share, given the data we have, we'd expect Amazon.com to trade at about $95 per share.
It makes sense, when America Online is rising, we expect Amazon to rise, when it AOL is down, so is Amazon.
In combination with a basic understanding of regression, the correlation coefficient can give us a sense of the amount of change which occurs between two variables.
It can also be used to predict values of Y given a value of X.