1.
Reviewing the Correlation Coefficient
The
following data related the number of violent crimes (defined as murder, rape,
robbery, and assault) at a sampling of college campuses in 1991 and 1994. The
data listed in terms of the number of crimes per 1,000 students.
SCHOOL |
1991 |
1994 |
Berkeley |
2.88 |
0.83 |
UCLA |
1.21 |
1.43 |
Davis |
0.32 |
0.58 |
Colorado
State |
0.48 |
0.94 |
University
of Florida |
0.83 |
0.94 |
University
of Minnesota |
0.26 |
0.39 |
University
of Iowa |
0.19 |
0.39 |
Penn
State |
0.24 |
0.37 |
Northeastern |
0.41 |
1.54 |
Boston
College |
0.54 |
3.15 |
A. Compute the
sample correlation coefficient r for the data.
1991 Average is
.7360 and the SD is .7736 and 1994 Average is 1.056 and the SD is .8013
The average of the products of each pair in 1991 and 1994 is .81344. The
correlation coefficient is:
.81344 - (.7360 x 1.056)
----------------------- = .0584
.7776 x .8013
B.
Suppose the Berkeley pair is wrong and it should have been 2.88 and
3.83 instead of 2.88 and 0.83. Can a single error change the correlation
coefficient?
Absolutely.
The new coefficient is .7655...a big change.
Note
that the correlation coefficient is sensitive to a change in a single number.
C.
Suppose I didn't want to work with decimal points so I multiplied all
of the numbers in the first column (1991) by 100 and all of the numbers in the
second column by 1000. Would the correlation coefficient change?
No.
It would still be .7655. The average for 1991 and 1994 would change in
predictable ways (100 times larger for the first and 1000 times larger for the
second) but as long as the relationship between the two years remains
unchanged, the correlation is unchanged.
D.
Why might knowing (3) be useful? Suppose the government was going to
change the way it measured crimes, suppose from per thousand as above to per
100,000? So the Berkeley number would
go from 2.88 to 288, the UCLA number would go from 1.21 to 12, etc1. Would that
change the correlations? No.
It might save you time to know the behavior of a correlation coefficient rather than recalculating all of the numbers.
2.
Properties Summarized again
3. Where the correlation can fail you or deceive you
Different SDs
Appearances can
deceive, the overall appearance of a scatter diagram depends on the Standard
Deviations of the individual X and Y variables. Smaller standard deviations make the scatter diagram look
"tighter" or more closely placed together. This happens because r is defined by how far individual points
deviate from their means divided by their SDs.
Clustering is relative. So
beware, your eye can be fooled. Be sure
to:
(a)
examine the ranges of the X and Y variables when comparing two sets of
data
(b)
look at the standard deviation for the X variable and the Y
variable. Are most points within a
standard deviation or not?
(c) calculate the correlation coefficient (if it hasn't been done yet) or review them (if they have been calculated for you)
Outliers
The correlation
is useful when your data points are football shaped. But sometimes r will mislead.
A single outlier can counterbalance a strong linear relationship
(example)
Non-Linear relationships
Two variables can be related, but their relationship is not well
described by a straight line.
Correlations are good for straight-line relationships, the are terrible
for curved relationships. The
correlation coefficient for a curved relationship will be near zero even though
there is clearly a relationship. (example)
Solutions? Always take a look at the scatter diagram if possible.
Ecological Correlations
Correlations for aggregated information (e.g. information by state) are
almost always stronger than correlations for individuals. Often times people calculate correlations
based on rates or averages which are based on many individuals combined. This tends to make a relationship look
stronger than it is. Beware when people
do this.
Association is NOT causation
Remember: a correlation tells you how strongly two variables are related
in a linear fashion. It doesn't mean
one variable causes another to happen.
The problem here is one of CONFOUNDING, that is, a third variable may be
present that you haven't accounted for.