Correlation

Ó 2004, S. D. Cochran. All rights reserved.

CORRELATION

We are now going to deal with the situation where we have two variables and we want to ask questions about their relationship.

You do this now intuitively--if you devote all weekend to studying (more numbers of hours) will you do better on your midterms (more points on each exam). This is a positive association--as the values of one variable increase so do the values of the other.

You also might ask a question that reflects a hypothesis about a negative correlation. If I lose weight will I feel more happiness? This is a negative relationship between weight and comfort, because as weight decreases, you hypothesize happiness increases.

Finally, you may also intuitively experience a situation in which two variables are not associated at all. For example, if you think about effort expended as a variable and extent of understanding of a very complicated concept in one of your classes, you might have had the experience that no matter how hard you tried it didn't seem to influence your degree of understanding. This is a situation of no association--and the human emotion attached to that is the perception of helplessness. Statisticians think of it as a lack of covariance.

When we have two variables, it is quite natural to think of one as influencing the other.

Example: If I lose weight (weight loss is a causal variable) then I will feel more energy (energy is a consequence)

Variables that we think of as influencing factors are called independent variables

Some independent variables are causal

Some independent variables are not--an example in the book is son's height predicting father's height. Or we might predict height from foot size, but clearly foot size is not causal of height

Variables that we think of as consequential are called dependent variables

Example: If I lose weight (independent variable or IV) then I feel better (dependent variable or DV)

The if-then sentence structure of the logical statement is the clue: If IV then DV.

Joint distributions

We have learned that a sample distribution has a center, , and a spread, S.D.

The S.D. when squared is call the variance

With two variables, each of them separately has a mean and variance in unidimensional space

Example: We collect data from 16 high school students taking a history class. We ask each of them two questions: how many hours did they study for their last quiz and what was their score. We have to keep track for each subject what his or her two responses were because later we will have to multiply these responses times each other.

The graph showing hours studied has a center and spread. The mean is 5.2 hours, the S.D. is 3.3 hours. Notice that the S.D. is wide in relation to the distribution. Why is that so?

We can also graph the distribution of scores on the quiz. The mean is 72.3 points and the S.D. is 17.2 points.

We can also plot the joint distribution of the set of scores

In a joint distribution, each observation contains information on two variables. This can be plotted in two dimensions as a point.

Notice that in our example, the joint distribution has the appearance of a line. But it too has a center and a spread. The center is two dimensional, so instead of being a point it is a line, called in the book, the SD line. The spread is the distance points are away from the line or center, vertically.

Pearson Correlation coefficient

The Pearson r is summary of the extent of association between two variables

Basic facts

-1 £ r £ 1, r ranges from -1 to 1

When r is 1 or -1, it means that

The joint distribution of x and y forms a straight line, and all the points of the joint distribution are on that line

If we know x and r=1 or r = -1, then we can perfectly predict the value of y

If r is 0, the points of the joint distribution do not form a straight line, but rather a cloud or buck shot. There can also be other shapes which I will get to later.

Pearson r represents the extent to which the same individuals or events occupy the same relative position on two variables.

Another way of stating this is to think of z-scores.

Z-scores are ranks. When we convert a score to a z-score we are ranking that score within the distribution

r calculates the extent to which for two variables, an individual's z-scores, or ranks, are the same

The correlation coefficient, r, is two pieces of information

It is a statement of the direction of association between two variables

When positive, it means as the values of both variable go in the same directions: as one variable increases, so does the value of the other; or as one decreases, the other decreases

When negative, it means that as the values of the two variables go in opposite directions: as one increases, the other decreases

When r is zero exactly, it means that changes in one variable do not predict changes in the other. They are uncorrelated, not associated, independent.

It is also a statement of the strength of association between x and y

As the spread in the joint distribution away from the line decreases, the size of r increases

This is another way of saying that as r increases, knowing x provides more information about what the value of y might be

Because a correlation contains two pieces of information it is not enough to say two variables are correlated. You have to convey both pieces of information by saying they are positively correlated or negatively correlated.

Calculating the correlation coefficient

In the book, the calculation involves converting both x and y into standard units and then taking the average of the product of the joint standard scores.

r = (sum (z_x)(z_y))/(number of elements in the sample)

More typically you will do this automatically in your hand calculator

The formula for r introduces the notion of covariance.

Covariance refers to the extent to which changes in x result in changes in y

There are several formulas for r, all of which are just ways of stating the same thing:

or a computational formula for raw data:

or another computational formula, making use of the mean and standard deviation which we probably have already calculated or may know:

So, to use our studying and test score data:

Book method:

Hours

Score

Z(hours)

Z(score

Product

1.2

50

-1.24

-1.30

1.61

1.2

50

-1.24

-1.30

1.61

1.2

51

-1.24

-1.24

1.53

1.7

54

-1.08

-1.07

1.16

2.1

56

-0.96

-0.95

0.91

3.1

62

-0.66

-0.60

0.39

4.0

63

-0.38

-0.54

0.21

4.3

76

-0.29

0.21

-0.06

6.3

74

0.32

0.10

0.03

6.4

80

0.35

0.45

0.16

6.9

80

0.51

0.45

0.23

7.2

82

0.60

0.56

0.34

7.5

83

0.69

0.62

0.43

8.9

93

1.12

1.20

1.35

10.9

101

1.73

1.67

2.89

11.0

102

1.76

1.73

3.04

Sum

83.9

1157

15.81

Mean

5.2

72.3

0.99 = r

(16 elements)

S.D.

3.3

17.2

z for the first element's hours = (1.2 - 5.2)/3.3

Computational formula method:

	Hours	Score	Product
	1.2	50	60.0
	1.2	50	60.0
	1.2	51	61.2
	1.7	54	91.8
	2.1	56	117.6
	3.1	62	192.2
	4.0	63	252.0
	4.3	76	326.8
	6.3	74	466.2
	6.4	80	512.0
	6.9	80	552.0
	7.2	82	590.4
	7.5	83	622.5
	8.9	93	827.7
	10.9	101	1100.9
	11.0	102	1122.0
Sum	83.9	1157	6955.3
Mean	5.2	72.3
S.D.	3.3	17.2

	Hours	Score	Z(hours)	Z(score	Product
	1.2	50	-1.24	-1.30	1.61
	1.2	50	-1.24	-1.30	1.61
	1.2	51	-1.24	-1.24	1.53
	1.7	54	-1.08	-1.07	1.16
	2.1	56	-0.96	-0.95	0.91
	3.1	62	-0.66	-0.60	0.39
	4.0	63	-0.38	-0.54	0.21
	4.3	76	-0.29	0.21	-0.06
	6.3	74	0.32	0.10	0.03
	6.4	80	0.35	0.45	0.16
	6.9	80	0.51	0.45	0.23
	7.2	82	0.60	0.56	0.34
	7.5	83	0.69	0.62	0.43
	8.9	93	1.12	1.20	1.35
	10.9	101	1.73	1.67	2.89
	11.0	102	1.76	1.73	3.04
Sum	83.9	1157			15.81
Mean	5.2	72.3			0.99 = r	(16 elements)
S.D.	3.3	17.2