Statistics 10
Lecture 22


Yesterday's correlation problem had an error in the calculation of the Standard deviations and the correlation coefficient. Here are the correct numbers:

1991 Average is .7360 and the SD is .7736
1994 Average is 1.056 and the SD is .8013
The average of the products of each pair in 1991 and 1994 is .81344
The correlation coefficient is:

.81344 - (.7360 x 1.056)
-----------------------    = .0584
  .7776 x .8013

A. Predicting Individual Values (Chapter 10.3 and 12.1)

The correlation coefficient can also be used to predict values of Y given a value of X. To predict values, it's easiest to use regression.

1. Idea: given a set of data, we might try to find the line that best summarizes the relationship between X and Y. This line will tell us how much Y changes with a change in X. Note that regression requires us to have explanatory and response variables.

2. Math fact: straight lines can be represented in the form

y = slope*x + intercept

The slope tells how much y tends to be different when x changes by one unit; the intercept tells what we expect to get for y when x=0.

The method of drawing a line through a scattering of points tries to make a line that is as close as possible to the points in the VERTICAL direction. This line is called a "least squares line" or "the method of ordinary least squares."

3. Formula: the coefficients for the least squares line (using X to predict Y) are

               slope = r * SD of y
                       -----------
                          SD of x
intercept = average of y's - (slope) * (average of x's)

* Note: you must solve for the slope first to calculate intercept using these formulas.

B. Example

Example page 202. For 555 California men age 25--29 in 1993, statistics given are:

average education = 12.5 years, SD = 4 years
average income = $21,500, SD = $16,000
r = 0.35

The regression equation for predicting income from number of years of education is:

First calculate the slope 

0.35 * 16,000 slope = ------------- = $1,400 4

The the intercept

intercept = 21,500 - (1,400)*12.5 = 4,000

And put it all together

predicted income = ($1,400 per year) x education + $4,000

  1. Predict the income of someone with 10 years of education.

    predicted income = 1,400 x 10 + 4,000 = $18,000.

  2. Predict the income of someone with 0 years of education.

    predicted income = $1,400x0 + 4,000 = $4,000

  3. Beware of extrapolation: predicting outside of the data

    Predict the income of someone with 50 years of education; is this reasonable? ($74,000, no, sample is men age 25--29 none of these people can possibly have 50 years of education)

    (Hint: Use the scatter diagram to make sure you stay within the data.)

C. Correlation and Regression Summarized

  1. Interpretation: a one SD change in X is ASSOCIATED WITH an r*SD change in Y
  2. The slope of the regression line is = r * (SD of Y)/(SD of X)

    The intercept is = average of y - slope*(average of x)

    (Recall: the slope measures the average observed change in Y when there is a unit change in X. You MUST calculate slope first to derive the intercept)

  3. Again a warning: r and the regression line measure ASSOCIATION, not causation. There is a desire to predict Y from X but we can't always be certain that X is causing Y to happen.

D. Some extra practice

A large bank is planning on introducing a new computer program to its staff. To learn about the optimal amount of training that is needed to effectively implement the new program, the bank randomly chose 10 employees of roughly equal skill. These workers were trained for different amounts of time and were then individually put to work on a given project.

The following data indicate the training times and the resulting times (both in hours) that it took each worker to complete the project.

Worker      Training Time       Time to Complete Project
1              22               18.4
2              18               19.2
3              30               14.5
4              16               19.0
5              25               16.6
6              20               17.7
7              10               24.4
8              14               21.0
9              28               15.0
10             24               16.0
A. Calculate the correlation coefficient for training time and project completion time.

B. What is the estimated regression equation for predicting Project Completion time from Training Time?

C. Predict the project completion time for someone who was given 19 hours of training.

D. Suppose someone was given 40 hours of training. Can you predict their project completion time?

And just to challenge you:
E. Test the hypothesis that the average project completion time for all bank employees is 19 hours using the information contained in your sample of 10 employees. Use a 5% level of significance as your decision rule.

Answers:
(a) -.9623
(b) project competion time = -.4523*training time + 27.5418
(c) about 18.95 hours or 19 hours
(d) No, 40 hours is outside of the available data range for training time, extrapolation is not advisable. You wouldn't know if the line continues to be straight.
(e) Use a t-test. The null hypothesis is 19, the alternative is less than 19 or you could test the idea that it isn't 19, but I haven't taught you how to do that (but go ahead if you think you know, it won't be counted against you).

      18.18 - 19 
t = ------------------------ = -.87 with 9 degrees of freedom
    (SQRT(10) * 2.9728) / 10
I would not reject the null hypothesis that the average completion time for all bank employees is 19 hours. The difference between the sample average of 18.18 and the claim of 19 hours for all employees is just due to chance error.
button Return to the Fall 1998 Statistics 10/50 Home Page
Last Update: 3 December 1998 by VXL