Statistics 10
Lecture 22


A. Predicting Individual Values (Chapter 10.3 and 12.1)

The correlation coefficient can also be used to predict values of Y given a value of X. To predict values, it's easiest to use regression.

1. Idea: given a set of data, we might try to find the line that best summarizes the relationship between X and Y. This line will tell us how much Y changes with a change in X. Note that regression requires us to have explanatory and response variables.
2. Math fact: straight lines can be represented in the form
y = slope*x + intercept
The slope tells how much y tends to be different when x changes by one unit; the intercept tells what we expect to get for y when x=0.
The method of drawing a line through a scattering of points tries to make a line that is as close as possible to the points in the VERTICAL direction. This line is called a "least squares line" or "the method of ordinary least squares."
3. Formula: the coefficients for the least squares line (using X to predict Y) are
               slope = r * SD of y
                       -----------
                          SD of x
intercept = average of y's - (slope) * (average of x's)
* Note: you must solve for the slope first to calculate intercept using these formulas.

B. Example

Example page 202. For 555 California men age 25--29 in 1993, statistics given are:

average education = 12.5 years, SD = 4 years
average income = $21,500, SD = $16,000
r = 0.35

The regression equation for predicting income from number of years of education is:

First calculate the slope 


         0.35 * 16,000
slope =  ------------- = $1,400
               4

The intercept

intercept = 21,500 - (1,400)*12.5 = 4,000

And put it all together

predicted income = ($1,400 per year) x education + $4,000

  1. Predict the income of someone with 10 years of education.
  2. predicted income = 1,400 x 10 + 4,000 = $18,000.

  3. Predict the income of someone with 0 years of education.
  4. predicted income = $1,400x0 + 4,000 = $4,000

  5. Beware of extrapolation: predicting outside of the data

Predict the income of someone with 50 years of education; is this reasonable? ($74,000, no, sample is men age 25--29 none of these people can possibly have 50 years of education)

(Hint: Use the scatter diagram to make sure you stay within the data.)

C. Correlation and Regression Summarized

  1. Interpretation: a one SD change in X is ASSOCIATED WITH an r*SD change in Y
  2. The slope of the regression line is = r * (SD of Y)/(SD of X)
  3. The intercept is = average of y - slope*(average of x)

    (Recall: the slope measures the average observed change in Y when there is a unit change in X. You MUST calculate slope first to derive the intercept)

  4. Again a warning: r and the regression line measure ASSOCIATION, not causation. There is a desire to predict Y from X but we can't always be certain that X is causing Y to happen.

D. Some extra practice

A large bank is planning on introducing a new computer program to its staff. To learn about the optimal amount of training that is needed to effectively implement the new program, the bank randomly chose 10 employees of roughly equal skill. These workers were trained for different amounts of time and were then individually put to work on a given project.

The following data indicate the training times and the resulting times (both in hours) that it took each worker to complete the project.

Worker      Training Time       Time to Complete Project
1              22               18.4
2              18               19.2
3              30               14.5
4              16               19.0
5              25               16.6
6              20               17.7
7              10               24.4
8              14               21.0
9              28               15.0
10             24               16.0

A. Calculate the correlation coefficient for training time and project completion time.

B. What is the estimated regression equation for predicting Project Completion time from Training Time?

C. Predict the project completion time for someone who was given 19 hours of training.

D. Suppose someone was given 40 hours of training. Can you predict their project completion time?

Answers:
(a) -.9623
(b) project completion time = -.4523*training time + 27.5418
(c) about 18.95 hours or 19 hours
(d) No, 40 hours is outside of the available data range for training time, extrapolation is not advisable. You wouldn't know if the line continues to be straight.