1991 Average is .7360 and the SD is .7736
1994 Average is 1.056 and the SD is .8013
The average of the products of each pair in 1991 and 1994 is
.81344
The correlation coefficient is:
.81344 - (.7360 x 1.056) ----------------------- = .0584 .7776 x .8013
The correlation coefficient can also be used to predict values of Y given a value of X. To predict values, it's easiest to use regression.
1. Idea: given a set of data, we might try to find the line that best summarizes the relationship between X and Y. This line will tell us how much Y changes with a change in X. Note that regression requires us to have explanatory and response variables.2. Math fact: straight lines can be represented in the form
y = slope*x + intercept
The slope tells how much y tends to be different when x changes by one unit; the intercept tells what we expect to get for y when x=0.
The method of drawing a line through a scattering of points tries to make a line that is as close as possible to the points in the VERTICAL direction. This line is called a "least squares line" or "the method of ordinary least squares."
3. Formula: the coefficients for the least squares line (using X to predict Y) are
slope = r * SD of y ----------- SD of xintercept = average of y's - (slope) * (average of x's)* Note: you must solve for the slope first to calculate intercept using these formulas.
Example page 202. For 555 California men age 25--29 in 1993, statistics given are:
average education = 12.5 years, SD = 4 years
average income = $21,500, SD = $16,000
r = 0.35
The regression equation for predicting income from number of years of education is:
First calculate the slope0.35 * 16,000 slope = ------------- = $1,400 4
The the intercept
intercept = 21,500 - (1,400)*12.5 = 4,000
And put it all together
predicted income = ($1,400 per year) x education + $4,000
predicted income = 1,400 x 10 + 4,000 = $18,000.
predicted income = $1,400x0 + 4,000 = $4,000
Predict the income of someone with 50 years of education; is this reasonable? ($74,000, no, sample is men age 25--29 none of these people can possibly have 50 years of education)
(Hint: Use the scatter diagram to make sure you stay within the data.)
The intercept is = average of y - slope*(average of x)
(Recall: the slope measures the average observed change in Y when there is a unit change in X. You MUST calculate slope first to derive the intercept)
A large bank is planning on introducing a new computer program to its staff. To learn about the optimal amount of training that is needed to effectively implement the new program, the bank randomly chose 10 employees of roughly equal skill. These workers were trained for different amounts of time and were then individually put to work on a given project.
The following data indicate the training times and the resulting times (both in hours) that it took each worker to complete the project.
Worker Training Time Time to Complete Project 1 22 18.4 2 18 19.2 3 30 14.5 4 16 19.0 5 25 16.6 6 20 17.7 7 10 24.4 8 14 21.0 9 28 15.0 10 24 16.0A. Calculate the correlation coefficient for training time and project completion time.
B. What is the estimated regression equation for predicting Project Completion time from Training Time?
C. Predict the project completion time for someone who was given 19 hours of training.
D. Suppose someone was given 40 hours of training. Can you predict their project completion time?
And just to challenge you:
E. Test the hypothesis that the average project
completion time for all bank employees is 19 hours
using the information contained in your sample of 10 employees. Use a 5% level
of significance as your decision rule.
Answers:
(a) -.9623
(b) project competion time = -.4523*training time + 27.5418
(c) about 18.95 hours or 19 hours
(d) No, 40 hours is outside of the available data range for
training time, extrapolation is not advisable. You wouldn't
know if the line continues to be straight.
(e) Use a t-test. The null hypothesis is 19, the alternative
is less than 19 or you could test the idea that it isn't
19, but I haven't taught you how to do that (but go ahead
if you think you know, it won't be counted against you).
18.18 - 19 t = ------------------------ = -.87 with 9 degrees of freedom (SQRT(10) * 2.9728) / 10I would not reject the null hypothesis that the average completion time for all bank employees is 19 hours. The difference between the sample average of 18.18 and the claim of 19 hours for all employees is just due to chance error.