Regression continued

Ó 2004, S. D. Cochran. All rights reserved.

REGRESSION CONTINUED

Scatter in a scattergram

There are four ways of thinking about distributions in a scattergram

The distribution of x

The distribution of y

The joint distribution of x and y

The distribution of y at a point on x

For each of these distributions we can calculate an index of the center and the spread

When there is a single variable, the center is commonly the mean, the spread is the S.D.

When it is a joint distribution of two variables, the center is a regression line and the spread is the r.m.s. error line--this involves the use of residuals--notice here that the estimate of center and spread is two dimensional

Residuals

Calculating residuals

The term residual refers to that part of the actual score that we failed to predict.

Example: In our joint distribution of age and cholesterol, if we had to guess the cholesterol level of a subject, knowing nothing else, our best guess would be the average cholesterol level, 170, not just because it is the center of the distribution, but also because in the normal distribution more possible values lie at a narrow slice at the center or mean. So by picking the center we increase our chances of being correct to the best odds of any choice we could make if we have to choose one number

Now, if we knew the age of the subject we could get even more accurate in our estimate. We would pick the center of the joint distribution at that point on the regression line that corresponds to the age of the subject. If the subject were 31 years old, we would predict a cholesterol level of 120--the mean for all 31 year olds. This is more accurate than 170, because the age of this individual is not average for the age distribution.

But we are off by the vertical distance from the line to the point that represents our subject. We are off by the residual. This variation around the center of the distribution can be thought of as a standard deviation. It is deviation from the average in our observed distribution.

We can calculate this standard deviation around the regression line, but here it is called the standard error of the estimate. The estimate is the point on the least squares regression line in the joint distribution.

Remember, however, there are two possible regression lines so we must specify which one we are talking about.

The formula for the r.m.s. error for the regression line of y on x is:

So in the age and cholesterol data where r = .86 and the S.D. of cholesterol is 56.3, the standard error of the estimate for the line predicting cholesterol from age = 56.3 X square root of (1 - .86²) = 28.7 units of cholesterol.

This means that on average, when we choose the average score by plus or minus 28.7 units we will have included an estimated 68% of the observations within that range

Plotting residuals

a. Commonly, statisticians plot the residuals in a distribution to evaluate how good a fit a straight line is to their data.

b. Here I have plotted the residuals from the age and cholesterol data

If the regression is a good fit to the data across the whole joint distribution, then at any point on the line the spread in the residuals should be the same.

When the joint distribution approximates the normal bivariate distribution then the residuals demonstrate homoscedasticity

When average size of the residuals varies along the regression line then the distribution is said to exhibit heteroscedasticity

Using the normal curve at a point on the regression line

We can also, at any point on the regression, take a slice and examine the univariate distribution. That is, holding the value of one variable constant, we can examine the distribution on the other variable.

For example, we could chose all 20 to 25 year olds and ask questions about their cholesterol level.

As with any other univariate distribution we can calculate percentiles, only this time we use:

For the average: the predicted regression score instead of the mean of the univariate distribution

For the S.D.: the r.m.s. error (or standard error of the estimate) instead of the SD of the univariate distribution

Notice this allows us, if we know the correlation between two variables and their means and standard deviations (only five pieces of information), to calculate an expected percentile for a value on a variable within a joint distribution. The value does not even need to be in our actual distribution. For example, we could ask what percent of 24.5 year olds have a cholesterol level of 100 or less?

The mean age in the sample is 45.9, with SD of 13.6

The mean cholesterol level is 169.8, with SD of 56.3

The correlation is .86

So, 24.5 year olds are (45.9 - 24.5)/13.6 = 1.57 S.D. below the the average

Their cholesterol levels can be estimated as (0.86)*(-1.57 SD) = -1.35 SD below the average

So their average cholesterol should be -1.35 * 56.3 = -76.1 points below average or 169.8 - 76.1 = 93.6

So, the z score for this = (cholesterol of 100 - predicted average cholesterol)/SE_Y = (100 - 93.6)/28.7 = .22

This z score cuts off approximately 15.85 percent of expected observations to either side of the mean, so

The percent of 24.5 year olds who have a cholesterol level of 100 or less is estimated as 50 + 15.85/2 = 57.93%