Economics 40/Statistics M11
Lecture 20


Regression (3.6, 13.1, 13.2)

A. Regression and Correlation again

  1.  Interpretation of regression in the context of correlation: a one SD change in X is ASSOCIATED WITH an r*SD change in Y
  2. Math facts:
    the slope of the regression line is b = r * ((SD of Y)/(SD of X) )
  3. (Recall: the slope b measures the average observed change in Y when there is a unit change in X.)

    the intercept is the value of y when x = 0

     

    B. Using the Regression Line again

    1. Prediction. What is of interest with a regression line is answering questions like "what would the price of IBM be if the yield on a 90-day T-Bill was 4.75% (currently 4.5%)? In other words, given the equation y = -48.357x + 361.39 what would the price of IBM if the yield on the 90-Day Treasury Bill was 4.75 instead of 4.48 on March 1st (IBM was trading at $171)? Solution: (-48.357*4.75)+361.39 = about 132 dollars/share, a drop of $40.
    2. Interpretation. How much change on average to expect from Y if X changes. If the treasury yield was zero, IBM is expected to trade at $361.39 a share (there are problems with this result). If Treasury yield increased by a point, IBM would be expected to drop by 48.357 dollars. If the Treasury yield increased by a point? The slope tells you how much change on average to expect in Y if X is changing. Intercept tells you what Y would be if X were equal to zero (sometimes this is nonsense). Most applications of regression are interested in slope.
    3. 3. R-square. This is the square of the correlation r. It is defined as the fraction of the variation in values of Y that is explained by the regression of Y on X. In a sense, you could think of it this way: Y has some natural variability and now you are examining this variability in light of X. If Y and X change together, some of the variation in Y is accounted for by X. All r-square does is give you an idea of how much of the variation seen in Y is due to X. So if you have a correlation of .64 (like IBM and the T-bill yield), this means you have an r-square of .41 or about 41% of the variation in IBM(Y) can be accounted for by knowing the value of the current T-bill yield (X). In your text, r-square = 1 - S2e /S2y ( page 121). S2e is called the error variance or residual variance of Y -- the error AFTER (think, residual error) a line is fit to the data.

       

      C. Residuals and potential problems

      Idea: The regression line is a summary, it is not expected to pass through every point but it is the best fitting line which minimizes deviations in the vertical (y) direction.

      These deviations are called residuals and frequently you see them expressed as "the error variance". It is simply the difference between the actual y value for a given x value, squared if you are looking at variance, and the predicted y (based on a regression) for a given x value.

      1.Residuals can help you assess the "fit" of the regression line. In Chapter 13, the notation changes just a bit, they introduce Y' which is just the actual value plus some residual.
      2.An advanced note: a discernible pattern in your residuals is not necessarily a good thing. Residuals, plotted against the original X variable are supposed to show no clear pattern and should look like a random scattering of points.
      Outliers
        1. Outliers in the X variable (Treasury yields here) can strongly affect the regression line. An influential observation can radically change a regression equation when it is removed. The second graph of Monday's handout shows the regression equation and residual plot with the outliers removed.
        2. The outliers can show up in the residuals too.
 
Confounding Variables
1. A confounder (or a lurking variable) is a variable that has an important effect on the relationship between two variables -- but for whatever reason -- it is not being studied. This is a bad thing.
2.Time is usually a good candidate for a lurking variable. Examine the residual plot when the time dimension is brought in. Finding a lurking variable is more of an "art" than a science.
Extrapolation
Remember that the intercept answers the question "what would the value of Y be if X=0?" A problem with asking this question is this x-value may fall outside of our range of data. Beware of extrapolation.
Association and Causation
Just because a correlation is high (or an r-square is high) does not mean that X necessary causes Y. There is always the possibility that a "lurking variable" exists that could completely change the relationship (remember the idea about firemen and fire damage -- they have a high association but attributing cause is a different matter entirely).
 
D. A Z-test for a regression
Once again, the interest is in testing hypotheses about a relationship between two variables. In general, the null hypothesis is that the slope, m = 0 and that there is no relationship between X and Y (if m is zero then r is zero). Your text uses a "5 step method" to produce an experimental mean of all possible slopes (by doing what is a permutation of the data at the beginning of Chapter 13). For us, we will work with the theoretical mean of all possible slopes which is m = 0 then test our particular sample outcome slope against this.
There is a basic form of the test:
 
Z = (sample slope - 0) / slope SD
 
Where slope SD is (Sy/Sx) * 1/ (root n) where n is the number of pairs and Sy and Sx are the standard deviations from the Y and X variables. Again, we are having to rely on sample statistics to get the information we need to perform this test. Once you obtain Z, it's a straight look-up from the table.
From the handout, the slope of the original data (with 4 outliers included) was -48.357 (what would the interpretation of this number be?) this could be used as a sample slope. There are 54 pairs of numbers. The slope SD is:
(27.36662/.3623254) * (1/root(54))
which is 10.27. THe resulting Z is about -4.7 This is way off the charts and the interpretation is that the slope, the change in Y for every unit change in X, is significantly different from zero (suggests that there is a relationship between the two -- whether it is truly meaningful or useful is a different question)