Introduction to Statistical Methods for the Life and Health Sciences
|
This lab involves simple and multiple regression.
Simple Linear Regression
Data: We will use the same data on the S & P 500 that was used in previously, it is at http://www.stat.ucla.edu/~darlene/datasets/sp500.dta
First, use the correlate function to obtain the correlation matrix for the continuous variables (you will have to figure out which variables those are). For example, if there are 3 variables called var1, var2, and var3, you would type:
correlate var1 var2 var3
The correlation of any variable with itself is 1; all other correlations are given in the matrix. The TA will help you to read the correlation matrix.
You should find that the largest correlation is between netincom and marketca. Make a scatterplot with netincom on the y-axis:
graph marketca netincom
Describe the trend in the plot: how are the two variables related? Would you say this is (roughly) a linear relationship?
We can quantify the linear relationship with a (least squares) regression. We will start to cover this topic in lecture tomorrow, but you will start to get a feel for it in the lab today. Don't worry that some of the commands may seem a bit mysterious. (We can compute this whether or not the relationship is really linear, but if it is not linear then the regression line will provide a very poor description.) Note that Stata gives us a lot more information than we are ready for; don't worry about this. We would like to predict netincom from marketca. To do the regression, type
regress netincom marketca
Note that the first variable is the response variable (the one we want to predict), the second is the predictor or explanatory variable. Look in the column headed 'Coef' to find the least squares intercept and slope. The TA will help you to write the equation of the line from this information. To graph the line on top of the scatterplot, type:
quietly regress netincom marketca
predict pnetinc
graph netincom pnetinc marketca, s(oi) c(.l)
Here is how that is supposed to work. The first command performs a regression that computes the slope and intercept of the regression line. The next command calculates the predicted values for netincom. These predicted values all fall on the regression line. The last command does the actual graphing. The command plots netincom and pnetinc versus marketca. The s(oi) sets the symbols for the plot so that netincom versus marketca is done with circles (the o option) and pnetinc versus marketca uses no symbol (the i for invisible option). The c(.l) option controls how the points are connected, so that netincom versus marketca is not connected (the . option) and pnetinc versus marketca is connected with a line (the l option).
What is the interpretation of this regression line? Is marketca a good predictor ofpnetinc? You may also use the value of the correlation coefficient to help answer this.
Multiple Linear Regression
For this part, we will use car leasing data:use "http://www.stat.ucla.edu/~darlene/datasets/honda.dta"
There should be 2,849 observations. Use the "summarize" command to generate summary statistics for each of the 20 variables in the dataset.
The variables are, for your information:
1. serv_bnk byte %8.0g Region of the country
2. full_mdl int %8.0g Model Year
3. mdl_seri str8 %9s Car Model
4. mdl_id str6 %9s Car Model Identifier
5. vhcl_mfg str5 %9s Manufacturer
6. orig_msr long %12.0g Manufacturer's Suggested RetailPrice
7. ls_trmn_ long %12.0g Lease Residual Value
8. ins_rsdu long %12.0g Insured Residual Value
9. ls_cntr_ str8 %9s Lease Start Date
10. vhcl_sol str7 %9s Date Vehicle Sold
11. ls_mtry_ str8 %9s Date Contract Terminated
12. ls_term_ byte %8.0g Lease Term (in months)
13. pur_cat_ str15 %15s Purchaser Type
14. mile_lim long %12.0g Mileage Allowed Per Year
15. vhcl_sls long %12.0g Final Vehicle Sales Price
16. mile_end long %12.0g Actual Mileage
17. wear_tea int %8.0g Dollar cost of wear and tear
18. excs_mil int %8.0g Excessive Mileage Charge
19. abbr_sta str2 %9s State
20. rmn_ls_m byte %8.0g Remaining months on lease
( < 0 exceeded lease, 0 = ended on
time, > 0 returned early)
If you tabulate or list any variable, you will have a fairly good idea of their interpretation. Ask for assistance if you cannot figure it out from the description given above and from a listing of the variable itself.
The codebook command in Stata is also useful. Try it.
The goal of this part is to model variable vhcl_sls (in other words, treat this variable as the y-variable or the dependent variable) using the other variables as predictors. There is no one right way of doing this, what is important is to treat this as a learning "adventure" and to put scatterplots, correlations, and regression to work for you.
1. Generate some scatterplots. You can do this with the graph command, specifically:
gr vhcl_sls orig_msr
Will generate a scatterplot with the original manufacturer's suggested retail price of the car on the x-axis and the final sales price (after the lease expired) of the automobile. A scatterplot can give you a sense of the relationship between two variables.
Please note that you can only generate scatterplots for numeric variables in Stata. You can't create a scatterplot of sales price by date of contract unless you convert the date variable to some other kind of measure. (That is an advanced topic outside of the scope of this course, but feel free to try it if you really wish).
2. Generate some correlations. Ask yourself, what is the interpretation of the correlation coefficient and how is it related to a regression model? Again, vhcl_sls must be included in the correlations. You can generate a correlation matrix in Stata by issue a command like:
correlate ins_rsdu orig_msr ls_trmn_ ls_term_ mile_lim vhcl_sls
This will produce a lower triangle of a matrix. The values represent correlations of the variables listed on the columns and the rows. The main diagonal will have "1.0000" in each cell representing a variable's correlation with itself. You learned how to read the correlation matrix in lab last week, this should give you a review.
3. Construct a regression model which predicts vhcl_sls from at least two of the other variables in the dataset. For example:
regress vhcl_sls ls_term_ wear_tea
will give you regression coefficients and other statistics. Make sure (hint hint) that you know how to write down the regression equation from the Stata output. You also need to be able to interpret the coefficients. What do the other variables tell you about the vehicle sales price? In other words, what is the relationship (using the example above) between the length of the lease term and the final sales price of the car? What about wear and tear?
Come up with your own model and your own interpretation. Remember, there is no "correct" answer, this is just a learning exercise, so try out using other variables, more variables, etc. Good luck!