Introduction to Statistical Methods for the Life and Health Sciences
|
Purpose: This lab is a review for you to practice reading Stata output.
Data: We will use some car leasing data. To get this data, you'll need to issue the following command in Stata:
There should be 2,849 observations. Use the summarize command to generate summary statistics for each of the 20 variables in the dataset. You can only use the variables which have valid means and standard deviations as shown by the summarize command.
The variables are, for your information:
1. serv_bnk byte %8.0g Region of the country
2. full_mdl int %8.0g Model Year
3. mdl_seri str8 %9s Car Model
4. mdl_id str6 %9s Car Model Identifier
5. vhcl_mfg str5 %9s Manufacturer
6. orig_msr long %12.0g Manufacturer's Suggested
RetailPrice
7. ls_trmn_ long %12.0g Lease Residual Value
8. ins_rsdu long %12.0g Insured Residual Value
9. ls_cntr_ str8 %9s Lease Start Date
10. vhcl_sol str7 %9s Date Vehicle Sold
11. ls_mtry_ str8 %9s Date Contract Terminated
12. ls_term_ byte %8.0g Lease Term (in months)
13. pur_cat_ str15 %15s Purchaser Type
14. mile_lim long %12.0g Mileage Allowed Per Year
15. vhcl_sls long %12.0g Final Vehicle Sales Price
16. mile_end long %12.0g Actual Mileage
17. wear_tea int %8.0g Dollar cost of wear and
tear
18. excs_mil int %8.0g Excessive Mileage Charge
19. abbr_sta str2 %9s State
20. rmn_ls_m byte %8.0g Remaining months on lease
( < 0 exceeded lease, 0 = ended on time, > 0 returned early)
If you tabulate or list any variable, you will have a fairly good idea of their interpretation. The codebook command in Stata is also useful (try it).
Choose 3 or 4 of the numeric variables, and make sure you know how to answer the following questions for each of your chosen variables:
1. Create a histogram (graph) for the variable, and say whether the mean is bigger, smaller, or the same as the median.
2. Create a boxplot (graph variable, box) for the variable, and estimate the median and quartiles. Are there any outliers?
3. What are the mean and standard deviation? (summarize or summarize variable, detail) Give a 5 number summary for each variable (recall that a 5 number summary is the minimum, Q1, median, Q3, and maximum). What is the IQR? What is the 10th percentile? What is the 90th percentile?
4. Would it be better to use more robust measures (e.g. median, interquartile range) to summarize some of the variables? If yes, which variables are good candidates for using more robust measures and explain why. If no, just explain why the mean and standard deviations are appropriate.