Homework: Regression
We'll practice on a pre-packaged data set. Type
data(airquality)
attach(airquality)
help(airquality)
from within R. The first command loads the dataset, the second lets you easily refer to variables by name, the third describes the data set.
For the time being, we'll ignore the "Day" and "Month" variable. These variables suggest we apply a time-series approach to these data, and we'll do this later.
For the first part of this exercise, we'll focus just on the relationship between Temp and Ozone.
1. Plot Ozone v. Temp and describe the relationship. It is helpful to add a "smoother" that picks up the general trend. The lowess(x,y) command will do so, but if you try it you'll note a problem: there are missing values in the Ozone variable.
Missing values in R are coded with NA. They are approximately harmless if they are missing at random (which, loosely speaking, means that the reason for their missing has nothing to do with the phenomenon being studied). R usually just drops them from calculations (unless you specify otherwise). Some R functions, such as "plot", handle missing values naturally and invisibly. Others, such as "lowess", return error messages.
Here are some useful functions for dealing with missing values:
is.na(x) returns a vector of T's and F's (true and false). T if the value is missing, F if not.
!is.na(x) is the negation of this. It returns T if present, F if missing. Sometimes easier to think this way.
To put a smoother ontop of your plot, then, type
lines(lowess(Temp[!is.na(Ozone)], Ozone[!is.na(Ozone)]))
Awkward, but it gets the job done.
Let's parse this: !is.na(Ozone) returns T if Ozone is NOT missing. Thus Temp[!is.na(Ozone)] selects only those values for Temp for which Ozone is not missing. (If Temp were also missing values, we would need an "and" to remove missing values from both variables.) This causes a problem, in that Temp[!is.na(Ozone)] is now shorter than Ozone. So we must also include Ozone[!is.na(Ozone)] in the command to make sure both vectors have the same length.
lowess returns the smoothed x and y values. The "lines" command super-imposes a line on an already existing scatterplot. The "points" command superimposes points on an already existing scatterplot.
2. Regress Ozone on Temp. Examine the residuals and report on how well the assumptions of the model hold. (You'll have missing value problems again.) The function resid(lm object) returns the residuals with missing values removed. So to plot the resids against Temp, type plot(Temp[!is.na(Ozone)], resid(lmobject)).
3. Fit a quadratic: Ozone = a + b*Temp + c*Temp^2. Do the residuals look better?
To do this, you need to create a new data-frame that includes the square of Temp as a variable:
Tempsq <- Temp^2
new <- data.frame(Ozone, Temp, Tempsq)
quadratic <- lm(Ozone ~ Temp + Tempsq, new)
4. Explore transforms of the predictor and response. Stick to square-root and log. Which best satisfies the assumptions of the model?
5. Once you've found a satisfactory fit (or at least as good as you can), interpret the model. What does it say about the relation between Temperature and Ozone?
update: a useful feature
Suppose you've fit a full model like so
full <- lm(y ~ x1 + x2 + x3)
And now you wish to add or subtract or transform a variable. Here are some commands that do this fairly quickly:
first <- update(full, .~.+x4)
This fits y ~ x1 + x2 + x3 + x4 that is, it adds a variable.
second <- update(full, .~.-x2)
removes x2: y ~ x1 + x3
third <- update(full, log(.) ~ .)
transforms the LHS: log(y) ~ x1 + x2 + x3
6. We now wish to fit a model including Solar Radiation, Temperature and Wind. Use the pairs command to see all of these at once.
a) Which variables do you think will be the best predictors?
b) Which pairs of predictors are related?
Note on pairs command:
You can add smoothers by including an extra call within the function:
pairs(airquality, panel= panel.smooth)
If you want to make a panel including functions of variables, you need to put them into a matrix. The command cbind(x,y,z) "binds" these vectors column-wise, and so you get an nX3 matrix with the first column equal to x, second to y, so on. You'll need this because "pairs" requires a matrix or a data-object as its input.
7. Fit the log of the Ozone against the three predictors. Examine the residuals. How well does the model fit? Are there any candidates for influential points?
8. Are there are any influential points?
9. A reporter from the local weather station calls and asks you some questions. Answer them using your model:
It's going to be 10 degrees warmer tomorrow. How will the ozone change?
But what if the wind blows, will it still change by that amount?
Suppose that tomorrow the Solar Radiation will be 100 units, The Temperature 68 degrees, and the Wind 7.9 mph. What ozone levels do you predict? What size error will there be in this prediction?
10. Download the big mac data from
http://www.stat.ucla.edu/~rgould/datasets/bigmac.dat
and load it into R. (The first line has variable names, so set header = T in the read.table function.)
A short description appears below. The basic idea is that economic security of a country can be fairly well summarized by the price of a Big Mac. Based on this data, what do you think?
Data taken from Rudolf Enz, "Prices and Earnings Around
the Globe",
1991 edition, Published by the Union Bank of Switzerland. The data
give average values in 1991 on several economic indicators for 45
world cities. All prices are in US dollars, using currency conversion
at the time of publication.
Name Type n Info
BigMac Variate 45 Min labor to buy a BigMac and fries
Bread Variate 45 Min labor to buy 1 kg bread
BusFare Variate 45 Lowest cost of 10k public transit
EngSal Variate 45 Electrical eng annual salary, 1000s
EngTax Variate 45 Tax rate paid by engineer
Service Variate 45 Annual cost of 19 services
TeachSal Variate 45 Primary teacher salary, 1000s
TeachTax Variate 45 Tax rate paid by primary teacher
VacDays Variate 45 Ave days vacation per year
WorkHrs Variate 45 Ave hours worked per year
City Text 45 Name of
city