Due Friday, February 23
I.
Return to the Ozone data:
Below are the sample correlation and sample covariance
matrices for the ozone data predictor variables. They are in table form
here.
CORRELATIONS (Numeric Variables)
VARIABLES
OBSERVATIONS | TEMP | INVERSIO | PRESSURE | VISIBILI | HEIGHT | HUMIDTY | TEMP2 | WINDSPEE |
TEMP | 1.00 | -0.58 | 0.41 | -0.27 | 0.76 | 0.46 | 0.84 | 0.29 |
INVERSIONHT | -0.58 | 1.00 | -0.05 | 0.37 | -0.54 | -0.33 | -0.84 | 0.07 |
PRESSURE | 0.41 | -0.05 | 1.00 | -0.07 | 0.03 | 0.71 | 0.07 | 0.42 |
VISIBILITY | -0.27 | 0.37 | -0.07 | 1.00 | -0.19 | -0.44 | -0.34 | 0.00 |
HEIGHT | 0.76 | -0.54 | 0.03 | -0.19 | 1.00 | 0.10 | 0.80 | 0.09 |
HUMIDTY | 0.46 | -0.33 | 0.71 | -0.44 | 0.10 | 1.00 | 0.31 | 0.35 |
TEMP2 | 0.84 | -0.84 | 0.07 | -0.34 | 0.80 | 0.31 | 1.00 | 0.10 |
WINDSPEED | 0.29 | 0.07 | 0.42 | 0.00 | 0.09 | 0.35 | 0.10 | 1.00 |
COVARIANCES (Numeric Variables)
VARIABLES
OBSERVATIONS | TEMP | INVERSIONHT | PRESSURE | VISIBILITY | HEIGHT | HUMIDTY | TEMP2 | WINDSPEED |
TEMP | 160.98 | -13410.14 | 165.61 | -269.87 | 780.78 | 127.30 | 130.55 | 8.25 |
INVERSIONHT | -13410.14 | 3344832.43 | -2716.72 | 53281.29 | -81108.33 | -13240.81 | -18889.12 | 290.19 |
PRESSURE | 165.61 | -2716.72 | 1029.90 | -180.01 | 87.17 | 490.83 | 27.97 | 29.71 |
VISIBILITY | -269.87 | 53281.29 | -180.01 | 6146.79 | -1200.22 | -753.59 | -329.02 | 0.46 |
HEIGHT | 780.78 | -81108.33 | 87.17 | -1200.22 | 6631.37 | 184.34 | 799.23 | 15.98 |
HUMIDTY | 127.30 | -13240.81 | 490.83 | -753.59 | 184.34 | 468.12 | 81.81 | 16.49 |
TEMP2 | 130.55 | -18889.12 | 27.97 | -329.02 | 799.23 | 81.81 | 149.74 | 2.68 |
WINDSPEED | 8.25 | 290.19 | 29.71 | 0.46 | 15.98 | 16.49 | 2.68 | 4.86 |
a) Which matrix, correlation or covariance, should
you perform your principal components analysis on, and why?
b) Suppose you base your PCA (Principal Components
Analysis) on the covariance matrix. In the first principal component, which
variable will have the biggest value? Put differently, the first principal
component is a linear combination of these 7 variables. Which variable
will have the biggest co-efficient in the linear combination? Why?
Here's the output of a PCA on the Correlations:
FIT MEASURES
COMPONENTS | E-Value | Prop. | CumProp |
PC1 | 3.68398 | 0.46050 | 0.46050 |
PC2 | 1.82334 | 0.22792 | 0.68841 |
PC3 | 1.07382 | 0.13423 | 0.82264 |
PC4 | 0.63570 | 0.07946 | 0.90210 |
PC5 | 0.42500 | 0.05313 | 0.95523 |
PC6 | 0.17152 | 0.02144 | 0.97667 |
PC7 | 0.15307 | 0.01913 | 0.99580 |
PC8 | 0.03358 | 0.00420 | 1.00000 |
COMPONENTS
VARIABLES | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 |
TEMP | 0.4751 | 0.0064 | 0.2366 | 0.0620 | -0.2652 | 0.0560 | 0.6718 | 0.4355 |
INVERSIONHT | -0.4056 | 0.2655 | 0.1807 | -0.1991 | -0.6500 | -0.2584 | 0.2481 | -0.3766 |
PRESSURE | 0.2097 | 0.5906 | 0.0501 | 0.4405 | -0.2067 | 0.5354 | -0.2240 | -0.1764 |
VISIBILITY | -0.2505 | -0.0100 | 0.7143 | 0.5183 | 0.3004 | -0.2604 | 0.0166 | 0.0025 |
HEIGHT | 0.3951 | -0.2880 | 0.3234 | -0.0762 | -0.4644 | -0.2077 | -0.6171 | 0.1043 |
HUMIDTY | 0.3180 | 0.4670 | -0.3239 | 0.1815 | 0.1092 | -0.7273 | -0.0406 | 0.0191 |
TEMP2 | 0.4740 | -0.2602 | 0.0899 | -0.0046 | 0.1680 | -0.0086 | 0.2154 | -0.7904 |
WINDSPEED | 0.1455 | 0.4602 | 0.4256 | -0.6746 | 0.3440 | 0.0588 | -0.0878 | 0.0367 |
c) Make a scree plot. How many principal components
should be retained according to this plot?
d) Can you place any physical interpretation on the
first PC? The second?
e) Verify that the length of PC1 is 1 (with allowances
for round-off error of course).
f) Calculate the first PC Loadings vector. (In other
words, find the eigenvector of length sqrt(first eigenvalue).)
g) Suppose we retain just the first two dimensions. The first observation in this data set was
80, 1298, 32, 40, 5860, 80, 75.2, 3 in the order
given in the matrix above. (i.e. Temp is the first, and windspeed the last,
variable.) What is the score of this observation for the first two principal
components?
h) Make part of a bi-plot: plot each variable as
a vector in the space defined by the first two PCs. Which variables contribute
most to the first PC? Which to the second?
II. For the following, assume Z(t) is discrete, white noise time series, and E(Z(t)) = 0, Var(Z(t)) = sigma-squared, Cov(Z(t), Z(t + k)) = 0 for all k not equal to 0.
a) Find the auto-correlation function of the MA process given by
X(t) = Z(t) + 0.7 Z (t-1) - 0.2Z(t-2)
b) Find the auto-correlation function -- rho(k) -- of the first-order AR process defined by
X(t) = 0.7 X (t - 1) + Z(t)
Plot rho(k) for k = -6, -5, ... -1, 0, 1, ..., 6
c) Use the time-series
package in xlispstat to simulate the processes in (a) and (b). Make
plots against time of these processes. (Choose the menu item Series:
Create series.) See the note at the bottom of the page.
Mac
version (a .sit file)
PC
or unix (a "zip" file)
d) Make your own time series! Get a coin and some graph paper. Put the origin in the lower-left corner of the page. The x-axis is "time", which in this case will really be "number of coin flips". Put a mark at (0, 0). Flip a coin. If it lands "heads", then y increases by 1. If it lands "tails", then y stays the same. So if I throw a heads on the first toss, I put a mark at (1,1). If I throw tails, I put the mark at (1,0). Connect the points. Continue for 30 flips or so.
a) What is the probability that your time series will reach (x,20) or beyond? (x is any integer > 0).
b) Suppose you flip many many times. What's the probability you will reach (x,20) or beyond?
c) This is a "random walk" time series:
X(t) = X(t-1) + Z(t). What type of random variable is Z(t)?
d) Use the xlisp software to simulate this time series. Suppose you were to model with a regression. What would be the values of the parameters?
e) Suppose, now, that your coin lands heads with
p = .25. Now what would be the values of the parameters in the regression?
I admit, this is a bit risky. I'm not sure
how this will work on a non-Mac machine. If necessary, I can email
it to you, but the file is big. Still, if you click, here is what
should happen:
a) you should download a file called timeseries.sit
(Macintosh) or timeseries.zip
(other).
b) You should "unstuff" it with Stuffit Expander
or some other expanding software. (Note: some browsers will unstuff
automatically and you can skip this test.)
c) Place it in the Vista folder.
d) You don't need to run Vista. Just double-click
on the xlstsp.lsp icon. This will load all of the appropriate
files.
These are just text files. So if worse comes to
worse, you can open up xlispstat (or ViSta) and load them in by cutting
and pasting. But there are many of them, and that would be tedious.
If worse comes to worse, skip those exercises. They are primarily to help give you some experience "seeing" the models.
1) A random walk is given by X(t) = X(t-1) + Z(t). Show that X(t) = Z(1) + Z(2) +...+Z(t)
2) Define the difference operator to be Dd. (This is what was
written with an upside down delta in class.) Hence
D1(Xt) = X(t) - X(t-1) (First order difference). And
D2(Xt) = D1(D1(X(t)) = D1(X(t) - X(t-1)) = [X(t) - X(t-1)] - [X(t-1)
- X(t-2)] = X(t) - 2X(t-1) - X(t-2)
a) What is D3(X(t))?
b) Define W(t) = Dd. Then W(t) = alpha1 W(t-1) + alpha2 W(t-2)
+...+alphap W(t-p) + Z(t) + Beta1 Z(t-1) + Beta2 Z(t-2) +...+Betaq Z(t-q)
is the definition of an ARIMA(p,d,q) process. Show that a random
walk is an ARIMA(0,1,0) process.
c) What is another name for an ARIMA(p,0,0) process?
d) What is another name for an ARIMA(0,0,q) process?
3. Suppose X(t) = Z(t) + Beta Z(t-1). This is a moving average (MA) process. As we said in class, moving average processes can be difficult to estimate (there is no closed-form solution). As a result, some analysts prefer to rewrite MA processes as AR processes. Re-write this as an AR processess.