Homework 6

Due Friday, February 23

I.

Return to the Ozone data:

Below are the sample correlation and sample covariance matrices for the ozone data predictor variables. They are in table form here.
 
 

CORRELATIONS (Numeric Variables)

VARIABLES
 
OBSERVATIONS TEMP  INVERSIO  PRESSURE VISIBILI  HEIGHT HUMIDTY  TEMP2  WINDSPEE 
TEMP  1.00  -0.58  0.41  -0.27  0.76  0.46  0.84  0.29
INVERSIONHT  -0.58  1.00  -0.05  0.37  -0.54  -0.33  -0.84 0.07
PRESSURE  0.41  -0.05  1.00  -0.07  0.03  0.71  0.07  0.42
VISIBILITY  -0.27  0.37  -0.07  1.00  -0.19  -0.44  -0.34  0.00
HEIGHT  0.76 -0.54  0.03  -0.19  1.00  0.10  0.80  0.09
HUMIDTY  0.46 -0.33  0.71 -0.44  0.10  1.00  0.31  0.35
TEMP2  0.84  -0.84  0.07 -0.34 0.80  0.31  1.00  0.10
WINDSPEED  0.29  0.07  0.42  0.00  0.09  0.35  0.10  1.00

 

COVARIANCES (Numeric Variables)

VARIABLES
 
OBSERVATIONS  TEMP INVERSIONHT  PRESSURE  VISIBILITY  HEIGHT  HUMIDTY  TEMP2  WINDSPEED 
TEMP  160.98  -13410.14  165.61  -269.87  780.78  127.30  130.55  8.25
INVERSIONHT  -13410.14  3344832.43  -2716.72  53281.29  -81108.33  -13240.81  -18889.12  290.19
PRESSURE  165.61 -2716.72  1029.90  -180.01  87.17  490.83  27.97  29.71
VISIBILITY  -269.87  53281.29  -180.01  6146.79  -1200.22  -753.59  -329.02  0.46
HEIGHT  780.78  -81108.33  87.17  -1200.22  6631.37  184.34  799.23  15.98
HUMIDTY  127.30  -13240.81  490.83  -753.59  184.34  468.12  81.81  16.49
TEMP2  130.55  -18889.12  27.97  -329.02  799.23  81.81  149.74  2.68
WINDSPEED  8.25  290.19  29.71  0.46  15.98 16.49  2.68  4.86

 

a) Which matrix, correlation or covariance, should you perform your principal components analysis on, and why?
 
 

b) Suppose you base your PCA (Principal Components Analysis) on the covariance matrix. In the first principal component, which variable will have the biggest value? Put differently, the first principal component is a linear combination of these 7 variables. Which variable will have the biggest co-efficient in the linear combination? Why?
 
 

Here's the output of a PCA on the Correlations:

FIT MEASURES
 
COMPONENTS  E-Value  Prop.  CumProp 
PC1  3.68398  0.46050  0.46050
PC2  1.82334  0.22792  0.68841
PC3  1.07382  0.13423  0.82264
PC4  0.63570  0.07946  0.90210
PC5  0.42500  0.05313  0.95523
PC6  0.17152  0.02144  0.97667
PC7  0.15307  0.01913  0.99580
PC8  0.03358  0.00420  1.00000

COMPONENTS
 
VARIABLES  PC1 PC2  PC3  PC4  PC5 PC6  PC7  PC8 
TEMP  0.4751  0.0064  0.2366  0.0620  -0.2652 0.0560  0.6718  0.4355
INVERSIONHT  -0.4056 0.2655 0.1807  -0.1991  -0.6500  -0.2584  0.2481 -0.3766
PRESSURE  0.2097  0.5906  0.0501 0.4405  -0.2067  0.5354 -0.2240  -0.1764
VISIBILITY  -0.2505  -0.0100  0.7143  0.5183  0.3004  -0.2604  0.0166  0.0025
HEIGHT  0.3951 -0.2880  0.3234  -0.0762  -0.4644  -0.2077  -0.6171  0.1043
HUMIDTY  0.3180  0.4670  -0.3239  0.1815 0.1092  -0.7273 -0.0406  0.0191
TEMP2  0.4740  -0.2602  0.0899 -0.0046 0.1680  -0.0086  0.2154 -0.7904
WINDSPEED  0.1455  0.4602  0.4256 -0.6746  0.3440 0.0588  -0.0878 0.0367

 

c) Make a scree plot. How many principal components should be retained according to this plot?
 
 

d) Can you place any physical interpretation on the first PC? The second?
 
 
 
 

e) Verify that the length of PC1 is 1 (with allowances for round-off error of course).
 
 

f) Calculate the first PC Loadings vector. (In other words, find the eigenvector of length sqrt(first eigenvalue).)
 
 
 
 

g) Suppose we retain just the first two dimensions. The first observation in this data set was

80, 1298, 32, 40, 5860, 80, 75.2, 3 in the order given in the matrix above. (i.e. Temp is the first, and windspeed the last, variable.) What is the score of this observation for the first two principal components?
 
 
 
 

h) Make part of a bi-plot: plot each variable as a vector in the space defined by the first two PCs. Which variables contribute most to the first PC? Which to the second?
 
 

II. For the following, assume Z(t) is discrete, white noise time series, and E(Z(t)) = 0, Var(Z(t)) = sigma-squared, Cov(Z(t), Z(t + k)) = 0 for all k not equal to 0.

a) Find the auto-correlation function of the MA process given by

X(t) = Z(t) + 0.7 Z (t-1) - 0.2Z(t-2)
 
 

b) Find the auto-correlation function -- rho(k) -- of the first-order AR process defined by

X(t) = 0.7 X (t - 1) + Z(t)

Plot rho(k) for k = -6, -5, ... -1, 0, 1, ..., 6

c) Use the time-series package in xlispstat to simulate the processes in (a) and (b). Make plots against time of these processes. (Choose the menu item Series: Create series.) See the note at the bottom of the page.
    Mac version (a .sit file)
    PC or unix (a "zip" file)

d) Make your own time series! Get a coin and some graph paper. Put the origin in the lower-left corner of the page. The x-axis is "time", which in this case will really be "number of coin flips". Put a mark at (0, 0). Flip a coin. If it lands "heads", then y increases by 1. If it lands "tails", then y stays the same. So if I throw a heads on the first toss, I put a mark at (1,1). If I throw tails, I put the mark at (1,0). Connect the points. Continue for 30 flips or so.

a) What is the probability that your time series will reach (x,20) or beyond? (x is any integer > 0).

b) Suppose you flip many many times. What's the probability you will reach (x,20) or beyond?

c) This is a "random walk" time series:

X(t) = X(t-1) + Z(t). What type of random variable is Z(t)?

d) Use the xlisp software to simulate this time series. Suppose you were to model with a regression. What would be the values of the parameters?

e) Suppose, now, that your coin lands heads with p = .25. Now what would be the values of the parameters in the regression?
 
 

Download Software

I admit, this is a bit risky.  I'm not sure how this will work on a non-Mac machine.  If necessary, I can email it to you, but the file is big.  Still, if you click, here is what should happen:
a) you should download a file called timeseries.sit (Macintosh) or timeseries.zip (other).
b) You should "unstuff" it with Stuffit Expander or some other expanding software.  (Note: some browsers will unstuff automatically and you can skip this test.)
c) Place it in the Vista folder.
d) You don't need to run Vista.  Just double-click on the xlstsp.lsp  icon.  This will load all of the appropriate files.
These are just text files. So if worse comes to worse, you can open up xlispstat (or ViSta) and load them in by cutting and pasting.  But there are many of them, and that would be tedious.

If worse comes to worse, skip those exercises.  They are primarily to help give you some experience "seeing" the models.



III. Additional time series problems.
 

1) A random walk is given by X(t) = X(t-1) + Z(t).  Show that X(t) = Z(1) + Z(2) +...+Z(t)

2) Define the difference operator to be Dd.  (This is what was written with an upside down delta in class.)  Hence
D1(Xt) = X(t) - X(t-1)  (First order difference). And
D2(Xt) = D1(D1(X(t)) = D1(X(t) - X(t-1)) = [X(t) - X(t-1)] - [X(t-1) - X(t-2)] = X(t) - 2X(t-1) - X(t-2)
a) What is D3(X(t))?
b) Define W(t) = Dd.  Then W(t) = alpha1 W(t-1) + alpha2 W(t-2) +...+alphap W(t-p) + Z(t) + Beta1 Z(t-1) + Beta2 Z(t-2) +...+Betaq Z(t-q)
is the definition of an ARIMA(p,d,q) process.  Show that a random walk is an ARIMA(0,1,0) process.
c) What is another name for an ARIMA(p,0,0) process?
d) What is another name for an ARIMA(0,0,q) process?

3.  Suppose X(t) = Z(t) + Beta Z(t-1).  This is a moving average (MA) process.  As we said in class, moving average processes can be difficult to estimate  (there is no closed-form solution).  As a result, some analysts prefer to rewrite MA processes as AR processes.  Re-write this as an AR processess.