Let the sample size tend to the infinity
Sample covariance-matrix converges to a matrix which is the population
covariance-matrix (due to law of large number)
The rest of steps remain the same
We shall use the population version for theoretical discussion
Basic facts :
1. Variance of linear combination of random variables
var(a x + b y)= a^2 var(x) + b^2 var(y) + 2 a b cov(x,y)
2. Life Easier if using matrix representation :
(B.1) var ( mí X)= mí Cov(X) m
here m is a p-vector, X consists of p random variables (x_1,
Ö,x_p)í
From (B.1), it follows that
Maximizing var(míx) subject to ||m||=1 is the same as Max
mícov(X)m subject to ||m||=1
(here ||m|| denotes the length of the vector m)
3. Eigenvalue decomposition :
(B.2) M v_i = l_i
v_i, where
l_1 >= l_2
>= Ö. >= l_p
Basic linear algebra tells us that the first eigenvector will do :
Solution of max mí M m subject to ||m||=1 must satisfy
M m= l_1 m
4. Covariance matrix is degenerated (I.e, some eigenvalues
are zero) if data are confined to a lower dimensional space S
Rank of covariance matrix = number of non-zero eigenvalues = dim. of
the space S
This explain why pca works for our first example
Why small errors can be tolerated ?
Large i.i.d. errors are fine too
Heterogeneity is harmful, correlated errors too
Further Discussion :
No guarantee of finding nonlinear structure like clusters , curves, etc.
In fact, sampling properties for pca are mostly developed for
normal data
(Mardia, Kent, Bibby 1979, Multivariate Analysis. New York: Academic
Press)
Still useful
Scaling problem ( use correlation or not ?) Options are available
in PCA-model.lsp
Projection pursuit: guided; random