Population version of PCA
 

Let the sample size tend to the infinity
Sample covariance-matrix converges to a matrix which is the population covariance-matrix (due to law of large number)
The rest of steps remain the same
We shall use the population version for theoretical discussion

Basic facts :

1.  Variance of linear combination of random variables
  var(a x + b y)= a^2 var(x) + b^2 var(y) + 2 a b cov(x,y)

2.  Life Easier if using matrix representation :

    (B.1)    var ( mí X)= mí Cov(X) m
  here m is a p-vector, X consists of p random variables (x_1, Ö,x_p)í
 

 From (B.1), it follows that

Maximizing var(míx) subject to ||m||=1 is the same as   Max  mícov(X)m subject to ||m||=1
(here ||m|| denotes the length of the vector m)

3.     Eigenvalue decomposition :

(B.2)  M v_i =  l_i   v_i,     where
 l_1   >=   l_2  >=  Ö. >=   l_p

Basic linear algebra tells us that the first eigenvector will do :
 Solution of  max mí M m subject to ||m||=1 must satisfy  M m= l_1 m

4.   Covariance matrix is degenerated (I.e, some eigenvalues are zero) if data are confined to a lower dimensional space S
Rank of covariance matrix = number of non-zero eigenvalues = dim. of  the space S

This explain why pca works for our first example

Why small errors can be tolerated ?

Large i.i.d. errors are fine too

Heterogeneity  is harmful, correlated errors too

Further Discussion :

No guarantee of finding nonlinear structure like clusters , curves, etc.

In fact, sampling properties for pca  are mostly developed for normal data
(Mardia, Kent, Bibby 1979, Multivariate Analysis. New York: Academic Press)

Still useful

Scaling problem ( use correlation  or not ?) Options are available in PCA-model.lsp
 

Projection pursuit: guided; random