Let the sample size tend to the infinity
Sample covariance-matrix converges to a matrix which is the population
covariance-matrix (due to law of large number)
The rest of steps remain the same
We shall use the population version for theoretical discussion
Basic facts :
1.  Variance of linear combination of random variables
  var(a x + b y)= a^2 var(x) + b^2 var(y) + 2 a b cov(x,y)
2. Life Easier if using matrix representation :
    (B.1)    var ( mí X)= mí Cov(X) m
  here m is a p-vector, X consists of p random variables (x_1,
Ö,x_p)í
 
From (B.1), it follows that
Maximizing var(míx) subject to ||m||=1 is the same as   Max 
mícov(X)m subject to ||m||=1
(here ||m|| denotes the length of the vector m)
3. Eigenvalue decomposition :
(B.2)  M v_i =  l_i  
v_i,     where
 l_1   >=   l_2 
>=  Ö. >=   l_p
Basic linear algebra tells us that the first eigenvector will do :
 Solution of  max mí M m subject to ||m||=1 must satisfy 
M m= l_1 m
4.   Covariance matrix is degenerated (I.e, some eigenvalues
are zero) if data are confined to a lower dimensional space S
Rank of covariance matrix = number of non-zero eigenvalues = dim. of 
the space S
This explain why pca works for our first example
Why small errors can be tolerated ?
Large i.i.d. errors are fine too
Heterogeneity is harmful, correlated errors too
Further Discussion :
No guarantee of finding nonlinear structure like clusters , curves, etc.
In fact, sampling properties for pca  are mostly developed for
normal data
(Mardia, Kent, Bibby 1979, Multivariate Analysis. New York: Academic
Press)
Still useful
Scaling problem ( use correlation  or not ?) Options are available
in PCA-model.lsp
 
Projection pursuit: guided; random