We now consider the situation when data has been collected and is in the form described in section 1. One of the interesting features of microarray experiments is the fact that they gather information on a large number of genes (six, ten thousand). If we are considering, in the statistical lingo, genes as variables, this means that our observations are in a 6000-dimensional space and p > > n. The expression ``curse of dimensionality'' is due to Bellman and in statistics it relates to the fact that the convergence of any estimator to the true value of a smooth function defined on a space of high dimension is very slow. In terms of microarrays, this means that, a priori, we need an ``enormous'' amount of observations (hybridizations to different cell lines) to obtain a ``good'' estimate of a function of the genes (that identifies, for example, which genes have altered expression patterns in a specific tumor type). A pretty dim scenario.
Fortunately, however, there are ``blessings'' associated with dimensionality (see Donoho). One of them has to do with what mathematicians call ``concentration of measure''. To try to get a commercial slogan out of it, we could say that in many cases, there are really ``few things that matter'' and that the function will be constant on most of the space. This opens up the possibility of doing statistics in a meaningful and novel way.
Suppose we are interested in classifying our n cell lines in two groups (cancer vs non cancer) based on the expression levels of p genes. If p is bigger than n we can certainly find a rule based on n genes that classifies our data perfectly. The problem is, however, that when the expression levels from a new cell line will be observed, our rule will do extremely purely in predicting the cancer status of the cell line. This is because our rule was constructed ``ad hoc'' for our data set. To make sure that the classification rule we produce is meaningful for cell lines whose expression pattern has not been observed yet, we need to select a classification rule that is based on few genes. In this way, we can hope to identify relations between genes and tumor types that are real and not merely present by accident in the particular data set in hand. ``For this reason, statisticians have, for a long time, considered model selection by searching among subsets of possible explanatory variables [genes], trying to find just a few variables among the many which adequately explain the dependent variable [tumor type]. The history of this approach goes back to the early 1960's when computers began to be used for data analysis and automatic variable selection become a distinct possibility.'' (from Donoho)
There is an important warning, however, implicit in any variable selection procedure, and particularly serious when the number of variables is really large: searching long enough and among numerous enough variables will find a pattern, even if all the variables are noise. How do we protect ourselves from this phenomena?
Typically, variable selection amounts to the search of a set of variables using which we can construct a model that minimize an error criterion. To take into account that the more variables are included in the model, the higher the variability of the prediction, a penalized form of the criterion is minimized:
In situations where the number of variables to search among is really big (p > > n, as in microarrays), proposals have appear in the literature that suggest c = cost log(p), to take into account of the search effect.
An other way of doing statistics in high dimension is to change the way of doing asymptotics. It is indeed realistic to consider situations where d goes to infinity together with n. A successful example of this is the work of Johnstone on principal components (that could be applied to microarray data) and, in a strictly microarray context, of Mark Van Der Land.
A simpler, but sometimes very effective, way of dealing with high-dimensional data is to reduce the number of dimensions, by eliminating some coordinates that seam irrelevant. In the case of microarray data this can often be done effectively and simply, by eliminating from consideration all those genes whose expression value doesn't vary across hybridization experiments. There are a variety of threshold rules that can be employed.
The statistic literature contains a whole set of techniques to identify the ``relevant'' coordinates of a dataset (see principal components analysis, independent component, SIR, etc. ). Some of these may be applicable to microarrays, even though certainly not in a blind fashion.