We now consider the situation when data has been collected and is in
the form described in section 1.
One of the interesting features of microarray experiments
is the fact that they gather information on a large number of genes
(six, ten thousand).
If we are considering, in the statistical lingo, genes as variables,
this means that our observations are in a 6000-dimensional space and
*p* > > *n*.
The expression ``curse of dimensionality'' is due to Bellman and in
statistics it relates to the fact that the convergence of any
estimator to the true value of a smooth function defined on a space of high
dimension is very slow. In terms of microarrays, this means that, a priori,
we need an ``enormous'' amount of observations (hybridizations to
different cell lines)
to obtain a ``good'' estimate of a function of the genes (that identifies,
for example, which genes have altered expression patterns in a
specific tumor type). A pretty dim scenario.

Fortunately, however, there are ``blessings'' associated with dimensionality (see Donoho). One of them has to do with what mathematicians call ``concentration of measure''. To try to get a commercial slogan out of it, we could say that in many cases, there are really ``few things that matter'' and that the function will be constant on most of the space. This opens up the possibility of doing statistics in a meaningful and novel way.

Suppose we are interested in classifying our *n* cell lines in two groups
(cancer vs non cancer) based on the expression levels of *p* genes.
If *p* is bigger than *n* we can certainly find a rule based on *n*
genes that classifies our data perfectly. The problem is, however,
that when the expression levels from a new cell line will be observed,
our rule will do extremely purely in predicting the cancer status of
the cell line. This is because our rule was constructed ``ad hoc'' for
our data set. To make sure that the classification rule we produce is
meaningful for cell lines whose expression pattern has not been
observed yet, we need to select a classification rule that is based on
few genes. In this way, we can hope to identify relations between
genes and tumor types that are real and not merely present by accident
in the particular data set in hand.
``For this reason, statisticians have, for a long time, considered
model selection by searching among subsets of possible explanatory
variables [genes], trying to find just a few variables among the many which
adequately explain the dependent variable [tumor type]. The history of
this approach goes back to the early 1960's when computers began to be
used for data analysis and automatic variable selection become a
distinct possibility.'' (from Donoho)

There is an important warning, however, implicit in any variable selection procedure, and particularly serious when the number of variables is really large: searching long enough and among numerous enough variables will find a pattern, even if all the variables are noise. How do we protect ourselves from this phenomena?

Typically, variable selection amounts to the search of a set of variables using which we can construct a model that minimize an error criterion. To take into account that the more variables are included in the model, the higher the variability of the prediction, a penalized form of the criterion is minimized:

min ERR(model) + *c*(Model Complexity),

where
In situations where the number of variables to search among is really
big (*p* > > *n*, as in microarrays), proposals have appear in the
literature that suggest
*c* = *cost* log(*p*), to take into account of the
search effect.

An other way of doing statistics in high dimension is to change the
way of doing asymptotics. It is indeed realistic to consider
situations where *d* goes to infinity together with *n*.
A successful example of this is the work of Johnstone on principal
components (that could be applied to microarray data) and, in a
strictly microarray context, of Mark Van Der Land.

A simpler, but sometimes very effective, way of dealing with high-dimensional data is to reduce the number of dimensions, by eliminating some coordinates that seam irrelevant. In the case of microarray data this can often be done effectively and simply, by eliminating from consideration all those genes whose expression value doesn't vary across hybridization experiments. There are a variety of threshold rules that can be employed.

The statistic literature contains a whole set of techniques to identify the ``relevant'' coordinates of a dataset (see principal components analysis, independent component, SIR, etc. ). Some of these may be applicable to microarrays, even though certainly not in a blind fashion.

**References**

- Donoho, "High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality
- M.J. van der Laan, J. Bryan (1999), Gene Expression Analysis with the Parametric Bootstrap.
- Dudoit
- FDR