Next: Design
Up: Basic Statistical references for
Previous: Contents
  Contents
Subsections
We are not going to provide a accurate description of the experimental
setting, which can be found in numerous other sources.
We are here concerned with the formal
description of the data derived by an array experiment that is more
useful for the purpose of statistical reasoning. As a general rule, we
will consider cDNA microarrays, but the content of what follows can be
in many cased adapted to the Affimetrix array, or to the Rosetta array.
By describing the gene-chip data
with traditional statistical language, we will be able to identify a
series of specific issues that we will discuss in following sections.
For an overview of statistical issues involved in determining which
genes change expression across two experiments, see the lecture
. For an overview of pattern recognition techniques, see the
lecture
In a first approximation,
the data from gene-chips experiments can be summarized in a matrix
where each row
corresponds to a specific spot on all the arrays and each column corresponds
to a different ``variety'' that involves
hybridization of a new array with the cDNA coming from a different cell
line.
Here are some possible sources of the cell lines
whose cDNA is
hybridized with the arrays: cell lines of
different tumors, of different tissues, or cell lines of the same
tissue in different individuals, identical cell lines under
different shocks, or in different moments of the cell cycle.
The data itself consist in a measurement of the (relative) expression
level registered for a spot in correspondence to a particular
hybridization.
Formally, then, we have
the matrix
X =
where the element Xij is the measured expression for spot i in
hybridization j.
In the literature, there is a tendency to identify each row with a
gene and each column with a variable corresponding to an experimental
condition (ex. tumor type or cell cycle phase). In the following we
will also use this same convention, initially, however, we
wanted to emphasize the correspondence between rows and spots and
columns and experiments. This is to take into account of the possible
repetitions: a same gene can be spotted in multiple locations of the
same array (so that we have more rows than the total number of gene
considered) and the same cDNA can be hybridized to different arrays
(so that we have more columns that the total number of conditions
considered). Indeed, a microarray experiment includes a number of
other variables that we are not considering for the moment and that we
will discuss in the sections Design and Measurements: taking into
account or not these other variables is appropriate or not according to
the nature of the question addressed.
For the purpose of general introduction, unless differently specified
we will assume that there is one row per gene and one column per
different experimental condition.
In most statistical frameworks there are two separate notions of
variables and observations (columns and rows of the data matrix). In
the data matrix described above, such distinction is not
immediate. In fact, gene-arrays are an example of what Art Owen calls
``transposable data''.
If the goal of the analysis is to formulate a classification rule for
tumor types on the base of the expression levels of different genes,
the variables are the genes and the different experiments are
different observations. However, if the question is to identify groups
of genes that have similar regulatory mechanisms, the variables are
the different experiments and the genes are the observations.
Indeed, the gene-array context helps under-scoring the fact
that the labels of ``variable'' or ``observation'' are more dependent on the
kind of interrogation of the data that one intend to carry on rather
than on an essential characteristics of the object.
Furthermore, it should be noticed that in many cases ``covariates''
are available either for the rows or the columns of the data matrix or
both. For example, for each gene, we may have sequence information of
the regulatory region upstream the coding one, information that we
may want to relate to the expression pattern of the gene (see ). On
the other hand, for each cell line, we may have information with
regard to the particular tumor type it belongs to, the survival time
of the affected individual to whom it belonged, the effect of a
particular therapy (see ). This underscore the fact that a priori
there is not a particular ``direction'' in which the analysis should
take place.
If we take the approach of the ``transposable data'', one should
somehow work both on rows and columns to gain information in both
directions. This is particularly true for exploratory analysis whose
goal is to find a convenient graphical representation of the data.
Examples are Hartigan, double clustering, gene shaving, plaid, paf. We
will discuss some of these in the sections graphical representation and
clustering.
In general, it is a statistics prejudice (quite well backed up, by the
way) that you should formulate a question before trying to answer it
looking at data. If this is the case, in most situations it will be
clear what should be used as variables and classical statistical
approaches will help with the analysis. If you then change the
question and want to use the same data set, there may be the need of
some thinking about how much you can use your data with out
``fishing'' for answers, but this will be discussed in the section tests.
Here are the example of some questions that one may want to answer by
using the above data-matrix and that call for a specific use of rows
and columns and that we discuss in the following:
- Can I diagnose tumor type based on gene expression levels?
- Can I predict survival time using gene expression levels?
- Can I identify distances between different tumors based on the
expression levels of various genes?
- Can I identify genes that are coregulated using the values of
their expression in different moments of the cell cycle and/or under
different shocks, in presence or absence of sequence information in
their promoter region?
- Here is a new gene that has a give expression profile: who are
his buddies?
As in traditional notation, our data matrix X has n rows and p
columns. Typically, n is in the order of thousands while p is in
the order of tens. Great setting if we consider the columns as
variables, but quite different from the classical statistical paradigm
if we consider the rows as variables, or if we want to investigate
both rows and columns. In general, one has to come to terms with the
fact that gene-chip data are high-dimensional, the usual kind of
asymptotic will not apply and the number of variables can be way
bigger than the number of observation. This require some additional
steps, but there are indeed some precise answer that statistics has
to offer. We will discuss them in the
section Curse of Dimensionality?
References
Next: Design
Up: Basic Statistical references for
Previous: Contents
  Contents
2001-09-19