next up previous contents
Next: Design Up: Basic Statistical references for Previous: Contents   Contents

Subsections

Data and data organization

We are not going to provide a accurate description of the experimental setting, which can be found in numerous other sources. We are here concerned with the formal description of the data derived by an array experiment that is more useful for the purpose of statistical reasoning. As a general rule, we will consider cDNA microarrays, but the content of what follows can be in many cased adapted to the Affimetrix array, or to the Rosetta array. By describing the gene-chip data with traditional statistical language, we will be able to identify a series of specific issues that we will discuss in following sections. For an overview of statistical issues involved in determining which genes change expression across two experiments, see the lecture . For an overview of pattern recognition techniques, see the lecture

Data matrix

In a first approximation, the data from gene-chips experiments can be summarized in a matrix where each row corresponds to a specific spot on all the arrays and each column corresponds to a different ``variety'' that involves hybridization of a new array with the cDNA coming from a different cell line. Here are some possible sources of the cell lines whose cDNA is hybridized with the arrays: cell lines of different tumors, of different tissues, or cell lines of the same tissue in different individuals, identical cell lines under different shocks, or in different moments of the cell cycle. The data itself consist in a measurement of the (relative) expression level registered for a spot in correspondence to a particular hybridization. Formally, then, we have the matrix

X = $\displaystyle \begin{array}{cccc}X_{11} & X_{12} &\cdots & X_{1p}\\
X_{21} &...
... & \vdots & \ddots & \vdots \\
X_{n1} & X_{n2} & \cdots & X_{np}
\end{array}$

where the element Xij is the measured expression for spot i in hybridization j. In the literature, there is a tendency to identify each row with a gene and each column with a variable corresponding to an experimental condition (ex. tumor type or cell cycle phase). In the following we will also use this same convention, initially, however, we wanted to emphasize the correspondence between rows and spots and columns and experiments. This is to take into account of the possible repetitions: a same gene can be spotted in multiple locations of the same array (so that we have more rows than the total number of gene considered) and the same cDNA can be hybridized to different arrays (so that we have more columns that the total number of conditions considered). Indeed, a microarray experiment includes a number of other variables that we are not considering for the moment and that we will discuss in the sections Design and Measurements: taking into account or not these other variables is appropriate or not according to the nature of the question addressed.

For the purpose of general introduction, unless differently specified we will assume that there is one row per gene and one column per different experimental condition.

Variables or observations?

In most statistical frameworks there are two separate notions of variables and observations (columns and rows of the data matrix). In the data matrix described above, such distinction is not immediate. In fact, gene-arrays are an example of what Art Owen calls ``transposable data''. If the goal of the analysis is to formulate a classification rule for tumor types on the base of the expression levels of different genes, the variables are the genes and the different experiments are different observations. However, if the question is to identify groups of genes that have similar regulatory mechanisms, the variables are the different experiments and the genes are the observations. Indeed, the gene-array context helps under-scoring the fact that the labels of ``variable'' or ``observation'' are more dependent on the kind of interrogation of the data that one intend to carry on rather than on an essential characteristics of the object. Furthermore, it should be noticed that in many cases ``covariates'' are available either for the rows or the columns of the data matrix or both. For example, for each gene, we may have sequence information of the regulatory region upstream the coding one, information that we may want to relate to the expression pattern of the gene (see ). On the other hand, for each cell line, we may have information with regard to the particular tumor type it belongs to, the survival time of the affected individual to whom it belonged, the effect of a particular therapy (see ). This underscore the fact that a priori there is not a particular ``direction'' in which the analysis should take place.

If we take the approach of the ``transposable data'', one should somehow work both on rows and columns to gain information in both directions. This is particularly true for exploratory analysis whose goal is to find a convenient graphical representation of the data. Examples are Hartigan, double clustering, gene shaving, plaid, paf. We will discuss some of these in the sections graphical representation and clustering. In general, it is a statistics prejudice (quite well backed up, by the way) that you should formulate a question before trying to answer it looking at data. If this is the case, in most situations it will be clear what should be used as variables and classical statistical approaches will help with the analysis. If you then change the question and want to use the same data set, there may be the need of some thinking about how much you can use your data with out ``fishing'' for answers, but this will be discussed in the section tests. Here are the example of some questions that one may want to answer by using the above data-matrix and that call for a specific use of rows and columns and that we discuss in the following:

  1. Can I diagnose tumor type based on gene expression levels?
  2. Can I predict survival time using gene expression levels?
  3. Can I identify distances between different tumors based on the expression levels of various genes?
  4. Can I identify genes that are coregulated using the values of their expression in different moments of the cell cycle and/or under different shocks, in presence or absence of sequence information in their promoter region?
  5. Here is a new gene that has a give expression profile: who are his buddies?

Dimension

As in traditional notation, our data matrix X has n rows and p columns. Typically, n is in the order of thousands while p is in the order of tens. Great setting if we consider the columns as variables, but quite different from the classical statistical paradigm if we consider the rows as variables, or if we want to investigate both rows and columns. In general, one has to come to terms with the fact that gene-chip data are high-dimensional, the usual kind of asymptotic will not apply and the number of variables can be way bigger than the number of observation. This require some additional steps, but there are indeed some precise answer that statistics has to offer. We will discuss them in the section Curse of Dimensionality?

References


next up previous contents
Next: Design Up: Basic Statistical references for Previous: Contents   Contents
2001-09-19