next up previous contents
Next: Other study design with Up: Basic Statistical references for Previous: Exploratory and confirmatory data   Contents

Subsections

Some statistical techniques and their use in microarrays

The goal of this section is to clarify which statistical technique is best used to address some question that have appeared in the microarray literature. We start proposing a non-exhaustive table that suggests a technique for different questions and then we give a brief description of the techniques with references.

I have here observation from cell lines of different colon cancers: which genes are differentially expressed in the the various types? Classification with genes as variables and cell lines as observations is a possibility.

I have expression levels of genes in different experiments (heat shock etc) which genes behave similarly? Clustering of the genes using experiments as variables.

I have expression levels of genes in different moments of the cell cycle: which genes behave similarly? One can try clustering again, but there is a clear dependency among the variables (times in the cell cycle) and one should be careful with the definition of distances.

Also, one may discretize the stages of the cell cycle and use classification to answer the question: when I see which genes on (off) I can be sure the cell is about to divide?

I have observations on different tumors in different organs: few colon cells, few lungs cells etc. I believe that some of these tumors are closer together (same biological mechanism) than others: how can I find an ordering of the tumors? Multidimensional scaling could help. Or clustering. In both case, though, one should use the available information to keep close together tumors that are known to be the same.

ANOVA

ANOVA is the classical statistical technique to analyze the outcome of experiments. I can be very valuable in microarray study to assess what are the sources of variation in the experiment and what portion of the variance does each of them explain.

Discriminant Analysis / Classification

Suppose that one has a vector of labels y associated to each experiment: examples are tumor types in an experiments that analyzes different tumor cells; tumor status in an experiments that compares tumor cells with normal ones; phase of the cell cycle in an experiment comparing the same cell lines in different moments; type of shocks that the cell underwent. Then, an interesting question is: which genes are most useful to discriminate between the various values of y? How do I formulate such a classification rule? This help us identifying genes that may be involved in the process that underlies the classification (for example, I could identify a set of genes that when turned on clearly signal that the cell is in meiosis and then learn that these genes are implicated in meioisis) and is helpful to predict the value of y for new cell lines of which we just know the expression pattern. This is particularly interesting, for example, in the case of tumor cells. The current criteria to classify a tumor type in a given organ may require observing the response of the cancer to different treatments. It would be useful, to be able to classify the tumor type of a new patient by looking at the gene expression pattern of his cancer cell. If this is done successfully, the patient could immediately receive the best treatment.

Discriminant analysis and /or classification are classical problems considered in statistics and a variety of tools are available. Fisher linear or quadratic discriminant, generalized linear models, trees, are examples of basic techniques. Boosting and Bagging are methods to improve the precision of given classifiers. Cross validation is an important tool to assess the performance of any given discriminant rule.

Because of the special nature of array experiments, one may additionally need to resort to some of the techniques we discussed in the section curses of dimensionality. What we intend to stress here is the usefulness of this type of analysis and the opportunity to conduct it when the data are in the form described.

References

Multidimensional Scaling

It is often the case that experimenters would like to derive from microarray data an ordering of some variables of interest. It may be that the hybridizations originating the datasets are with cells from different tumors and it is believed that some of these cancer types are closer to each other than others. It is of interest to gather information about such similarity from the expression patterns exhibited by genes from these different cancer cells. There are many ways of answering this question, but if the goal is a reordering of the cell lines in a way that represents the distances among them, multidimensional scaling is one possibility that is worth exploring. If there is previous knowledge of some tumor types, this information should be taken into account in the analysis.

It is possible to obtain a similar reordering from clusters and trees, but these methods will only make sure that observations that are adjacent in the proposed order are close, it will not be possible to attach any meaning to the position of distant observations.

References

Clustering

Clustering is the statistical technique that doesn't need advertising in microarray context as it is the single most uses instrument of analysis.

It is appropriate to cluster data when the observations are believed to be non-homogeneous (coming from different populations) and there is no precise notion of how they could be separated in homogeneous groups and it is desired to find such groups (that is, if there are labels identifying different classes, there is no need to cluster per se). There are a variety of cluster methods based on different notions of distances between the observations and on different algorithms. In case of microarrys particular attention should be devoted to the definition of distance. For example, suppose that we want to cluster genes on the base of the variation of their expression values during the cell cycle. Then, one needs to decide if two genes that have opposite behavior (one is on while the other is off and vice versa) should be in the same cluster (as they probably have a similar regulatory mechanism) or in different clusters--and decide how to define distance accordingly. Also, what about a gene that behaves exactly like an other, but with a lag?

Inference with clusters is not easy: it is not easy to determine how many clusters are there in the data and how likely is that the assignments of each observation to a given cluster is correct. One may use bootstrap techniques to answer to these questions.

Microarrays are special in that, as we described in the introduction, the statistical notions of ``observations'' and ``variables'' are interchangeable. With reference to the data matrix introduced in section 1, one may want to cluster both with respect to row and columns. To do this, one may independent cluster both (as done in Eisen), or cluster jointly. The latter is more complicated: the following section contains references to a method of Hartigan to do so and to two methods explicitly developed for microarray context (see Lazzeroni and Hastie).

References


next up previous contents
Next: Other study design with Up: Basic Statistical references for Previous: Exploratory and confirmatory data   Contents
2001-09-19