456 99 189KB
English Pages 13 Year 2003
Biostatistics (2003), 4, 4, pp. 555–567 Printed in Great Britain
Multivariate exploratory tools for microarray data analysis ANIKO SZABO† , KENNETH BOUCHER, DAVID JONES, ALEXANDER D. TSODIKOV Huntsman Cancer Institute and Department of Oncological Sciences, University of Utah, 2000 Circle of Hope, Salt Lake City, UT 84112-5550, USA [email protected] LEV B. KLEBANOV Department of Probability and Statistics, Charls University, Sokolovska 83, Praha-8, CZ-18675, Czech Republic ANDREI Y. YAKOVLEV Department of Biostatistics and Computational Biology, University of Rochester, 601 Elmwood Avenue, Box 630 Rochester, NY 14642, USA and Huntsman Cancer Institute and Department of Oncological Sciences, University of Utah, 2000 Circle of Hope, Salt Lake City, UT 84112-5550, USA S UMMARY The ultimate success of microarray technology in basic and applied biological sciences depends critically on the development of statistical methods for gene expression data analysis. The most widely used tests for differential expression of genes are essentially univariate. Such tests disregard the multidimensional structure of microarray data. Multivariate methods are needed to utilize the information hidden in gene interactions and hence to provide more powerful and biologically meaningful methods for finding subsets of differentially expressed genes. The objective of this paper is to develop methods of multidimensional search for biologically significant genes, considering expression signals as mutually dependent random variables. To attain these ends, we consider the utility of a pertinent distance between random vectors and its empirical counterpart constructed from gene expression data. The distance furnishes exploratory procedures aimed at finding a target subset of differentially expressed genes. To determine the size of the target subset, we resort to successive elimination of smaller subsets resulting from each step of a random search algorithm based on maximization of the proposed distance. Different stopping rules associated with this procedure are evaluated. The usefulness of the proposed approach is illustrated with an application to the analysis of two sets of gene expression data. Keywords: Cross-validation; Differential expression; Permutation test; Probability distance; Random search; Sets of genes.
1. I NTRODUCTION With its potential to quantitatively measure expression levels of a large number of genes in parallel, microarray technology holds the promise of becoming an extremely valuable tool in basic biological † To whom correspondence should be addressed
c Oxford University Press; all rights reserved. Biostatistics 4(4)
556
A. S ZABO ET AL.
sciences and clinical diagnostics. However, the ultimate usefulness of the technology will depend critically on whether or not the search for efficient statistical methods meets with success. One of the most popular uses of microarray data is the identification of those genes that may be responsible for differences between cell types (genotyping cell lines) or functional states of a cell. The genetic profile of a tissue determines its properties, which is why different tissues are expected to have different gene expression patterns. As demonstrated by various clustering methods of gene expression vectors (Khan et al., 1998), tissues of the same histological origin tend to cluster together. Ross et al. (2000) came to a similar conclusion when clustering 60 NCI cell lines. Alizadeh et al. (2000) used hierarchical clustering to demonstrate the existence of two previously unknown genetically different subtypes of B-cell lymphoma that carry significantly different prognosis in terms of patients’ survival. A closely related application is the identification of subtypes of known diseases. This includes both improving the existing classification into known classes and the discovery of new/unknown sub-classes that are clinically significant. While clustering techniques are useful in providing insights into interactions between different genes or finding genetically similar patterns, these techniques do not answer the question many researchers are interested in: which genes are expressed differently in the tissues under comparison? One typical example of such a setting is represented by the well-known leukemia data set (Golub et al., 1999) that includes 27 ALL (acute lymphoblastic leukemia) and 11 AML (acute myeloid leukemia) samples processed using Affymetrix oligonucleotide microarrays. A test set of 34 samples is also available. The problem in question is to find those genes that are responsible for the distinction between the two types of leukemia. Multiple methods for selection of differentially expressed genes have been proposed, from using a fixed cutoff for the ratio to using various parametric and non-parametric measures of differential expression (Kerr and Churchill, 2001; Kerr et al., 2000; Newton et al., 2000; van der Laan and Bryan, 2001; Ben-Dor et al., 2000, 2001). A characteristic feature of these methods is the univariate nature of the decision to include a particular gene in the target set. The well-known complexity of interactions between gene functions in a cell strongly suggests that a search for methods that utilize multivariate information on gene expression signals is warranted. In a recent paper (Szabo et al., 2002), we introduced a multivariate method for finding a subset of differentially expressed genes of a given size. In this paper, we further develop this methodology to approach the problem of determining the size of a target set of differentially expressed genes. The first step for developing multivariate methods is to define a measure of distance between two tissue types that is based on a set of genes; in Section 3 we review a novel measure proposed by Szabo et al. (2002) and contrast it with classical alternatives. Once a measure of differential expression of gene sets is selected, sets of genes that exhibit an ‘unusually large’ distance will be considered to be differentially expressed. Thus one needs a method to find the gene-sets for which the distance is large and a method to determine whether this distance is larger than expected by chance. The random search methodology designed for finding the set with the maximal distance is described in Section 4 and a resampling based approach for finding a cutoff for ‘significant’ differential expression is developed in Section 5. The performance of the method is demonstrated via computer simulations and in Section 6 we re-analyze our motivating data that is described in detail in the next section. Finally, Section 7 presents an example where large marginal differences dominate the data and thus univariate approaches are expected to perform well.
2. M OTIVATING E XAMPLE : I NHOMOGENEOUS DATA S ET In many applications the groups to be separated are heterogeneous, whether it is known to the investigator or not. In such cases the goal of the investigator is two fold: find genes that explain the separation of the data into the known categories and also discover the unknown sub-categories. While the
557
Gene B
Multivariate exploratory tools for microarray data analysis
Gene A
Fig. 1. An artificial example of heterogeneous groups. Gray and white circles represent samples from the two groups, respectively.
second objective might appear unrelated to the first, a simple hypothetical example presented in Figure 1 shows how knowledge of subgroups can enhance classification. In this example the ‘white’ group has two subgroups each of which is similar to the ‘gray’ group in some respect: one has the same expression of Gene A, the other has the same expression of Gene B, thus neither gene separates the ‘white’ group from the ‘gray’ one by itself. However, jointly they can not only separate the two groups, but also indicate the two ‘white’ subgroups. The case in point is the leukemia data set of Golub et al. (1999) described in the Introduction. The ALL group is a mixture of two clinically recognizably different T-cell and B-cell subtypes. When analyzing this data set, the authors selected those genes that are individually highly correlated with the known classification and then used a voting procedure for classification of new samples. These techniques are essentially univariate. The authors produced a list of candidate genes to which the difference between the two types of leukemia can be attributed. Our recent analysis (Szabo et al., 2002) allowing for a more general multivariate data structure has resulted in a dissimilar list of ten differentially expressed genes for the same data set. Some of these genes were T-cell specific, thus they segregate only a part of the inhomogeneous ALL group; they were found due to the multivariate approach. Another example discussed in the present paper shows that the multivariate and univariate approaches may produce quite similar results in some settings. The method used by Szabo et al. (2002) is described in Sections 3 and 4; briefly, it attempts to find the gene-set of a pre-specified size that best separates two tissues. Two shortcomings of that method are that the number of the selected genes has to be specified in advance and no assessment of statistical significance was proposed. Thus while the selected genes ‘made sense’ and provided reasonable test-set classification rates, it is unclear how many other genes are differentially expressed. The present paper offers a solution to this problem. 3. Q UANTITATIVE M EASURES OF D IFFERENTIAL E XPRESSION The set of microarray data on p distinct genes represents a random vector X = (X 1 , . . . , X p ) with mutually dependent components. The dimension of X is extremely high relative to the number of observations (replicates of experiments). This unusual feature of microarray data prevents variable selection based on conventional discriminant analysis techniques using X as the feature vector. Therefore, one needs to limit the analysis of gene expression to subvectors of much lower dimension than that of X. In addition to this ‘curse of dimensionality’, the problem of microarray data analysis is complicated by the presence of experimental noise in the data. There are many sources of the observed noise so that some of the errors are likely to be additive (background), some others are multiplicative (dye incorporation, fluorescence efficiency, spot size), while saturation effects have a non-linear form. This hampers the use
558
A. S ZABO ET AL.
of statistical models with an explicitly specified noise structure for data adjustment (or normalization). Another method of noise reduction in microarray data consists in data categorization (see Tsodikov et al., 2002; Szabo et al., 2002; Chilingaryan et al., 2002 for discussion). One way of doing this is to replace the raw expression measurements with their fractional rank within each slide; this idea will be employed in Sections 6 and 7. Briefly, let X i j represent measurements of fluorescent intensity for gene i = 1, . . . , p on (r ) slide j = 1, . . . , n. Than the fractional rank of gene i is defined as X i j = (rankj X i j )/ p, where rankj u i j is the place (counted from the left) of u k j in the sequence u i j , i = 1, . . . , p arranged in decreasing order for each j. Notwithstanding inevitable loss of information (see Szabo et al., 2002, for more discussion), categorical adjustments have proven to be very useful in the analysis of differential expression of individual genes (Tsodikov et al., 2002) and sets of genes (Szabo et al., 2002; Chilingaryan et al., 2002). For comparison, we also used the cube-root adjustment recommended for use with Affymetrix data with SAM (Tusher et al., 2001). 3.1 A distance between vectors of expression signals To compare expression signals in two different tissues (or states) we need a pertinent distance between two random vectors. This distance is expected to satisfy the following requirements: (1) its empirical counterpart should allow for combining information from different slides; (2) it should accommodate ranks and categorical data (thus should not necessarily assume normality); (3) its estimate should be stable to random fluctuations and numerical errors; (4) its computation should not be too time consuming. Szabo et al. (2002) proposed a new distance and its nonparametric estimate to measure differential expression for sets of genes. This distance meets the above requirements. Let µ and ν be two probability measures defined on the Euclidean space Rd . Let L(x, y) be a real-valued measurable function, and introduce the following expression: L(x, y)dµ(x)dν(y) − L(x, y)dµ(x)dµ(y) − L(x, y)dν(x)dν(y). (3.1) N (µ, ν) = 2 Rd Rd
Rd Rd
Rd Rd
√ It can be shown (Zinger et al., 1989) that N (µ, ν) is a metric in the space of all probability measures d on R if and only if the kernel L(x, y) is strictly negative definite, that is i,s j=1 L(xi , x j )h i h j 0 for s h i = 0 with equality if and only if all h i = 0. Therefore, to obtain any x1 , . . . , xs and h 1 , . . . , h s , i=1 a universally applicable kernel (not dependent on the unknown measures µ and ν), L has to be strictly negative definite. Consider two independent samples, consisting of n 1 and n 2 observations respectively, represented by the d-dimensional vectors x1 , . . . , xn 1 and y1 , . . . , yn 2 , and introduce an empirical counterpart of N (µ, ν) as follows: Nˆ = N (µˆ n 1 , νˆ n 2 ) =
n2 n1 n1 n2 n1 n2 1 1 1 2L(xi , y j ) − 2 L(xi , x j ) − 2 L(yi , y j ). n 1 n 2 i=1 j=1 n 1 i=1 j=1 n 2 i=1 j=1
(3.2)
√ When using the distance N (µ, ν) one needs to choose a pertinent (strictly negative definite) kernel L. Szabo et al. (2002) discussed techniques for constructing such kernels in detail, in this paper we use 2 only the Euclidean kernel L(x, y) = g∈S (x g − yg ) , where S denotes the set of the genes considered. 3.2
Comparison with other distance measures √ Mahalanobis distance. The estimate of the distance N is nonparametric and does not involve numerically unstable high-dimensional parts, thus it is expected to be numerically stable even for small
559
1.0
distance N
0.5
Density
1.5
Multivariate exploratory tools for microarray data analysis
0.0
Mahalanobis distance
N 0
M 2
4
6
8
10
Fig. 2. The sampling distribution of the Mahalanobis distance and N between two multivariate normal distributions ( p = 4, n = 5, ρ = 0.3, see text). ‘N’ and ‘M’ denote the true value of the distances for distance N and the Mahalanobis distance, respectively.
sample sizes. A commonly used parametric measure of separation of two samples is the Mahalanobis distance x + y −1 R 2Mah = (¯x − y¯ )′ (¯x − y¯ ), (3.3) 2 where x¯ and y¯ are the sample means and x , y are the two sample variance–covariance matrices. We used simulation to compare the stability of the Mahalanobis distance and N when estimating the distance between two p-variate normal distributions with means (0 . . . 0)′ and (1 . . . 1)′ and exchangeable correlation ρ based on two samples of size n. The coefficient of variation of N is smaller than that of the Mahalanobis distance for all tested values of p = 2, . . . , 10, n = p + 1, . . . , 100 and ρ = 0, 0.3, 0.8. Typical sampling distributions are shown in Figure 2. We also found that both estimators are biased upwards, however the bias of the Mahalanobis distance is much larger. Distance √ to the nearest neighbor. The following example demonstrates the superior stability of the distance N as compared to another non-parametric distance measure based onnearest 1 neighbors. 0.5 and We consider classifying samples into one of two bivariate normal distributions N 00 , 0.5 1 0 1 −0.5 . Note that any distance measure based on the sample mean, as the Mahalanobis N 0 , −0.5 1 distance, will fail in this situation. Thus we compared N with the nearest-neighbor distance and with the ‘true state’ based on the density functions. We generated 300 data points from both distributions, denoting these samples D1 and D2 respectively, and 2500 test points forming a uniform 50 × 50 grid. For each test point we calculated the distance to D1 and D2 and classified the point into the ‘closest’ group; as a measure of the certainty of this decision we considered the difference between the two distances. Plots of these differences are shown in Figure 3. Comparing panels (a) and (b) to the difference of the true densities shown in panel (c), we see that the distance N produces results quite close to the fully informed method, without actually assuming normality. The nearest-neighbor approach is very unstable, as demonstrated by the roughness of the surface, and provides highly certain classification far from the actual support of the distributions (shown as two thick contours). 4. R ANDOM S EARCH FOR THE TARGET S UBSET OF A G IVEN S IZE Suppose two samples of microarray data are available, each sample having been drawn from a distinct class of biological specimens such as tissues of two different types. Once a multivariate distance between
560
A. S ZABO ET AL. (a)
(b)
(c)
Fig. 3. (a) Difference between distances to D2 and D1 for the proposed distance N ; (b) difference between the distances to nearest neighbor from D2 and D1 ; (c) difference of the ‘true’ density functions. Positive values (that is points classified into D1 ) are shaded gray. The thick contour lines are the projections of the 95% ellipsoid quantiles of D1 and D2 .
expression signals has been selected, it can be employed in a search for differentially expressed genes with the target subset of genes being defined as a subset for which the distance between the two classes attains its maximum. Ideally, all subsets of a given size should be evaluated in terms of the adopted distance and the one that provides a maximum should be chosen. However, the number of possible subsets exponentially increases with the total number of genes, and subsequently the exhaustive search procedures as well as the branch-and-bound method (Fukunaga, 1990) become computationally prohibitive. In such a situation, stepwise procedures seem to be an indispensable aid to variable selection. For all practical purposes, the issue of computational complexity can be resolved by applying random search methodology. Random search can be designed in a number of various ways. A simple algorithm for finding a subset of predetermined size k with the largest distance between two classes (tissues) (Szabo et al., 2002; Chilingaryan et al., 2002) is presented in box (A1) and used for data analysis in Sections 6 and 7. In our study, the number M was set at 100 000.
Random search 1. Randomly select k genes to form the initial approximation; calculate the distance between the two classes for this subset (cluster) of genes. 2. Replace at random one gene from the current cluster by a gene from outside the cluster; calculate the distance for this new cluster. 3. If the distance for the new cluster is larger than for the original cluster (improvement), keep the change, otherwise revert to the previous cluster. 4. Repeat the process until a predetermined number, M, of steps is reached.
(A1)
When selecting a subset of genes to provide the best discrimination between two classes, it is easy to come up with over-optimistic conclusions as a result of over-fitting, that is, finding overly specific patterns that do not extend to new samples. Whenever a small number of variables is selected from a large set, one should expect a selection bias associated with choosing the optimal of a large number of subsets, regardless of the criterion used. Cross-validation techniques provide a powerful tool for reducing this selection bias. Ganeshanandam and Krzanowski (1989) suggested that cross-validation should precede the variable selection itself. Resorting to this idea and the well-known methodology of v-fold cross-validation
Multivariate exploratory tools for microarray data analysis
561
(see, for example Breiman et al., 1984), we resort to a ‘cross-validated search’ procedure that checks for reproducibility of its results. The basic structure of the algorithm is presented in box (A2).
Cross-validated search for differentially expressed genes 1. Randomly divide the data into v groups of nearly equal size. 2. Drop one of the parts and find the optimal (in accordance with the chosen criterion) subset of genes using only the data from v − 1 groups. Algorithm (A1) can be used, for example. 3. Repeat step 2 in succession for each of the groups, obtaining v ‘optimal’ sets. 4. Combine these sets by selecting the genes with the highest frequencies of occurrence.
(A2)
The performance of this algorithm was evaluated by Szabo et al. (2002). An alternative idea (Chilingaryan et al., 2002) is to search for multiple local maxima by stopping the random search (A1) before it could find the global maximum; in practice, it can be done by selecting a relatively small M = 500 or 1000. Then the genes are listed in the order of the frequency of their occurrence in these suboptimal sets. 5. F ORMING LARGER TARGET SUBSETS OF GENES Technically, the estimate Nˆ can be computed for vectors of any dimension. However, it is natural to limit the dimension of the sought-for subset of genes by the number of available training samples (microarray slides). The reason for this limitation is that the variance of Nˆ increases with dimension of the vectors under comparison. In many settings, the above-mentioned restriction is difficult to meet, because the number of differentially expressed genes is expected to be quite large, as in the example discussed in Section 7. Consider a typical study of differences in genetic profiles of two tissues or cell types. Since algorithm (A2) described in Section 4 is aimed at finding a subset of differentially expressed genes of a given size, it would be desirable to find a way of extending the set of separating genes, while maintaining all the benefits of the multivariate approach to the problem. We propose a successive selection procedure that eliminates groups of genes from the data after each run of the search algorithm and keeps doing so until no more subsets of differentially expressed genes can be found. Then the removed genes are the ones responsible for the difference between the tissues, so we declare all of them to be differentially expressed. The idea of discarding ‘interesting’ genes obtained at each step of a selection procedure and then repeating the selection procedure to find additional genes is somewhat similar to the approach by Hastie et al. (2000). To formulate a stopping rule, one needs to determine the properties of an optimal set of genes in a ‘no-difference’ data set. Since our optimality criterion is based on the multivariate distance (3.1), the covariance structure of the particular data is expected to influence the selection process. Thus the ‘nodifference’ baseline data have to be generated in a way that preserves this structure as close as possible. In box (A3) we describe an algorithm that achieves this goal: the first step of the algorithm ensures that the marginal means of the two hypothetical tissues have the same true mean, and the second step mimics the biological variability through permutation. We denoted the adjusted fluorescence level for gene i, i = 1, . . . , p in the two tissues by X i j , j = 1, . . . , n 1 and Yi j , j = 1, . . . , n 2 , respectively. The average 1 2 adjusted expression levels for each gene are X¯ i · = nj=1 X i j /n 1 and Y¯i · = nj=1 Yi j /n 2 . Based on the permutation resampling scheme (A3), the null-distributions of various quantitative characteristics of the optimal gene-set can be estimated. In particular, we will focus on the associated distance, leave-one-out cross-validated classification rate and test set classification rate. In applications, the test set classification rate is probably the gold standard. However, due to the scarcity of samples an
562
A. S ZABO ET AL.
Resampling 1. For each gene i, i = 1, . . . , p shift the values from the tissues so they are centered at the overall mean for this gene, that is n 1 X¯ i · + n 2 Y¯i · n 1 X¯ i · + n 2 Y¯i · , Yi∗j = Yi j − Y¯i · + . n1 + n2 n1 + n2 2. Randomly permute the resulting n 1 + n 2 vectors. The first n 1 and the last n 2 vectors provide a random sample from the null-distribution. X i∗j = X i j − X¯ i · +
(A3)
appropriate test set is often difficult to provide. We conducted a simulation study and demonstrated that the between-tissue distance associated with gene sets is a good and stable proxy for the classification rate. A description of the simulations can be found as a Supplemental Material on the journal’s website. In box (A4) we describe a distance-based procedure for selecting a subset of genes that are differentially expressed in two tissues; the successive selection procedures based on cross-validation or test sample classification rates have been designed along similar lines. The procedure requires the size k of the building-block clusters and significance level α as inputs.
Main algorithm: Successive selection of subsets of genes 1. Form (using e.g. (A3)) m independent permutation samples of sizes n 1 and n 2 , respectively, from n 1 + n 2 observations (slides). For each of the m permutation samples find (using e.g. (A2)) an optimal k-element set for which the associated distance attains its maximum. Estimate from the permutation samples the top αth percentile Dα of the baseline distribution of the optimal distance (referred to as the null-distribution). 2. Returning to the original two-sample setting, find (using e.g. (A1)) the k-element optimal set of genes and denote it by G1 . If the associated distance D(G1 ) > Dα , then continue, otherwise declare no differentially expressed genes. 3. In the ℓth iteration, discard sets G1 , . . . , Gℓ−1 and find the k-element optimal set Gℓ from the remaining genes. If the associated distance D(Gℓ ) > Dα , then continue with this step (next iteration), otherwise proceed to step 4. 4. The union ∪ℓ−1 j=1 G j defines the set of those genes that are differentially expressed in the two tissues.
(A4)
6. A PPLICATION TO LEUKEMIA DATA Finally, we return to our motivating example, the leukemia data set of Golub et al. (1999). While we consider the ALL vs AML comparison, we know that the ALL group is not homogeneous—it is a mixture of the clinically recognizably different T-cell and B-cell subtypes. In this data set this sub-classification is known for each sample, however a similar situation with inhomogeneous disease with unknown subtypes can easily be imagined. Thus in the ALL vs AML comparison the difference between the two groups has a more complicated structure than marginal difference. With the rank-adjusted leukemia data we carried out successive searches for five-member optimal gene sets (k = 5) using the 10-fold (v = 10) cross-validated search algorithm (A2). The Euclidean distance was chosen for the kernel L(x, y) in the distance measure (3.1). For each of the successive optimal sets Gi , we recorded the corresponding optimal distance (denoted by Dist) and estimated the tissue classification
Multivariate exploratory tools for microarray data analysis
563
100%
CV 1.4
Class
80%
Distance
1.2 60%
1
Dist
0.8
40% 0.6 0.4
20%
Classification/CV rate
1.6
0.2 0
0% 1
5
9
13 17 Sets of size 5
21
25
Fig. 4. Properties of the subsequent optimal sets Gi of size 5 in the leukemia data set. On the left axis: Dist, associated distance between the tissues. On the right axis: CV, cross-validation rate; Class, test set classification rate. The thin dashed lines represent the estimated rates; the solid lines represent their isotonically smoothed counterparts; the thick dashed lines show the cutoff based on the estimated 99th percentile of the corresponding null-distribution. The last set above the cutoff is marked with a gray diamond.
rate using both leave-one-out cross-validation based on the selected gene set (CV) and the independent test set (Class). The results are presented in Figure 4. Both estimates of the classification rate (CV and Class, thin dashed lines in the figure) are highly variable, while the distance (Dist) is decreasing monotonically. As the optimal sets were selected according to the distance, the observed monotonicity confirms the ability of the basic algorithm to find an optimal subset. To reduce the observed variability of the classification rate estimates we assume the true rates to be non-increasing and apply isotonic regression (Robertson et al., 1988) to smooth the corresponding curves (solid lines in the figure). The dotted lines represent the level of the 99th percentile of the null-distribution of the corresponding measure; they were estimated by generating m = 300 random permutation samples that mimic ‘no-difference’ data in accordance with algorithm (A3). The iterative search procedure (A4) with sets of size k = 5 at α = 0.01 selects 16 groups for a total of 80 genes. The first three sets of five genes are listed in Table 1; the entire list can be viewed at the journal’s web site. If the stopping were based on cross-validation or test-set classification rate, the procedure would have stopped earlier, after five or seven sets. However, the high variability of these measures and the reported problems with the test set (it was obtained from different institutions and some of them are pediatric cases) makes stopping according to the classification rate less desirable. The results obtained with the above-described multivariate procedures are worth comparing with a univariate selection of differentially expressed genes. For the first comparison we have sorted the genes according to the value of the corresponding (marginal) t-statistics and selected the top 16 × 5 = 80 of them, so that the number of genes coincides with that selected by the multivariate distance-based cutoff criterion. We found that the two lists have only 34 (42%) genes in common. The Significance Analysis of Microarrays (SAM) of Tusher et al. (2001) is another commonly used univariate approach, so we have run SAM (Chu et al., 2002) on both the rank-adjusted leukemia data and on the data adjusted according to the recommendation of Tusher et al. (2001). The latter approach uses linear calibration on a cube-root scatter plot of each sample against the average of all samples, followed by transformation to the original scale. As SAM focuses on estimating the false discovery rate (FDR) for each cutoff of the score and not on controlling type I error, no well-defined procedure exists for the selection of a final gene set. Our rule for selecting the cutoff was to minimize the estimated false discovery rate: the minimum was at 0.47% with 143 genes selected and 0.65% with 105 genes for the rank and cube-
564
A. S ZABO ET AL. Table 1. List of the top 15 genes selected by the iterative search and their order according to SAM. The ordering within the sets of the iterative search is random Rank according to SAM Iterative search (k = 5) Cube-root adjustment Rank adjustment M28130 Interleukin 8 gene 13* 56* M89957 Immunoglobulin-associated beta (B29) 465 283 M27891 Cystatin C 7* 24* M27783 Elastase 2, neutrophil 611 8* M84526 Adipsin 18* 1* U46499 Glutathion S-transferase 31* 15* M63438 Glutamine synthase 633 427 M83667 NF-IL6-beta protein 30* 19* X82240 T cell leukemia gene 566 276 U89922 Lymphotoxin-beta 937 422 X52056 Oncogene spi1 39* 143 D88422 Cystatin A 93* 6* M57731 GRO2 oncogene 73* 28* D87076 KIAA0239 gene 252 50* M20203 Neutrophil elastase 529 44* * the gene is selected by SAM when the smallest FDR rate is chosen as cutoff
root adjusted data, respectively† . The proportion of the genes found by the iterative search that were also selected by SAM was 47% and 41% for the rank and cube-root adjusted data, respectively. Table 1 also contains the ranking of the genes selected by our iterative search procedure according to SAM. The selection of some genes, e.g. M27783 and M20203, seems to be determined by the adjustment procedure. These genes have almost no expression in ALL (average cube-root adjusted expression levels of 522 and 110, respectively) and moderately high levels in AML (average cube-root adjusted expression levels of 5210 and 4511). However, there are also genes that are ranked very low by SAM regardless of the adjustment: e.g. M89957, M63438, X82240 and U89922. These genes are specific to one of the ALL subtypes, individually they separate only T-cell ALL from non-T-cell (e.g. X82240—T-cell leukemia gene) or B-cell-ALL from non-B-cell-ALL (e.g. M63438, Glutamine synthase), however jointly they also separate ALL from AML. M89957 and X82240 are ranked 2 and 3 by SAM if genes informative for the three-way classification into AML, T-cell and B-cell are sought. The multidimensional approach was able to find them without investing the additional information of the ALL subtypes. 7. A PPLICATION TO COLON CANCER CELL LINE DATA To evaluate the performance of our method in the presence of large marginal differences, we also applied our methodology to two commonly studied colon cancer cell lines. HT29 cells represent advanced, highly aggressive colon tumors. They contain mutations in both the APC gene and p53 gene, two tumor suppressor genes that frequently mutate during colon tumorigenesis. As another cell type, we selected HCT116 cells. This cell line models less aggressive colon tumors and harbors functional p53 and APC. However, they show a deficiency of those genetic systems that are responsible for the repair of mismatched regions of DNA. To generate the data, three samples of each mRNA (1 µg each) were labeled by production of first-strand cDNA in the presence of Cy3-dCTP (green) or Cy5-dCTP (red). Six identical † The lists of genes selected by the univariate approaches are also available through the journal’s website.
Multivariate exploratory tools for microarray data analysis
565
comparison sets of samples were labeled in the reverse orientation, that is in the first three sets Cy-3 was used to label HCT116 cells while Cy-5 was used for HT29 cells. In the next three sets Cy-5 was used with HCT116 while HT29 was labeled using Cy-3. Each comparison set was hybridized against two microarray facing slides containing 4608 minimally redundant cDNAs spotted in duplicate (unpublished data). A rank adjustment (Tsodikov et al., 2002) was applied for both dyes within each sub-slide and the adjusted values were averaged across the four dependent replicates obtained from the same original sample (the two sides of a two-face slide). Thus six independent replicates were obtained for both HT29 and HCT116 cell lines. Data from an earlier (lower quality) experiment in which each sample was hybridized only to two sides of one slide (resulting in two dependent replicates instead of four) was similarly adjusted and used as an independent test set. This set contained eight replicates of each of the two cell lines. The number of permutation samples was m = 300 in this analysis, and the cluster size k = 5 was used in algorithm (A4). Using the 99th percentile of the null-distribution as the cutoff, the cross-validation rate and the distance-based criteria would stop close to each other (at the 57th and 56th sets of size 5, respectively), while the smoothed test set classification rate drops below the cutoff much earlier, at the 12th set. However, when the 95th percentile was used, the stopping points were much closer to each other (sets number 57, 63 and 67 for cross-validation, classification and distance, respectively). The extremely high variability of the test set classification rate is probably responsible for this discrepancy. One possible reason for that is the imperfect nature of our test data set: it was obtained a year earlier and the expertise of the Microarray Core Facility that produced the data has greatly increased since that time. In addition, the test set was formed using only two replicates for each sample compared to four in the training set. The comparison with univariate procedures was performed similarly to the previous section. For the first comparison we have sorted the genes according to the value of the corresponding (marginal) tstatistics and selected the top 56 × 5 = 280 of them, so that the number of genes coincides with that selected by the multivariate distance-based cutoff criterion. We found that the two lists have only 94 (33%) genes in common. SAM with rank-adjusted data has the minimum FDR at 0.23% with 283 genes selected. This number is very close to the 280 genes selected by the distance-based cutoff. These two sets of genes greatly overlap: 260 genes are common in the two lists; the two orderings of the genes also show high agreement with Kendall’s τ = 0.78. Thus in a data set with a high number of marginally differentially expressed genes our method gives results very close to SAM. The cube-root adjustment is not applicable to two-color arrays, thus it was not applied to this data set.
8. D ISCUSSION In this paper, we propose a successive selection procedure designed to identify a set of differentially expressed genes from microarray data. The three stopping rules explored in Sections 6 and 7 in conjunction with this procedure appear to produce similar final subsets of genes. The only discrepancy observed was between the test-set-based selection and the other two rules in the application to actual data. This discrepancy can be attributed to the poor quality of the test sample available in this study. A distinct advantage of the distance-based stopping rule is its stability to random fluctuations in the course of eliminating groups of differentially expressed genes. However, it is always wise to check the distancebased procedure against its test-set-based counterpart whenever possible. The proposed procedure is computationally expensive and probably would have been infeasible not so long ago. The most time-intensive resampling part (A4, step 1) is easily parallelizable, so multi-processor systems can be utilized to speed up the computation. All the calculations for this paper were performed on a computer with two 1000 MHz processors and a typical analysis lasted 1–3 hours depending on the sample size and the exact choice of the parameters. The speed of the random search procedure effectively
566
A. S ZABO ET AL.
limits what can be done by resampling. Considerable room remains for improvement in designing the random search algorithm. In Algorithm (A4), all genes have equal probability of being selected at each step of the random search. Other methods can be explored for improving the speed of the random search, from simple ones, such as penalizing previously rejected genes, to more sophisticated methods, such as simulated annealing or a genetic algorithm (Li et al., 2001). In the last step of algorithm (A4), the subsets of size k obtained at each step of the successive selection procedure are used as ‘building blocks’ for a finally selected set of differentially expressed genes. A value of k is conventionally chosen to meet the important practical constraint imposed by the number of available replicates (Section 5). It is natural to expect that the size and composition of the final set of genes depend on the choice of k: larger values of k typically lead to larger final sets. Since the problem has no formal solution, the choice of k is left to the investigator. One of the most obvious deficiencies of the techniques proposed for finding differentially expressed genes is that they are essentially univariate and are frequently based on the (sometimes implicit) assumption that expression signals are stochastically independent. Our example in Section 7 shows that a pertinent univariate method, properly adjusted for multiple testing, and a more general multivariate method may yield similar results. As the example suggests, this is likely to be true whenever an abundance of differentially expressed individual genes is an inherent feature of the biological systems under comparison. In such settings, the virtues of multivariate approaches become less obvious. But there may be numerous other situations where gene-to-gene interactions are especially important so that multivariate methodology must be brought to the fore: our analysis of the leukemia data in Section 6 provides a good example. It should be kept in mind that the ultimate goal of microarray data analysis is not just correct classification of unknown samples but selection of biologically relevant (however vague the definition of relevance) genes, although the two problems are closely related. For example, a very good separation between classes can sometimes be provided by looking at a single gene so that the classification error rate is difficult to reduce further by including other differentially expressed genes. However, one would like to keep the chance of missing other interesting genes to a minimum. This is one of the reasons why more robust methods based on distances between random vectors may perform better than error-based multivariate methods of gene selection (Li et al., 2001). R EFERENCES A LIZADEH , A. A., E ISEN , M. B., DAVIS , R. E., M A , C., L OSSOS , I. S., ROSENWALD , A., B OLDRICK , J. C., S ABET , H., H UDSON , J. J. AND L U , L., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511. B EN -D OR , A., B RUHN , L., F RIEDMAN , N., NACHMAN , I., S CHUMMER , M. AND YAKHINI , Z. (2000). Tissue classification with gene expression profiles. Journal of Computational Biology 7, 559–584. B EN -D OR , A., F RIEDMAN , N. Laboratories.
AND
YAKHINI , Z. (2001). Scoring genes for relevance. Technical Report. Agilent
B REIMAN , L., F RIEDMAN , J. H., O LSHEN , R. A. AND S TONE , C. J. (1984). Classification and Regression Trees. Monterey, CA: Wadworth and Brooks/Cole Advanced Books and Software. C HILINGARYAN , A., G EVORGYAN , N., VARDANYAN , A., J ONES , D. AND S ZABO , A. (2002). Multivarite approach for selecting sets of differentially expressed genes. Mathematical Biosciences 176, 59–69. C HU , G., NARASHIMHAN , B., T IBSHIRANI , R. AND T USHER , V. (2002). SAM ‘Significance Analysis of Microarrays’, Users guide and technical document. Stanford University, http://www-stat.stanford.edu/ ~tibs/SAM/, 1.21 edition. D UDOIT , S., F RIDLYAND , J. AND S PEED , T. P. (2000). Comparison of discrimination methods for the classification
Multivariate exploratory tools for microarray data analysis
567
of tumors using gene expression data. Technical Report, 576. Berkeley, CA: University of California. F UKUNAGA , K. (1990). Introduction to Statistical Pattern Recognition, 2nd edition. London: Academic. G ANESHANANDAM , S. AND K RZANOWSKI , W. (1989). On selecting variables and assessing their performance in linear discriminant analysis. Australian Journal of Statistics 32, 443–447. G OLUB , T. R., S LONIM , D. K., TAMAYO , P., H UARD , C., G AASENBEEK , M., M ESIROV , J., C OLLER , H., L OH , M. L., D OWNING , J. R., C ALIGIURI , M. A., B LOOMFIELD , C. D. AND L ANDER , E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531– 537. H ASTIE , T., T IBSHIRANI , R., E ISEN , M. B., A LIZADEH , A., L EVY , R., S TAUDT , L., C HAN , W. C., B OTSTEIN , D. AND B ROWN , P. (2000). ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 1, 0002.1–0003.21. K ERR , M. K. AND C HURCHILL , G. A. (2001). Experimental design for gene expression microarrays. Biostatistics 2, 183–201. K ERR , M. K., M ARTIN , M. AND C HURCHILL , G. A. (2000). Analysis of variance for gene expression microarray data. Journal of Computational Biology 7, 819–837. K HAN , J., S IMON , R., B ITTNER , M., C HEN , Y., L EIGHTON , S. B., P OHIDA , T., S MITH , P. D., J IANG , Y., G OODEN , G. C., T RENT , J. M. AND M ELTZER , P. S. (1998). Gene expression profiling of alveolar rhabdomyosarcoma with cDNA microarrays. Cancer Research 58, 5009–5013. L I , L., DARDEN , T. A., W EINBERG , C. R., L EVINE , A. J. AND P EDERSEN , L. G. (2001). Gene assessment and sample classification for gene expression data using a genetic algorrithm/k-nearest neighbor method. Combinatorial Chemistry & High Throughput Screening 4, 727–739. N EWTON , M. A., K ENDZIORSKI , C. M., R ICHMOND , C. S., B LATTNER , F. R. AND T SUI , K. W. (2000). On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology 8, 37–52. ROBERTSON , T., W RIGHT , F. T. Wiley.
AND
DYKSTRA , R. L. (1988). Order Restricted Statistical Inference. London:
ROSS , D. T., S CHERF , U., E ISEN , M. B., P EROU , C. M., R EES , C., S PELLMAN , P., I YER , V., J EFFREY , S. S., VAN DE R IJN , M., WALTHAM , M., P ERGAMENSCHIKOV , A., L EE , J. C. F., L ASHKARI , D., S HALON , D., M YERS , T. G., W EINSTEIN , J. N., B OTSTEIN , D. AND B ROWN , P. O. (2000). Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics 24, 227–235. S ZABO , A., B OUCHER , K., C ARROLL , W., K LEBANOV , L., T SODIKOV , A. AND YAKOVLEV , A. (2002). Variable selection and pattern recognition with gene expression data generated by the microarray technology. Mathematical Biosciences 176, 71–98. T SODIKOV , A., S ZABO , A. AND J ONES , D. (2002). Adjustments and measures of differential expression for microarray data. Bioinformatics 18, 251–260. T USHER , V. G., T IBSHIRANI , R. AND C HU , G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academic Science, USA 98, 5116–5121. L AAN , M. J. AND B RYAN , J. F. (2001). Gene expression analysis with the parametric bootstrap. Biostatistics 2, 445–461.
VAN DER
Z INGER , A. A., K LEBANOV , L. B. AND K AKOSYAN , A. V. (1989). Stability Problems for Stochastic Models, chapter Characterization of distributions by mean values of statistics in connection with some probability metrics. Moscow, VNIISI: pp. 47–55. [Received December 28, 2001; first revision August 6, 2002; second revision December 18, 2002; accepted for publication February 5, 2003]