281 79 92KB
English Pages 13 Year 2005
Biostatistics (2005), 6, 1, pp. 157–169 doi: 10.1093/biostatistics/kxh026
Sample size calculation for multiple testing in microarray data analysis SIN-HO JUNG Department of Biostatistics and Bioinformatics, Duke University, Box 2716, Durham, NC 27705, USA [email protected] HEEJUNG BANG Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA STANLEY YOUNG National Institute of Statistical Sciences, Research Triangle Park, NC 27709, USA S UMMARY Microarray technology is rapidly emerging for genome-wide screening of differentially expressed genes between clinical subtypes or different conditions of human diseases. Traditional statistical testing approaches, such as the two-sample t-test or Wilcoxon test, are frequently used for evaluating statistical significance of informative expressions but require adjustment for large-scale multiplicity. Due to its simplicity, Bonferroni adjustment has been widely used to circumvent this problem. It is well known, however, that the standard Bonferroni test is often very conservative. In the present paper, we compare three multiple testing procedures in the microarray context: the original Bonferroni method, a Bonferronitype improved single-step method and a step-down method. The latter two methods are based on nonparametric resampling, by which the null distribution can be derived with the dependency structure among gene expressions preserved and the family-wise error rate accurately controlled at the desired level. We also present a sample size calculation method for designing microarray studies. Through simulations and data analyses, we find that the proposed methods for testing and sample size calculation are computationally fast and control error and power precisely. Keywords: Adjusted p-value; Bonferroni; Multi-step; Permutation; Simulation; Single-step.
1. I NTRODUCTION DNA microarray is a biotechnology for performing genome-wide screening and monitoring of expression levels in cells for thousands of genes simultaneously, and has been extensively applied to a broad range of problems in biomedical fields (Golub et al., 1999; Alizadeh and Staudt, 2000; Sander, 2000). A primary aim is often to reveal the association of the expression levels and an outcome or other risk factor of interest. Golub et al. (1999) explored about 7000 genes extracted from bone marrow in 38 patients, 27 with acute lymphoblastic leukemia (ALL) and 11 with acute myeloid leukemia (AML), in order to identify the susceptible genes with potential clinical heterogeneity in the two subclasses of leukemia. Genes useful to distinguish ALL from AML may provide insight into cancer pathogenesis and patient treatment. c Oxford University Press 2005; all rights reserved. Biostatistics Vol. 6 No. 1
158
S IN -H O J UNG ET AL.
The authors concluded that roughly 1100 genes were more highly correlated with the AML–ALL class distinction relying on, what they called, neighborhood analysis; they then selected the top 50 genes arbitrarily for intensive research. This data set has been referred to and reanalyzed by many other researchers (Thomas et al., 2001; Pan, 2002; Dudoit et al., 2003; Ge et al., 2003). Due to different methods and assumptions adopted, statistical inference obtained from the same data set has varied widely with respect to observed significance and the number of significant genes declared (Pan, 2002; Dudoit et al., 2003). Traditional statistical testing procedures, such as two-sample t-tests or Wilcoxon rank sum tests, are frequently used to determine statistical significance of the difference in gene expression patterns. These approaches, however, are faced with serious multiplicity as a very large number—possibly 10 000 or more—of hypotheses are to be tested, while the number of studied experimental units is relatively small— tens to a few hundreds (West et al., 2001). If we use a per comparison type I error rate α in each test, the probability of rejecting any null hypothesis when all null hypotheses are true, which is called the family-wise error rate (FWER), will be greatly inflated. So as to avoid this pitfall, the Bonferroni test is used most commonly in this field despite its well-known conservativeness. Although Holm (1979) and Hochberg (1998) improved upon such conservativeness by devising multi-step testing procedures, they did not exploit the dependency of the test statistics and consequently the resulting improvement is often minor. Later, Westfall and Young (1989, 1993) proposed adjusting p-values in a state-of-the-art step-down manner using a simulation or resampling method, by which dependency among test statistics is effectively incorporated. Westfall and Wolfinger (1997) derived exact adjusted p-values for a step-down method for discrete data. Recently, the Westfall and Young’s permutation-based test was introduced to microarray data analyses and strongly advocated by Dudoit and her colleagues. Troendle et al. (2004) favor permutation test over bootstrap resampling due to slow convergence in high dimensional data. Various multiple testing procedures and error control methods applicable to microarray experiments are well documented in Dudoit et al. (2003, pp. 73–75). Which test to use among a bewildering variety of choices should be judged by relevance to research questions, validity (of underlying assumptions), type of control (strong or weak), and computability. The Bonferroni-type single-step procedure, however, is still attractive due to its easy calculation and interpretation. Comparisons between single vs. multi-step testing procedures have been briefly discussed in several papers, but there is little attempt to compare their theoretical and numerical properties, especially in the microarray framework. A stepwise procedure does not offer a critical value, while the Bonferroni’s critical value is fixed based on the number of comparisons. Neither provides a simple way to calculate the minimal sample size for a designated power. Sample size estimation in this area is also an important problem as indicated in Golub et al. (1999), where the authors called for larger studies because they were uncertain about the statistical power. In this article, we compare the Bonferroni, resampling-based single-step and step-down multiple testing procedures through simulation and a real data example. The null distribution of the test statistics is approximated by permutation, which is nonparametric in that it does not require specification of the joint distribution of the test statistics and hence of the p-values. Adjusted p-values are also derived as better-suited summaries of the evidence against the null. Most importantly, we show that the single-step test provides a simple and accurate method for sample size determination and that can also be used for multi-step tests. 2. M ULTIPLE TESTING PROCEDURES : REVIEW 2.1 Single-step vs. multi-step Suppose that there are n 1 subjects in group 1 and n 2 subjects in group 2. Gene expression data for m genes are measured from each subject. We want to identify the informative genes, i.e. those that are differentially
Sample size calculation for multiple testing in microarray data analysis
159
expressed between the two groups. Let (X 1i1 , . . . , X 1im ) denote the gene expression levels obtained from subject i (= 1, . . . , n 1 ) in group 1 and (X 2i1 , . . . , X 2im ) similarly for subject i (= 1, . . . , n 2 ) in group 2. Let µ1 = (µ11 , . . . , µ1m ) and µ2 = (µ21 , . . . , µ2m ) represent the respective mean vectors. In order to test whether or not gene j (= 1, . . . , m) is not differentially expressed between the two conditions, i.e. H j : µ1 j − µ2 j = 0, we may use the t-test statistic Tj =
X¯ 1 j − X¯ 2 j −1 S j n −1 1 + n2
n 1 n 2 where X¯ k j is the sample mean in group k (= 1, 2) and S 2j = { i=1 (X 1i j − X¯ 1 j )2 + i=1 (X 2i j − 2 ¯ X 2 j ) }/(n 1 + n 2 − 2) is the pooled sample variance for the jth gene. Suppose that our interest lies in identifying any genes overexpressed in group 1. This question can be stated as multiple one-sided tests of H j vs. H¯ j : µ1 j > µ2 j for j = 1, . . . , m. Two-sided tests, as a simple extension, will be discussed briefly later and in Appendix 1. A single-step procedure adopts a common critical value c to reject H j , in favor of H¯ j , when T j > c. In this case, FWER fixed at α is defined as α = P(T1 > c or T2 > c, . . . , or Tm > c|H0 ) = P( max T j > c|H0 ) j=1,...,m
(1)
where H0 : µ1 j = µ2 j for all j = 1, . . . , m, or equivalently H0 = ∩mj=1 H j , is the complete null hypothesis and the relevant alternative hypothesis is Ha = ∪mj=1 H¯ j . In order to control FWER below the nominal level α, Bonferroni uses c = cα = tn 1 +n 2 −2,α/m , the upper α/m-quantile for the t-distribution with n 1 + n 2 − 2 degrees of freedom imposing normality for the expression data, or c = z α/m , the upper α/m-quantile for the standard normal distribution based on asymptotic normality. If gene expression levels are not normally distributed, the assumption of t-distribution may be violated. Furthermore, n 1 and n 2 usually may not be large enough to warrant a normal approximation. Even if the assumed conditions are met, the Bonferroni procedure is conservative for correlated data. In fact, microarray data are collected from the same individuals and experience co-regulation, so they are expected to be correlated. Being motivated by these limitations together with the relationship in (1), we derive the distribution of W = max j=1,...,m T j under H0 using permutation. There are B = nn1 different ways of partitioning the pooled sample of size n = n 1 + n 2 into two groups of sizes n 1 and n 2 . In order to maintain the dependence structure and distributional characteristics of the gene expression measures within each subject, the sampling unit is subject, not gene. Recently, this type of resampling became popular in multiple testing to avoid the specification of the true distribution for the gene expression data (Dudoit et al., 2002, 2003; Mutter et al., 2001; Ge et al., 2003). Note that the number of possible permutations B can be very large even with a small size. For instance, with n 1 = n 2 = 10, there exist distinct 184 756 permutations. A reasonable number of random permutations, say B = 10 000, can be chosen for feasible computation. For the observed test statistic t j of T j from the original data, the unadjusted (or raw) p-values can B (b) be approximated by p j ≈ B −1 b=1 I (t j t j ) where I (A) is an indicator function of event A. For gene-specific inference, an adjusted p-value quantifying a significance of each gene relative to FWER is more realistic. Toward this end, we define an adjusted p-value for gene j as the minimum FWER for which H j will be rejected, i.e. p˜ j = P(max j ′ =1,...,m T j ′ t j |H0 ). In what follows, this probability is estimated from the permutation distribution: Algorithm 1 (Single-step procedure) (A) Compute the test statistics t1 , . . . , tm from the original data.
160
S IN -H O J UNG ET AL. (b)
(b)
(B) For the bth permutation of the original data (b = 1, . . . , B), compute the test statistics t1 , . . . , tm (b) and wb = max j=1,...,m t j . B (C) Estimate the adjusted p-values by p˜ j = b=1 I (wb t j )/B for j = 1, . . . , m. (D) Reject all hypotheses H j ( j = 1, . . . , m) such that p˜ j < α. Alternatively, with steps (C) and (D) replaced, the cut-off value cα can be determined: Algorithm 1′
(C′ ) Sort w1 , . . . , w B to obtain the order statistics w(1) · · · w(B) and compute the critical value cα = w([B(1−α)+1]) , where [a] is the largest integer no greater than a. If there exist ties, cα = w(k) where k is the smallest integer such that w(k) w([B(1−α)+1]) . (D′ ) Reject all hypotheses H j ( j = 1, . . . , m) for which t j > cα . Below is a step-down analog suggested by Dudoit et al. (2002, 2003), originally proposed by Westfall and Young (1989, 1993, see Algorithms 2.8 and 4.1 in their book): Algorithm 2 (Step-down procedure) (A) Compute the test statistics t1 , . . . , tm from the original data. (A1) Sort t1 , . . . , tm to obtain the ordered test statistics tr1 · · · trm , where Hr1 , . . . , Hrm are the corresponding hypotheses. (b) (b) (B) For the bth permutation of the original data (b = 1, . . . , B), compute the test statistics tr1 , . . . , trm (b) and u b, j = max j ′ = j,...,m tr j ′ for j = 1, . . . , m. B (C) Estimate the adjusted p-values by p˜r j = b=1 I (u b, j tr j )/B for j = 1, . . . , m. (C1) Enforce monotonicity by setting p˜r j ← max( p˜r j−1 , p˜r j ) for j = 2, . . . , m. (D) Reject all hypotheses Hr j ( j = 1, . . . , m) for which p˜r j < α. Note that two-sided tests can be fulfilled by replacing t j by |t j | in steps (B) and (C) in Algorithm 1. Finally, it can be shown that a single-step procedure, controlling the FWER weakly as in (1), also controls the FWER strongly under the condition of subset pivotality (see p. 42 in Westfall and Young, 1993). 2.2
A simulation study
We investigate the performance of the multiple testing procedures for control of the FWER and power through a simulation study: the Bonferroni (BON), the single-step procedure (SSP) and the step-down procedure (SDP) presented in this section. To evaluate FWER empirically, 1000-dimensional artificial gene expression profiles in each group were generated from a multivariate Gaussian distribution with zeromeans (i.e. µ1 = µ2 = 0) and unit marginal variances. A block exchangeable correlation structure was assumed with the correlation coefficient ρ (= 0, 0.4 or 0.8) and block size 100, i.e. genes are correlated within blocks and uncorrelated between blocks. We used balanced allocation (n 1 = n 2 = n/2) with n = 20 or 50 subjects. With one-sided FWER α = 0.05, cα was approximated from B = 1000 random permutations and the empirical FWER was estimated by the proportion of H0 being rejected out of N = 1000 replications. As Table 1(A) displays, BON is precise with mild correlation (ρ 0.4), but becomes highly conservative as correlation increases (ρ = 0.8). The conservatism becomes more prominent with a larger sample (n = 50). The estimates from both SSP and SDP are slightly anticonservative with n = 20 and ρ = 0, but accurate overall. Also reported are the average of cα values for SSP over simulation along with
Sample size calculation for multiple testing in microarray data analysis
161
Table 1. Simulation results n 20
50
(A) Average FWER (critical value) ρ BON SSP 0 0.064(4.966) 0.066(4.950) 0.4 0.046 0.054(4.898) 0.8 0.017 0.037(4.384) 0 0.046(4.244) 0.046(4.233) 0.4 0.047 0.054(4.177) 0.8 0.009 0.044(3.767)
SDP 0.066 0.054 0.037 0.046 0.054 0.044
(B) Average true rejection rate (global power) δ=1 δ = 1.5 ρ BON SSP SDP BON SSP SDP 0 0.022(0.237) 0.022(0.245) 0.022(0.245) 0.116(0.702) 0.117(0.706) 0.117(0.706) 0.4 0.020(0.190) 0.024(0.208) 0.024(0.208) 0.106(0.536) 0.113(0.554) 0.113(0.554) 0.8 0.023(0.097) 0.055(0.210) 0.055(0.210) 0.116(0.339) 0.215(0.517) 0.217(0.517) 50 0 0.019(0.625) 0.020(0.627) 0.020(0.627) 0.115(0.999) 0.119(0.999) 0.119(0.999) 0.4 0.020(0.395) 0.022(0.421) 0.022(0.421) 0.120(0.856) 0.127(0.866) 0.127(0.866) 0.8 0.016(0.185) 0.042(0.314) 0.042(0.314) 0.117(0.507) 0.211(0.688) 0.214(0.688) 50 10 0 0.265(0.949) 0.268(0.949) 0.268(0.949) 0.842(1.00) 0.844(1.00) 0.845(1.00) 0.4 0.265(0.810) 0.286(0.834) 0.287(0.834) 0.840(0.997) 0.855(0.997) 0.855(0.997) 0.8 0.241(0.516) 0.393(0.695) 0.394(0.695) 0.845(0.969) 0.929(0.990) 0.929(0.990) 50 0 0.265(1.00) 0.267(1.00) 0.268(1.00) 0.842(1.00) 0.844(1.00) 0.846(1.00) 0.4 0.272(0.947) 0.289(0.956) 0.291(0.956) 0.842(1.00) 0.859(1.00) 0.859(1.00) 0.8 0.274(0.692) 0.426(0.836) 0.429(0.836) 0.841(0.984) 0.925(0.996) 0.925(0.996) BON=Bonferroni, SSP=single-step procedure, and SDP=step-down procedure. n denotes sample size and D denotes the number of genes with non-zero effect size δ out of m = 1000 genes tested. Block diagonal matrix with block size 100 and correlation ρ was used for correlation structure. Nominal α is set at 0.05. B = N = 1000 permutations and simulations were used. Average false rejection rates (among genes with zero effect size) range in 0.001 − 0.0015 and are omitted in this table. n 20
D 10
ones for BON, tn−2,α/m . As expected, the estimated critical value cα increases in m (result not shown) and decreases in n and is always smaller than the critical value of BON. For power analysis, the first D genes in group 1 have a non-zero effect size δ, i.e. µ1 = (δ×1 D , 0m−D ), where 1a and 0a are a-dimensional row vectors with components of all 1 and 0, respectively. Effect size as well as correlation vary: δ = 1 or 1.5; ρ = 0, 0.4 or 0.8. Three different rejection rates were assessed: (1) global power (i.e. the probability of rejecting at least one null hypothesis); (2) false rejection rate (FRR) (i.e. the probability of declaring the genes with a null effect as predictive); and (3) true rejection rate (TRR) (i.e. the probability of declaring the predictive genes as predictive). This is important because high global power does not mean high rate of rejecting individual (true or false) hypotheses as Table 1(B) makes clear. For different concepts of power in the multiple testing context, see Dudoit et al. (2003, p. 74). The FRRs are omitted in the table, being similarly very low (maximum 0.15%) for all entries. All three procedures show that the TRR and global power increase in n, δ or D. Interestingly, ρ is associated inversely with global power but positively with TRR both for SSP and SDP. However, for BON, the TRR is virtually constant in ρ. SSP and SDP exhibit almost the same performance although SDP has slightly higher (by 0.5% at most) TRR than SSP, particularly with D = 50 and n = 50. SSP and SDP show identical global power (and FWER under the composite null) in all cases. This is obvious because global power
162
S IN -H O J UNG ET AL. Table 2. Average rejection rate and global power in a classical setting Average rejection rate Global Procedure TRR FRR power SDP 0.011 0.048 SSP 0.010 0.048 1 SDP 0.411 0.013 0.422 SSP 0.410 0.011 0.422 2 SDP 0.431 0.018 0.619 SSP 0.414 0.014 0.619 3 SDP 0.452 0.023 0.724 SSP 0.416 0.013 0.724 4 SDP 0.483 0.032 0.806 SSP 0.426 0.013 0.806 5 SDP 0.520 0.853 SSP 0.428 0.853 SDP=step-down procedure and SSP=single-step procedure. TRR and FRR denote true rejection rate (among genes that are differentially expressed) and false rejection rate (among genes that are not differentially expressed), respectively. D is the number of genes with non-zero effect size δ. m = 5 genes and B = N = 10 000 permutations and simulations were used. Compound symmetry with the correlation coefficient of 0.3 and a total sample size n of 20 (n 1 = n 2 = 10) were employed. D 0
is governed by the smallest adjusted p-value, min j=1,...,m p˜ j , which is common for the two procedures. We conclude that Algorithms 1 (SSP) and 2 (SDP) behave very similarly in situations typically arising in microarray experiments, where the number of genes is very large but the proportion of genes differentially expressed is small. To examine possible differences of the two procedures, we simulated a typical multiple testing situation with a small number of tests and report our findings in Table 2. We set n 1 = n 2 = 10 and m = 5, among which D = 0, . . . , 5 test hypotheses have effect size δ = 1. Raw data are generated from a multivariate Gaussian distribution with a compound symmetry (CS) structure and mild correlation coefficient (ρ = 0.3). For each D, B = 10 000 permutations were conducted within each simulation and this process was repeated N = 10 000 times. As D increases, the TRR and FRR are relatively constant in SSP but sharply increase in SDP. Both TRR and FRR are higher in SDP and the difference becomes more pronounced as D increases.
3. S AMPLE SIZE CALCULATION In this section, we derive a sample size calculation method using the single-step procedure. The calculated sample size is also applied to the step-down procedure since the two procedures have the same global power. Our discussion is focused on one-sided testing, but two-sided testing case can be similarly derived. Recall that the multiple testing procedures discussed in this paper do not require a large sample assumption. However, we derive our sample size formula based on the large sample approximation and then show through simulations that the formula also works well with moderate sample sizes.
Sample size calculation for multiple testing in microarray data analysis
163
3.1 Algorithms for sample size calculation We wish to determine sample size for a designated global power 1 − β. Suppose that the gene expression data {(X ki1 , . . . , X kim ), i = 1, . . . , n k , k = 1, 2} are random samples from an unknown distribution with E(X ki j ) = µk j , var(X ki j ) = σ j2 and corr(X ki j , X ki j ′ ) = ρ j j ′ . Let R = (ρ j j ′ ) j, j ′ =1,...,m be the m × m correlation matrix. Under Ha , we specify the effect size as δ j = (µ1 j − µ2 j )/σ j . In the design stage of a microarray study, we usually project the number of predictive genes D and set an equal effect size among them, i.e. δ j = δ for j = 1, . . . , D = 0 for j = D + 1, . . . , m.
(2)
Appendix 2A shows that, for large n 1 and n 2 , (T1 , . . . , Tm ) has approximately the same distribution as √ (e1 , . . . , em ) ∼ N (0, R) under H0 and (e j + δ j npq, j = 1, . . . , m) under Ha , where p = n 1 /n and q = 1 − p. Hence, at FWER = α, the common critical value cα is given as the upper α quantile of max j=1,...,m e j from (1). Similarly, the global power as a function of n is √ h a (n) = P{ max (e j + δ j npq) > cα }. j=1,...,m
Thus, given FWER = α, the sample size n to detect the specified effect sizes (δ1 , . . . , δm ) with a global power 1 − β will be calculated as the solution to h a (n) = 1 − β. Analytic calculation of cα and h a (n) √ will be feasible only when the distributions of max j e j and max j (e j + δ j npq) are available in simple forms. With a large m, however, it is almost impossible to derive the distributions. We avoid the difficulty by using simulation. Our simulation method is to approximate cα and h a (·) by generating random vectors (e1 , . . . , em ) from N (0, R). For easy generation of the random numbers, we have to assume a simple, but realistic, correlation structure for the gene expression data. Recall that R is the correlation matrix among the gene expression data (X ki1 , . . . , X kim ). A reasonable correlation structure would be block compound symmetry (BCS) or CS (i.e. with only 1 block). Suppose that m genes are partitioned into L blocks, and Bl denotes the set of genes belonging to block l (l = 1, . . . , L). We assume that ρ j j ′ = ρ if j, j ′ ∈ Bl for some l, and ρ j j ′ = 0 otherwise. Under the BCS structure, we can generate (e1 , . . . , em ) as a function of i.i.d. standard normal random variates u 1 , . . . , u m , b1 , . . . , b L : √ (3) e j = u j 1 − ρ + bl ρ for j ∈ Bl . Finally, the entire procedure can be summarized as follows: (a) Specify FWER (α), global power (1 − β), effect sizes (δ1 , . . . , δm ) and correlation structure (R). (k) (k) (b) Generate K (say, 10 000) i.i.d. random vectors {(e1 , . . . , em ), k = 1, . . . , K } from N (0, R). Let (k) e¯k = max j=1,...,m e j . (c) Approximate cα by e¯[(1−α)K +1] , the [(1 − α)K + 1]th order statistic of e¯1 , . . . , e¯ K . (d) Calculate n by solving hˆ a (n) = 1 − β by the bisection method (Press et al., 1996), where hˆ a (n) = K √ (k) K −1 k=1 I {max j=1,...,m (e j + δ j npq) > cα }. Mathematically put, step (d) is equivalent to finding n ∗ = min{n : hˆ a (n) 1 − β}. In Appendix 2A, the asymptotic distribution of (T1 , . . . , Tm ) is derived without resort to the use of permutations in testing. In this sense, the above algorithm using (3) will be called a ‘naive’ method. Appendix 2B shows that the permutation procedure alters the correlation structure among the test statistics
164
S IN -H O J UNG ET AL.
under Ha . Suppose that there are m 1 genes in block 1, among which the first D are predictive. Then, under (2) and BCS, we have + pqδ 2 )/(1 + pqδ 2 ) ≡ ρ1 if 1 j < j ′ D (ρ (4) corr(T j , T j ′ ) ≈ if 1 j D < j ′ m 1 ρ/ 1 + pqδ 2 ≡ ρ2 ′ ′ ρ if D < j < j m 1 or j, j ∈ Bl for l 2 where the approximation is with respect to large n. ˜ denote the correlation matrix with these correlation coefficients. Note that R ˜ = R under H0 : Let R δ = 0, so that calculation of cα is the same as in the naive method. However, h a (n) should be modified to √ h˜ a (n) = P{ max (e˜ j + δ j npq) > cα } j=1,...,m
where random samples of (e˜1 , . . . , e˜m ) can be generated using √ √ √ ρ1 − ρ2 u j √1 − ρ1 + b1 ρ2 + b−1 √ √ u j √1 − ρ + b1 ρ2 + b0 ρ − ρ2 e˜ j = √ u j 1 − ρ + bl ρ
if 1 j D if D < j m 1 if j ∈ Bl for l 2 (k)
(k)
with u 1 , . . . , u m , b−1 , b0 , b1 , . . . , b L independently from N (0, 1). Then {(e˜1 , . . . , e˜m ), k = 1, . . . , K } ˜ and are i.i.d. random vectors from N (0, R), K
√ (k) I { max (e˜ j + δ j npq) > cα }. hˆ˜ a (n) = K −1 k=1
j=1,...,m
The sample size calculation solving hˆ˜ a (n) = 1 − β will be named a ‘modified’ method. Note that the methods discussed here are different from a pure simulation method in the sense that it does not require generating the raw data and then calculating test statistics. Thus, the computing time is not of an order of n × m, but of m. Furthermore, we can share the random numbers u 1 , . . . , u m , b−1 , b0 , b1 , . . . , b L in the calculation of cα and n. We do not need to generate a new set of random numbers at each replication of the bisection procedures either. If the target n is not large, the large sample approximation may not perform well. In our simulation study, we examine how large n needs to be for an adequate approximation. If the target n is so small that the approximation is questionable, then we have to use a pure simulation method by generating raw data. 3.2
A simulation study
We conducted numerical experiments to investigate the accuracy of our sample size estimation. First, sample size was computed under one-sided FWER = 0.05; 80% global power; p = q = 0.5; δ = 0.5 or 1; ρ = 0.1, 0.4 or 0.8; m, D and block size varied as shown in Table 3. A simulated sample of the calculated size was generated from the same parameter setting as in sample size calculation. B = N = 1000 samples were generated, and global power was calculated empirically. Sample size increases in ρ (assuming there is no variable reduction technique involved) and decreases in δ. Given D, intuitively, a larger number of tests (m) demand a larger sample size. The sample sizes by the naive method are underpowered, especially with δ = 1 and large m. The modified method remarkably improves the accuracy except when δ = 1 and m = 10 000. With large m and ρ, the large sample convergence will be slow; resulting in a poor approximation, especially with a large effect size which yields a small n. These results show that power and sample size depend on not only the study design but also the proposed method for analyzing data.
Sample size calculation for multiple testing in microarray data analysis
165
Table 3. Sample size (empirical power) for 80% global power Correlation δ = 0.5 δ=1 formula ρ = 0.1 ρ = 0.4 ρ = 0.8 ρ = 0.1 ρ = 0.4 ρ = 0.8 naive 119(0.79) 150(0.79) 179(0.82) 30(0.68) 38(0.75) 45(0.74) modified 127(0.79) 152(0.82) 183(0.80) 35(0.79) 40(0.80) 47(0.76) 1000 (100) 10 naive 139(0.76) 168(0.78) 199(0.76) 35(0.65) 42(0.70) 51(0.75) modified 145(0.81) 176(0.80) 204(0.81) 41(0.79) 48(0.81) 53(0.75) 10 000 (100) 10 naive 183(0.70) 233(0.75) 284(0.79) 45(0.53) 59(0.70) 71(0.70) modified 188(0.77) 239(0.79) 288(0.81) 53(0.74) 64(0.77) 74(0.75) 10 000 (1000) 1000 naive 41(0.64) 86(0.82) 152(0.77) 10(0.21) 22(0.68) 39(0.71) modified 57(0.83) 113(0.87) 185(0.82) 20(0.87) 34(0.85) 49(0.85) m is the total number of genes tested and D is the number of genes with non-zero effect size δ. ‘Naive’ and ‘modified’ represent the original and modified correlation matrix before and after permutation, respectively. Sample size n was estimated from K = 5000 simulated samples. B = N = 1000 times of permutation and simulation were used. m (block size) 100 (10)
D 5
4. A PPLICATION TO LEUKEMIA DATA In this section, the leukemia data from Golub et al. (1999) are reanalyzed. There are n ALL = 27 patients with ALL and n AML = 11 patients with AML in the training set, and expression patterns in m = 6810 human genes are explored. Note that, in general, such expression measures are subject to preprocessing steps such as image analysis and normalization, and also to a priori quality control. Supplemental information and dataset are located in the authors’ website (http://www.genome.wi. mit.edu/MPR). Gene-specific significance was ascertained for alternative hypotheses H¯ 1, j : µALL, j = µAML, j , H¯ 2, j : µALL, j < µAML, j , and H¯ 3, j : µALL, j > µAML, j by SDP and SSP. We implemented our algorithm as well as PROC MULTTEST in SAS with B = 10 000 permutations (Westfall et al., 2001). Due to essentially identical results, we report the results from SAS. Table 4 lists 41 genes with two-sided adjusted p-values which are smaller than 0.05. Although adjusted p-values by SDP are slightly smaller than SSP, the results are extremely similar, confirming the findings from our simulation study. Note that Golub et al. and we identified 1100 and 1579 predictive genes without accounting for multiplicity, respectively. A Bonferroni adjustment declared 37 significant genes. This is not so surprising because relatively low correlations among genes were observed in these data. We do not show the results for H¯ 3, j ; only four hypotheses are rejected. Note that the two-sided p-value is smaller than twice of the smaller one-sided p-value as theory predicts (see Appendix 1) and that the difference is not often negligible (Shaffer, 2002). Suppose that we want to design a prospective study to identify predictive genes overexpressing in AML based on observed parameter values. So we assume m = 6810, p = 0.3(≈ 11/38), D = 10 or 100, δ = 0.5 or 1, and BCS with block size 100 or CS with a common correlation coefficient of ρ = 0.1 or 0.4. We calculated the sample size using the modified formula under each parameter setting for FWER α = 0.05 and a global power 1 − β = 0.8 with K = 5000 replications. For D = 10 and δ = 1, the minimal sample size required for BCS/CS are 59/59 and 74/63 for ρ = 0.1 and 0.4, respectively. If a larger number of genes, say D = 100, are anticipated to overexpress in AML with the same effect size, the respective sample sizes reduce to 34/34 and 49/41 in order to maintain the same power. With δ = 0.5, the required sample size becomes nearly 3.5 to 4 times that for δ = 1. Note that, with the same ρ, BCS tends to require a larger sample size than CS. One of the referees raised a question about the accuracy of our sample size formula when the gene expression data have other distributions than the multivariate normal distributions. We considered the setting α = 0.05, 1 − β = 0.8, δ = 1, D = 100, ρ = 0.1 with CS structure, which results in the smallest
166
S IN -H O J UNG ET AL. Table 4. Reanalysis of the leukemia data from Golub et al. (1999) Alternative hypothesis µALL = µAML µALL < µAML Gene index (description) SDP SSP SDP SSP 1701 (FAH Fumarylacetoacetate) 0.0003 0.0003 0.0004 0.0004 3001 (Leukotriene C4 synthase) 0.0003 0.0003 0.0004 0.0004 4528 (Zyxin) 0.0003 0.0003 0.0004 0.0004 1426 (LYN V-yes-1 Yamaguchi) 0.0004 0.0004 0.0005 0.0005 4720 (LEPR Leptin receptor) 0.0004 0.0004 0.0005 0.0005 1515 (CD33 CD33 antigen) 0.0006 0.0006 0.0006 0.0006 402 (Liver mRNA for IGIF) 0.0010 0.0010 0.0009 0.0009 3877 (PRG1 Proteoglycan 1) 0.0012 0.0012 0.0010 0.0010 1969 (DF D component of complement) 0.0013 0.0013 0.0011 0.0011 3528 (GB DEF) 0.0013 0.0013 0.0010 0.0010 930 (Induced Myeloid Leukemia Cell) 0.0016 0.0016 0.0013 0.0013 5882 (IL8 Precursor) 0.0016 0.0016 0.0013 0.0013 1923 (PEPTIDYL-PROLYL CIS-TRANS Isomerase) 0.0017 0.0017 0.0014 0.0014 2939 (Phosphotyrosine independent ligand p62) 0.0018 0.0018 0.0014 0.0014 1563 (CST3 Cystatin C) 0.0026 0.0026 0.0021 0.0021 1792 (ATP6C Vacuolar H+ ATPase proton channel subunit) 0.0027 0.0027 0.0023 0.0023 1802 (CTSD Cathepsin D) 0.0038 0.0038 0.0032 0.0032 5881 (Interleukin 8) 0.0041 0.0041 0.0036 0.0036 6054 (ITGAX Integrin) 0.0056 0.0055 0.0042 0.0041 6220 (Epb72 gene exon 1) 0.0075 0.0075 0.0062 0.0062 1724 (LGALS3 Lectin) 0.0088 0.0088 0.0071 0.0071 2440 (Thrombospondin-p50) 0.0091 0.0091 0.0073 0.0073 6484 (LYZ Lysozyme) 0.0101 0.0100 0.0081 0.0080 1355 (FTL Ferritin) 0.0107 0.0106 0.0086 0.0085 2083 (Azurocidin) 0.0107 0.0106 0.0086 0.0085 1867 (Protein MAD3) 0.0114 0.0113 0.0092 0.0091 6057 (PFC Properdin P factor) 0.0143 0.0142 0.0108 0.0107 3286 (Lysophospholipase homolog) 0.0168 0.0167 0.0126 0.0125 6487 (Lysozyme) 0.0170 0.0169 0.0127 0.0126 1510 (PPGB Protective protein) 0.0178 0.0177 0.0133 0.0132 6478 (LYZ Lysozyme) 0.0193 0.0191 0.0144 0.0142 6358 (HOX 2.2) 0.0210 0.0208 0.0160 0.0158 3733 (Catalase EC 1.11.1.6) 0.0216 0.0214 0.0162 0.0160 1075 (FTH1 Ferritin heavy chain) 0.0281 0.0279 0.0211 0.0209 6086 (CD36 CD36 antigen) 0.0300 0.0298 0.0224 0.0222 189 (ADM) 0.0350 0.0348 0.0260 0.0258 1948 (CDC25A Cell division cycle) 0.0356 0.0354 0.0263 0.0261 5722 (APLP2 Amyloid beta precursor-like protein) 0.0415 0.0413 0.0306 0.0304 5686 (TIMP2 Tissue inhibitor of metalloproteinase) 0.0425 0.0423 0.0314 0.0312 5453 (C-myb) 0.0461 0.0459 1.000 1.000 6059 (NF-IL6-beta protein mRNA) 0.0482 0.0480 0.0350 0.0348 Adjusted p-values from two-sided hypothesis less than 0.05 are listed in increasing order among total m = 6810 genes investigated. The total number of studied subjects n was 38 (n ALL = 27 and n AML = 11). B = 10 000 times of permutation were used. Note that C-myb gene has p-value of 0.015 against the hypothesis µALL > µAML . Although some gene descriptions are identical, gene accession numbers are different.
Sample size calculation for multiple testing in microarray data analysis
167
sample size, n = 34, in the above sample size calculation. Gene expression data were generated from a correlated asymmetric distribution: X k j = µk j + (ek j − 2) ρ/4 + (ek0 − 2) (1 − ρ)/4 for 1 j m and k = 1, 2. Here, µ1 j = δ j and µ2 j = 0, and ek0 , ek1 , . . . , ekm are i.i.d. random variables from a χ 2 distribution with two degrees of freedom. Note that (X k1 , . . . , X km ) have means (µk1 , . . . , µkm ), marginal variances 1, and a compound symmetry correlation structure with ρ = 0.1. In this case, we obtained an empirical FWER of 0.060 and an empirical global power of 0.832 which are close to the nominal α = 0.05 and 1 − β = 0.8, respectively, from a simulation with B = N = 1000. 5. D ISCUSSION Genomic scientists are using DNA microarray as a major high-throughput assay to display DNA or RNA abundance for a large number of genes concurrently; this examination has rekindled interest in statistical issues such as multiple testing, giving methodological and computational challenges. Endeavors to identify the informative genes should be made taking multiplicity into account, but also with enough power to discover important genes successfully. This problem is different from the classical multiple testing situations in that the number of truly effective genes is often very small compared to the number of candidate genes under investigation. Moreover, only a small sample size is often available so large sample theory is not justified for standard statistical inference. An underpowered study is no service to the investigator or to science; results ‘significant’ without assurance will often fail to replicate, and time will be wasted and resources needlessly expended. In this paper, we compared three popular testing procedures and developed a new fast algorithm for determining sample size with a particular emphasis on the microarray context. We basically suggest using exact permutation-based tests but also argue for the utility of the single-step which is often undervalued. Permutation tests do not require specification of the joint distribution or true correlation structure of the gene expression data. In typical circumstances occurring in microarrays, we verified that the actual advantage of the step-down procedure is minimal and that the improvement is more relevant in classical testing situations dealing with a small number of hypotheses. The single-step method is fast, easy to understand, computes critical values as well as adjusted p-values and, most importantly, offers a simple way to calculate sample size. Generating high-dimensional (say, 10 000) multivariate (normal) data many times (say, 5000) is not a simple undertaking even with a fast computer. To the best of our knowledge, there is no fast numerical algorithm to generate high-dimensional random vectors from general correlation structure. Some simplifying assumptions (e.g. BCS or CS correlation structure, common effect size and normal test statistics) may be more realistic in the microarray analysis under such technical constraints. However, further simulation under more varied conditions would be extremely useful. Our method for sample size determination is efficiently implemented using a novel and fast algorithm, and accurate as reflected in the empirical evaluation. Although there have been several publications on sample size estimation in the microarray context, none have examined the accuracy of their estimates. Furthermore, all focused on exploratory and approximate relationships among statistical power, sample size (or the number of replicates) and effect size (often, in terms of fold-change), and used the most conservative Bonferroni adjustment without any attempt to incorporate underlying correlation structure (Witte et al., 2000; Wolfinger et al., 2001; Black and Doerge, 2002; Lee and Whitmore, 2002; Pan et al., 2002; Simon et al., 2002; Cui and Churchill, 2003). By comparing empirical power resulting from naive and modified methods, we show that an ostensibly similar but incorrect choice of sample size ascertainment could cause considerable underestimation of
168
S IN -H O J UNG ET AL.
required sample size. We recommend that the assessment of bias in empirical power (compared to nominal power) be a conventional step in publication of all sample size papers. Recently, some researchers proposed the new concepts of error such as false discovery rate (FDR) and positive-FDR (so-called, pFDR), which control the expected proportion of Type I error among the rejected hypotheses (Benjamini and Hochberg, 1995; Storey, 2002). Controlling these quantities relaxes the multiple testing criteria compared to controlling FWER in general and increases the number of declared significant genes. In particular, pFDR is motivated by Bayesian perspective and inherits the idea of single-step in constructing q-values, which are the counterpart of the adjusted p-values in this case (Ge et al., 2003). It would be useful to do a sample size comparison for FDR, pFDR and FWER. FWER is important as a benchmark because the reexamination of Golub et al.’s data tells us that classical FWER control (along with global power) may not necessarily be as exceedingly conservative as many researchers thought and carries clear conceptual and practical interpretations. Appendices are available online at http://biostatistics.oupjournals.org/.
ACKNOWLEDGEMENTS The authors are grateful to the reviewers for their careful and speedy reviews of this paper. Their comments greatly improved this paper without a doubt.
R EFERENCES A LIZADEH , A. A. AND S TAUDT , L. M. (2000). Genomic-scale gene expression profiling of normal and malignant immune cells. Current Opinions in Immunology 12, 219–225. B ENJAMINI , Y. AND H OCHBERG , Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57, 289–300. B LACK , M. A. AND D OERGE , R. W. (2002). Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments. Bioinformatics 18, 1609–1616. C UI , X. AND C HURCHILL , G. A. (2003). How many mice and how many arrays? Replication in mouse cDNA microarray experiments. In Johnson, K. F. and Lin, S. M. (eds), Methods of Microarray Data Analysis II, Norwell, MA: Kluwer Academic Publishers, pp. 139–154. D UDOIT , S., S HAFFER , J. P. AND B OLDRICK , J. C. (2003). Multiple hypothesis testing in microarray experiments. Statistical Science 18, 71–103. D UDOIT , S., YANG , Y. H., C ALLOW, M. J. AND S PEED , T. P. (2002). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 12, 111–139. G E , Y., D UDOIT , S. TEST 12, 1–44.
AND
S PEED , T. P. (2003). Resampling-based multiple testing for microarray data analysis.
G OLUB , T. R., S LONIM , D. K., TAMAYO , P., H UARD , C., G AASENBEEK , M., M ESIROV , J. P., C OLLER , H., L OH , M. L., D OWNING , J. R., C ALIGIURI , M. A., B LOOMFIELD , C. D. AND L ANDER , E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537. H OCHBERG , Y. (1998). A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, 800–802. H OLM , S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, 65–70. L EE , M. L. T. AND W HITMORE , G. A. (2002). Power and sample size for DNA microarray studies. Statistics in Medicine 21, 3543–3570.
Sample size calculation for multiple testing in microarray data analysis
169
M UTTER , G. L., BAAK , J. P. A., F ITZGERALD , J. T., G RAY , R., N EUBERG , D., K UST , G. A., G ENTLEMAN , R., G ALLANS , S. R., W EI , L. J. AND W ILCOX , M. (2001). Global express changes of constitutive and hormonally regulated genes during endometrial neoplastic transformation. Gynecologic Oncology 83, 177–185. PAN , W. (2002). A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18, 546–554. PAN , W., L IN , J. AND L E , C. T. (2002). How many replicated of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biology 3, 1–10. P RESS , W. H., T EUKOLSKY , S. A., V ETTERLING , W. T. Fortran 90. New York: Cambridge University Press.
AND
F LANNERY , B. P. (1996). Numerical Recipes in
S ANDER , C. (2000). Genomic medicine and the future of health care. Science 287, 1977–1978. S HAFFER , J. P. (2002). Multiplicity, directional (Type III) errors, and the null hypothesis. Psychological Methods 7, 356–369. S IMON , R., R ADMACHER , M. D. Epidemiology 23, 21–36.
AND
D OBBIN , K. (2002). Design of studies with DNA microarrays. Genetic
S TOREY , J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society B 64, 479–498. T HOMAS , J. G., O LSON , J. M., TAPSCOTT, S. J. AND Z HAO , L. P. (2001). An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Research 11, 1227–1236. T ROENDLE , J. F., KORN , E. L. AND M C S HANE , L. M. (2004). An example of slow convergence of the bootstrap in high dimensions. American Statistician 58, 25–29. W EST , M., B LANCHETTE , C., D RESSMAN , H., H UANG , E., I SHIDA , S., S PRANG , R., Z UZAN , H., O LSON , J., M ARKS , J. AND N EVINS , J. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences USA 98, 11462–11467. W ESTFALL , P. H. AND YOUNG , S. S. (1989). P-value adjustments for multiple tests in multivariate binomial models. Journal of the American Statistical Association 84, 780–786. W ESTFALL , P. H. AND YOUNG , S. S. (1993). Resampling-based Multiple Testing: Examples and Methods for Pvalue Adjustment. New York: Wiley. W ESTFALL , P. H. 51, 3–8.
AND
W OLFINGER , R. D. (1997). Multiple tests with discrete distributions. American Statistician
W ESTFALL , P. H., Z AYKIN , D. V. AND YOUNG , S. S. (2001). Multiple tests for genetic effects in association studies: methods in molecular biology. In Looney, S. (ed.), Biostatistical Methods, Toloway, NJ: Humana Press, pp. 143–168. W ITTE , J. S., E LSTON , R. C. AND C ARDON , L. R. (2000). On the relative sample size required for multiple comparisons. Statistics in Medicine 19, 369–372. W OLFINGER , R. D., G IBSON , G., W OLFINGER , E. D., B ENNETT , L., H AMADEH , H., B USHEL , P., A FSHARI , C. AND PAULES , R. S. (2001). Assessing gene significance from cDNA microarray expression data via mixed models. Journal of Computational Biology 8, 625–637. [Received 27 April 2004; revised 4 August 2004; accepted for publication 23 September 2004]