[Article] A note on joint versus gene-specific mixed model analysis of microarray gene expression data


286 49 31KB

English Pages 4 Year 2005

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

[Article] A note on joint versus gene-specific mixed model analysis of microarray gene expression data

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Biostatistics (2005), 6, 2, pp. 183–186 doi:10.1093/biostatistics/kxi001

A note on joint versus gene-specific mixed model analysis of microarray gene expression data INA HOESCHELE∗ , HUA LI Virginia Bioinformatics Institute and Department of Statistics, Virginia Tech, Blacksburg, VA 24061-0477, USA [email protected]

S UMMARY Currently, linear mixed model analyses of expression microarray experiments are performed either in a gene-specific or global mode. The joint analysis provides more flexibility in terms of how parameters are fitted and estimated and tends to be more powerful than the gene-specific analysis. Here we show how to implement the gene-specific linear mixed model analysis as an exact algorithm for the joint linear mixed model analysis. The gene-specific algorithm is exact, when the mixed model equations can be partitioned into unrelated components: One for all global fixed and random effects and the others for the gene-specific fixed and random effects for each gene separately. This unrelatedness holds under three conditions: (1) any gene must have the same number of replicates or probes on all arrays, but these numbers can differ among genes; (2) the residual variance of the (transformed) expression data must be homogeneous or constant across genes (other variance components need not be homogeneous) and (3) the number of genes in the experiment is large. When these conditions are violated, the gene-specific algorithm is expected to be nearly exact. Keywords: Differential gene expression; Microarrays; Mixed model analysis.

1. I NTRODUCTION Microarray gene expression experiments produce expression profiles of thousands or ten thousands of genes simultaneously. The focus of this paper is on the application of linear mixed model methodology to the detection and estimation of differential gene expression or, more generally, to the identification of which factors of interest influence the expression of which of the arrayed genes and to the estimation of their effects. Linear mixed model analysis (LMMA) of microarray data is either gene-specific (Wolfinger et al., 2001) or global (e.g. Kerr and Churchill, 2001), i.e. the analysis is performed separately for each gene or jointly for all genes, respectively. Gene-specific analyses can have low power as noticed, e.g. by Wu et al. (2003) and Pfister-Genskow et al. (2004). Reasons for lower power of gene-specific analyses, relative to the joint analysis, include the difference in degrees of freedom, joint versus gene-specific estimation of the error variance and other variance components, and the use of different contrasts. Conditions for equivalence between gene-specific and joint analyses have been stated in previous contributions for the case of fixed linear models (e.g. Kerr, 2003; Wu et al., 2003). Here we discuss the use of the gene-specific analysis as an algorithm for the joint analysis under a mixed linear model. ∗ To whom correspondence should be addressed.

c The Author 2005. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]. 

184

I. H OESCHELE AND H. L I 2. M ETHODS

We write a general linear mixed model for microarray data as follows y = X1 β1 + Z1 u1 + X2 β2 + Z2 u2 + e,

(2.1)

where β1 and u1 contain the effects of the fixed (e.g. treatment) and random global factors, respectively, β2 and u2 contain the effects of the fixed and random gene-specific factors, respectively, and X1 , Z1 and X2 , Z2 are design matrices for the global and gene-specific factors, respectively. Gene-specific effects include the gene main effects (fixed) and interactions between the global fixed and random factors with the gene factor. It is usually appropriate to treat array effects as random, so e.g. let u1 = A be the vector of random global array effects and let u2 = A × G be the vector of random gene-specific array effects. For the mixed model analysis, we need to specify the variance–covariance structure of the random factors. We typically 2 specify that E(A) = 0, Var(A) = G11 = Iσ A2 , E(A × G) = 0, E(e) = 0, Var(A × G) = G22 = Iσ AG 2 and Var(e) = R = Iσe , under the assumption that gene-specific are constant across genes, or g variances g 2 , under the assumption that gene2 and Var(e) = R = i=1 Ini σe(i) Var(A × G) = G22 = i=1 In a σ AG(i) specific variances differ among genes, where n a is the number of arrays, n i is the number of observations 2 and σ 2 are the gene-specific array variance or variance for gene i, σ A2 is the global array variance and σ AG e due to array-by-gene interaction and the residual variance, respectively. Inferences about the unknown parameters (fixed effects and variance components) are obtained by utilizing the mixed model equations (MME) (Goldberger, 1962; Henderson, 1963). The MME for the uncentered mixed model in (2.1) are ⎡ ⊤ −1 ⎤ ⎡ ⎤ ⎡ ⊤ −1 ⎤ −1 −1 −1 X1 R X1 X⊤ X⊤ X⊤ X1 R y βˆ1 1 R X2 1 R Z1 1 R Z2 ⎢X⊤ R−1 X X⊤ R−1 X ⎥ ⎢ ⎥ ⎢X⊤ R−1 y⎥ ⊤ ⊤ −1 −1 ˆ R Z R Z X X β 2 1 2 1 ⎢ 2 ⎥ ⎥ ⎢ 2⎥ ⎢ 2 2 2 2 (2.2) ⎢ ⊤ −1 ⎢ ⎥ = ⎢ ⊤ −1 ⎥ , ⊤ R−1 Z + G11 Z⊤ R−1 Z + G12 ⎥ −1 X ⎣Z1 R X1 Z⊤ ⎦ ⎦ ⎣ ⎦ ⎣ R Z R y Z ˆ u 2 1 2 1 1 1 1 1 ⊤ −1 ⊤ −1 −1 21 Z⊤ R−1 Z + G22 −1 Z⊤ Z⊤ uˆ 2 2 2 R X1 Z2 R X2 Z2 R Z1 + G 2 2R y where −1

G



G11 = G21

G12 G22

−1

G11 = G21 

G12 , G22

ˆ where G12 = Cov(A, A × G⊤) = 0. Given known values of the variance components, β-solutions to the ˆ MME are best linear unbiased estimates and u-solutions are best linear unbiased predictions. Wolfinger et al. (2001) proposed to implement the gene-specific analysis with a two-step procedure. First, the data y are analyzed with a sub-model (normalization model) containing only global effects (β1 and u1 ), and then the predicted residuals from this model are analyzed with the second sub-model (gene-model) containing only gene-specific effects (β2 and u2 ). Fitting model (2.1), with centering of the columns of X2 and Z2 (Rencher, 2000) or without β2 and u2 , implies a reparameterization from β1 to β1∗ and u1 to u∗1 , where an element in β1∗ or u∗1 now contains the average of the interactions of this factor with the gene factor (G). We denote the centered matrices by X∗2 and Z∗2 . After centering, or equivalently, after fitting a mixed (normalization) model containing only global factors, the vector of global array effects becomes A∗ with elements A∗k = Ak + (A × G)k· and with Var(A∗ ) = G∗11 = Iσ A2∗ and Cov(A∗ , A × G⊤) = G∗12 . The array variance component now is σ A2∗ = Var(A∗k ) = Var(Ak + A × G k· ), which in the simplest case of equal replicate or probe numbers across all genes simplifies to σ A2∗ = 2 , where n is the number of genes. Matrix G∗ has non-zero elements between any A∗ and σ A2 + n1G σ AG G k 12 2 . The MME for the (A × G)kg for all g and a given k, which in the simplest case are equal to n1G σ AG centered mixed model are obtained by replacing X2 and Z2 by X∗2 and Z∗2 and replacing G−1 with −1  ∗11  G G∗12 G11 G∗12 −1 ∗ . = G = G∗21 G∗22 G∗21 G∗22

Joint model analysis

185

It can be shown that there are non-zero elements in matrix G∗12 between A∗k and (A × G)kg for all g, and that there are non-zero elements in matrix G∗22 between (A × G)kg and (A × G)kg′ for all g, g ′ with g = g ′ . Therefore, the offdiagonal block in the coefficient matrix of the centered MME between u∗1 = A∗ and u∗2 = A × G is no longer strictly zero, and the same holds for offdiagonal blocks between any sub-vectors of u∗2 corresponding to two different genes. As a consequence, the global and the genespecific MME are not exactly unrelated so that the gene-specific analysis is no longer an exact algorithm to perform the joint analysis. The degree of approximation approaches exactness when the number of genes becomes large so that G∗ approaches G. For the centered MME, all offdiagonal sub-matrices of ∗ −1 the coefficient matrix in (2.2) involving global and gene-specific design matrices (e.g. X⊤ 1 R X2 ) are 2 zero when R = Iσe , but are not (exactly) zero for other R. Separating the MME (2.2) into n G + 1 (approximately) unrelated components greatly improves the efficiency of LMMA with joint estimation of global and gene-specific variances. 3. R ESULTS The gene-specific algorithm for the joint analysis is illustrated with a worked example using a small, artificial data set and fixed and mixed models, which is presented as supplementary data available at Biostatistics online. We also performed the joint analysis on a larger real data set and on data sets simulated with structure and parameters similar to those of the real data. The real data included 2640 cDNAs printed in duplicate on 100 arrays, two cell populations and 10 biological replicates from each cell population [the analysis of this data set is described in detail by Pfister-Genskow et al. (2004)]. We analyzed the real and simulated data with the ASReml software (Gilmour et al., 2002) and with our gene-specific algorithm for the joint analysis and confirmed that the estimates of the variance components and the test statistics were identical up to numerical accuracy [differences in the results between the joint analysis and the original gene-specific analysis of Wolfinger et al. (2001) were reported in Pfister-Genskow et al. (2004)]. 4. D ISCUSSION Joint analysis of the expression data on all genes in a microarray experiment has multiple advantages over the common practice of analyzing the data on individual genes separately: (1) There is great flexibility on how the variance–covariance structure of the data is modeled, and different formulations can be compared (homogeneous within-gene variance components, heterogeneous within-gene variance components estimated by shrinkage methods or by grouping of genes, etc.). (2) It allows us to evaluate the significance of global treatment and technical factors and to evaluate gene-specific treatments contrasts which include main effects (Black and Doerge, 2002). (3) As pointed out by Kerr (2003), joint analysis also permits model evaluation and residual analysis. We note, however, that while for linear regression models, there is well-established theory on residuals and influence, and outlier detection statistics or deletion diagnostics are commonly available in regression software packages, residual analysis and deletion diagnostics are not yet performed routinely in mixed model analysis and are still underdeveloped for mixed models despite recent progress (e.g. Haslett and Dillane, 2004). We do not consider the joint and gene-specific analyses as alternative methods, but we view the genespecific analysis merely as an efficient algorithm for performing the joint analysis. We have shown that the gene-specific algorithm provides (essentially) exact inferences for the joint LMMA under three conditions: (1) equal number of replicate spots or probes for any gene across all arrays (these numbers can differ among genes), (2) homogeneous residual variance across genes (other variance components may be heterogeneous) and (3) sufficiently large numbers of genes in the experiment (this condition is needed for mixed but not for fixed models).

186

I. H OESCHELE AND H. L I ACKNOWLEDGMENTS

This work was supported by NSF Plant Genome Cooperative Agreement DBI-0211863. R EFERENCES B LACK , M. A. AND D OERGE , R. W. (2002). Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments. Bioinformatics 18, 1609–1616. G ILMOUR , A. R., C ULLIS , B. R., W ELHAM , S. J. AND T HOMPSON , R. (2002). ASREML Reference Manual. NSW, Australia: NSW Agriculture, Orange Agricultural Institute. G OLDBERGER, A. S. (1962). Best linear unbiased prediction in the generalized linear regression model. Journal of the American Statistical Association 57, 369–375. H ASLETT, J. AND D ILLANE , D. (2004). Application of ‘delete = replace’ to deletion diagnostics for variance component estimation in the linear mixed model. Journal of the Royal Statistical Society, Series B 66, 131–143. H ENDERSON, C. R. (1963). Selection index and expected genetic advance. In Statistical Genetics and Plant Breeding (NRC publication 982). Washington, DC: National Academy of Sciences, pp. 141–163. K ERR, M. K. (2003). Linear models for microarray data analysis: hidden similarities and differences. Journal of Computational Biology 10, 891–901. K ERR , M. K. 183–201.

AND

C HURCHILL , G. (2001). Experimental design for gene expression microarrays. Biostatistics 2,

P FISTER -G ENSKOW, M., C HILDS , L., M YERS , C., L ACSON , J., B ETTHAUSER , J., G OULEKE , P., Z HENG , Y., L ENO , G., F ORSBERG , E., YANG , X., H OESCHELE , I. AND E ILERTSEN , K. J. (2004). Analysis of individual pre-implantation cattle embryos using cDNA microarrays: comparison of NT and IVF produced embryos using an interwoven loop design and linear mixed model analysis. Biology of Reproduction 72. In press. R ENCHER, A. C. (2000) Linear Models in Statistics. New York: John Wiley & Sons. W OLFINGER , R. D., G IBSON , G., W OLFINGER , E. D., B ENNETT, L., H AMADEH , H., B USHEL , P., A FSHARI , C. AND PAULES , R. S. (2001). Assessing gene significance from cDNA microarray expression data via mixed models. Journal of Computational Biology 8, 625–637. W U , H., K ERR , M. K., C UI , X. Q. AND C HURCHILL , G. A. (2003). MAANOVA: A Software Package for the Analysis of Spotted cDNA Microarray Experiments. The Analysis of Gene Expression Data: Methods and Software. New York: Springer. http://www.jax.org/research/churchill/. [Received February 18, 2004; first revision June 17, 2004; second revision September 14, 2004; accepted for publication October 11, 2004]