BGX: a fully Bayesian integrated approach to the analysis of Affymetrix GeneChip data


271 79 641KB

English Pages 25 Year 2005

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

BGX: a fully Bayesian integrated approach to the analysis of Affymetrix GeneChip data

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Biostatistics (2005), 6, 3, pp. 349–373 doi:10.1093/biostatistics/kxi016 Advance Access publication on April 14, 2005

BGX: a fully Bayesian integrated approach to the analysis of Affymetrix GeneChip data ANNE-METTE K. HEIN∗ Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London W2 1PG, UK [email protected] SYLVIA RICHARDSON Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London W2 1PG, UK HELEN C. CAUSTON Microarray Centre, MRC Clinical Sciences Centre, Imperial College, Hammersmith Hospital, London W12 0NN, UK GRAEME K. AMBLER, PETER J. GREEN School of Mathematics, University of Bristol, University Walk, Bristol BS8 1TW, UK

S UMMARY We present Bayesian hierarchical models for the analysis of Affymetrix GeneChip data. The approach we take differs from other available approaches in two fundamental aspects. Firstly, we aim to integrate all processing steps of the raw data in a common statistically coherent framework, allowing all components and thus associated errors to be considered simultaneously. Secondly, inference is based on the full posterior distribution of gene expression indices and derived quantities, such as fold changes or ranks, rather than on single point estimates. Measures of uncertainty on these quantities are thus available. The models presented represent the first building block for integrated Bayesian Analysis of Affymetrix GeneChip data: the models take into account additive as well as multiplicative error, gene expression levels are estimated using perfect match and a fraction of mismatch probes and are modeled on the log scale. Background correction is incorporated by modeling true signal and cross-hybridization explicitly, and a need for further normalization is considerably reduced by allowing for array-specific distributions of nonspecific hybridization. When replicate arrays are available for a condition, posterior distributions of condition-specific gene expression indices are estimated directly, by a simultaneous consideration of replicate probe sets, avoiding averaging over estimates obtained from individual replicate arrays. The performance of the Bayesian model is compared to that of standard available point estimate methods on subsets of the well known GeneLogic and Affymetrix spike-in data. The Bayesian model is found to perform well and the integrated procedure presented appears to hold considerable promise for further development. ∗ To whom correspondence should be addressed.

c The Author 2005. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]. 

350

A.-M. K. H EIN ET AL .

Keywords: Affymetrix; Bayesian; Differential expression; GeneChip; Gene expression; Markov chain Monte Carlo; Probe-level analysis.

1. I NTRODUCTION Microarrays are one of the new technologies that have developed in line with the sequencing of human and other genomes and the developments in miniaturization and robotics. They permit the expression profiles of tens of thousands of genes to be measured in a single experiment and promise to revolutionize the biomedical and life sciences. Affymetrix is one of the leading manufacturers of microarrays. Their ‘GeneChips’ differ from many other array types in that a single labeled extract is hybridized to each array and because they contain multiple ‘match’ and ‘mismatch’ sequences for each transcript. This presents particular challenges for low-level data analysis.

1.1

Affymetrix oligonucleotide arrays

The oligonucleotide array technology exploits two fundamental biological properties: (a) mRNA is an intermediate product between genes encoded in DNA and their protein products, so mRNA abundance can be used as a measure of gene expression, and (b) single stranded RNA molecules have a high affinity to form double stranded structures. Pairing between RNA strands is highly specific and complementary strands have particularly high binding affinities. The arrays contain hundreds of thousands of features (probes), at each of which a different oligonucleotide sequence is represented in many copies. A measure of the abundances of the transcript RNAs in a biological sample can be obtained by hybridizing a fluorescently labeled, fragmented version of the RNA to an array and scanning it (Schena et al., 1995). On GeneChip arrays oligonucleotides of length 25 are used. However, many genes are similar, sharing common motifs or subsequences, and cannot, in general, be uniquely identified by a single sequence of length 25. Therefore, each gene is represented by a probe set, consisting of a number of probe pairs. A probe pair consists of a perfect match (PM) probe and a mismatch (MM) probe. At each PM probe, an oligonucleotide which perfectly matches part of the transcript is represented. The detection of transcripts at the PMs of a probe set indicates that the gene is expressed, and the level of detection indicates the degree of expression. However, although complementary RNA sequences have particularly high affinities, sequences that are complementary over only part of the length of the sequence, or shorter sequence fragments, may also hybridize. We refer to the hybridization of noncomplementary transcripts to the probes as nonspecific hybridization. This is the motivation for including MM probes. The oligonucleotides represented at an MM probe are identical to those at the corresponding PM probe, except that the middle nucleotide is that of the complementary base. The intention is that, since PM and MM probes are almost identical, equal amounts of nonspecific hybridization will occur at these probes. Excess hybridization to the PM probe, relative to the MM probe will be due to specific hybridization, that is, the hybridization of complementary transcripts. A probe set for a gene typically consists of 11–20 PM and MM probe pairs, and these represent the information available about the expression of the gene.

1.2 Gene expression experiments and analysis The generation of gene expression data is a multistep process, and variability may be presented at a number of experimental stages. The variability of interest is that of biological origin, e.g. variability in gene expression between experimental conditions, individuals, or tissue types. Variability of nonbiological origin may arise due to differences in the preparation of the biological samples to be hybridized, in the manufacture of the arrays, or in the process of scanning the arrays (see Hartemink et al., 2001, for a more

BGX: integrated Bayesian gene expression analysis

351

detailed discussion). The replicability of raw gene expression data is low and gene expression data are notoriously noisy. The analysis of gene expression data is usually also treated as a multistep process consisting of correcting the intensities for background noise, normalization between arrays, estimation of gene expression indices, assessment of which genes are differentially expressed, and clustering of genes or conditions with similar expression profiles or patterns. A drawback of splitting up the analysis of gene expression data into separate steps that are dealt with independently is that the error associated with each step is ignored in the downstream analysis. The probable consequences of this include overoptimistic estimates of precision of estimates and, in nonlinear models like this, systematic bias. In assessing differential expression, it is clearly of interest to know how reliable is the expression index of a gene. In turn, in the estimation of the gene expression index, it is of interest to quantify the variability in the background corrected intensities, on which the estimation is based. A primary aim of the work presented here is to develop a statistically coherent framework for the analysis of Affymetrix arrays, in which the splitting up of the analysis into separate steps is avoided, and thus more realistic measures of credibility are obtained. 1.3

Bayesian hierarchical modeling of Affymetrix gene expression data

In this paper we present Bayesian hierarchical models for the analysis of gene expression data, where all steps in the process that generates the data, and associated errors, are modeled simultaneously. For clarity, we first set out a model for estimating the expression of genes using data obtained from a single array. In the model, background correction for nonspecific hybridization and calculation of gene expression indices are considered simultaneously. Next, we extend the model to encompass situations in which different experimental conditions are considered, and where replicate arrays may be available. Here all information is used simultaneously to make the relevant inferences: condition-specific gene expression levels are obtained from a simultaneous consideration of probe sets on replicate arrays. We base the inference on the full posterior distributions for the parameters, and functions thereof. Credibility intervals as well as point estimates of gene expression levels, fold changes, and ranks of the genes with respect to the degree of differential expression are thus available. The integration of all steps of the analysis into one framework comes at a price: the Bayesian hierarchical approach is computationally more demanding than other available approaches. In our opinion, the advantage outweighs the computational cost. The computational time scales approximately with the product of the number of arrays, genes, and probe pairs per gene. For an analysis of 10 arrays of 12 500 genes, each represented by 16 probe pairs, 10 000 sweeps can be obtained in 25 h on a 2.2 GHz AMD Opteron machine (a standard desktop PC). Thus, for many gene expression experiments the Bayesian analysis is feasible, and relative to the time, cost, and labor invested in generating the data, the computational time requirements are acceptable. The paper is organized as follows: In Section 2 we briefly describe the well-known data sets that are used to illustrate and evaluate the performance of the Bayesian hierarchical models proposed. In Section 3 we motivate and formulate a simple Bayesian hierarchical model for estimating gene expression levels from a single GeneChip array, and examine the performance of the model in a detailed analysis of one of the data sets. We compare the performance of the Bayesian hierarchical model to those of two alternative methods for obtaining measures of gene expression from Affymetrix GeneChip arrays. The section also includes a discussion of the sensitivity to assumptions on the prior distributions chosen in the hierarchical model. In Section 4 we present the model for the analysis of arrays under different conditions, and compare the results under this model to those obtained using other methods. A description of the Markov chain Monte Carlo (MCMC) implementation of the models is given. We conclude with a discussion of the methods presented and future directions. Supplementary material for this publication consisting of additional diagrams is available at http://www.biostatistics.oupjournals.org.

352

A.-M. K. H EIN ET AL . 2. DATA SETS

We consider three data sets: Data Set A, Data Set B, and Data Set C, which are subsets of the well-known GeneLogic and Affymetrix spike-in data sets. These were generated under controlled experimental settings and form a good basis for evaluating and comparing methods of analysis. Data from the arrays and further details on the experiments are available from GeneLogic (http://www.genelogic.com/media/studies/ index.cfm) and Affymetrix (http://www.affymetrix.com/index.affx). For processing Data Sets A and B, GeneLogic prepared replicate samples (technical replicates) of cRNA from an acute myeloid leukemia (AML) tumor cell line. Eleven exogenous cRNAs were spiked into each sample at specific known concentrations and each sample was subsequently hybridized to an array. A similar procedure was used by Affymetrix to produce the arrays in Data Set C, with a complex human background cRNA in place of the AML tumor cell line, and 42 spike-ins, a number of which were cloned human genes, believed not to be expressed in the complex background cRNA. GeneLogic used HGU95A arrays, on which the spike-in genes were represented by 20 probe pairs, and most of the remaining genes by 16. Affymetrix used HGU133A GeneChips. On these most genes are represented by 11 probe pairs. Details of the data sets are Data Set A: We include data from 14 of the arrays in the GeneLogic varying concentration series data set (arrays 91–96, 53–54, 56, 58, 60, 62, 64, and 66, with prefixes and suffixes 924, and hgu95a11). On array i all 11 spike-in genes were spiked in at concentration ci , with ci , i = 1, . . . , 14, equal to 0.0, 0.5, 0.75, 1.0, 1.5, 2.0, 3.0, 5.0, 12.5, 25.0, 50.0, 75.0, 100.0, and 150.0 pM. Data Set B: We include data from six of the arrays in the GeneLogic Latin Square data set (arrays 11, 21, and 31, with prefix 92557hgu95a, and 11, 21, and 31, with prefix 92561hgu95a). The 11 spike-in genes were spiked in at two different sets of concentrations on the two triplets of arrays, with fold changes ranging from 3 to 200. We refer to the two triplets of arrays as replicates under conditions 1 and 2, respectively. Data Set C: Data Set C consists of data from the 42 arrays in the Affymetrix U133A data set (14 experiments, each with three technical replicates). Affymetrix divided the 42 spike-in genes into 14 groups of three and the genes were spiked in so that a Latin Square was formed for the 14 gene groups in 14 experiments, with the 14 concentrations, c1 , . . . , c14 , being 0.0, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, and 512 pM. For simplicity, we restrict analysis to a randomly chosen subset of the data on the arrays. For the arrays in Data Sets A and B, we consider the PM and MM measurements from the 11 spike-in genes in addition to, in the case of Data Set A, the set of 500 genes having numbers 6503–7002, and for Data Set B the genes with numbers 6000–7002, excluding 6030, 6367, and 6463. All the non-spiked-in genes in these subsets have 16 probe pairs. In Data Set C we consider PM and MM measurements from every 20th gene, starting with gene 1, and all the 42 spike-in genes, resulting in a total of 1154 genes. In both cases the numbering of the genes is that obtained when reading CEL-files in R (http://www.R-project.org) using the affy package (Gautier et al., 2004). Judged from distributions of PM, MM, and PM−MM intensities the sets of genes appear representative for the full set represented on the arrays. 3. S INGLE ARRAY ANALYSIS 3.1

Motivation

A considerable effort has recently been invested in the study of Affymetrix probe-level characteristics. We present a simple model which is motivated by the following four key findings. The model serves to

BGX: integrated Bayesian gene expression analysis

353

illustrate the main idea behind the integrated Bayesian hierarchical approach to Affymetrix expression array data analysis, and may be extended to incorporate additional features of probe behavior (see Section 5). (1) The MM intensities generally increase with the PM intensities (Naef et al., 2002b). Both PM and MM intensities increase with concentration, and the increase in the MM intensities is less than that of the PM intensities (Irizarry et al., 2003a). (2) The spread of replicate PM intensities increases with the average intensities (Irizarry et al., 2003a). (3) There is considerable variability in the response of PM probes (and MM probes) representing the same gene (Li and Wong, 2001). For a substantial minority of probe pairs MM > PM (Naef et al., 2002a). (4) Probe effects are approximately additive on the log scale (Irizarry et al., 2003a), that is, log(PM) ≈ α + β log(c). Observation (1) indicates that transcripts which perfectly match the PM probes also hybridize to the corresponding MM probes, but that the hybridization is stronger to the PM probes than the MM probes. Observation (2) suggests that there is a multiplicative error and that it is thus necessary to transform the data to obtain a variance that is independent of the mean. The standard deviation does not approach zero as the mean does, but has a positive intercept, suggesting an additive error in addition to the multiplicative error (Geller et al., 2003). From (3), it is likely that the expression of some genes can be more reliably estimated than others. Finally, (4) suggests that calculation/estimation of gene expression should be performed on a log scale. 3.2

Model formulation for a single array

At the first level of our model, we assume that the intensity observed for the PM measurement for probe j of gene g, PMg j , is due partly to binding of labeled cRNA that perfectly matches the sequence on the array, Sg j , and partly to hybridization of labelled, cRNA that does not perfectly match the sequence, Hg j . We refer to Sg j as ‘true signal’, and to Hg j as ‘nonspecific hybridization’, and assume that both are nonnegative. For the corresponding mismatch intensity, MMg j , we assume that the intensity observed is due partly to binding of a fraction, φ ∈ (0, 1), of the true signal and partly due to nonspecific hybridization. We thus allow true signal and nonspecific hybridization to be gene and probe specific (Sg j and Hg j , respectively), but assume that the same fraction of true signal binds to the MM probes for all genes and probes. We assume that, conditional on Sg j , Hg j , and φ, PMg j and MMg j are normally distributed with means Sg j + Hg j and φ Sg j + Hg j , respectively, and variance τ 2 . The first level of the model is summarized in the set of equations (3.1) given below: PMg j ∼ N (Sg j + Hg j , τ 2 ), MMg j ∼ N (φ Sg j + Hg j , τ 2 ).

(3.1)

We thus allow for an additive error on the normal scale, which accommodates probe pairs with MM > PM. In this and subsequent displays of distributions, for simplicity we omit to state that distributions are conditional on variables appearing on the right-hand side, and are independent over all values of indexing suffices. For the true signal and nonspecific hybridization terms, Sg j and Hg j , we wish to model them on the log scale, but also allow them to be zero. We therefore consider log(Sg j + 1) and log(Hg j + 1) as opposed to log(Sg j ) and log(Hg j ), respectively. Information about the expression of gene g is contained in the true signals Sg j , j = 1, . . . , Jg . We assume that log(Sg j + 1), j = 1, . . . , Jg , are truncated normally distributed (denoted by TN, and obtained from a normal distribution by conditioning the variable to be nonnegative, since by assumption Sg j  0) with gene-specific parameters (µg , σg2 ). We use the

354

A.-M. K. H EIN ET AL . µ

medians of the truncated normal distributions, θg (that is, θg = µg − σg −1 ( 12 σgg )), g = 1, . . . , G, as measures of gene expression, and refer to them as the Bayesian gene expression indices. For the nonspecific hybridization terms, we do not expect Hg j , j = 1, . . . , Jg , to depend on the expression of gene g in particular, but assume that log(Hg j + 1), g = 1, . . . , G, j = 1, . . . , Jg , come from a common, arraywide, truncated normal distribution of nonspecific hybridization. The modeling of the true signal and the nonspecific hybridization is thus specified by log(Sg j + 1) ∼ TN(µg , σg2 ), log(Hg j + 1) ∼ TN(λ, η2 ).

(3.2)

To account for outlying probes and stabilize the gene-specific variance parameters, we assume that the σg2 , g = 1, . . . , G, parameters are exchangeable with distribution log(σg2 ) ∼ N (a, b2 ).

(3.3)

We conclude the full specification of our model by assuming a uniform prior, U (0, 15), on µg , which comfortably covers the range of possible log intensities, a B(1, 1) prior on φ, a flat normal prior on λ −1 −1 [N (0, 1000)], and flat gamma priors on (τ 2 ) and (η2 ) [ (0.001, 0.001)]. The parameters a and b2 in (3.3) are fixed at values obtained by an empirical procedure in the same spirit as Empirical Bayes approaches. We obtain estimates,  Sg j , of the true signals Sg j , calculate the sample variance, σ˜ g2 , of the log( Sg j + 1) for each probe set g, and fix a and b2 at the empirical mean and variance, respectively, of the resulting set of log(σ˜ g2 ) values (g = 1, . . . , G). An obvious choice for  Sg j is obtained by assuming φ = 0 in (3.1) and setting  Sg j = max{PMg j − MMg j , 0}.

(3.4)

However, using the minimum possible value, 0, as empirical estimates of the true signals for all probe pairs with MM > PM gives an extreme spread of log( Sg j + 1) within probe sets. We prefer to shrink the true signals within probe sets by using a weighted value of the minimum positive PM − MM difference over the probe set: ⎧ ⎨PMg j − MMg j , if PMg j  MMg j ,  Sg j = k ⎩ min j=1,...,Jg {PMg j − MMg j : PMg j > MMg j }, if PMg j < MMg j , Jg with k = |{j: PMg j > MMg j }|. An examination of the sensitivity to the choice of priors, including the values of a and b2 , and that of adding an extra level of hierarchy by treating a and b as random, is given in Section 3.5. 3.3

Performance of the single array model

Four genes at the same concentration. In Figure 1, the raw data relating to four different genes (spike-in genes 6–9) on array 8 of Data Set A are shown, along with summaries of posterior distributions of parameters relating to the expression of the genes. The spike-in concentration of each of these genes on this array is 5.0 pM (see Section 2). The following observations are made. Raw data (Figure 1, upper panel). The probe response plots illustrate the high variability in probe intensities (for PMs and MMs) within a probe set as reported by Li and Wong (2001), as well as high variability

BGX: integrated Bayesian gene expression analysis

355

Fig. 1. Plots illustrating probe-level data and posterior distributions of parameters, for four different genes, each spiked in at concentration 5.0 pM. Upper panel: Each plot shows PMg j , j = 1, . . . , 20, broken lines; MMg j , j = 1, . . . , 20, dotted lines; and PMg j − MMg j , j = 1, . . . , 20, full lines, for each of four genes. Lower panel: Each plot shows posterior distributions obtained for the parameters corresponding to one of the four genes described in the upper panel. Crosses and horizontal lines: posterior means and 2.5–97.5% credibility intervals (on x-axis) for log(Sg j + 1), j = 1, . . . , 20, with each cross (line) representing a probe. Circles: log(PMg j − MMg j ) where defined, zero else (x-axis), j = 1, . . . , 20. Curves: truncated normal distribution, TN(µˆ g , σˆ g2 ), with µˆ g and σˆ g2 equal to the posterior mean values of µg and σg2 . The bold lines: posterior 2.5–97.5% credibility intervals for the BGX gene expression indices θg . The y-axis has meaning for the TN densities only. The plots demonstrate (a) high variability among PM and MM intensities within and between probe sets, for genes spiked-in at the same concentration and (b) that the posterior distributions of parameters and quantities related to the true signals reflect this variability.

in response between probe sets for different genes. Gene 7 is at one extreme, with both PM and MM values consistently low, gene 9 at the other extreme, with a majority of high PM values and low MM values (see Section 3.4 for further comments on gene 7 in this data set). Genes 6 and 8 fall in between the two extremes, with the probe response pattern of gene 8 being more consistent. These observations highlight that (1) gene expression can be more reliably determined for some genes than for others and (2) gene expression measurements for different genes may not be on the same scale. These characteristics reflect properties of the probes chosen to represent the genes (in particular, related to hybridization kinetics), a choice that is restricted by the DNA sequence representing the gene. Posterior distributions (Figure 1, lower panel). The posterior distributions of the log-scale true signals log(Sg j + 1), j = 1, . . . , 20, and TN(µˆ g , σˆ g2 ) (where µˆ g and σˆ g2 are the posterior mean values of µg and σg2 , a convention we will use throughout this paper) consistently indicate negligible expression of gene 7, and high expression of gene 9. The posterior distributions of log(Sg j + 1), j = 1, . . . , 20, have

356

A.-M. K. H EIN ET AL .

less overlap, and cover a wider range for gene 6 than gene 8, as should be expected from the plots in the upper panel. The truncated normal distributions, specified in (3.2), capture the overall variability of the log-scale true signals well. A slightly higher mode is obtained for gene 8 than gene 6, but more noticeable is the wider spread of the distribution for gene 6 than gene 8. With respect to outliers, gene 8 has two probes for which MM > PM and gene 6 has four. For gene 8, the two probes have little impact on the truncated normal distributions: the posterior distributions of the signals being drawn toward the right by the remaining probes. For gene 6, a substantially more dispersed truncated normal distribution is obtained. In summary, the characteristics of the probe set responses for the four genes are reflected well in the posterior distributions of the parameters relating to the genes. A single gene at 10 concentrations. Figures 2 and 3 summarize the results obtained for spike-in gene 1 on 10 of the arrays of Data Set A. In the figures, each graph shows results for one array. Note that the gene has been spiked in at different concentrations on the arrays (see Section 2). Gene expression level (Figure 2). In the figure, summaries of posterior distributions of parameters related to the expression of spike-in gene 1 on the 10 arrays are shown. The credibility intervals of the log-scale true signals, log(Sg j + 1), j = 1, . . . , 20, and gene expression index, θg , are shifted to the right as the spike-in concentration is increased. The increase in spike-in concentrations are reflected in the levels of expression detected. Nonspecific hybridization (Figure 3). The figure shows 2.5–97.5% credibility intervals of the nonspecific hybridization terms, log(Hg j + 1), j = 1, . . . , 20, along with TN(µˆ g , σˆ g2 ), for ease of comparison. As would be expected, the nonspecific hybridization terms do not reflect the concentration of the spike-in genes. There is a small increase for some nonspecific hybridization terms as the concentration increases. This is likely to be a consequence of saturation of the PM probes, causing MM intensities to ‘catch up’ with PM intensities. In summary, the detailed analysis given above shows that the gene expression indices reflect the spike-in concentrations, that the nonspecific hybridization terms do not correlate with spike-in concentrations, and that the model accommodates outliers.

3.4

Comparison with other methods

Since Affymetrix launched the MAS4.0 Micro Array Suite package for estimating gene expression levels from GeneChips, a number of alternative methods have been developed. These address various shortcomings of MAS4.0, in particular sensitivity to outliers and the generation of negative expression indices (e.g. Hubbell et al., 2002). Two of the most popular alternative methods are MAS5.0 (Hubbell et al., 2002; Affymetrix, 2001) and robust multi-array average (RMA; Irizarry et al., 2003a). In common with MAS4.0, both produce a point estimate of expression for each gene on each array. AffyComp (Cope et al., 2004) is a graphical tool for evaluating, comparing, and displaying the performance of expression level estimators from Affymetrix GeneChips on Benchmark data sets. The MAS5.0 and RMA methods are generally found to be superior to MAS4.0 (Irizarry et al., 2003b). BGX is fundamentally different from other methods in providing a posterior distribution of gene expression, rather than a single point estimate. Comparisons with existing methods are thus not straightforward: virtues do not necessarily transfer from point estimates to posterior distributions. We nevertheless find it informative and useful to compare a summary version of the BGX measure, pBGX, the posterior mean BGX value, to the most popular existing point estimators. Below we present such a comparison of pBGX with MAS5 and RMA using elements from AffyComp.

BGX: integrated Bayesian gene expression analysis

357

Fig. 2. Illustration of posterior distributions of parameters relating to the true signal of spike-in gene 1, obtained in the analysis of 10 of the arrays of Data Set A. The posterior distributions, summarized by the 2.5–97.5% credibility intervals, of the parameters log(Sg j + 1), j = 1, . . . , Jg , and θg , and the distribution TN(µˆ g , σˆ g2 ) move to the right as the spike-in concentration increases. The notation is as for Figure 1.

358

A.-M. K. H EIN ET AL .

Fig. 3. Illustration of posterior distributions of parameters relating to the nonspecific hybridization to the probes of spike-in gene 1, obtained in the analysis of 10 of the arrays in Data Set A. Posterior 2.5–97.5% credibility intervals of log(Hg j + 1), j = 1, . . . , 20, are shown as horizontal lines. Overlaid in each plot is the TN(µˆ g , σˆ g2 ) distribution of the log-scale true signals. The nonspecific hybridization terms do not reflect the change in spike-in concentration.

BGX: integrated Bayesian gene expression analysis

359

Overall comparison of gene expression measures and fold changes. The methods give measures of gene expression on different scales, and must be transformed in order to be compared. We consider here transformations to the natural logarithmic scale (the scale of BGX). Whereas the methods behave similarly at the higher end, there are nonnegligible differences at the lower end. Density plots show that the sets of expression indices obtained with pBGX and RMA are left skewed, whereas those obtained with MAS5 are closer to being symmetric (see supplementary material, http://www.biostatistics.oupjournals.org). Figure 4 refers to pairwise comparisons involving experiment 1 in Data Set C. It displays for each method an MA plot in which the differences in expression levels are plotted against the means for each gene. The numbers give the fold changes in spike-in concentrations of the spike-in genes. These are generally higher and more accurate for pBGX and MAS5 than for RMA. The gray circles show points for non-spiked-in genes and should fall on the x-axis. pBGX is more precise than MAS5.0 but less precise than RMA. Gene expression indices versus concentrations. For each of the three methods, Figure 5 shows mean gene expression indices over three replicate arrays obtained for the 42 spike-in genes in the 14 experiments of Data Set C, plotted against log(spike-in concentration). Mean expression measures at concentration 0 are not included as a log scale is used. Ideally the points should fall on a line with slope 1, so that fold differences in concentration give rise to the same fold differences in gene expression. This is not the case for any of the measures: all show some curvature. Increases in log-concentrations above 0 are generally picked up by all methods. The slope in this range is highest, and close to 1, for pBGX and lowest for RMA. In the range below 0, pBGX and RMA estimate very weak increases, if any. The performance of MAS5 in this range is more variable. In Figure 6, the 2.5–97.5% credibility intervals obtained with BGX are plotted against log(spike-in concentration) for the 11 spike-in genes on 13 of the 14 arrays of Data Set A; results for array 1, with spike-in concentration 0 pM, are not included). For low expression levels, the credibility intervals are generally wider, reflecting the relatively larger influence of noise and nonspecific hybridization. One gene

Fig. 4. Rotated scatter plots (MA plots) of the (e¯1,g , e¯2,g ) values obtained by plotting the points (0.5(e¯1,g + e¯c,g ), e¯1,g − e¯c,g ), g = 1, . . . , G, c = 2, . . . , 14, obtained with each method in analyses of the arrays in Data Set C. The range of A values for pBGX and RMA depicted is 1–9 and for MAS5 0–8. e¯c,g is the mean gene expression level for gene g over the three replicate arrays in experiment c. Values for spike-in genes are given by numbers (indicating the fold change), those for non-spiked-in genes by gray circles. For pBGX the sets of intensities on the arrays were prescaled so that all arrays had a trimmed mean intensity equal to that of the median trimmed mean intensity among arrays. Each array was then analyzed separately and the mean of the three replicate posterior gene expression indices obtained. The plot shows that whereas pBGX and MAS5 are more accurate than RMA, RMA is more precise than pBGX, which is more precise than MAS5.

360

A.-M. K. H EIN ET AL .

Fig. 5. Mean gene expression measures obtained over three replicate arrays using pBGX, MAS5, and RMA for the 42 spike-in genes in the 14 experiments in Data Set C (excluding expression measures at concentration 0) as a function of log(concentration). Each curve connects expression measures for a gene. Increases above log-concentrations of 0 pM are reflected in the gene expression measures for all methods.

Fig. 6. Credibility intervals (2.5–97.5%) for spike-in gene expression measures obtained using BGX on the arrays of Data Set A (excluding array 1, for which the spike-in concentration is 0 pM) as a function of log(concentration). To allow credibility intervals for all genes to be depicted in one plot, the credibility intervals for the expression measures of each gene have been shifted by a small amount on the x-axis. Credibility intervals are generally wider for low than for high concentrations. The expression of gene 7 is estimated as low on all arrays (broken line). This is confirmed by other methods (results not shown) and probe responses consistently indicate a spike-in failure for this gene.

(gene 7) is not responding, and a closer examination of the data for this gene suggests that the gene may not have been correctly spiked in. This is supported by results using other methods (not shown). Reproducibility: SDs versus means of replicate point estimates. Figure 7 refers to analyses of the three replicate Experiment 1 arrays of Data Set C. It shows for each method the standard deviation plotted

BGX: integrated Bayesian gene expression analysis

361

Fig. 7. SD versus mean over three replicate point estimate expression measures for each gene obtained for each of three methods. Numbers refer to row numbers of genes when reading CEL-files in R. For all methods genes with high mean levels of expression have low SD. At lower levels RMA has lower SD than BGX, which has lower SD than MAS5. SDs obtained with different methods are not correlated (see Supplementary Material).

against the mean (calculated over the three replicate point estimates) for each gene. For genes with the highest expression levels, the three methods give SDs of similar magnitude, but at lower levels, RMA has considerably smaller SD than pBGX, which in turn has smaller SD than MAS5. For pBGX and MAS5 the magnitude of the SD varies with the mean: least reproducible estimates are obtained for the lowest levels of expression with MAS5, and for low to medium levels for pBGX. Importantly, whereas the mean expression levels obtained with the three methods are correlated, the SDs are not (see Supplementary Material). This is expected as the methods interpret data differently: RMA uses PM intensities only to estimate gene expression levels, pBGX uses PMs and a fraction of MMs for this task, while interpreting a probe pair with PM < MM as evidence for no signal, and MAS5 uses PM − MM, disregarding extreme values. Focussing on pBGX, the genes with high SDs are typically genes which have conflicting probe responses, that is, probe sets with a substantial minority of probe pairs with PM ≫ MM > 0 and the remaining PM ≈ MM. For such genes the posterior distribution of BGX is often bimodal, resulting in a high SD on the pBGX values obtained for replicate arrays. Note that with replicate arrays, point estimate methods typically estimate a condition-specific gene expression level by averaging across replicates. Using BGX we would not do this, but instead consider the arrays simultaneously using a multiple array model (see Section 4). Also note that we do not generally expect reproducibility of posterior distributions of gene expression levels for the same gene on two replicate arrays: if one array is of worse quality, e.g. subject to higher levels of nonspecific hybridization, we would expect the true signals on that array to be relatively less strong. Appropriately, the posterior BGX distribution for the gene on that array will be more dispersed. 3.5

Sensitivity to the priors

We examined the sensitivity of the results obtained under the model in Section 3.2 to the structural and quantitative specifications of the priors by analyzing the data from array 1 of Data Set B, using a number of modified versions of the model. Sensitivity was assessed by comparing posterior mean parameter values obtained under the different models. The results for the φ, τ 2 , λ, and η2 parameters are summarized in Table 1, and for the sets of log-scale true signals, log-scale nonspecific hybridization terms, gene expression indices, and σg2 parameters in Table 2. Detailed graphical displays of the results are available as Supplementary Material. Quantitative prior specifications. For the quantitative specifications of the priors, we anticipate that the values of the a and b2 parameters and the prior assumption for the µg (g = 1, . . . , G) are most likely to be influential, and focus the sensitivity analysis on assumptions for these.

362

A.-M. K. H EIN ET AL .

Table 1. Sensitivity to the priors. Posterior mean (standard deviation) of four parameters obtained under five different models (first column) in an analysis of array 1 of Data Set B. The models are as follows— Base: model of Section 3.2 with a and b2 fixed by the empirical procedure detailed in Section 3.2. (a, b2 ) = (0, 0.5) and (a, b2 ) = (0.8, 0.3): alternative choices of values for a and b. µg ∼ TN(0, 16): as Base but with µg ∼ TN(0, 16) in place of µg ∼U (0, 15). Hierarchical: as Base but with hyperprior distributions on a and b2 [see (3.3′ ) in Section 3.5]. The mean posterior values of the parameters under the different priors are very similar φ Base (a, b2 ) = (0, 0.5) (a, b2 ) = (0.8, 0.3) µg ∼ TN(0,16) Hierarchical

0.177 (2.68 × 10−3 ) 0.177 (2.74 × 10−3 ) 0.177 (2.69 × 10−3 ) 0.176 (2.64 × 10−3 ) 0.177 (2.82 × 10−3 )

τ2

λ

η2

1930 (21) 1940 (21) 1919 (20) 1930 (20) 1941 (20)

4.548 (5.94 × 10−3 )

0.325 (4.79 × 10−3 ) 0.329 (5.02 × 10−3 ) 0.320 (4.72 × 10−3 ) 0.323 (4.74 × 10−3 ) 0.329 (4.85 × 10−3 )

4.543 (6.19 × 10−3 ) 4.554 (5.71 × 10−3 ) 4.554 (5.92 × 10−3 ) 4.543 (6.68 × 10−3 )

For array 1 in Data Set B, using the model and estimation procedure of Section 3.2 we obtain (a, b2 ) = (0.341, 0.335). To explore the sensitivity of the analysis to the particular values of (a, b2 ), we repeated the analysis assuming two different pairs of values: (0.0, 0.5) and (0.8, 0.3). The values of the first pair are approximately equal to the posterior mean values obtained assuming a hierarchical prior with weakly informative hyperpriors on a and b2 for the σg2 , the hierarchical prior model [see (3.3′ ) below]. The second pair, (0.8, 0.3), is obtained by using (3.4) in Section 3.2. The two pairs of values chosen to illustrate the sensitivity thus represent extremes, within reason, with respect to borrowing information between genes and shrinking the distributions of log-scale true signals within probe sets. As expected, the posterior mean values of the σg2 parameters are lower under the model with (a, b2 ) = (0.0, 0.5) and higher under the model with (a, b2 ) = (0.8, 0.3), relative to those obtained using the recommended values (0.341, 0.335). However, this variability has little impact on the remaining parameters: posterior mean values of the φ, τ 2 , λ, and η2 parameters obtained under the three models are almost identical, as are the log-scale nonspecific hybridization term parameters. For the parameters of primary interest {θg : g = 1, . . . , G}, the ranges are highly similar, there is a strong correlation between values obtained under the models, and the root mean squared errors are small. The findings for the log-scale true signals reiterate those for the gene expression indices (Tables 1 and 2). To assess sensitivity to the choice of prior on µg we compared results obtained under the model of Section 3.2 to those obtained when µg ∼ TN(0, 16). Changing the prior on the µg has no noticeable impact—posterior mean parameter values obtained under the models are very similar (Tables 1 and 2). From this comparison we conclude that there is weak effect of the particular choice of prior distributions for a and b on the posterior distributions of the parameters. Hierarchical prior for the variances. We explored the effect of replacing the assumption in Section 3.2 of an exchangeable prior (3.3) with fixed values of a and b2 for the σg2 parameters by the assumption of a hierarchical prior for σg2 , g = 1, . . . , G, with hyperprior distributions on the a and b2 parameters, the hierarchical prior model, log(σg2 ) ∼ N (a, b2 ), a ∼ N (a0 , σ02 ), b−2 ∼ (α0 , β0 ).

(3.3′ )

BGX: integrated Bayesian gene expression analysis

363

Table 2. Sensitivity to the priors. Range: 2.5–97.5% range for the set of posterior mean parameter values. R 2 : Pearson’s correlation coefficient of posterior mean values obtained  under the model and K base − y model )2 /K , those obtained under the base model. RMSE: root mean squared error, k=1 (yk k between values obtained under base model (ykbase ) and alternative model (ykmodel ) for the full set of values, and restricted to values lying in the lower and upper quarter of the base range, RMSE L Q and RMSEU Q . For model descriptions see Table 1 Range

R2

RMSE

RMSE L Q

RMSEU Q

log(Sg j + 1) Base (0, 0.5) (0.8, 0.3) TN(0,16) Hierarchical

1.29–6.39 1.23–6.39 1.35–6.39 1.27–6.39 1.32–6.39

— 0.996 0.995 0.998 0.997

— 0.137 0.152 0.104 0.120

— 0.148 0.144 0.114 0.127

— 0.002 0.002 0.002 0.002

log(Hg j + 1) Base (0, 0.5) (0.8, 0.3) TN(0,16) Hierarchical

4.13–5.93 4.12–5.93 4.15–5.93 4.14–5.93 4.12–5.94

— 1.000 1.000 1.000 1.000

— 0.014 0.016 0.011 0.014

— 0.014 0.016 0.011 0.014

— 0.001 0.001 0.001 0.001

1.28–5.69 1.22–5.70 1.35–5.66 1.25–5.65 1.30–5.71

— 0.996 0.996 0.998 0.997

— 0.128 0.143 0.100 0.110

— 0.140 0.113 0.106 0.116

— 0.006 0.009 0.025 0.005

0.90–2.74 0.69–2.75 1.21–3.53 0.90–2.81 0.69–2.71

— 0.985 0.977 0.994 0.985

— 0.307 0.646 0.068 0.311

— 0.331 0.602 0.051 0.335

— 0.395 0.770 0.181 0.494

θg Base (0, 0.5) (0.8, 0.3) TN(0,16) Hierarchical σg2 Base (0, 0.5) (0.8, 0.3) TN(0,16) Hierarchical

For model (3.3′ ), we wish to choose hyperpriors on a and b that give support to any realistic values of log(σg2 ). A consideration of hypothetical extreme values of log-transformed variances of log-scale true signals (which are nonnegative and bounded from above by the maximum scanner intensity) suggests that the range of possible values of the log(σg2 ) can be amply covered by taking a ∼ N (0, 16). To ensure flexibility in the dispersion of the values, we choose an Inverse Gamma distribution b−2 ∼ (1.2, 0.3). Slow mixing of the MCMC sampler was detected for the hierarchical prior model, resulting in problems with achieving convergence. As a result, very long runs (∼1 000 000 sweeps) were necessary for reliable inference to be made. Posterior mean values of (a, b2 ) under the hierarchical prior model are (−0.009, 0.513). The results for the remaining parameters are almost identical to those obtained for the model with fixed (a, b2 ) = (0, 0.5), and lead to the same conclusions. Although the hierarchical prior model is intellectually appealing, as it allows by principle a full propagation of uncertainties in the value of a and b, in light of the weak influence of the prior specification shown, and the long computing times for the hierarchical prior model, we favor the model with fixed a and b.

364

A.-M. K. H EIN ET AL . 4. M ULTIPLE ARRAY ANALYSIS

The single array model is readily extended to situations where typically multiple conditions are considered, and where replicate arrays may be available under some or all of the conditions. The flexibility of the Bayesian hierarchical framework allows a structural choice to be made on how information is combined between different probes, genes, replicates, and conditions. Below we present a multiple array model, in which different ‘true signal’ terms for each gene, probe, and array are assumed at the first level of the model. At the second level, the full set of true signals on replicate arrays is combined under each condition (over probes and replicate arrays) to estimate condition-specific gene expression indices. Note that alternative structural choices can be made, by combining information at different levels (see Section 5). 4.1

Multiple array model formulation

We let c = 1, . . . , C refer to the conditions, and r = 1, . . . , Rc refer to the replicates under condition c. At the first level of the model, we allow for different additive errors on the arrays. We assume gene-, probe-, condition-, and replicate-specific true signals and nonspecific hybridization terms, and generalize (3.1) to 2 PMg jcr ∼ N (Sg jcr + Hg jcr , τcr ), 2 MMg jcr ∼ N (φ Sg jcr + Hg jcr , τcr ).

(4.1)

Information about the expression of gene g under condition c is contained in the true signals derived from the probe sets representing this gene on the available replicate arrays under this condition: Sg jcr , j = 1, . . . , Jg , r = 1, . . . , Rc . We thus base the estimation of the gene expression level of gene g under condition c on all these true signals, by assuming that log(Sg jcr + 1), for all j and r , are from a geneand condition-specific truncated normal distribution, and we use the median of this distribution, θgc , as the gene expression index, BGX. To allow for differences in the processing of the biological sample (e.g. in the labeling, fragmentation, or hybridization steps), in the arrays or the scanning process, we assume array-specific (that is, sample-specific) distributions of the nonspecific hybridization terms. Thus, the second level of the model is 2 log(Sg jcr + 1) ∼ TN(µgc , σgc ), 2 ). log(Hg jcr + 1) ∼ TN(λcr , ηcr

(4.2)

2 , g = 1, . . . , G, parameters are As in the single array model we assume that for each condition c the σgc exchangeable, with condition-specific distributions 2 log(σgc ) ∼ N (ac , bc2 ).

(4.3)

The last level of the model is identical to the single array model formulation, with the following prior 2 )−1 ∼ (0.001, 0.001), φ ∼ B(1, 1), and hyperpriors assumed (independently for each g, c, and r ): (τcr −1 2 ) ∼ (0.001, 0.001). The ac and bc2 parameters in (4.3) µgc ∼ U (0, 15), λcr ∼ N (0, 1000), and (ηcr are fixed at values obtained by the procedure described in Section 3.2, here obtaining empirical signals, 2 over each  Sg jcr , for each probe set [that is, for each triplet (g, c, r )], calculating the sample variance σ˜ gc 2 set of log( Sg jcr ), r = 1, . . . , Rc , j = 1, . . . , Jg , and fixing ac as the sample mean and bc as the sample 2 ), g = 1, . . . , G. variance of the values of log(σ˜ gc

BGX: integrated Bayesian gene expression analysis

365

Fig. 8. Summary of data and posterior distributions related to the expression of spike-in genes 3 and 4, in Data Set B, obtained using the single array model (upper panel) and the multiple array model (lower panel). Shown are 2.5–97.5% posterior credibility intervals for the BGX gene expression indices (horizontal lines), truncated normal distributions for the sets of log-scale true signals (using mean posterior values µˆ g and σˆ g2 ) (curves), and log(PM−MM) values (circles), for spike-in genes 3 and 4 under each of the two conditions. Under the single array model there are three truncated normal distributions of gene expression indices (one for each array). Under the multiple array model there is one truncated distribution and one gene expression index per gene per condition only. The plot illustrates the effect of combining information available in replicate probe sets in the multiple array model.

4.2

Performance of the multiple array model

We illustrate the performance of the multiple array model by presenting results of analyses of Data Sets B and C. Single versus multiple array analysis. The effect of considering replicate probe sets simultaneously when inferring gene expression indices under the multiple array model, relative to the single array analyses, is illustrated in Figure 8, by presenting comparative results for spike-in genes 3 and 4. Results using separate single array analyses of the six arrays are given in the upper panel, and those obtained by the multiple array model (which pools information over replicate arrays) are given in the lower panel. For gene 3 under condition 1 (first column), the three probe sets on the replicate arrays give rise to highly similar distributions of the log-scale true signals and gene expression indices. This is manifested in the multiple array model by the tighter truncated normal distribution for the log-scale true signals and the considerably shortened credibility interval for the (relatively low) level of expression for this gene. For gene 3 under condition 2 and gene 4 under condition 1 (second and third column) there is less reduction

366

A.-M. K. H EIN ET AL .

Fig. 9. Upper panels: Density plots (one for each array) for the sets of log(PM)s (full lines, left) and log(MMs) (broken lines, left; note: the curves have been shifted by −1.5 on the x-axis) and log(PM − MM) (right), for the latter only positive PM−MM considered, for Data Set B. There are noticeable differences between the sets of log(PM)s and log(MM)s on the arrays, and differences are smaller, but persist, for the sets of log(PM−MM) values. Lower panels: For each array (c = 1, 2, r = 1, 2, 3) a density plot is shown for the sets of posterior mean values of log(Hg jcr + 1), g = 1, . . . , G, j = 1, . . . , Jg (left), and log(Sg jcr + 1), g = 1, . . . , G, j = 1, . . . , Jg (right). There are considerable differences between the distributions of the log-scale nonspecific hybridization terms for different arrays. The density plots for the log-scale true signals on different arrays coincide.

in the size of the credibility interval, due to the less consistent information contained in the probe sets on the replicate arrays. For gene 4 under condition 2 (last column), the data and posterior distributions of the log-scale true signals on the three replicate arrays differ substantially. As a consequence, the truncated normal distribution summarizing the distribution of the log-scale true signals under this condition is more dispersed, and, keeping the log scale in mind, the credibility interval for the (high) expression index of the gene is relatively wide. From this detailed analysis, it is apparent that when a measure of gene expression is extracted from multiple arrays and probes, inference is strengthened when the replicate arrays are consistent. 4.3

Comparison with other methods

Normalization. Considerable differences are often observed in the raw data for replicate hybridizations with GeneChips. This is clearly seen in Data Set B (Figure 9, upper panel). Differences are generally more pronounced for the density plots for sets of log(PM)s and log(MM)s than for those of the log(PM−MM) values. Apart from the spike-in genes, aliquots of the same material have been hybridized to the arrays, so

BGX: integrated Bayesian gene expression analysis

367

Fig. 10. Scatter plots (MA versions) of gene expression measures obtained for each condition in Data Set B with different methods. For pBGX the mean of the posterior condition-specific indices was used. MAS5 and RMA use the mean gene expression measures over the three replicate arrays for each condition. For all methods, similar measures of expression are obtained for the non-spiked-in genes under the two conditions.

the differences represent variability of nonbiological origin, introduced at some stage in the experiment. In spite of differences in raw data between arrays, similar gene expression measures should be obtained for the majority of the genes, and only for spike-in genes should the expression levels differ. Scatter plots of gene expression measures under conditions 1 and 2 in Data Set B obtained with each of the methods (Figure 10) indicate that this is the case for all methods: the points for the spike-in genes fall outside the clouds of points for non-spiked-in genes, which are scattered around the x-axis. The similarity between expression measures obtained for the non-spiked-in genes under the two conditions is expected for the MAS5 and RMA methods, given the explicit normalization procedures these use. For BGX, the bringing into par of the log-scale true signals on different arrays is achieved by assuming gene- and condition-, rather than gene- and array-, specific distributions of log-scale true signals, and by allowing for array-specific distributions of the nonspecific hybridization terms. This is illustrated in the lower panel of Figure 9: the differences between the sets of log(PM) and log(MM) values on different arrays are reflected in the nonspecific hybridization rather than the true signals—there are substantially larger differences between the density plots for the sets of log(Hg j + 1) for different arrays than for those of log(Sg j + 1). Similar observations are made for the other data sets (see supplementary material). Ranking the genes with respect to differential expression. For each method, we obtained conditionspecific gene expression indices for each experiment in Data Set C. For MAS5 and RMA, the mean of the three replicate gene expression measures under each condition was used. For pBGX, we used the posterior mean values of the experiment-specific gene expression indices obtained in separate analyses of the 14 experiments. To avoid influence of systematic differences in overall intensities between conditions analyses were performed on prescaled arrays (see caption of Figure 4). The MA plot of the point estimates thus obtained is highly similar to that shown in Figure 4. Absolute differences of gene expression indices were ranked for each method and each pairwise comparison. Table 3 gives the number of spike-in genes correctly ranked among the top 42 most differentially expressed genes in each pairwise comparison among the 14 experiments for each of the methods. All methods perform best when experiments with large differences in spike-in concentrations of the spike-in genes are compared (entries in the interiors of the triangles). RMA generally performs best, followed by pBGX, and MAS5, with the most noticeable difference in performance being for small fold changes.

368

A.-M. K. H EIN ET AL .

Table 3. Number of spike-in genes correctly ranked among the top 42 genes with respect to degree of differential expression for each method and each pairwise comparison of experiments in Data Set C. For MAS5, MBEI, and RMA mean gene expression indices over the three replicates were used, for pBGX the posterior mean values of the gene expression indices obtained in 14 separate analyses of each experiment, that is, of each set of three replicate arrays. All methods do better for comparisons with large differences in spike-in concentrations. RMA generally does best, followed by pBGX and MAS5. The biggest differences is seen for small differences in spike-in concentrations Exp. 1 2 3 4 5 6 7 8 9 10 11 12 13

2 26 — — — — — — — — — — — —

3 33 22 — — — — — — — — — — —

4 37 35 32 — — — — — — — — — —

5 39 37 33 28 — — — — — — — — —

6 39 37 36 32 27 — — — — — — — —

7 40 40 40 36 34 21 — — — — — — —

8 40 40 41 39 37 31 23 — — — — — —

9 41 41 41 41 41 37 35 22 — — — — —

10 41 41 41 41 41 39 37 30 25 — — — —

11 39 39 40 41 41 40 38 35 33 25 — — —

12 38 40 40 40 41 41 40 38 37 32 28 — —

13 34 38 38 40 40 40 40 39 38 34 32 17 —

14 22 34 35 38 40 40 40 40 40 38 36 32 27

MAS5

1 2 3 4 5 6 7 8 9 10 11 12 13

3 — — — — — — — — — — — —

20 9 — — — — — — — — — — —

37 31 18 — — — — — — — — — —

40 37 30 14 — — — — — — — — —

40 38 37 26 9 — — — — — — — —

40 40 38 33 24 9 — — — — — — —

40 40 40 36 35 27 14 — — — — — —

38 39 41 40 38 35 27 4 — — — — —

38 40 40 41 41 39 35 21 10 — — — —

37 39 41 41 41 40 39 35 28 13 — — —

33 37 40 41 41 41 39 37 32 29 6 — —

24 33 37 40 40 40 39 38 36 35 21 4 —

5 21 32 39 40 40 40 38 37 37 29 20 7

RMA

1 2 3 4 5 6 7 8 9 10 11 12 13

33 — — — — — — — — — — — —

35 33 — — — — — — — — — — —

39 40 39 — — — — — — — — — —

39 40 38 39 — — — — — — — — —

41 41 40 34 36 — — — — — — — —

41 41 42 41 40 38 — — — — — — —

41 41 42 41 41 38 37 — — — — — —

41 41 41 41 41 41 37 33 — — — — —

41 41 41 41 41 40 37 35 34 — — — —

40 40 41 41 41 40 40 39 35 36 — — —

40 40 40 40 41 40 40 40 40 38 33 — —

40 40 40 40 40 40 40 40 40 40 34 33 —

34 39 40 40 40 40 40 40 40 40 39 39 39

pBGX

BGX: integrated Bayesian gene expression analysis

369

Detection of differential expression. The multiple array BGX model allows the posterior distributions of any function of the gene expression measures to be obtained. Of particular interest are those of the differences: diffg = θg1 − θg2 . This is illustrated in Figure 11 (upper panels), summarizing the posterior distributions of the differences for the 1011 genes in Data Set B, in terms of posterior means and 5– 95% and 10–90% credibility intervals. As the gene expression measures are obtained from log-scale true signals, the differences can be interpreted as fold changes. Fold changes in concentrations of the spike-in genes for this data set are in the range of 3–200. With BGX, for all the spike-in genes (genes 1001–1011) except gene 1004 the posterior 5–95% credibility intervals do not include 0, and the posterior probabilities of these genes being differentially expressed are high. Gene number 1004 should have been spiked in at a concentration of 25 pM under condition 1 and 100 pM under condition 2. The PM and MM probe values on the six arrays suggest that the gene was not correctly spiked in (see Figure 8). For all of the 1000 nonspiked-in genes, the posterior credibility intervals include 0, and the posterior credibility of these genes being differentially expressed is thus small. Thus, the inferred differences in log-scale gene expression indices under BGX reflect the actual differences well. Credibility intervals for the ranks of the genes. For Data Set B, ranking the genes with respect to degree of differential expression under the two conditions (rank 1 indicating the most different) results in 10 of the spike-in genes being correctly ranked in the top 11 by RMA and pBGX. Due to the larger cloud (Figure 10) MAS5 performs less well, succeeding in ranking only eight of the spike-in genes in the top 11. For the BGX multiple array model, we obtained, in addition to point estimates, the posterior distributions of the ranks of the absolute differences |θg1 − θg2 |, g = 1, . . . , G, between the two conditions. These are summarized in Figure 11 (lower panels), by their posterior means, 5–95% and 10–90% credibility intervals. The intervals are ordered by median rank [e.g.: the credibility intervals for the gene with the highest (second highest) median rank is plotted with ordinate 1 (2)]. All of the spike-in genes except number 4 stand out as having much narrower intervals than the 1000 non-spiked-in (and non-differentially expressed) genes. The top six ranked genes have narrow credibility intervals, and achieve high ranks with respect to differential expression under the two conditions with high credibility. The high rank of spike-in gene 10 is less certain, as reflected by the wide credibility interval of this gene. With 90% credibility, six of the spike-in genes fall among the 50 most differentially expressed genes. The nonspiked-in genes generally have flat rank distributions (wide credibility intervals), although the genes with the lowest median ranks (highest ordinates) have narrower credibility intervals. Thus, in addition to providing point estimates, BGX allows credibility intervals for the ranks of the genes with respect to the degree of differential expression to be obtained. This additional information is easy to interpret and can be presented visually. 4.4

MCMC implementation

During the development phase, the models were implemented using the WinBUGS software (http://www.mrc-bsu.cam.ac.uk/bugs/classic/coda04/readme.shtml), which provides a user-friendly and easily modifiable implementation of Bayesian hierarchical models. This ease of implementation is balanced by a lack of efficiency in dealing with the large data sets encountered in genomic applications. In order to be able to process data from typical microarray experiments in a realistic time frame, purposebuilt Fortran 77 and C++ codes were developed. Reproducibility of the results between all three codes was checked for all of the low-dimensional model parameters (those not indexed by g), as well as for an arbitrarily chosen subset of the high-dimensional parameters. The C++ and Fortran codes implemented the same sampler, and their results and performance were essentially identical; we do not discuss the Fortran version further. The WinBUGS code gave similar results, but differed somewhat in detail. In the C++ and Fortran implementations a uniform prior was used for φ in place of the Beta prior for

370

A.-M. K. H EIN ET AL .

Fig. 11. Posterior mean (crosses) and credibility intervals (5–95%, full line, and 10–90%, broken line) for the differences in gene expression measures, diffg = θg1 − θg2 (upper panel) and the ranks of the differences (lower panel). Left column: Results for all 1011 genes in Data Set B. Upper right figure: Results for genes 961–1011 only (the spike-in genes are genes 1001–1011). Lower right figure: Results for the genes with the 50 highest median ranks only—intervals for gene with kth highest median rank has ordinate k. Posterior distributions for differences and ranks of differences obtained with BGX reflect the true patterns of expressions under the conditions.

BGX: integrated Bayesian gene expression analysis

371

computational convenience; since φ did not vary outside the (0, 1) interval in any of the MCMC runs reported here, our results continue to be valid for the Beta prior. The C++ implementation performs single-variable updating using Gibbs sampler steps for parameters φ, τcr , ac , and bc . Simple random walk Metropolis updates were used for the remainder of the parameters. These were tuned using pilot runs so that acceptance rates fell in the range (0.2, 0.3). The sampler used by WinBUGS also relies on single-site updating, but with some more sophisticated individual updates, based on Neal’s overrelaxation method. This gave improved performance by reducing the chain’s autocorrelation on a sweep-by-sweep basis; however, sweeps were so much slower than those of the C++ code that this advantage was easily outweighed. Point estimate versions of BGX expression indices for genes whose expression levels are in the top 25% were reliably estimated using relatively short runs of the C++ code. These genes generally displayed small posterior standard deviation of θgc and had small Monte Carlo standard error of the posterior mean of θgc for moderate run lengths. We found that using 5000 sweeps after a burn-in of 5000 sweeps was sufficient to produce reliable results for this subset of genes. If, on the other hand, one is interested in reliable estimation for genes whose expression level is likely to be much lower, and in particular that have less consistent probe set responses, we found that considerably longer sweeps were needed, due to substantial autocorrelations. A figure illustrating the reproducibility of posterior mean values for sweeps of different lengths is given in the Supplementary Material. For 10 000 sweeps, the C++ code for the multiple array model takes about 90 min on Data Set B described in Section 2 (1011 genes on six arrays) on a 2.2 GHz AMD Opteron machine. For the results presented on Data Sets A and B burn-in, a burn-in phase of 16 384 followed by 65 536 post-burn-in sweeps were used. Longer runs were used for analyses of Data Set C, with 32 768 sweeps as burn-in followed by 262 144 post-burn-in sweeps. In all cases, equispaced subsampling was used to obtain posterior samples of size 1024, to economize on output file space. The C++ and WinBUGS programs are available from http://www.bgx.org.uk/. 5. D ISCUSSION The Bayesian hierarchical models for the analysis of GeneChip data presented are a first step toward a complete integrated approach to the analysis of Affymetrix gene expression data. It is our long term goal to further extend the models to include other common steps in gene expression data analysis, such as clustering procedures. In the shorter term a more elaborate study is planned of how to exploit optimally the full posterior in connection with tests for differential expression (see e.g. Lewin et al., 2005). The simultaneous consideration of background correction, gene expression index estimation, assessment of differential expression, etc. should lead to improved estimates of uncertainty on each of these entities and thus to more reliable inference. The BGX method is fundamentally different from other available approaches in that it provides posterior distributions of gene expression indices and other quantities of interest rather than point estimates. Although simple, the point estimate version of the model performs well in comparison with MAS5 and RMA. The performance of the model is likely to improve with a more detailed and careful modeling of the individual steps in the model. Of particular importance are the parts of the models related to background correction/normalization. The current version of the model normalizes the signals within condition partly by assumption (a common gene- and condition-specific distribution is assumed for replicate signals for the same gene) and partly by allowing for array-specific distributions of nonspecific hybridization. Although we found that this to a large extent accommodated overall differences for the data sets studied, it is clear that this part of the model can be refined. Refinements are likely to be of importance particularly in cases where biological rather than technical replicates are considered, and where systematic differences of nonbiological variability persist between conditions.

372

A.-M. K. H EIN ET AL .

There are two immediate extensions of the models which we are pursuing. First, because of the different binding energies of nucleotide pairs, the fraction of the true signal binding to an MM probe is likely to depend on the nucleotide content of the probe (Schadt et al., 2000; Naef and Magnasco, 2003). This can be easily incorporated in the model by assuming nucleotide-content-specific φ. Similarly, the propensity for nonspecific hybridization to a probe may well be linked to the nucleotide composition of the probe (Wu et al., 2004). This could be accounted for in our model by assuming nucleotide-compositionspecific distributions of the log-scale nonspecific hybridization terms. Refinements along these lines have been found to lead to improved performance (see e.g. GCRMA, Wu et al., 2004). For the multiple array model presented in Section 4, a particular structural choice was made for combining information at the different levels of the model. The flexibility of the Bayesian hierarchical framework allows different structural choices to be made: rather than assuming gene-, probe-, condition-, and replicate-specific true signals in (4.1), one could choose to collapse over replicate probes at an earlier level by assuming gene-, condition-, and probe-specific true signals only [dropping the j index on Sg jcr in (4.1) and (4.2)]. The effect of this on the partitioning of intensities into true signals and nonspecific hybridization terms and thus background correction and normalization remains to be explored. The structural choice of collapsing over replicate probes for each gene, replicate array and experimental condition at the first level, is in line with that adopted by the logit-t procedure (Lemon et al., 2003). Given the impressive performance of this procedure with respect to positive predictive value for detecting differential expression, this option may be worth exploring.

ACKNOWLEDGMENTS We would like to thank Alex Lewin, Clare Marshall, and Natalia Bochkina for helpful discussions, members of the Microarray Centre for their ongoing observations and insights into the behavior of Affymetrix GeneChips, Dave Lunn and Andrew Thomas for help with WinBUGS, and the referees for many useful comments. This work was supported by BBSRC ‘Exploiting Genomics’ grants 28EGM16093 and 7EGM16096. R EFERENCES A FFYMETRIX . (2001). Statistical Algorithm Reference Guide. Santa Clara, CA: Affymetrix. C OPE , L. M., I RIZARRY, R. A., JAFFEE , H. A., W U , Z. AND S PEED , T. (2004). A benchmark for Affymetrix GeneChip expression measures. Bioinformatics 20, 323–331. G AUTIER , L., C OPE , L., B OLSTAD , B. M. AND I RIZARRY, R. A. (2004). Affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 307–315. G ELLER , S. C., G REGG , J. P., H AGERMAN , P. AND ROCKE , D. M. (2003). Transformation and normalization of oligonucleotide microarray data. Bioinformatics 19, 1817–1823. H ARTEMINK , A., G IFFORD , D., JAAKKOLA , T. AND YOUNG , R. (2001). Maximum likelihood estimation of optimal scaling factors for expression array normalization. SPIE International Symposium on Biomedical Optics 2001 (BiOS01). In Bittner, M., Chen, Y., Dorsel, A. and Dougherty, E. (eds), Microarrays: Optical Technologies and Informatics. Proceedings of SPIE 4266, 132–140. H UBBELL , E., L IU , W.-M. 1585–1592.

AND

M UI , R. (2002). Robust estimators for expression analysis. Bioinformatics 18,

I RIZARRY, R., H OBBS , B., C OLLIN , F., B EAZER -BARCLAY, Y., A NTONELLIS , K., S CHERF, U. AND S PEED , T. (2003a). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264.

BGX: integrated Bayesian gene expression analysis

373

I RIZARRY, R. A., B OLSTAD , B. M., C OLLINS , F., C OPE , L. M., H OBBS , B. AND S PEED , T. P. (2003b). Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research 31, 1–8. L EMON , W. J., L IYANARACHCHI , S. AND YOU , M. (2003). A high performance test of differential gene expression for oligonucleotide arrays. Genome Biology 4, R67. L EWIN , A., R ICHARDSON , S., M ARSHALL , C., G LAZIER , A. AND A ITMAN , T. (2005). Bayesian modelling of differential gene expression. Biometrics. http://www.bgx.org.uk/ (under revision). L I , C. AND W ONG , W. (2001). Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proceedings of the National Academy of Sciences of the United States of America 98, 31–36. NAEF, F., L IM , D. A., PATIL , N. AND M AGNASCO , M. (2002a). Characterization of the expression ratio noise structure in high-density oligonucleotide arrays. Genome Biology 3, RESEARCH0018. NAEF, F., L IM , D. A., PATIL , N. AND M AGNASCO , M. (2002b). DNA hybridization to mismatched templates: a chip study. Physical Review E65, 040902.1–040902.4. NAEF, F. AND M AGNASCO , M. (2003). Solving the riddle of the bright mis-matches: labe and effective binding in oligonucleotide arrays. Physical Review E68, 011906.1–011906.4. S CHADT, E. E., L I , C., S U , C. AND W ONG , W. H. (2000). Analyzing high-density oligonucleotide gene expression array data. Journal of Cellular Biochemistry 80, 192–202. S CHENA , M., S HALON , D., DAVIS , R. W., AND B ROWN , P. O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470. W U , Z., I RIZARRY, R. A., G ENTLEMAN , R., M URILLO , F. M. AND S PENCER , F. (2004). A model–based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association 99, 909–918.

[Received March 17, 2004; first revision October 4, 2004; second revision December 9, 2004; accepted for publication January 6, 2005]