262 65 10MB
English Pages 416 [426] Year 2006
COMPUTATIONAL AND STATISTICAL APPROACHES TO GENOMICS, SECOND EDITION
COMPUTATIONAL AND STATISTICAL APPROACHES TO GENOMICS, SECOND EDITION
edited by
Wei Zhang University of Texas, M.D. Anderson Cancer Center and Ilya Shmulevich Institute for Systems Biology, Seattle, WA
Library of Congress Cataloging-in-Publication Data Computational and statistical approaches to genomics / edited by Wei Zhang and Ilya Shmulevich. – 2nd ed. p. cm. Includes bibliographical references and index. ISBN-13: 978-0-387-26287-1 ISBN-10: 0-387-26287-3 (alk. paper) ISBN-13: 978-0-387-26288-8 (e-book) ISBN-10: 0-387-26288-1 (e-book) 1. Genomics—Mathematical models. 2. Genomics—Statistical methods. 3 Genomics—Data processing. 4 DNA microarrays. I. Zhang, Wei, 1963- II. Shmulevich, Ilya, 1969QH438.4.M3C65 2005 572.8’6’015118—dc22
2005049761
c 2006 Springer Science+Business Media, Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, Inc., 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. 9 8 7 6 5 4 3 2 1 springeronline.com
SPIN 11419631
This book is dedicated to our families and colleagues
Contents
Preface
ix
1
Microarray Image Analysis and Gene Expression Ratio Statistics Yidong Chen, Edward R. Dougherty, Michael L. Bittner, Paul Meltzer, and Jeffery Trent
1
2
Statistical Considerations in the Assessment of cDNA Microarray Data Obtained Using Amplification Jing Wang, Kevin R. Coombes, Keith Baggerly, Limei Hu, Stanley R. Hamilton, and Wei Zhang
21
3
Sources of Variation in Microarray Experiments Kathleen F. Kerr, Edward H. Leiter, Laurent Picard, and Gary A. Churchill
37
4
Studentizing Microarray Data Keith A. Baggerly, Kevin R. Coombes, Kenneth R. Hess, David N. Stivers, Lynne V. Abruzzo, and Wei Zhang
49
5
Exploratory Clustering of Gene Expression Profiles of Mutated Yeast Strains Merja Oja, Janne Nikkila, Garry Wong, Eero Castren, ¨ Petri To¨ ronen, ¨ ´ and Samuel Kaski
61
6
Selecting Informative Genes for Cancer Classification Using Gene Expression Data Tatsuya Akutsu and Satoru Miyano
75
7
Finding Functional Structures in Glioma Gene-Expressions Using Gene Shaving Clustering and MDL Principle Ciprian D. Giurcaneanu, Cristian Mircean, Gregory N. Fuller, and Ioan Tabus
89
viii
Contents
8
Design Issues and Comparison of Methods for Microarray-Based Classification Edward R. Dougherty and Sanju N. Attoor
119
9
Analyzing Protein Sequences Using Signal Analysis Techniques Karen M. Bloch and Gonzalo R. Arce
137
10
Scale-Dependent Statistics of the Numbers of Transcripts and Protein Sequences Encoded in the Genome Vladimir A. Kuznetsov
163
11
Statistical Methods in Serial Analysis of Gene Expression (SAGE) Ricardo Z. N. Vencio Vˆ and Helena Brentani
209
12
Normalized Maximum Likelihood Models for Boolean Regression with Application to Prediction and Classification in Genomics Ioan Tabus, Jorma Rissanen, and Jaakko Astola
235
13
Inference of Genetic Regulatory Networks via Best-Fit Extensions Harri Lahdesm aki, ¨ ¨ Ilya Shmulevich, Olli Yli-Harja, and Jaakko Astola
259
14
Regularization and Noise Injection for Improving Genetic Network Models Eugene van Someren, Lodewyk Wessels, Marcel Reinders, and Eric Backer
279
15
Parallel Computation and Visualization Tools for Codetermination Analysis of Multivariate Gene Expression Relations Edward B. Suh, Edward R. Dougherty, Seungchan Kim, Michael L. Bittner, Yidong Chen, Daniel E. Russ, and Robert L. Martino
297
16
Single Nucleotide Polymorphisms and Their Applications Rudy Guerra and Zhaoxia Yu
311
17
The Contribution of Alternative Transcription and Alternative Splicing to the Complexity of Mammalian Transcriptomes Mihaela Zavolan and Christian Schonbach ¨
351
18
Computational Imaging, and Statistical Analysis of Tissue Microarrays: Quantitative Automated Analysis of Tissue Microarrays Aaron J. Berger, Robert L. Camp, and David L. Rimm
381
Index
405
Preface
During the three years after the publication of the first edition of this book, the computational and statistical research in genomics have become increasingly more important and indispensable for understanding cellular behavior under a variety of environmental conditions and for tackling challenging clinical problems. In the first edition, the organizational structure was: data → analysis → synthesis → application. In the second edition, we have kept the same structure but have decided to eliminate chapters that primarily focused on applications. Our decision was motivated by several factors. Firstly, the main focus of this book is computational and statistical approaches in genomics research. Thus, the main emphasis is on methods rather than on applications. Secondly, many of the chapters already include numerous examples of applications of the discussed methods to current problems in biology. We have tried to further broaden the range of topics to which end we have included newly contributed chapters on topics such as alternative splicing, tissue microarray image and data analysis, single nucleotide polymorphisms, serial analysis of gene expression, and gene shaving. Additionally, a number of chapters have been updated or revised. We thank all the contributing authors for their contributions and hope that you enjoy reading this book. Wei Zhang Houston, TX Ilya Shmulevich Seattle, WA
Chapter 1 MICROARRAY IMAGE ANALYSIS AND GENE EXPRESSION RATIO STATISTICS Yidong Chen1 , Edward R. Dougherty2 , Michael L. Bittner1 , Paul Meltzer1 , and Jeffery Trent1 1 Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of
Health, Bethesda, Maryland, USA 2 Department of Electrical Engineering, Texas A & M University, College Station, Texas, USA
1.
Introduction
A cell relies on its protein components for a wide variety of its functions, including energy production, biosynthesis of component macro-molecules, maintenance of cellular architecture, and the ability to act upon intraand extra-cellular stimuli. Each cell in an organism contains the information necessary to produce the entire repertoire of proteins the organism can specify. Since a cell’s specific functionality is largely determined by the genes it is expressing, it is logical that transcription, the first step in the process of converting the genetic information stored in an organism’s genome into protein, would be highly regulated by the control network that coordinates and directs cellular activity. A primary means for regulating cellular activity is the control of protein production via the amounts of mRNA expressed by individual genes. The tools required to build an understanding of genomic regulation of expression reveal the probability characteristics of these expression levels. Complementary DNA microarray technology provides a powerful analytical tool for human genetic research (Schena et al., 1995; Schena et al., 1996; DeRisi et al., 1996; DeRisi et al., 1997; Duggan et al., 1999). It combines robotic spotting of small amounts of individual, pure nucleic acid species on a glass surface, hybridization to this array with multiple fluorescently labeled nucleic acids, and detection and quantitation of the resulting fluor-tagged hybrids with a scanning confocal microscope (Fig. 1.1). A basic application is quantitative analysis of fluorescence signals representing
2
COMPUTATIONAL GENOMICS
clones in 96-well plates
Photomultiplier Tube PMT Infected Cells
Uninfected Cells
Reverse Transcription Label with Fluor Dyes
Red channel
Barrier filter PMT
Dichroic mirror
PCR Amplification DNA Purification
Pinhole
Green channel
Excitation Lasers
Barrier filter
Laser Laser
UV-crosslink Robotic Blocking printing Denature
Hybridize probe to microarray
Laser
Expression analysis
Objective X-Y stage
Poly-L-Lysine coated glass slide
Microarray Preparation
Red probe Green probe
cDNA Probe Hybridization
Array Database
Confocal Microscope
Computer Analysis
Figure 1.1. Illustration of a microarray system.
the relative abundance of mRNA from distinct tissue samples. cDNA microarrays are prepared by printing thousands of cDNAs in an array format on glass microscope slides, which provide gene-specific hybridization targets. Distinct mRNA samples can be labeled with different fluors and then co-hybridized on to each arrayed gene. Ratios of gene-expression levels between the samples can be used to detect meaningfully different expression levels between the samples for a given gene. Given an experiment design with multiple tissue samples, microarray data can be used to cluster genes based on expression profiles (Eisen et al., 1998; Khan et al., 1998), to characterize and classify disease based the expression levels of gene sets (Golub et al., 1999; Ben-Dor et al., 2000; Bittner et al., 2000; Hedenfalk et al., 2001; Khan et al., 2001), and for the many statistical methods presented in this book. When using cDNA microarrays, the signal must be extracted from the background. This requires image processing to extract signals arising from tagged mRNA hybridized to arrayed cDNA locations (Chen et al., 1997; Schadt et al., 2000; Kim et al., 2001), and variability analysis and measurement quality control assessment (Bittner et al., 2001; Newton et al., 2001; Wang et al., 2001). This chapter discusses an image processing environment whose components have been specially designed for cDNA microarrays. It measures signals and ratio statistics to determine whether a ratio is significantly high or low in order to conclude whether the gene is up- or down-regulated, and provides related tools such as those for quality assessment.
Microarray Image Analysis and Ratio Statistics
2.
3
Microarray Image Analysis
A typical glass-substrate and fluorescent-based cDNA microarray detection system is based on a scanning confocal microscope, where two monochrome images are obtained from laser excitations at two different wavelengths. Monochrome images of the fluorescent intensity for each fluor are combined by placing each image in the appropriate color channel of an RGB image (Fig. 1.2). In this composite image, one can visualize the differential expression of genes in the two cell types: test sample typically placed in red channel, while the reference sample in green channel. Intense red fluorescence at a spot indicates a high level of expression of that gene in the test sample with little expression in the reference sample. Conversely, intense green fluorescence at spot indicates relatively low expression of that gene in the test sample compared to the reference. When both test and reference samples express a gene at similar levels, the observed array spot is yellow. We generally assume that specific DNA products from two samples have an equal probability of hybridizing to the specific target. Thus, the fluorescent intensity measurement is a function of the amount of specific RNA available within each sample, provided samples are wellmixed and there is sufficiently abundant cDNA deposited at each target location. The objective of the microarray image analysis is to extract probe intensities or ratios at each cDNA target location, and then cross-link printed clone information so that biologists can easily interpret the outcomes and perform further high-level analysis. The block diagram of the image analysis system is shown in Fig. 1.3. A microarray image is first segmented into individual cDNA targets, either by manual interaction or an
Figure 1.2. An example of microarray image.
4
COMPUTATIONAL GENOMICS IDs, and Controls
Layout Import
Manual Automatic
Printing
Information
Micro array Image
(TIFF, others)
Image Reader
Grid Alignment
Clone Assignment
for entire array
Background Extraction for each cDNA target
Target Mask Extraction
Target Detection
Intensity, Ratio,Quality Accessment
for entire array
Normalization Visualization
for each print tip
Figure 1.3. Block diagram of cDNA microarray image analysis.
automated algorithm. For each target, the surrounding background fluorescent intensity is estimated, along with the exact target location, fluorescent intensity and expression ratios. Microarray image sources are not from a single print mode (i.e., different printing tip arrangement or different arrayers (Bowtell, 1999)) or one hybridization method (i.e., fluorescent, radioactive probe, and others); nevertheless, to simplify the presentation, we model the microarray images from two fluorescent probes as our main processing input images.
2.1
Target Segmentation and Clone Information Assignment
Since each element of an array is printed automatically to a predefined pattern and position, we assume the final probe signals from a regular array to be automatically aligned to a predefined grid-overlay. The initial position of the grid (Fig. 1.4) can be manually determined if no particular orientation markers are printed or no visible signals that can be used as orientation markers, or automatically determined if the orientation markers are presented in the final image and the entire array has no obvious missing row or column signals (Bittner et al., 2001). Due to the complication of customized print procedures and various hybridization protocols, we assume the initial grid-overlay is manually determined. Usually, the initial target segmentation achieved by grid overlaying does not need to be precise. An automatic refinement of the grid position is preferable following the manual grid overlaying. For each sub-array, and after the initial alignment of the grid pattern, we utilize the following procedure:
Microarray Image Analysis and Ratio Statistics
5
Figure 1.4. Manual grid alignment.
1 Adjust the four corners’ horizontal coordinates with [−1, 0, 1]. 2 Calculate the column average along the vertical edge of the bonding boxes for all columns in the sub-array, and then sum the average intensities I, as illustrated in Fig. 1.5 by red open circles. 3 Calculate the center of gravities between two valleys, as marked in Fig. 1.5 by dark “+”. Evaluate the average difference between the calculated center to the center of gravities, D. 4 Select the corner coordinates that produce the minimum value of I × D, or overlaying grid pattern between the cDNA targets. The minimum value I × D ensures that the grid will be mostly overlaid within the background region, and when the background space between columns of the target is large, the center of gravity shall overlay with the centers of the each grids. 5 Repeat Steps 1-4 for the horizontal direction of the grid pattern. 6 Repeat steps 1-5 once, to limit the adjustment to each corner to no more than 2 pixels (more repeats are optional to allow a larger adjustment). This is a semi-automatic grid alignment procedure that depends on the initial user interaction. A post-refinement of the grid alignment may be activated if the image needs to be processed again. Statistics derived from the prior detection procedure, such as spot measurement quality, intensity or size, are available to further qualify the spots for precise grid alignment. After segmentation, clone information is linked to array locations before further processing. Information regarding non-printed, duplicated,
6
COMPUTATIONAL GENOMICS
+: Conter of gravity between two valleys
o: Projection average through background
Figure 1.5. Semi-automatic grid alignment.
negative-control, or housekeeping-genes is useful for normalization and other calibration purposes.
2.2
Background Detection
Typically, the background of a microarray image is not uniform over the entire array and extraction of local background intensity is necessary. Changes of fluorescent background across an array are usually gradual and smooth, and due to many technical reasons. Abrupt changes are rare; when they happen, the actual signal intensities of array elements near these changes are not reliable. Conventionally, pixels near the bounding box edge are taken to be background pixels, and thus the mean gray-level of these pixels provides an estimation of the local background intensity (Schena et al., 1995; Schena et al., 1996; Eisen et al., 1998). This method becomes inaccurate when the bounding box size is close to 10 × 10 pixels or the target fills entire bounding box. Alternative methods require some knowledge of the target location, such that the corner regions between targets may be chosen to be the candidate region for background estimation. Fluorescent background is typically modeled by a Gaussian process (Wang and Herman, 1996). If we choose a large area that covers the center of its neighboring targets, then the gray-level histogram within the box is usually unimodal, since the majority of the background pixel values are
7
Microarray Image Analysis and Ratio Statistics 500 bounding box Pixel Count
Histogram Region
400 300 200 100
0 0
1000
3000
2000
4000
5000
Gray Level
(a)
(b)
Figure 1.6. Background estimation.
similar while the target pixel values spread up to very high gray levels. The histogram mode provides the mean local background intensity μb and the left tail of the histogram provides the spread (standard deviation σb ) of the background intensity. A unimodal histogram is shown in Fig. 1.6 that has been derived from a region (the small insert at right-hand side) with no effort made to eliminate the target region, even though it contains a bright target. The background statistics derived here provide essential information for target detection. In practice, the histogram is not always unimodal. Thus, care must be taken. We first use the minimal value from average intensities from the four sides of the bounding box of the target to estimate of the mean background level μb . We then take the gray level, g, at 1 percentile within the bounding box with the intention being to eliminate possible dark pixels due to scratches (random or in the center of the target). We consider an initial estimate σb = (v − g)/3. We then refine the estimate μb and σb from within the initial range (μb − 3σb , μb + 3σb ), and then estimate the mode within the range (μb − σb , μb + σb ). Finding exact background intensities is problematic. Some hybridizations yield much higher or lower intensities than those from the peripheral region. It is highly recommend that microarray design include randomly distributed negative control locations. The background intensity for a cDNA target can be assessed by its nearest negative control locations. Background subtraction can produce negative intensities. We set a negative intensity to 1.0 to avoid an unattainable ratio quotient. Alternatives include global background subtraction, conservative estimation of background level by the morphological opening operator, or no subtraction at all.
8
COMPUTATIONAL GENOMICS
After threshold
Average of bright spots
Pen-mask overlay
Selection of strong spots
Figure 1.7. Pen-mask extraction illustration.
2.3
Target Detection
Identifying the target region within a bounding box is a difficult imageprocessing task. Targets are somewhat annular (Fig. 1.7) owing to how the robot print-tip places the cDNA on the slide and how the slide is treated; however, the final target image may be a collection of subregions within the nominal circular target region owing to variability caused by cDNA deposition or the hybridization process. To reduce the noise-to-signal interaction, the final signal intensity should be measured over regions corresponding to the probe-hybridized-to-target area. Conventional adaptive thresholding segmentation techniques are unsatisfactory when the signal is weak because there is no marked transition between foreground and background. Standard morphological methods also fail because for weak signals there is no consistent shape information for the target area. Given the knowledge that each sub-array in a microarray slide is produced by a single print-tip, resulting in a consistent spot morphology, the target detection algorithm has two steps: (1) target shape (mask) extraction; and (1) target detection within each optimal matched target mask region. Target Mask Detection. For a given sub-array and local background statistics (μb , σb ), we can easily identify strong targets, for which Iavg > μb + 4σb , where Iavg is the mean intensity over the entire bounding box. Perform auto-correlation to the first detected strong target, S, and then add a detected target to the first detected target. After summing all strong targets to S, the target mask is obtained by thresholding S at a value, t, determined by Otsu’s histogram method (see Eq. 1.1 below) (Otsu, 1979), where t is limited by (μ + 4σ ) ≤ t ≤ 2μ. If the pixel count in the bounding box is less than 200, the threshold value is μ + 4σ , where μ and σ are the average mean background and average standard deviation over all
Microarray Image Analysis and Ratio Statistics
9
detected strong targets, respectively. We keep largest component in the mask to be the final mask for each print-tip, M. Target Detection – Mann-Whitney Method. (Chen et al., 1997). For this method, a target site is segmented from the target patch by the following procedure. A predefined target mask M is used to identify a portion of the target patch that contains the target site. Eight sample pixels are randomly selected from the known background (outside the target mask) as Y1 , Y2 , . . . , Y8 . The eight lowest samples from within the target mask are selected as X 1 , X 2 , . . . , X 8 . The rank-sum statistic W is calculated and, for a given significance level α, compared to wα, 8, 8 . Eight samples are chosen because the Mann-Whitney statistic is approximately normal when m = n ≥ 8 (Gibbons and Chakraborti, 1996). If the null hypothesis is not rejected, then we discard some predetermined number (perhaps only 1) of the 8 samples from the potential target region and select the lowest 8 remaining samples from the region. The Mann-Whitney test is repeated until the null hypothesis is rejected. When H0 is rejected, the target site is taken to be the 8 pixels causing the rejection together with all pixels in the target mask whose values are greater than or equal to the minimum value of the eight. The resulting site is said to be a target site of significance level α. If the null hypothesis is never rejected, then it is concluded that there is no appreciable probe at the target site. One can require that the Mann-Whitney target site contain some minimum number of pixels for the target site to be considered valid and measured for fluor intensity. Figures 1.8 and 1.9 show the detection results of target sites at α = 0.0001 and α = 0.05, respectively, where detected site boundaries are superimposed with original images.
Figure 1.8. Detection result at α = 0.0001.
10
COMPUTATIONAL GENOMICS
Figure 1.9. Detection result at α = 0.05.
Target Detection – Fixed Threshold Method. First, match the mask M to the target by auto-correlation (this step also performed in MannWhitney method). The threshold value is the maximum max{q1 (t)q(t)[μ1 (t) − μ2 (t)]2 } all t
(1.1)
where q1 (t) and q2 (t) are the number of pixels whose gray levels are below or above the threshold value t, respectively, and μ1 (t) and μ2 (t) are the mean gray levels from pixels below or above t, respectively. We limit the threshold by (μb + 2.3σb ) ≤ t ≤ (μb + 4σb ) in case the histogram method fails to find an appropriate threshold value. Finally, we union the detected target pixels from both fluorescent channels as Ti . Delete the target if the number of pixels within Ti is too few (default to 10 pixels).
2.4
Intensity Measurement and Ratio Calculation
Intensity measurements are carried out after target regions are determined. For a two-color system, the target regions detected from the red and green channels are combined. Both probes have been hybridized to the same target, so that if we observe either one of them, the underlying region must belong to original target unless the scanner has an obvious mechanical channel alignment problem. Upon obtaining the correct target site, the measurements of fluorescent intensities from both channels for a target region are obtained in the following manner: 1 Trim 5% of the pixels from the top and 5% from the bottom of the intensity-sorted lists for both red channel and green channels. 2 Select pixels not trimmed from either channel. This step may effectively trim up to 10% of the pixels from top and up to 10% from the bottom.
Microarray Image Analysis and Ratio Statistics
11
3 Calculate mean intensities, standard deviations and other measurement statistics from both fluorescent channels and corresponding expression rations based on the trimmed pixel collection. 4 Perform background intensity subtraction, and if the background subtracted intensity is less than 1, set it to 1. 5 The ratio is calculated by T = R/G, where R and G are the background subtracted intensities from the red and green channels, respectively. The trimmed mean is used for estimating the average intensity to lessen the interference of spike noises or possible mis-registration between the two channels. Trimming suppresses most noise spikes while maintaining the robustness of the averaging process and compatibility with the background level estimation. We take the background subtracted mean intensity, which has a normal distribution when large numbers are presented (central limit theorem), to estimate the signal intensity.
3.
Ratio Statistics
In many practical microarray analyses, the ratio statistic has been chosen to quantitate the relative expression levels differentially expressed in two biological samples. It is well-known that working with ratio distributions can be problematic (Feldman and Fox, 1991; Sokal and Rohlf, 1995) and recent research on the matter is generally confined to normality study of the ratio distribution and numerical calculations (Shanmugalingam, 1982; Schneeberger and Fleischer, 1993; Korhonen and Narula, 1989). However, as we now discuss, a special situation arises for gene expression that permits a more detailed statistical analysis, as well as hypothesis tests and confidence intervals based on a single microarray. The unique conditions in gene expression analysis are (1) the level of a transcriptor depends roughly on the concentration of the related factors which, in turn, govern the rate of production and degeneration of the transcript; (2) the random fluctuation for any particular transcript is normally distributed; and (3) as a fraction of abundance, the variation of any transcript is constant relative to most of the other transcripts in the genome, which means that the coefficient of variation, cv, can be taken as constant across the genome. These observations provide the statistical foundation for many of our routine analyses.
3.1
Constant Coefficient of Variation
Consider a microarray having n genes, with red and green expression values R1 , R2 , . . . , Rn and G 1 , G 2 , . . . , G n , respectively. Regarding the expression ratios, Ti = Ri /G i , even if red and green measurements are
12
COMPUTATIONAL GENOMICS
identically distributed, the mean of the ratio distribution will not be 1. Moreover, a hypothesis test needs to be performed on expression levels from a single microarray. Letting μ Rk and σ Rk denote the mean and standard deviation of Rk (and similarly for G k ), the equal-cv assumption means that σ Rk = cμ Rk σG k = cμG k
(1.2)
where c denotes the common coefficient of variation (cv). Previously, we have proposed a ratio-based hypothesis test for determining whether Rk is over- or under-expressed relative to G k , assuming constant cv for all genes in the microarray (Chen et al., 1997). This assumption facilitates the pooling of statistics on gene expression ratios across the microarray. The desired hypothesis test using the ratio test statistic Tk = Rk /G k is H0 : μ Rk = μG k H1 : μ Rk = μG k
(1.3)
Under the null hypothesis H0 , Eq. 1.3 implies that σ Rk = σG k . Assuming Rk and G k to be normally and identically distributed, Tk has the density function √ −(t − 1)2 (1 + t) 1 + t 2 f Tk (t; c) = . (1.4) √ exp 2c2 (1 + t) c(1 + t 2 )2 2π Since the subscript k does not appear on the right-hand side, the density applies to all genes, which enables us to pool expression ratios across the entire microarray to perform statistical analyses, such as normalization and construction of confidence intervals. Figure 1.10 depicts the probability 6
c = 0.2 c = 0.1 c = 0.05
5 4 3 2 1 0 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Figure 1.10. Ratio density function.
1.8
2
Microarray Image Analysis and Ratio Statistics
13
density given in Eq. 1.4 for c = 0.05, 0.1, and 0.2. The density function of Eq. 1.4 is an asymmetric function and its peak is close to 1 under the null hypothesis. In most practical microarray applications, the evidence of equally distributed genes, regardless of their expression levels, around the 45◦ diagonal line in a log-log scatter plot shows that the constantcoefficient-of-variation assumption is not overly violated. And it has been demonstrated that some variation of cv is tolerable (Chen et al., 1997). However, even if the assumption intrinsically holds for expression levels, a detected expression level may not satisfy the assumption, particularly when the gene expresses weakly. Expression-level variation increases when the levels approach the background fluorescence, even though the aforementioned image processing techniques are capable of detecting the cDNA target reliably. This problem is addressed in Section 3.4.
3.2
Ratio Confidence Interval
If a set of internal control genes whose expression levels are assumed to be stable between the reference and test samples (satisfying the null hypothesis in Eq. 1.3), or housekeeping genes, can be identified, then according to Eq. 1.4, then we can pool to estimate the parameter of Eq. 1.4 by using maximum likelihood. We select a parameter c to maximize the likelihood function 2 n (1 + ti ) 1 + ti2 − (t2i −1) 2 2c (1+ti ) L(c) = (1.5) √ e c(1 + ti2 )2 2π i=1
where t1 , t2 , . . . , tn are ratio samples taken from a single collection of expression values, for example, all housekeeping-genes ratios in a microarray. The maximum-likelihood criterion requires that d[log L(c)]/dc = 0. Hence, the estimator for c is n 1 (ti − 1)2 cˆ = (1.6) n (t 2 + 1) i=1 i
The 99% confidence interval for a given c can be evaluated by Eq. 1.4 and used to identify over- or under-expressed genes, relatively to their internal control genes’ variation. A set of confidence intervals for different cv values is shown in Fig. 1.11. For two identical mRNA samples co-hybridized on a slide (self-self experiment), c (cv of the fluorescent intensity) provides the variation of assay. However, it is not possible to guarantee the null hypothesis condition implied in Eq. 1.5 for an arbitrary experiment. One alternative is to duplicate some or all clones where the same expression ratio is expected.
14
COMPUTATIONAL GENOMICS 3 2.5 95% 90% 85% 80%
ratio value
2 1.5 1 0.5 0 0
0.05
0.1 0.15 0.2 coefficient of variation (c)
0.25
0.3
Figure 1.11. Confidence intervals for different cv.
The ratio of expression ratios, T = t/t ′ , satisfies the null hypothesis required by Eq. 1.3. It can be shown (Chen et al., 2001) that c = σlog T /2, where log T is the natural log-transform of T, and with the assumption that the measurement of the log-transformed expression level is approximately normally distributed. For any given experiment with some duplicated 2 clones, σlog T is easily calculated. For a given microarray experiment, one can either use the parameter derived from pre-selected housekeeping genes (Eq. 1.6), or a set of duplicated genes, c = σlog T /2, if they are available in the array. The former confidence intervals contain some levels of variation from the fluctuation of the biological system that also affects the housekeeping genes, while the latter contains no variation of the biological fluctuation, but contains possible spot-to-spot variation. This spot-to-spot variation is not avoidable if one wishes to repeat the experiment. Thus, the confidence interval derived from the duplicated genes is termed the confidence interval of the assay.
3.3
Ratio Normalization
In practical application, a constant amplification gain m may apply to one signal channel, in which case the null hypothesis may become μ Rk = mμG k . Under this uncalibrated signal detection setting, the ratio density can be modified by f T (t; c, m) =
1 f T (t/m; c, 1) m
(1.7)
where f T (·; c, 1) is given by Eq. 1.5. In Ref. (Chen et al., 1997), an estimation procedure for the parameter m is proposed that has proven to be very efficient.
Microarray Image Analysis and Ratio Statistics
15
Other studies utilize the approximate normality of the distribution of the log-transformed ratio parameter. Therefore, the normalization procedure is to move the mean of the log-normal distribution to 0 (or ratio of 1) (Schena et al., 1995; Schena et al., 1996; DeRisi et al., 1996; DeRisi et al., 1997). More recent proposals include non-linear regression (Bittner et al., 2001) and rank-order-based methods (Schadt et al., 2000).
3.4
Ratio Statistics for Low Signal-to-Noise Ratio
The condition of constant cv is made under the assumption that Rk and G k are the expression levels detected from two fluorescent channels. The assumption is based on image processing that suppresses the background noise relative to the true signal. It is quite accurate for strong signals, but problems arise for weak signals. Even with image processing, the actual expression intensity measurement is of the form Rk = (S Rk + B Rk ) − μ B Rk
(1.8)
where S Rk is the expression intensity measurement of gene k, B Rk the fluorescent background level, and μ B R∗k the mean background level. The null hypothesis of interest is μ B Rk = μ SG k . The constant cv assumption applies to S Rk and SG k , not Rk and G k , and the density of Eq. 1.5 is not applicable. The situation is treated in detail in (Chen et al., 2001). Here we provide a brief summary. The matter is quantified by defining an appropriate signal-to-noise ratio. Assuming S Rk and B Rk , are independent, c2Rk = c2 + S N R −2 Rk
(1.9)
where the signal-to-noise ratio is defined as S N R = μ S R /σ B R . If S N R ≫ 1 for gene k, meaning the expression signal is strong, then the measured cv is close to a constant, namely c Rk ≈ c, but if the signal is weak, the constant cv condition is violated. Figure 1.12 shows how in most practical applications the weaker expression signals (at the lower left corner of the scatter plot) produce a larger spread of gene placement. Given the complication introduced by Eq. 1.9 into Eq. 1.5, we can only evaluate the ratio distribution by a Monte Carlo method. By using the numerical method, the 99% confidence interval at different SNRs is shown in Fig. 1.13.
3.5
Measurement Quality Assessment
Before submitting expression ratios from an experiment for further analysis, it is useful to forward a quality measurement with each ratio. For instance, if a cDNA target has very low SNR, then we may discard it.
16
COMPUTATIONAL GENOMICS
Expression Intensity (Cy3)
100000
10000
1000
100
10 10
100
1000
10000
100000
Expression Intensity (Cy5)
Figure 1.12. Background interference: scatter plot showing expression intensity of Cy5 vs. Cy3. 100
10
Ratio
Upper bound of 99% confidence interval
1 Lower bound of 99% confidence interval
0.1
0.01
0
5
10
15 SNR
20
25
30
Figure 1.13. Background interference: effect on the confidence interval.
A quality metric must be designed at an early processing stage, because information is lost through image processing and when ratios are extracted and forwarded to higher level processing. We briefly describe the qualitative considerations behind a proposed quality metric for cDNA expression ratio measurement that combines local and global statistics to describe the quality at each ratio measurement (Chen et al., 2001). The single quality
REFERENCES
17
metric enables unified data filtering, and it can be used directly in higherlevel data analysis. For a given cDNA target, the following factors affect ratio measurement quality: (1) Weak fluorescent intensities from both channels result in less stable ratios. This quality problem has been addressed in the last section via the confidence interval if over- or under-expression is the only concern; however, when comparing one experiment to another via a common reference sample, low intensities provide less reliable ratios. (2) A smaller than normal detected target area indicates possible poor quality in clone preparation, printing, or hybridization. (3) A very high local background level may suggest a problematic region in which any intensity measurement may be inaccurate. (4) A high standard deviation of target intensity is usually caused by the contamination of strong fluorescent crud sitting within the target region. Our image processing package extracts all of this information, and for each factor, a quality metric is defined taking a value from 0 (lowest measurement quality) to 1 (highest measurement quality). The final reported quality is the minimum of all individual measurement quality assessments.
4.
Conclusions
Various image analysis issues have been addressed: target segmentation, background detection, target detection, and intensity measurement. Since microarray technology is still under development and image quality varies considerably, a robust and precise image analysis algorithm that reduces background interference and extracts precise signal intensity and expression ratios for each gene is critical to the success of further statistical analysis. The overall methodology discussed in this chapter has been developed and enhanced through five years of experience working with cDNA microarray images. It continues to be expanded and revised as new issues arise.
References Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., and Yakhini, Z. (2000). “Tissue Classification with Gene Expression Profiles.” J Comput Biol 7:559–583. Bittner, M., Chen, Y., and Dougherty, E. R. (Eds.) (2001). Microarrays: Optical Technologies and Informatics, Proceedings of SPIE 4266:1–12. Bittner, M., Meltzer, P., Khan, J., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A., Dougherty, E., Wang, E., Marincola, F., Gooden, C., Lueders, J., Glatfelter, A., Pollock, P., Gillanders, E., Leja, A., Dietrich, K.,
18
COMPUTATIONAL GENOMICS
Beaudry, C., Berrens, M., Alberts, D., Sondak, V., Hayward, N., and Trent, J. (2000). “Molecular Classification of Cutaneous Malignant Melanoma by Gene Expression Profiling.” Nature 406:536–540. Bowtell, D. (1999). “Options Available – from Start to Finish – for Obtaining Expression Data by Microarray.” Nat Genet 21:25–32. Chen, Y., Dougherty, E. R., and Bittner, M. (1997). “Ratio-based Decisions and the Quantitative Analysis of cDNA Microarray Images.” Biomedical Optics 2:364–374. Chen, Y., Kamat, V., Dougherty, E. R., Bittner, M. L., Meitzer, P. S., and Trent, J. M. “Ratio Statistics of Gene Expression Levels and Applications to Microarray Data Analysis.” (submitted to Bioinformatics). DeRisi, J. L., Iyer, V. R., and Brown, P. O. (1997). “Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale.” Science 278:680–686. DeRisi, J., Penland, L., Brown, P. O., Bittner, M. L., Meltzer, P. S., Ray, M., Chen, Y., Su, Y. A., and Trent, J. M. (1996). “Use of a cDNA Microarray to Analyze Gene Expression Patterns in Human Cancer.” Nat Genet 14:457–460. Duggan, D. J., Bittner, M. L., Chen, Y., Meltzer, P. S., and Trent, J. M. (1999). “Expression Profiling Using cDNA Microarrays.” Nat Genet 21:10–14. Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). “Cluster Analysis and Display of Genome-wide Expression Patterns.” Proc Natl Acad Sci USA 95:14863–14868. Feldman, D. and Fox, M. (1991). Probability, the Mathematics of Uncertainty. New York: Marcel Dekker. Gibbons, J. D. and Chakraborti, S. (1996). Nonparametric Statistical Inference, 3rd ed. New York: Marcel Dekker. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. (1999). “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring.” Science 286:531–537. Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Raffeld, Yakhini, Z., Ben-Dor, A., Dougherty, E., Kononen, J., Bubendorf, L., Fehrle, W., Pittaluga, S., Gruvverger, S., Loman, N., Johannsson, O., Olsson, H., Wifond, B., Sauter, G., Kallioniemi, O. P., Borg, A., and Trent, J. (2001). “Gene Expression Profiles Distinguish Hereditary Breast Cancers.” N Engl J Med 34:539–548. Khan, J., Simon, R., Bittner, M., Chen, Y., Leighton, S. B., Pohida, T., Smith, P. D., Jiang, Y., Gooden, G. C., Trent, J. M., and Meltzer, P. S.
REFERENCES
19
(1998). “Gene Expression Profiling of Alveolar Rhabdomyosarcoma with cDNA Microarrays.” Cancer Res 58:5009–5013. Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., and Meltzer, P. S. (2001). “Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks.” Nat Med 7:673–679. Kim, J. H., Kim, H. Y., and Lee, Y. S. (2001). “A Novel Method Using Edge Detection for Signal Extraction from cDNA Microarray Image Analysis.” Exp Mol Med 33:83–88. Korhonen, P. J. and Narula, S. C. (1989). “The Probability Distribution of the Ratio of the Absolute Values of Two Normal Variables.” J Statistical Computation and Simulation 33:173–182. Newton, M. A., Kendziorshi, C. M., Richmond, C. S., Blattner, F. R., Tsui, K. W. (2001). “On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data.” J Comput Biol 8:37–52. Otsu, N. (1979). “A Threshold Selection Method from Gray-level Histogram.” IEEE Trans Syst Man Cyber SMC-9:62–66. Schadt, E. E., Li, C., Su, C., and Wong, W. H. (2000). “Analyzing Highdensity Ologonucleotide Gene Expression Array Data.” J Cell Biochem 80:192–202. Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995). “Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray.” Science 270:467–470. Schena, M., Shalon, D., Heller, R., Chai, A., Brown, P. O., and Davis, R. W. (1996). “Parallel Human Genome Analysis: Microarray-based Expression Monitoring of 1000 Genes.” Proc Natl Acad Sci USA 93: 10614–10619. Schneeberger, H. and Fleischer, K. (1993). “The Distribution of the Ratio of Two Variables.” J Statistical Computation and Simulation 47:227–240. Shanmugalingam, S. (1982). “On the Analysis of the Ratio of Two Correlated Normal Variables.” The Statistician 31:251–258. Sokal, R. R. and Rohlf, F. J. (1995). Biometry: The Principles and Practice of Statistics in Biological Research, 3rd Ed. W. H. Freeman and Co. Wang, X., Ghosh, S., and Guo, S. W. (2001). “Quantitative Quality Control in Microarray Image Processing and Data Acquisition.” Nucleic Acids Res 29:E75. Wang, X. F. and Herman, B. (Eds.) (1996). Fluorescence Imaging Spectroscopy and Microscopy. New York: John Wiley & Sons.
Chapter 2 STATISTICAL CONSIDERATIONS IN THE ASSESSMENT OF CDNA MICROARRAY DATA OBTAINED USING AMPLIFICATION Jing Wang1 , Kevin R. Coombes1 , Keith Baggerly1 , Limei Hu2 , Stanley R. Hamilton2 , and Wei Zhang2 1 Department of Biostatistics, 2 Department of Pathology, The University of Texas M. D.
Anderson Cancer Center, Houston, Texas, USA
1.
Introduction
Data acquisition is the critical first step in microarray studies to profile gene expression. Successful data acquisition depends on obtaining consistent signals above the level of the background noise on the microarray. Many factors influence the quality and quantity of the fluorescent signals obtained from microarray experiments, including the strength of the fluorescent dyes, the sensitivity of the scanners that detect the signals, and the amount of labeled cDNA targets applied to the microarray. In order to increase signal intensity, substantial efforts have been made to develop stronger dyes and to improve the sensitivity of the scanners. Both laser and CCD based scanners have been developed, but most commercially available scanners have similar performances (Ramdas et al., 2001). Technological improvements in dyes and scanners, by themselves, have provided incomplete solutions to the problem of generating strong, reproducible signals from microarray experiments using small amounts of starting materials. The single most important factor in microarray data acquisition is the amount of RNA required. In order to obtain consistent hybridizations and adequate signals, conventional protocols require more than 50μg of total RNA (or 1–2 μg of mRNA) as starting material, from which labeled cDNA targets are generated by incorporating Cy3-dNTP or Cy5-dNTP during reverse transcription (Fig. 2.1). To harvest this amount of RNA, several
22
COMPUTATIONAL GENOMICS AAAAAAAA TTTTTTTTT
oligodT primer
Reverse Transcription
AAAAAAAA TTTTTTTTT Cy-dye labeled dNTP
Figure 2.1. Conventional protocol for generating fluorescence labeled cDNA target for microarrays. cDNA is generated by reverse transcription using oligo dT primer. The Cy-dye labelled dNTP is incorporated into the cDNA during reverse transcription.
million cultured cells or several hundred milligrams of tissue are needed. Unfortunately, this requirement cannot realistically be met in many clinical experiments. For clinical specimens obtained by core biopsy or fine needle aspiration, only a small amount of total RNA (around 2μg) can be obtained. To study gene expression regulation during early embryogenesis, one may want to push the limit to a single cell, which would only yield 20–40 pg of total RNA (Roozemond et al., 1976; Uemura et al., 1980). Developing amplification methods that can reduce the amount of RNA required for successful microarray experiments has become an intensive area of research. However, it is necessary to determine if an amplification protocol produces consistent, reliable results. We begin this chapter by briefly reviewing some amplification methods that have been proposed. Taking the results of experiments performed with a conventional protocol as ”ground truth”, we then describe a data analysis method for evaluating how well an amplification protocol works. Finally, we illustrate this analytical method with a set of data obtained in our laboratory.
2.
Amplification Methods
Several methods have been developed to acquire strong signals from minute amounts of RNA. The current amplification methods for microarray experiments fall into two main categories: (1) amplification of the starting RNA material before hybridization; and (2) amplification of the fluorescence signals after hybridization for better detection, mainly involving the dendrimer (3DNA) technique (Stears et al., 2000) or an enzyme catalysis approach, such as the TSATM assay (Alder et al., 2000).
2.1
RNA Amplification
A straightforward method for reducing the amount of required starting material is to amplify the RNA, much as DNA is amplified using the
Assessment of cDNA Microarray Data Obtained Using Amplification
23
polymerase chain reaction (PCR). RNA amplification is based on cDNA synthesis and a template-directed in vitro transcription reaction. An RNA polymerase promoter is incorporated into each cDNA molecule by priming cDNA synthesis with a synthetic oligonucleotide containing the phage T7 RNA polymerase promoter sequence. The second strand of cDNA, which will serve as the template for the RNA polymerase, can be generated either by second strand cDNA synthesis or by a combination of a switch mechanism at the 5′ end of RNA templates and PCR (Van Gelder et al., 1990; Phillips et al., 1996; Wang et al., 2000). After synthesis of double-stranded cDNA, T7 RNA polymerase is added and anti-sense RNA (aRNA) is transcribed from the cDNA template. The processive synthesis of multiple aRNA molecules from a single cDNA template results in amplification of aRNA. The aRNA is then labeled with Cy-dye-dNTP by reverse transcription using random hexamer primer (Fig. 2.2). In theory, the ‘lowest limit for the total amount of RNA used for representative amplification is at the level of 20–40 pg that is the average amount of RNA from a single cell (Roozemond et al., 1976; Uemura et al., 1980). Gene expression profiling results become unrepresentative when the amount of RNA is below that limit, since some low copy number genes would be lost at the very beginning.
2.2
Fluorescent Signal Amplification
RNA amplification can increase the amount of labeled cDNA applied to the microarray. However, if the signal can be amplified at the detection step so that one molecule can yield signals equivalent to 10–100 unamplified molecules, then fewer hybridized molecules will be needed. Similar strategies have been used to detect proteins on western blots by chemiluminescence assay. The MICROMAX-TSA Labeling and Detection System (NEN Life Science Products, Inc.) is based on such a principle. The detailed protocol can be obtained from the manufacturer. Basically, tyramide signal amplification (TSA) achieves signal amplification by catalyzing reporter deposition of labeled tyramide at the site of probe binding (Bobrow et al., 1989). In this system, labeled cDNA is prepared from experimental and control samples by incorporating fluorescein-dNTP and biotin-dNTP, respectively, during reverse transcription. The two populations of cDNA are then cohybridized to a microarray. After hybridization, stringent washes are carried out to eliminate excess and non-specifically bound cDNA. Either anti-fluorescein antibody or streptavidin is conjugated with horseradish peroxidase (HRP), an enzyme that catalyses the deposition of Cy-dye labeled tyramide amplification reagent. The reaction results in the deposition of numerous Cy-dye labels immediately adjacent to the immobilized HRP. Using this method, a minimum starting
24
COMPUTATIONAL GENOMICS AAAAAAAA TTTTTTTT-T7
OligodT-T7 primer
anscription
First strand cDNA
AAAAAAA TTTTTTTT-T7 d cDNA synthesis
Double strand cDN
Bacteria phage T7 promoter sequence fication by in vitro transcription nti-sense RNA
Cy-dye labele
cription using random hexamer primer
Figure 2.2. Illustration of the RNA amplification approach. Oligo dT primer is modified by flanking bacteria phage T7 sequence at its 5′ end, and the sequence is incorporated into the cDNA during reverse transcription. After second strand cDNA synthesis, anti-sense RNA is amplified by in vitro transcription using RNA polymerase. Cy-dye labeled dNTP is incorporated into cDNA by reverse transcription using random hexamer primer.
material of 0.5 μg of total RNA yields adequate signal intensities, according to the manufacturer (http://lifesciences.perkinelmer.com). The dendrimer (3DNA) detection technique, commercialized by Genisphere, is another signal amplification method (Fig. 2.3). This technique does not rely on modified nucleotide incorporation in the fluorescent labeling reaction. The dendrimers are complexes of partially double-stranded oligonucleotides, forming stable, spherical structures with a determined number of free ends where fluorescent dyes can be coupled. One dendrimer molecule carries more than 100 fluorescent molecules. A modified oligo dT primer containing dendrimer binding sequence at its 5′ end is used during reverse transcription to generate cDNA targets. Then the cDNA targets are hybridized with the dendrimer and the DNA spots on the microarray (Stears et al., 2000). From the published results, it is clear that these amplification methods can lead to amplified signals, thus reducing the requirement for large amounts of biological starting material. However, as with the well-known concern about the faithfulness of PCR amplification, there is concern about the reliability of the data obtained from amplification methods.
3.
Data Analysis Strategy
The goal of an amplification protocol is to enhance the microarray signals obtained from a small amount of RNA. Initial evaluation of the protocol
Assessment of cDNA Microarray Data Obtained Using Amplification AAAAAAAA TTTTTTTTT
25
Dentrimer binding sequence
Reverse transcription
TTTTTTTTT
cDNA
Hybridize with dendrimer Fluorescent dye
TTTTTTTTT
Hybridize to the microarray slide
Dentrimer
Figure 2.3. Illustration of the dendrimer technique. Oligo dT primer modified with the dendrimer binding sequence is used to generate cDNA by reverse transcription. Then the cDNA is hybridized with the dendrimer through binding the complementary sequence on the dendrimer and the primer. Finally, the complexes are hybridized with the microarray slide. There can be more than 200 Cy dye molecules on each dendrimer molecule and therefore, the fluorescence signal is amplified more than 200 times.
can be achieved by visual inspection of the scanned microarray images. For example, if we use a regular protocol starting with 1μg of total RNA, we would have an image containing only a few spots representing the most highly expressed genes. When we use the amplification protocol based on the T7 RNA polymerase (RNA amplification), we normally obtain an image with as many bright spots as we would see from the regular protocol with 100 μg total RNA. Such an image clearly indicates successful amplification of the RNA. In a similar fashion, a good image is always the starting point for evaluating whether an indirect labeling method worked. The danger is that a bright image, by itself, is not an adequate indicator of the faithfulness of the amplification. A similar situation occurs when evaluating gene expression with PCR, which amplifies the signal but often loses quantitative power due to the rapid speed (exponential amplification) at which saturation is reached in the PCR reaction. Even in the absence of formal data analysis, it is not difficult to envision that nonlinear amplification of different RNA molecules can occur for a variety of reasons. First, different double strand cDNA molecules have different sizes, and thus it takes different amounts of time to complete one round of RNA synthesis during the T7 RNA polymerase reaction. This problem is alleviated to some extent by the low processivity associated with T7 RNA polymerase, although variation among cDNA of different lengths is still a factor. Consequently, the amplification procedure can change the relative amounts of transcripts in the cells, favoring short genes and genes whose secondary structure allows an early drop-off of the T7 polymerase. Second, even when amplification is performed during the detection step, non-
26
COMPUTATIONAL GENOMICS
linear signal enhancement can still occur. For example, when too many fluorescent dendrimers are deposited at one spot because of a high expression level, fluorescent quenching can occur. This phenomenon may adversely affect the dynamic range of the signal detection. Consequently, some criteria must be set to assess the outcome of an amplification procedure based on the goal of the intended project. In many of our projects, we want to identify genes that are differentially expressed between tumors and normal tissues or between tumors before and after exposure to drug treatment. Focusing on the ability to detect differentially expressed genes, we have developed criteria in three areas. 1. Enhancement of signal intensity for low copy number genes. A good amplification method should increase the signal intensity of low expressing genes; i.e., the number of genes exhibiting measurable signal intensity above background levels using the amplification protocol should be greater than that using the regular protocol. To evaluate this property, we use the following criteria: a) If a spot can consistently be seen above background using the regular protocol, then it can also be seen using the amplification protocol (measure of consistency). b) Some spots can be seen above background using amplification that are not seen using the regular protocol (measure of signal enhancement). To convert these criteria into tools we can use quantitatively, we need a precise definition of what it means to ”see” a spot. We use the signalto-noise (S/N) ratio for this assessment. The S/N ratio for microarray data is computed, in most software packages, by dividing the backgroundcorrected intensity by the standard deviation (SD) of the background pixels. We typically require that S/N > 2.0 in order to say that a gene has measurable signal intensity on an array. In other words, if the difference between signal intensity and background intensity is greater than 2.0 SD of the local background, then the gene gives adequate signal intensity. 2. Reproducibility and reliability of signal intensity. The reproducibility of microarray results, whether produced using the same protocol or different protocols, is important for evaluating amplification protocols. Replicate experiments performed using the amplification protocol should be as reproducible as experiments performed using the regular protocol. In other words, given an appropriate metric of the re-
Assessment of cDNA Microarray Data Obtained Using Amplification
27
producibility of microarray experiments, the value of this metric when comparing two amplification experiments should be comparable to the value when comparing two regular experiments. Ideally, we should also see comparable values when comparing an amplification experiment to a regular experiment. To measure reproducibility, we use the concordance correlation coefficient, which is analogous to the Pearson correlation coefficient but measures how well a set of points matches the identity line (the 45deg line). In contrast to other measures such as the Pearson correlation coefficient or the paired t-test, concordance correlation is specifically designed to measure agreement (Lin et al., 1989, 1992). The concordance correlation coefficient (rc ) is computed as rc = r p
s12
+ s22
2s1 s2 , + (μ1 − μ2 )2
where r p is the Pearson correlation coefficient. This formula expresses the concordance correlation coefficient as a product of two factors: a measure of precision (r p , which measures the deviation from the best-fit line) and a measure of accuracy (which determines how far the best-fit line deviates from the identity line). 3. Preservation of patterns of differential gene expression. One of the major goals of microarray experiments is to find differentially expressed genes. Ideally, the same set of differentially expressed genes should be identified using microarrays produced from both the regular and amplification protocols. In other words, an ideal amplification protocol would satisfy the following criteria: a) If a spot is identified as differentially expressed using the regular protocol, then it is also identified as differentially expressed using the amplification protocol. b) If a spot is identified as differentially expressed using the amplification protocol and if the spot can be seen using the regular protocol, then it is also identified as differentially expressed using the regular protocol. It is, however, overly optimistic to expect to see identical lists of differentially expressed genes generated from the two protocols. We may not even achieve this goal using the same protocol twice, due to the variability associated with microarray experiments (Lockhart et al., 1996; Hess et al., 2001; Baggerly et al., 2001). Nonetheless, we do expect most differentially expressed genes to appear on both lists.
28
COMPUTATIONAL GENOMICS
In order to proceed quantitatively, we must be more precise about what it means for “most” genes to behave as expected. The critical step is to estimate the power π of the statistical test used to identify differentially expressed genes. For example, suppose we follow the common practice of computing a t-statistic for each gene and then declaring a gene to be differentially expressed if its t-statistic exceeds some threshold. Based on the number N of genes on the microarray, we can adjust the significance level α assigned to a single t-test to ensure that the number of false positives is small (preferably less than one). We then use α and the number of experiments to set the threshold. By making some assumptions about the intrinsic variability of the measurements and the size of the effect we are hoping to detect, we can then compute the power of this test. Suppose, to take a specific example, that 100 genes on the microarray exhibit twofold differential expression, and that both the regular protocol and the amplification protocol allow us to perform a test with power π = 0.7 for detecting these differences. Then each protocol should correctly identify about 70 differentially expressed genes. However, there is no reason for them to identify the same set of 70 genes. In fact, it is likely that they will identify different sets of genes, and that only about 49 genes (since 0.7 × 0.7 = 0.49) will be identified by both protocols. In other words, we would only expect 49 of the 70 genes identified by the regular protocol to also be identified by the amplification protocol. The “rate of agreement” in this example is equal to 49/70 = 0.7. In general, the rate of agreement is expected to equal the power of the test and so we merely require this level of agreement to conclude that the amplification protocol is consistent with the conventional protocol. After first weakening our criteria to require only that the fraction of genes identified by both tests should approximately equal the power of the test, we can ask for something slightly stronger. Genes that are identified by only one protocol should at least tend in the correct direction. More precisely, the “sign of the log ratio” should be the same with both protocols whenever a gene is identified as differentially expressed by one of the protocols.
4.
An Example
In this section, we describe an example involving nine microarrays that illustrates how we evaluated a particular amplification method. Each microarray contains 2304 genes spotted in duplicate, along with positive and negative controls. Among the nine experiments, five used the regular protocol with varying amounts of total RNA (100μg to 300μg) and four used an RNA amplification protocol with 1.0μg total RNA. Each ex-
Assessment of cDNA Microarray Data Obtained Using Amplification
29
periment compared two glioma cell lines, U251 (in the Cy5 channel) and LN229 (in the Cy3 channel).
4.1
Data Preprocessing
Normalization and threshold. Standard methods to identify differentially expressed genes typically involve taking the ratio between the signals from the two fluorescent dye channels. One problem is that different channels often have different overall intensities. These differences can occur because of different amounts of labeled target hybridized to the array, different rates of Cy3 and Cy5 degradation in storage (Cy5 is less stable than Cy3), or different gains used when scanning images. Therefore, we cannot directly take ratios between the two background-corrected values without going through a normalization process to remove systematic biases from the data. Normalization can be achieved in various ways. One method is to normalize to a “housekeeping” gene or a mixture of housekeeping genes. One significant drawback to this approach is that housekeeping genes are not always stable in all samples. We prefer a normalization method based on the assumption that the overall expression level of all genes in a highdensity array is more stable than any single gene or any small set of genes. Based on this assumption, our normalization approach is to adjust the values on any array so that the median (or another fixed percentile) ratio between channels becomes equal to 1. Because some arrays contain many low expressors, we usually normalize each channel separately by rescaling the background-corrected values to set the 75th percentile of the spots equal to 1000. Under assumptions similar to those described above, this normalization method brings the median ratio between reliably expressed spots close to 1, thus giving results similar to those obtained by normalizing ratios. An advantage of this approach is that it allows us to extract individual channels from different experiments and compare them directly. After normalization, we replace all negative and all small positive normalized values by a “detection threshold” value. The setting of the detection threshold is data-driven. In our microarray experiments, we have found that a normalized value of 150 corresponds approximately to a spot whose S/N ratio equals 1. Any spot whose background-corrected intensity is below this threshold cannot be reliably distinguished from background noise. Thus, we typically use 150 for the detection threshold. Smooth t-statistics. Microarrays contain genes that exhibit a wide spectrum of expression levels. It is, by now, well known that the detection of low signal intensities is much more variable than the detection of
30
COMPUTATIONAL GENOMICS
high signal values. Taking ratios involving low expressing genes is highly unstable. Therefore, a measure of fold difference based simply on the ratios provides biologists with incomplete (and potentially inaccurate) information. Instead, we apply a statistical approach to determine the gene expression profile on each array. Briefly, we compute “smooth t-statistics” or “studentized log ratios” by rescaling the estimates of the log ratios of gene expression levels between samples to account for the observed variability. In this approach, we estimate both the mean log intensity and the standard deviation of the spots within a channel (using the fact that all genes are spotted in duplicate on our microarrays). We then fit a smooth curve describing the standard deviation as a function of the mean log intensity. After estimating the variability within a channel, the smooth curves that estimate the within-channel standard deviation are pooled to calculate a single curve estimating the standard deviation (σ ) across the entire range of intensities on the microarray. We use the pooled estimates to compute a “smooth t-statistic” for each gene by the formula t-score = [log2 (A) − log2 (B)]/σ, where A and B are the geometric means of the expression levels in the two channels. (We use the term “smooth t-statistics” to distinguish these scores from the usual t-statistics that one would obtain by using the raw estimates of the standard deviation on an independent, gene-by-gene basis.) A detailed mathematical description of the application of smooth t-statistics to microarray data has been addressed in Baggerly et al. (2001), Baldi et al. (2001), and in a chapter by Baggerly et al. in this book (Chapter 4).
4.2
Data Analysis
In this section, we apply the analytical method described in section 3 to the normalized log intensities and smooth t-statistics computed from our microarray experiments. Enhancement of signal intensity for low copy number genes. To evaluate whether the amplification protocol preserves signals that are seen using the regular protocol, we evaluated each channel (U251 labeled with Cy5 and LN229 labeled with Cy3) separately. In each channel, we located all the spots that consistently exhibited adequate signal strength (S/N > 2.0) in all experiments conducted using the regular protocol (Fig. 2.4). We then computed the percentage of those genes that also have S/N > 2.0 in all microarray experiments conducted using the amplification protocol. High percentage agreements are expected if the amplification protocol succeeded. We found that 95% to 100% of the genes that were consistently
31
Assessment of cDNA Microarray Data Obtained Using Amplification
1000 1500 2000
1
1
1
1
100
100
100
500
200
1
1
1
1
100
100
100
200
Total RNA Used (Micrograms)
300
1000 1500 2000
Number of Genes > 2SD Bkgd
LN229 Channels (of 2304 Genes)
500 300
Number of Genes > 2SD Bkgd
U251 Channels (of 2304 Genes)
Total RNA Used (Micrograms)
Figure 2.4. Number of genes exhibiting adequate signal (S/N > 2) in experiments comparing the glioma cell lines U251 and LN229. Each point on the graph corresponds to a separate microarray experiment, identified by the amount of total RNA used to prepare the labelled cDNA targets. Arrays produced with 100 to 300 μg total RNA used the regular protocol; arrays produced with 1.0μg total RNA used the amplification protocol.
found to exhibit adequate signal using the regular protocol were also found to exhibit adequate signal using the amplification protocol. Reproducibility and reliability of signal intensity. We assessed the reproducibility of microarray experiments by computing the concordance correlation coefficient between the smooth t-scores. In our study, the concordance correlation coefficient between experiments using the regular protocol ranged from 0.574 to 0.835 across 5 microarrays, with a median value of 0.739. The concordance correlation coefficient between experiments using the amplification protocol ranged from 0.843 to 0.894 across 4 microarrays, with a median value of 0.873. The concordance correlation coefficient between experiments that used different protocols ranged from 0.562 to 0.796, with a median value of 0.652. As measured by the concordance correlation coefficient in this study, replicate microarray experiments using the amplification protocol were at least as reproducible as microarray experiments using the regular protocol. In addition, the concordance coefficients when comparing a microarray experiment using the amplification protocol to an experiment using the regular protocol were in the same range as those comparing two experiments that both used the regular protocol. In other words, the results obtained using the amplification protocol were consistent with the results obtained using the regular protocol. Figure 2.5 illustrates the consistency of the smooth t-scores concordance correlation in three typical comparisons of pairs of experiments.
32
COMPUTATIONAL GENOMICS Regular vs Amplified
-10
-5
0
5
R1 smooth t-score Concordance = 0.799
10
5 0 -10
-5
A6 smooth t-score
5 0
A7 smooth t-score
-10
-10
-5
10 R3 smooth t-score 0 5 -5
10
Amplified vs Amplified 10
Regular vs Regular
-10
-5
0
5
A6 smooth t-score Concordance = 0.889
10
-10
-5
0
5
10
R1 smooth t-score Concordance = 0.796
Figure 2.5. Demonstration of concordance correlation of ‘smooth t-scores’ between (left) two arrays produced using the regular protocol, (middle) two arrays produced using the amplification protocol, and (right) arrays produced using different protocols.
Assessment of differentially expressed genes. To identify differentially expressed genes, we computed a combined set of smooth t-statistics using all 5 microarrays produced using the regular protocol. Similarly, we computed a set of t-statistics using the data from all 4 experiments performed using the amplification protocol. Genes are considered to be differentially expressed if the smooth t-statistic exceeds a threshold value. We chose |t-score| > 4.0 as the cut off for our analysis. Because we measured each gene in duplicate in each channel of each microarray, we can assume that the t-statistics follow a t-distribution with at least 14 degrees of freedom. (There are at least 16 observations of each gene: 4 arrays, 2 channels per array, and 2 observations per channel.) Thus, the cut off value of 4.0 corresponds to a significance level on an individual two-sided t-test of α = 0.0013. Since there are 2304 genes on our array, this test should find fewer than 3 false positives from each set of experiments. In order to compute the power of this test to identify genes that exhibit twofold differential expression, we must estimate the standard deviation. For most genes expressed at moderate to high levels (i.e., with S/N > 2.0), we have observed within-group standard deviations of the base-two log intensities on the order of 0.5 or 0.6. Taking α as above, with at least 8 observations of each gene in each channel, this test has a power between 0.55 and 0.78. As described earlier, we would expect the rate of agreement between the two protocols to equal the power. In our experiments, we identified 26 differentially expressed genes using the amplification protocol and 20 genes using the regular protocol, with 14 genes in common. These observations yield agreement rates of 14/20 = 0.70 and 14/26 = 0.54,
-5
0
5
10
33
-10
t-scores computed from all amplified arrays
Assessment of cDNA Microarray Data Obtained Using Amplification
-10
0 5 -5 t-scores computed from all unamplified arrays Concordance=0.785
10
Figure 2.6. Differentially expressed genes identified using two different protocols. Genes identified by both protocols are indicated in squares. Genes identified as differentially expressed on amplified arrays, but not on unamplified (regular) arrays are indicated in triangles. Genes identified as differentially expressed on unamplified arrays but not on amplified arrays are indicated in circles. The vertical and horizontal dots lines correspond to the cutoff values ±4.
just as predicted. Moreover, all the genes identified as differentially expressed by one of the protocols “tend in the correct direction”, in the sense that the sign of the log ratio is the same (Fig. 2.6). In addition, seven of the differentially expressed genes were tested by northern blotting assay and were confirmed (data not shown).
5.
Discussion
Although amplification protocols can produce a multitude of bright spots in the images of microarray experiments, this success does not necessarily mean that the information so acquired is reliable. Systematic biases and nonlinear amplification effects have the potential to distort the profile of gene expression in individual samples. In spite of these risks, there is tremendous pressure to reduce the amount of RNA required to perform microarray experiments. Before deciding to use one of the many amplification protocols that have been proposed, it is important to decide if the protocol can be used to produce consistent, reliable results. In this chapter, we have described a statistical method to evaluate the faithfulness of amplification protocols used to produce microarray data.
34
COMPUTATIONAL GENOMICS
Our evaluation is based on the idea that microarray experiments conducted using a conventional protocol provide a level of ”ground truth” or a ”gold standard” to which the amplification results can be compared. From this perspective, we require successful amplification protocols to satisfy three conditions. First, most spots (more than 95%, say) that are visible on microarrays produced using a conventional protocol should remain visible on arrays produced using the amplification protocol. The utility of this criterion depends on the definition of visibility; we require S/N > 2.0. There is no particular reason to prefer the value 2.0; one can set more or less stringent requirements on the signal-to-noise ratio. Alternatively, one can define visibility directly in terms of the background-corrected signal intensity, by finding a level that most effectively separates the intensities of positive and negative controls on the array. Second, replicated microarrays produced using the amplification protocol should be as reproducible as replicate arrays using the conventional protocol. This reproducibility can be assessed by computing the concordance correlation between normalized or studentized intensities or log ratios. Other measures of agreement can also be used, but we feel that measures of how well the data fits the identity line are more appropriate than statistics that measure more general linear trends. Third, the set of genes identified as differentially expressed using the amplification protocol should be consistent with the set identified using the conventional protocol. The method described here can be applied whenever it is possible to estimate the power of the statistical test being used to identify differentially expressed genes; the expected level of agreement only depends on the power. We have illustrated these three criteria by applying them successfully to a set of replicate experiments performed in our laboratory to compare two cell lines using both an amplification protocol and a conventional protocol. The results indicated that the amplification protocol produced reproducible, reliable microarray data that was consistent with the regular protocol. Fundamental to our approach is the assumption that the conventional protocol provides a gold standard to which the amplification protocol can be compared. An alternative view is that an amplification protocol can be evaluated strictly on its own terms. From this perspective, the critical issue is the reproducibility of the results obtained using the amplification protocol, which only be assessed by performing numerous replicate experiments with the amplification protocol. One advantage to our analysis is that it allows for the possibility of comparing two different amplification
REFERENCES
35
methods to determine if one of them more faithfully reproduces results consistent with those obtained using a conventional protocol.
Acknowledgments This work was partially supported by the Tobacco Settlement Fund appropriated by the Texas Legislature, and by a generous donation from the Kadoorie Foundation to the Cancer Genomics Core Program.
References Adler, K., Broadbent, J., Garlick, R., Joseph, R., Khimani, A., Mikulskis, A., Rapiejko, P., and Killian, J. (2000). “MICROMAXT M : A Highly Sensitive System for Differential Gene Expression on Microarray.” In M. Schena, ed., Microarray Biochip Technology, 221–230. Baggerly, K. A., Coombes, K. R., Hess, K. R., Stivers, D. N., Abruzzo, L. V., and Zhang, W. (2001). “Identifying Differentially Expressed Genes in cDNA Microarray Experiments.” J Comp Biol 8:639–59. Baldi, P. and Long, A. D. (2001). “A Bayesian Framework for the Analysis of Microarray Expression Data: Regularized T-test and Statistical Inferences of Gene Changes.” Bioinformatics 7:509–19. Bobrow, M. N., Harris, T. D., Shaughnessy, K. J., and Litt, G. J. (1989). “Catalyzed Reporter Deposition, A Novel Method of Signal Amplificaiton.” Application to immunoassays. J Immunol Methods 125: 279–285. Hess, K. R., Zhang, W., Baggerly, K. A., Stivers, D. N., and Coombes, K. R. (2001). “Microarrays: Handling the Deluge of Data and Extracting Reliable Information.” Trends Biotechnol 19:463–8. Lin, L. (1989). “A Concordance Correlation Coefficient to Evaluate Reproducibility.” Biometrics 45:255–268. Lin, L. (1992). “Assay Validation Using the Concordance Correlation Coefficient.” Biometrics 48:599–604. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., and Brown, E. L. (1996). “Expression Monitoring of Hybridization to Highdensity Oligonucleotide Arrays.” Nat Biotechnol 14:1675–80. Phillips, J. and Eberwine, J. H. (1996). “Antisense RNA Amplification: A Linear Amplification Method for Analyzing the mRNA Population from Single Living Cells.” Methods 10:283–8. Ramdas, L., Wang, J., Hu, L., Cogdell, D., Taylor, E., and Zhang, W. (2001). “Comparative Evaluation of Laser-based Microarray Scanners.” BioTechniques 31:546–552.
36
COMPUTATIONAL GENOMICS
Roozemond, R. C. (1976). “Ultramicrochemical Determination of Nucleic Acids in Individual Cells Using the Zeiss UMSP-I Microspectrophotometer. Application to Isolated Rat Hepatocytes of Different Ploidy Clases.” Histochem J 625–638. Stears, R. L., Getts, R. C., and Gullans, S. R. (2000). “A Novel, Sensitive Detection System for High-density Microarrays Using Dendrimer Technology.” Physiol Genomics 3:93–99. Uemura et al., (1980). “Age-related Changes in Neuronal RNA Content in Rhesus Monkeys (Macaca mulatta).” Brain Res Bull 5:117–119. Van Gelder, R. N., Von Zastrow, M. E., Yool, A., Dement, W. C., Barchas, J. D., and Eberwine, J. H. (1990). “Amplified RNA Synthesized from Limited Quantities of Heterogeneous cDNA.” Proc Natl Acad Sci USA 87:1663–7. Wang, E., Miller, L. D., Ohnmacht, G. A., Liu, E. T., and Marincola, F. M. (2000). “High-fidelity mRNA Amplification for Gene Profiling.” Nat. Biotechnol 18:457–459.
Chapter 3 SOURCES OF VARIATION IN MICROARRAY EXPERIMENTS Experimental design and the analysis of variance for microarrays
Kathleen F. Kerr1 , Edward H. Leiter2 , Laurent Picard3 , and Gary A. Churchill2 1 Department of Biostatistics, University of Washington, Seattle, Washington, USA 2 The Jackson Laboratory, Bar Harbor, Maine, USA, 3 Corning, Inc., Corning, New York, USA
Introduction Multiple factors and interactions of those factors contribute to variation in gene expression microarray data. For example, certain kinds of systematic variation are most commonly dealt with via data “normalization.” The ultimate goal in the analysis of microarray data is to account for different sources of variation in order to “isolate” the effects of the factors of interest. In this chapter we use a small experiment to illustrate some principles of experimental design and scientific inference for microarray studies. Our focus is on statistical and analytic issues rather than the biological results of the particular data example. Brown and Botstein (1999) give an introduction to microarray technology. In brief, DNA sequences are spotted and immobilized onto a glass slide or other surface, the array. Each spot contains a single sequence, although the same sequence may be spotted multiple times per array. RNA is removed from cell populations of interest. Purified mRNA is reversetranscribed into cDNA and a dye-label is incorporated. The standard fluors are Cy3 and Cy5, commonly referred to as “green” and “red.” Green- and red-labeled pools of cDNA are mixed together and washed over an array. Single strands of dye-labeled cDNA hybridize to their complementary sequences spotted on the array, and unhybridized cDNA is washed off. The array is scanned, producing a “red” and “green” image for each slide.
38
COMPUTATIONAL GENOMICS
Analysis of this image (Brown, Goodwin, and Sorger, 2001) produces a “red” and “green” reading for each spot. There is weak control over the amount of DNA probe available for hybridization in each spot. This means it is impossible to tell whether a large fluorescent intensity is the result of high abundance of the transcript in the RNA or simply due to a particularly large or dense spot. However, the relative red and green intensity contains information about the relative expression of a gene in the two samples based on the principle that the sample that contained more transcript should produce the higher signal. This principle provides the rationale for the typical analysis of microarray data, which is based on using the ratio of red and green signal as the starting point for the analysis. The primary alternative to a purely ratio-based approach is to use ANOVA models with random spot effects (Wolfinger et al., 2001). There is a mathematical relationship between such ANOVA models and the usual ratio-based analysis, as explained by Kerr (2003a).
1.
The Experiment
Spotted cDNA microarrays were used to study the expression of genes in the liver related to type II diabetes. NZO/HlLt male mice represent a model of maturity-onset type II diabetes. Mice of this strain are large at weaning, and gain weight rapidly thereafter. The rapid development of post-weaning obesity leads to the development of insulin resistance, and eventually type II diabetes in males (Leiter et al., 1998). If the diet fed to weaning mice is supplemented with 0.001% CL316,243, a beta-3 adrenergic receptor agonist, the metabolic defects are in large part normalized such that obesity is blunted and diabetes is averted. Prototype mouse arrays were printed on CMT-GAPS coated slides using Corning proprietary high-throughput printing technology. Seventy-six genes associated with growth and metabolism were spotted four times on each array and two additional genes were spotted 16 times per array, for a total of seventy-eight unique sequences. In order to screen for genes with dysregulated expression in mice whose diet is unsupplemented with CL316,243, we compared gene expression in livers of treated and control NZO/HlLt male mice. All mice were fed for 4 weeks from weaning on NIH 31 diet (4% fat). The chow of the treated mice was supplemented with 0.001% CL316,243.
2.
Experimental Design
Two arrays were used to compare the treated and control RNA samples. The dye-labeling was switched between the arrays. The design of this experiment is most commonly referred to as a “dye swap,” although it has
39
Sources of Variation in Microarray Experiments Table 3.1. Experimental design. Dye 1/“Green”
Dye 2/“Red”
Treatment Control
Control Treatment
Array 1 Array 2
also been described as a “flip fluor” or Latin Square design (Kerr, Martin, and Churchill, 2000). Table 3.1 shows the experimental layout. It turns out that the dye-swap design has a structure that is conducive to efficient estimation of various effects in the data. Following Kerr, Martin, and Churchill (2000) and Kerr and Churchill (2001a and 2001b), we identify four basic experimental factors: arrays (A), dyes (D), the treated and control RNA varieties (V), and genes (G). With four binary factors there are 24 = 16 possible factorial effects. However, not all of these effects are identifiable. Table 3.2 gives the confounding structure inherent in microarray data arising from this design (Cochran and Cox, 1992). Each experimental effect is confounded with one other effect, its alias. Non-aliased effects are orthogonal. The effects of interest are expression levels of genes specifically attributable to varieties, i.e. variety×gene (VG) interactions. In order to measure V G effects it is necessary to assume that ADG effects are 0. A three-way interaction of arrays, dyes, and genes would mean that every measurement depends on the channel and array where it is taken, and that this dependence is different for every gene. In effect, this says there is no consistency across arrays and that the technology is ineffective. An advantage of the dye-swap design is that V G effects are orthogonal to all other effects. This means that other factorial effects will not bias estimates of V G effects, and accounting for other sources of variation does not reduce the precision of V G estimates. The orderly structure of the dye-swap design is a consequence of the balance among the factors. In detail:
Table 3.2. Confounding structure of the dye-swap experiment. The symbol ∼ identifies aliased effects. A =Array, D =Dye, V =Variety of mRNA, G =Gene. mean A D V
∼ ∼ ∼ ∼
ADV DV AV AD
G AG DG VG
∼ ∼ ∼ ∼
ADVG DVG AVG ADG
40
COMPUTATIONAL GENOMICS
1 Varieties and dyes are balanced. Each variety is labeled with both dyes and each dye-labeled sample is used equally often in the experiment. 2 Varieties and arrays are balanced. Each variety appears on each array. 3 Arrays and dyes are balanced. Each array has a red channel and a green channel. 4 Genes and arrays are balanced. The same set of genes is spotted on each array. 5 Genes and dyes are balanced. For every gene there are an equal number of red signals and green signals. 6 Genes and varieties are balanced. Associated with every variety are fluorescent intensities for the same set of genes. Which of these properties of balance can be expected to hold for a general microarray design, other than the simple dye-swap design considered here? It is reasonable to assume that the same set of genes is spotted on every array in an experiment because of the automated printing process. It follows that genes are balanced with respect to the other factors and properties (4)-(6) hold. Also, by the nature of the technology, each array has a red channel and a green channel, so property (3) holds. Property (1) may or may not hold — this is determined by the design. If each sample is labeled with both the red and green dyes and the design uses each equally often in the experiment, then varieties and dyes will be balanced. Kerr and Churchill (2001a) call such designs even designs. In general microarray experiments, property (2) will not hold. This is because each array is strictly limited to two varieties since there are two fluorescent dyes. When there are more than two varieties, every variety cannot be assayed on every array, so arrays and varieties will not be balanced. In this case variety effects and array effects are not orthogonal, but partially confounded. Consequently, variety×gene effects (the effects of interest) are partially confounded with array×gene effects. As discussed in the next section, array×gene effects correspond to variation due to spots and are expected to be an important source of variation in microarray data. Therefore this partial confounding means that the experimental design is a key determinant in how precisely the effects of interest, V G, are estimated (Kerr and Churchill, 2001a).
3.
Data Analysis
We began the analysis of the experiment with simple plots to get an overview of the data. Following the recommendation of Yang et al. (2002),
41
Sources of Variation in Microarray Experiments
we plotted the log-ratio of the red and green signal for each spot against the mean log intensity. These plots, which we refer to as RI plots (ratiointensity plots), are a simple but powerful diagnostic tool (Cui, Kerr, and Churchill, 2001). Ideally, this plot should appear as a horizontal band of points representing genes that are not differentially expressed, with some points above or below the band representing genes that are differentially expressed. Instead, we see curvature in these plots (Fig. 3.1 (a) and (b)). Consequently, we applied the “shift-log” data transformation (Kerr et al., 2002), which assumes there is a differences in the dyes that is additive on the raw data scale. The transformation requires estimating a single parameter, the “shift,” for each array. Fix an array i and let R be the vector of red signal intensities and G be the vector of green signal intensities. We estimate a constant shift si for array i = 1, 2 to minimize the absolute deviation from the median of log(G + si ) − log(R − si ). The estimated shifts are s1 = 1728.8 and s2 = 1801.6. In other words, for array 1 the transformed data are log(G+1728.8), log(R−1728.8) and similarly for array 2. Figure 3.1 (c) and (d) show the RI plots for the shift-log data. Comparing (a) with (c) and (b) with (d), we see the shift-log transform straightens these plots. Like symbols were used in Fig. 3.1 to plot the four (or sixteen) points corresponding to the multiple spots of each gene. Consider the four squares above the mass of the data in the plots for array 1. Notice that these squares
Difference
Array 1 (a)
2
0
-2
-2 9
10
11
12
13
14
(c)
2
8
0
-2
-2 9
10 11 12 Mean Intensity
13
14
9
10
11
12
13
14
9
10 11 12 Mean Intensity
13
14
(d)
2
0
8
(b)
2
0
8
Difference
Array 2
8
Figure 3.1. Plots of the (adjusted) log ratio against the mean intensity of every spot for the log data and the shift-log data. Like symbols represent the four or sixteen spots for the same sequence.
42
COMPUTATIONAL GENOMICS
are below the mass of points in array 2. It is logical that the points are in opposite positions in the two plots because the dyes were switched. In contrast, consider the diamonds in the upper-left corner of the plots for array 1. Under close inspection we see these points stay in approximately the same position for array 2. This gene, Igf 2, which encodes the insulinlike growth factor-2 (IGF-2), consistently produced higher signal in the green fluor, regardless of which sample was labeled green. Note that if only one array had been used in this experiment instead of two, there would have been no way to see that the differential signal was an artifact of the dyes and not due to differential expression between the varieties. This is a clear demonstration of the problems presented with confounding – on a single array, dye and variety are completely confounded. If only one array had been used, one would have been tempted to conclude that Igf 2 is differentially expressed based on its interaction with dye. In our particular data example we could identify this gene-specific dye effect by inspecting simple scatterplots because there are only 78 genes. However, it is more typical for many thousands of genes to be spotted. With such a large amount of data it is impractical to expect to identify such anomalies simply by inspecting scatterplots. Analysis of variance (ANOVA) methods (Kerr, Martin, and Churchill, 2000) provide a tool to automatically detect and quantify such artifacts in the data. Let yi jkgr be the shift-log transformed data value from array i = 1, 2, dye j = 1, 2, variety k = 1, 2, and spot r = 1, . . . , sg of gene g = 1, . . . , 78, where sg = 4 or 16 depending on how many times gene g was spotted. Consider the ANOVA model yi jkgr = μ+ Ai + D j + Vk + G g + (AG)igr + (V G)kg + (DG) jg + ǫi jkgr , where μ is the overall mean, Ai is the overall array effect, D j is the overall dye effect, Vk is the overall variety effect, and G g is the overall gene effect across the other factors. The (AG)igr terms capture spot effects, which arise because there is variation in the amount of DNA deposited in each spot during the printing process. The (DG) jg terms capture gene-specific dye effects such as those observed for Igf 2. The (V G)kg effects represent levels of signal intensity for genes that can specifically be attributed to the treatment and control varieties. These are the effects of interest. The errors ǫi jkgr are assumed to be independent have mean 0 – we will return to discuss the independence assumption, which is doubtable. We note that all sixteen possible experimental effects of array, dye, variety, and gene are directly or indirectly accounted for in this model because of the confounding described in Table 3.2. Table 3.3 gives the analysis of variance for the transformed data. Because genes were spotted multiple times on the arrays and we are as-
43
Sources of Variation in Microarray Experiments
(b) VG
(a) DG
-2
-1
0
1
2
-2
-1
0
1
2
3
Figure 3.2. Histograms of estimated effects for differences in (a) dye-by-gene interactions and (b) variety-by-gene interactions. Vertical lines in (b) mark the limits for a gene to be found differentially expressed according to bootstrap 99.9% confidence bounds. Thus the proportion of the histogram outside these bounds indicates the proportion of genes for which there was evidence of differenctial expression at the 99.9% confidence level.
suming independence for every data value, degrees of freedom remain to estimate error even though all factorial effects are represented. The dye×gene effect has the smallest mean square, but it is 11 times larger than the residual mean square. Figure 3.2(a) shows a histogram of the estimated dye ×gene effects for the 78 genes. There is one outlier in the distribution, corresponding to Igf2. This demonstrates how ANOVA automatically detects effects in the data. The experimental objective is to identify genes that are differentially expressed between the treated and control samples. In statistical language, we are interested in genes with significant variety×gene interactions. In order to decide which genes show convincing evidence of differential exTable 3.3. Analysis of variance. Notation: SS=sum of squares; df=degrees of freedom; MS=mean square. Source SS df MS Array Dye Variety Gene Spot Variety∗Gene Dye∗Gene Residual Adjusted Total
0.11 3.35 45.69 917.53 145.19 66.46 21.30 19.72 1219.35
1 1 1 77 593 77 77 516 1343
0.11 3.35 45.69 11.92 0.46 0.86 0.28 0.0382
44
COMPUTATIONAL GENOMICS
pression, we need to put error-bars on the estimates of differential expression. The residuals from this analysis are heavier-tailed than the normal distribution, so standard confidence intervals based on the normal distribution are not appropriate. Instead, we use a bootstrapping procedure (Efron and Tibshirani, 1986) (Kerr, Martin, and Churchill, 2000). Figure 3.2(b) is a histogram of the estimates of (V G)2g − (V G)1g for the 78 genes. The vertical lines on the histogram show the threshold for differential expression determined by the bootstrapping procedure for 4-fold spot replication. Ten-thousand bootstrap simulations were used. Figure 3.3 plots the 78 estimates of (V G)2g − (V G)1g in ascending order with 99.9% bootstrap confidence bounds. The two genes spotted with 16-fold replication instead of 4-fold replication are easily identified since their confidence intervals are half as wide. There is strong evidence that five genes, which are labeled in the plot, are differentially expressed. The gene whose expression shows the highest increase in expression in response to drug treatment is Mt2, encoding the metallothionein-2 protein involved in detoxification and metabolism of heavy metals. The gene whose expression shows the greatest decrease is Pparα, encoding the peroxisome proliferator associated receptor alpha, a transcription factor regulating hepatic genes associated with fatty acid oxidation in the liver. In addition to the five labeled genes, a handful of other genes have confidence bounds that do not contain 0. These genes are candidates for further study, but the evidence of differential expression is much less compelling.
3 Log Relative Expression
Mt2 2 Igfbp1 1
Gh
0 -1 -2 -3 0
Igf2r Ppara
10
20
30
40
50
60
70
80
Figure 3.3. Estimated log-relative expression with 99.9% bootstrap confidence bounds.
Sources of Variation in Microarray Experiments
4.
45
Discussion
The analysis of variance provides a statistical foundation for analyzing microarray data that can be seen as an extension of a simple ratio-based analysis (Kerr, 2003a). In contrast to using a simple two- or three-fold threshold in the log ratios to select genes, the ANOVA approach uses the data to both estimate relative expression and also to learn the magnitude of the error in the data. This allows one to put error-bars on estimates of relative expression and gauge the strength of the evidence for differential expression. The importance of replication in microarray experiment has been stressed by many authors (Kerr and Churchill, 2001b) (Lee et al., 2000) (Dobbin and Simon, 2002) (Kerr, 2003b). In this experiment there were two kinds of replication: two arrays were used instead of one, and genes were spotted multiple times per array. Both kinds of replication are purely technical and provide no information about the biological variability of gene expression (Dobbin and Simon, 2002)(Kerr, 2003b). That is, inferences are limited to the two particular RNA samples that were assayed. Any statistical inference is about these particular RNAs, and the inference is made in the context of the variability inherent in the technology. An additional, important aspect that is beyond the scope of this chapter is that replicate arrays and replicate spots represent different levels of technical replication. Replicate arrays demonstrate greater variability than replicate spots because they represent separate hybridizations and therefore additional layers of variability. In the simple analysis presented here we did not distinguish these two levels of replication, but this could be handled with a more sophisticated mixed model. For purposes of illustration we treated all data as independent, which gave us error degrees of freedom to illustrate the bootstrapping procedure. However, by ignoring correlations in the data, our analysis was anti-conservative – confidence bounds should be wider than those presented. A more appropriate analysis would include random effects for spots (nested within arrays), which would reflect the fact that data from repeated spots on the same array are correlated. In addition to replication, another fundamental aspect of good design is balance. In the dye swap design there is total balance among arrays, dyes, varieties, and genes. The effects of interest, V G, are orthogonal to all other factorial effects except one (ADG), which is reasonably assumed to be zero. In particular, V G effects are orthogonal to gene-specific dye effects DG. In this experiment we uncovered a striking example of DG effects. We have seen DG effects in other data, and thus advocate even designs — designs where RNA varieties and dyes are balanced (Kerr and Churchill, 2001a). Such balance ensures that DG and V G ANOVA ef-
46
COMPUTATIONAL GENOMICS
fects are orthogonal so that accounting for the former does not alter the latter. By minding the principles of good design, balance and replication, experimenters can conduct statistically sound microarray experiments and get the most out of their microarray data.
Acknowledgments ´ and Bruno Caubet for assistance printThe authors thank Valerie Lemee ing and testing these microarrays, Peter Reifsnyder for running the experiment, and Chunfa (Charles) Jie for critical reading of a previous version of this manuscript. This work was supported by a Burroughs-Wellcome post-doctoral fellowship from the Program in Mathematics and Molecular Biology to KK, NIH grants DK56853 and RR08911 to EHL, and NIH grants CA88327, DK58750, and CA34196 to GAC.
References Brown, P. O. and Botstein, D. (1999). “Exploring the New World of the Genome with DNA Microarrays.” Nature Genetics 21:33–37. Brown, C. S., Goodwin, P. C., and Sorger, P. K. (2001). “Image Metrics in the Statistical Analysis of DNA Microarray Data.” Proceedings of the National Academy of Sciences of the USA 98:8944–8949. Cochran, W. G. and Cox, G. M. (1992). Experimental Design, 2nd edition. New York: John Wiley & Sons. Cui, X., Kerr, M. K., and Churchill, G. A. (2003). “Transformations for cDNA Microarray Data.” Statistical Applications in Genetics and Molecular Biology 2: http://www.bepress.com/sagmb/vol2/iss1/art4 Dobbin, K. and Simon, R. (2002). “Comparison of Microarray Designs for Class Comparison and Class Discovery.” Bioinformatics 18:1438–1445. Efron, B. and Tibshirani, R. (1986). “Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy.” Statistical Science 1:54–77. Kerr, M. K. (2003a). “Linear Models for Microarray Data Analysis: Hidden Similarities and Differences.” Journal of Computational Biology 10:891–901. Kerr, M. K. (2003b). “Design Considerations for Efficient and Effective Microarray Studies.” Biometrics 59:822–828. Kerr, M. K., Afshari, C. A., Bennett, L., Bushel, P., Martinez, J., Walker, N., and Churchill, G. A. (2002). “Statistical Analysis of a Gene Expression Microarray Experiment with Replication.” Statistica Sinica 12:203–217.
REFERENCES
47
Kerr, M. K. and Churchill, G. A. (2001a). “Experimental Design for Gene Expression Microarrays.” Biostatistics 2:183-201. Kerr, M. K. and Churchill, G. A. (2001b). “Statistical Design and the Analysis of Gene Expression Microarray Data.” Genetical Research 77:123–128. Kerr, M. K., Martin, M., and Churchill G. A. (2000). “Analysis of Variance for Gene Expression Microarray Data.” Journal of Computational Biology 7:819–837. Lee, M. -L. T., Kuo, F. C., Whitmore, G. A., and Sklar, J. (2000). “Importance of Replication in Microarray Gene Expression Studies: Statistical Methods and Evidence from Repetitive cDNA Hybridizations.” Proceedings of the National Academy of Sciences of the USA 97:9834–9839. Leiter, E. H., Reifsnyder, P. C., Flurkey, K., Partke, H. -J., Junger, E., and Herberg, L. (1998). “NIDDM Genes in Mice: Deleterious Synergism by Both Parental Genomes Contributes to Diabetogenic Thresholds.” Diabetes 47:1287–1295. Wolfinger, R. D., Gibson, G., Wolfinger, E. D., Bennett, L., Hamadeh, H., Bushel, P., Afshari, C., and Paules, R. S. (2001). “Assessing Gene Significance from cDNA Microarray Expression Data via Mixed Models.” Journal of Computational Biology 8:625–637. Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., and Speed, T. P. (2002). “Normalization for cDNA Microarray Data: A Robust Composite Method Addressing Single and Multiple Slide Systematic Variation.” Nucleic Acids Research 30:e15.
Chapter 4 STUDENTIZING MICROARRAY DATA Keith A. Baggerly1 , Kevin R. Coombes1 , Kenneth R. Hess1 , David N. Stivers1 , Lynne V. Abruzzo2 , and Wei Zhang3 1 Section of Bioinformatics, Department of Biostatistics, 2 Department of Hematopathology, 3 Cancer Genomics Laboratory, Department of Pathology, University of Texas M. D. Anderson
Cancer Center, Houston, Texas, USA
1.
Introduction
Microarrays let us measure relative expression levels of thousands of genes simultaneously. While there are a variety of microarray platforms (e.g., radioactively labeled cDNA on nylon membranes, Affymetrix gene chips) we will focus on spotted cDNA arrays. These involve cDNA samples labeled with fluorescent dyes, hybridized to probes spotted by a robotic arrayer onto a glass substrate. Two samples are hybridized to each array, with the two samples labeled with different dyes. The most commonly used dyes are Cy5 (red) and Cy3 (green). Scanning the array produces a greyscale intensity image for each sample; we will refer to these as the two “channels” of the array. While these two samples may both be of interest, it is often the case that there will be one sample of primary interest and the other sample will be from a “control” or “reference” substance (typically a mixture of cell lines). The question addressed by microarrays is whether the expression level of a given gene in the sample of interest is different from the level in the control sample. This difference is measured as a ratio or log ratio. Data from multi-array experiments is often summarized in matrix form, with rows corresponding to genes, columns corresponding to arrays, and entries corresponding log ratios. Clustering and “gene hunting” are then performed on this matrix. One problem with such a summary of the data is that it treats all measurements symmetrically, and thus fails to take into account the fact that microarray data is of widely varying quality. Within an array, one gene
50
COMPUTATIONAL GENOMICS
(or our measurement of it) may be inherently more variable than another. One array may be visibly “dirtier” than another for a variety of reasons (the quality of the RNA used comes immediately to mind). This problem can be addressed if we can measure this changing variability, and couple a matrix of standard deviations with the matrix of measurements. Rescaling the fold differences by their associated standard deviations “studentizes” the data, and provides a more robust way of assessing which differences are in fact significant. In the remainder of this chapter, we will describe how the scale of the standard deviation varies as a smooth function of the underlying intensity of the spots, and how we can estimate this function on a single array by using replicate spottings of the same gene at different positions on the array. These replicates can further be used for quality control checks, and can provide visual evidence of the “effectiveness” of a reference channel. A detailed example is given; theoretical details of the form of the error model are available elsewhere (Baggerly et al, 2001).
2.
Fold Differences and Error Models
The first few papers on spotted cDNA microarrays (e.g., Schena et al, 1995, Schena et al, 1996, DeRisi et al, 1996) flagged genes as interesting if the detected fold difference in expression exceeded a fixed arbitrary threshold. Later, (Chen et al, 1997) this threshold was made less arbitrary by using replicated housekeeping genes to assess the standard deviation of replicates on an array. Both of these methods implicitly assume that the coefficient of variation for the gene expression values is constant. More recent work (Baggerly et al, 2001, Hughes et al, 2000, Newton et al, 2001, Rocke and Durbin, 2001) has shown that this is not the case; the standard deviation associated with a log ratio decreases with increasing intensity. An idealized version of the situation is shown in Fig. 4.1. We can measure fold differences associated with bright spots quite precisely, while the same is not true for faint spots. Faint spots can undergo large fold changes that are almost entirely due to random variation. Consequently, the fold change associated with significance varies with intensity. This trend in variation with intensity can be made visually apparent on a single array through the use of replicate spots.
3. 3.1
A Case Study Array Layout and Preprocessing
On our arrays, every spot has been printed in duplicate; one chip design involves a 4 by 12 grid of patches, where each patch is an 8 by 8 grid of
51
Studentizing Microarray Data Idealized variances of noise (dashed), signal (dotted), and sum
Standard Deviation of log ratio
1
0.8
0.6
0.4
0.2
0 2
4
6 8 Average log intensity
10
12
Figure 4.1. Idealized standard deviation of log ratios as a function of spot intensity. There are different factors contributing to the error. A multiplicative component is associated with a lognormal or binomial model for target counts; this contributes a constant amount of variance to the log ratio (dotted line). An additive component is associated with the estimation of background; on the log ratio scale this contributes a large amount to the variance of weak signals and a negligible amount to the variance of strong signals (dashed line). The observed variance is a sum of these. This phrasing (additive plus multiplicative error) is given in Rocke and Durbin, 2001. Note that the vertical scale was chosen arbitrarily here, and should not be expected to match that observed in an actual experiment.
spots, and where the top 4 rows of each patch are replicated in the bottom 4 rows. All told, this yields 1152 genes spotted in duplicate, together with 96 positive controls (32 each of GAPDH, beta actin, and tubulin), 96 negative controls (all yeast DNA), and 576 blanks (buffer solution). At each spot, intensity is measured as the sum of all pixel intensities within a fixed radius circle; local background is assessed as the median pixel intensity in the four diamond-shaped interstices between a spot and its neighbors. This background value is subtracted from each pixel in the circle, giving a subtracted volume, or “sVol”. The raw image files have pixel intensities that range from 0 to 216 − 1; these are scaled in processing so that the range of sVol values is from 0 to about 212 . Values of sVol less than 1 are set to 1 (our threshold) and the data are log-transformed (base 2). Replicate agreement can be assessed by plotting the log sVol value for one replicate against another, and looking for concentration about the 45 degree identity line. A better display is achieved by taking this plot and rotating it 45 degrees clockwise, as advocated by Dudoit et al, 2000, so
52
COMPUTATIONAL GENOMICS
that we are plotting the difference between replicates on the y-axis and the average of the replicates on the x-axis. This corresponds to plotting the standard deviation of the replicates as a function of average intensity, letting us see how the variability changes. We have chosen a fairly faint image to use for our examples below to illustrate the points more clearly.
3.2
Single Channel Images
We tend to produce plots first for each channel separately. We begin here with the Cy5 channel, and check replicate agreement for the genes, Fig. 4.2, the positive and negative controls, Fig. 4.3, the blanks, Fig. 4.4, and all of them superimposed, Fig. 4.5. Plotting the genes illustrates the “spread effect” as we move from high to low intensities (right to left); the V-shape at the left end of the plots is due to the threshold at 1. Plotting the controls gives us some idea of the stability of the array; the ranges of the negative controls and the blanks give us some idea of the range of intensity values that are believable. A gene must be higher in intensity than either blanks or negative controls before we pay attention to it. Next, we check the green channel (which tends to be a bit brighter) and then the replicate ratios. The fan shape is more apparent in the green channel, Fig. 4.6.
Replicate Gene Agreement, Cy5
log2(sVol1) - log2(sVol2)
6 4 2 0
-2 -4 -6 0
2
4 6 8 log2(sVol1) + log2(sVol2)/2
10
12
Figure 4.2. Agreement (y-axis) between replicate spottings of the same gene as a function of intensity (x-axis). Each spot on the plot corresponds to a pair of spots on the array, with the identity of the spots the same. Note that agreement is poorer when intensity is low.
53
Studentizing Microarray Data Replicate Control Agreement, PC Star NC Triangle, Cy5
log2(sVol1) - log2(sVol2)
6 4 2 0 -2 -4 -6 0
2
4 6 8 (log2(sVol1) + log2(sVol2))/2
10
12
Figure 4.3. Just as we can check agreement between replicate spottings of a gene, we can look at the agreement between positive controls (PCs) and negative controls (NCs) paired following the same geographic pattern (top half to bottom half). There should be clear separation between the two, and the positive controls give us some feel for the variation in intensity possible on an array. Variation in intensity is corrected for by using ratios.
Blank Agreement, Cy5
log2(sVol1) - log2(sVol2)
6 4 2 0 -2 -4 -6 0
2
4 6 8 (log2(sVol1) + log2(sVol2))/2
10
12
Figure 4.4. We can also check agreement at the paired blank values. These are at the same level as the negative controls in this example, but that is not always the case. Paying attention only to those genes that are present at levels higher than either blanks or negative controls is a conservative rule.
54
COMPUTATIONAL GENOMICS Replicate Agreement, Superposition Cy5
log2(sVol1) - log2(sVol2)
6 4 2 0 -2 -4 -6 0
2
4 6 8 (log2(sVol1) + log2(sVol2))/2
10
12
Figure 4.5. Superimposing the results shows where we can focus on potentially usable results.
Replicate Agreement, Superposition, Cy3
log2(sVol1) - log2(sVol2)
6 4 2 0 -2 -4 -6 0
2
4 6 8 (log2(sVol1) + log2(sVol2))/2
10
12
Figure 4.6. Superimposing the results shows where we can focus on potentially usable results. The flare is more apparent in the green channel for this array.
3.3
Replicate Ratios
The fan shape is even more evident on the replicate ratio plot, Fig. 4.7, than it was on either of the individual channel plots. Note that the variance at the high end decreases markedly as we move from the single channel replicates to the ratio replicates. This is visual
55
Studentizing Microarray Data
log2(sVolR1/sVolG1) - log2(sVolR2/sVolG2)
Replicate Ratio Agreement 6 4 2 0 -2 -4 -6 0
2 4 6 8 10 log2(sVolR1* sVolR2*sVolG1*sVolG2)/4
12
Figure 4.7. The change in variance with intensity is more apparent on the ratio plot than on either of the individual channel plots. The most marked change is that the variance at the high end has decreased on the ratio plot. The variance here is associated with multiplicative errors (a slight preference for one replicate spot over another) and these errors are correlated between channels and hence cancel. Noise at the low end is more random, and cancels to a much lesser degree.
evidence of the fact that ratios are more stable than intensities at the high end. This implies that there are trends in hybridization efficiency across the array, but that these trends are paralleled in the two channels so that taking ratios cancels out the trends. Looking for a decrease in variance at the high end is one way of checking the “efficiency” of a reference channel. The story at the left end of the plot is different; there the variability is being driven by nonsystematic background “white noise” which does not cancel so well across channels. Zooming in on the high end of the ratio plot, Fig. 4.8 shows that there are some noticeable outliers. Note that outliers on replicate agreement plots, either single channel or ratio, are not indicative of features of biological interest! Rather, these correspond to data artifacts on the array such as dust specks or fibers. Checking the two outliers from the bottom left of Fig. 4.8 confirms their artifactual nature; one is due to a fiber, the other is due to a dust speck. These are shown in detail in Fig. 4.9.
3.4
Variance Fitting and Studentization
Thus far, we have exploited the plots primarily for quality control. However, the variance estimable from the replicate ratios can be used to define
56
COMPUTATIONAL GENOMICS
log2(sVolR1/sVolG1) - log2(sVolR2/sVolG2)
Replicate Ratio Agreement, High Intensity Zoom 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2
6
7 8 9 10 11 log2(sVolR1* sVolR2*sVolG1*sVolG2)/4
12
Figure 4.8. Zooming in on the high intensity replicate ratios. Agreement here is quite good, and this lets us estimate the variance associated with a log ratio measurement. Assessment must use robust methods so as to take into account the presence of occasional artifacts such as the two in the lower left - these correspond to replicate spots where the ratio is quite different even though the gene is the same. Left outlier, rep 1, Cy3
4
x 10
Right outlier, rep 1, Cy3
x 104
3.5
6
3
5
2.5
4
2 1.5
3
1 0.5
1
2
0
Left outlier, rep 2, Cy3
4000 3500 3000
Right outlier, rep 2, Cy3 4500 4000 3500 3000
2500 2000
2500 2000
Figure 4.9. Visually checking the outliers from the replicate ratio plot (just looking at the green channel here) confirms that these are artifacts. Note the pixel intensity scales are different in the different subimages.
the precision with which a ratio measurement can be made, and this in turn will allow us to scale the differences between channels, which are “average ratios”. We estimate the standard deviation by fitting a smooth curve to the spread of the replicate ratios, capturing the precision with which a ratio is
57
Studentizing Microarray Data
measured as a function of the average channel intensity. This is done using a robust loess fit to the absolute values of the differences between replicate log ratios. Loess (Cleveland, 1979) uses only a windowed fraction of the data, so that the standard deviation is “local”; here the fraction is 40%. This fit is applied to the genes only, not to the positive controls, negative controls, or blanks. Further, so as not to be caught by the “V” at the left end, we impose a monotonicity condition: variance cannot decrease as the signal gets weaker. The results are shown in Fig. 4.10. Note that the outliers are clearly visible. We then shift to plotting the average difference between channels. There is a vertical displacement corresponding to a normalization effect, indicated by the horizontal line. We assess the normalization offset using the median displacement for expressed genes; that expressed here corresponds to an average signal value greater than 3. The results are shown in Fig. 4.11. To assess the degree to which we believe a gene to be differentially expressed, we take the difference between the observed channel difference and the normalization line and divide it by our estimate of the local standard deviation. This produces locally studentized values. While the extreme case in the lower right (transfected p53) would have been caught even without adjustment, many false positives would have been identified using a fixed fold difference. It is interesting to note that this image illustrates a potential conflict with the positive controls. Zooming in to the high-intensity region and labeling the different types of positive controls with different symbols shows that GAPDH and tubulin
log2(sVolR1/sVolG1) - log2(sVolR2/sVolG2)
Replicate Ratio Agreement, with Variance fit ⫾4s.d. 6 4 2 0 -2 -4 -6 0
2 4 6 8 10 log2(sVolR1* sVolR2*sVolG1*sVolG2)/4
12
Figure 4.10. Loess fit to the standard deviation of the replicate ratios. We fit the mean absolute replicate difference, which corresponds to roughly 0.8 s.d. with normal data. The bands shown correspond to ±4 s.d.
58
COMPUTATIONAL GENOMICS Average Cy5 - Cy3 log difference, with smooth log2((sVolR1*sVolR2)/(sVolG1*sVolG2))/2
8 6 4 2 0 -2 -4 -6 -8
0
2 4 6 8 10 log2(sVolR1* sVolR2*sVolG1*sVolG2)/4
12
Figure 4.11. Average differences in log intensities for the Cy5 and Cy3 channels, with local standard deviation bounds. Spots outside the bands correspond to differentially expressed genes. There is a clear extreme case at the bottom, corresponding to p53 which had been transfected into one of the two samples used in this experiment. Note that the positive controls appear to have some differences. Average Cy5 - Cy3 log difference, Positive Control Zoom log2((sVolR1*sVolR2)/(sVolG1*sVolG2))/2
3 2 1 0 -1 -2 -3
6
7 8 9 10 11 log2(sVolR1* sVolR2*sVolG1*sVolG2)/4
12
Figure 4.12. Channel difference zoom on positive controls. Crosses are GAPDH, starts are beta actin, and squares are Tubulin. GAPDH is expressed at a different relative level than beta actin, but replicates within type are fairly consistent.
are essentially on the normalization line, whereas beta actin was expressed at different levels in the two samples. This is shown in Fig. 4.12.
REFERENCES
4.
59
Discussion
Studentizing microarray data allows us to incorporate the inherently different scales of variation into our assessment of differential expression. This makes our results more stable. Further, for comparing different arrays, the studentized values incorporate overall levels of “noisiness” of the arrays. This scaling of variation with intensity arises from the combination of different types of measurement error, which can be assessed using replicate spots. Replicating entire arrays can further allow for the partitioning of between and within array variation. Efficiently identifying and removing sources of variation that are not of interest allows us to focus more clearly on the information that is the biological focus of our experiments.
Acknowledgments Research at the MDACC Cancer Genomics Core Laboratory is partially supported by Tobacco Settlement Funds as appropriated by the Texas State Legislature and by a generous donation from the Michael and Betty Kadoorie Foundation.
References Baggerly, K. A., Coombes, K. R., Hess, K. R., Stivers, D. N., Abruzzo, L. V., and Zhang, W. (2001). “Identifying Differentially Expressed Genes in cDNA Microarray Experiments.” J Comp Biol 8:639–659. Chen, Y., Dougherty, E. R., and Bittner, M. L. (1997). “Ratio-based Decisions and the Quantitative Analysis of cDNA Microarray Images.” J Biomed Optics 2:364–374.
Chapter 5 EXPLORATORY CLUSTERING OF GENE EXPRESSION PROFILES OF MUTATED YEAST STRAINS Merja Oja1 , Janne Nikkil¨a1 , Petri T¨or¨onen2 , Garry Wong2 , Eero Castr´en2 , and Samuel Kaski1 1 Neural Networks Research Centre, Helsinki University of Technology, Helsinki, Finland 2 A. I. Virtanen Institute, University of Kuopio, Kuopio, Finland
1.
Introduction
Genome projects are continuously revealing novel genes and the function of many of the genes discovered previously is unknown. The clarification of the function of these genes will be a great challenge and will require coordinated work by many different disciplines. Because many genes share a relatively high homology in different organisms, simple model organisms have been exploited in the analysis of gene function. For example, projects are underway to mutate every single gene in baker’s yeast, Saccharomyces cerevisiae (Winzeler et al., 1999). One of the problems that has hampered functional analysis is that simple organisms display only a limited number of observable phenotypes and many mutations do not produce any phenotype at all. Gene expression analysis has been recently proposed as an alternative means to analyze phenotypes of mutated yeast strains. Hughes et al. (2000) used the large mutation data set in an exploratory fashion to suggest inferences about gene function, for example about which pathway an uncharacterized gene belongs to. The global transcriptional response, i.e. the expression level of all the genes, of a cell after a mutation can be regarded as a detailed molecular phenotype. Tentative inferences on the similarity of function can be made based on similarity of the gene expression profiles, and the hypotheses can then be tested experimentally. Hughes et al. (2000) clustered the mutations according to similarities in their expression profiles, to make similarities hidden within the large
62
COMPUTATIONAL GENOMICS
data set explicit. Hierarchical agglomerative clustering, a popular method for grouping gene expression data, was used. The hierarchical clustering methods visualize a tree of subclusters of increasing detail. The problem with the trees is that they are sensitive to small variations in the data that may affect the agglomeration process, and for large data sets the trees become huge. It is then hard to extract the essential cluster structures from them. We have been using Self-Organizing Map (SOM)-based clustering and information visualization for the same purpose (T¨or¨onen et al., 1999; Kaski et al., 2001a). The SOM (Kohonen, 1982; Kohonen, 1995) projects the genes or mutations onto a two-dimensional display which is ordered according to the similarity of the expression profiles. Perhaps the most immediate advantage over alternative methods is the easily understandable visualizability of the result: The SOM display provides an overview of the similarity relationships within the large genome-wide data set. Clusters are visualized with gray shades on the display, and for example mutation clusters can be characterized in terms of which kinds of genes the mutations affect. This paper has two interrelated goals. We show how to use SelfOrganizing Map-based data exploration tools to analyze a genome-wide expression data set and visualize its essential properties. By directly comparing the results with those obtained with hierarchical clustering we hope to convince experimentalists on the utility of the tools in other gene expression analyses as well. More specifically, we show that the same things that have been found using hierarchical clustering can be found easily with the SOM. The second goal is to demonstrate methods for exploring the gene expression data.
2.
The Data
We used the data set analyzed earlier by Hughes et al. (2000). The set consists of expression measurements for all yeast (Saccharomyces cerevisiae) genes in 300 different treatments, in which one gene was either removed or its function was blocked. The expression of each gene in the mutated yeast is compared to the expression of a wild type yeast. The expression ratio and noise estimate for the ratio is measured for each gene-mutation pair. Thus the amount of data is 300 × 6297 ratios, and for each ratio there is an estimate for the variance of the measurement noise. Hughes et al. analyzed first the probability that a given expression measurement deviates from pure background noise, in order to concentrate on expressed genes. They call the process significance-cutting. They
Exploratory Clustering of Gene Expression Profiles of Mutated Yeast Strains 63
constructed a gene-specific error model to calculate the P-value that a given measurement (for a gene-mutation pair) deviates from noise. Genes for which both the P-value and the actual measured expression ratio did not exceed the specified thresholds were left out of the analysis, and likewise for the mutations. In our work the significance threshold was chosen to be P < 0.01, and two different thresholds are used for the expression ratios: When choosing the metric a looser test was used to leave enough data to make inferences: the ratio had to exceed 2. When clustering the data the threshold was stricter, 3.2, in accordance with the study of Hughes et al. The size of the resulting larger set was 179 × 1482, and the size of the smaller set was 127 × 578. Each yeast strain was assigned a MIPS1 functional class according to the class of the gene that had been mutated. The MIPS classification is hierarchical and allows a gene to belong to several functional classes. The original classification (249 classes) was reduced to 70 by picking the top level of homogeneous main classes, and second level of more heterogeneous classes. The functional classes which we considered heterogeneous were metabolism, cell growth, transcription, cellular transport, cellular biogenesis and cellular organization. For example, in the cellular organization class the second level distinguished the genes based on which cellular components they affect, and for the metabolism class the second level subclassifies the genes to the metabolism of different biomolecules like fatty acids or amino acids.
3.
Choosing the Metric
Clustering methods group together similar data items, here gene expression profiles, while maximizing dissimilarity between clusters. Hence, the results depend heavily on the metric. The most common distance or similarity measures for real-valued data, such as gene expression profiles, are based on either the Euclidean distance or the correlation coefficient. In this paper we will confine ourselves to variants of these two measures. To our knowledge, there are no comparisons of different metrics with gene expression data in the literature. The suitability of different measures depends both on the accuracy of the data and on the goal of the analysis. The questions that need be answered include: (i) Is the zero level of the measurements reliable enough, i.e., is there possibly a (large) shift in the baseline? (ii) Which is interesting, the absolute magnitude of expression ratios or the relative values? Some suitable measures for each case are listed in Fig. 5.1.
64
COMPUTATIONAL GENOMICS
Absolute magnitudes
Zero level
Interesting Not Interesting
Reliable
Unreliable
Euclidean metric
(Euclidean with mean subtracted)
Inner product
Correlation
Figure 5.1. Distance or similarity measures for gene expression profiles. Suitability of the measures depends on whether the zero level of the expressions is reliable or not, and whether the absolute or relative magnitudes are important. Inner product is taken of normalized vectors.
If the baseline is not very reliable then the measures should be invariant to it. In the correlation coefficient the average of the measurements is effectively subtracted before the analysis, and the same can in principle be done for the Euclidean measure as well. The Euclidean distance measures both the absolute and relative differences in gene expression. If one is only interested in the latter, then the (Euclidean length of the) expression profiles can be normalized. It is easy to show that, for normalized profiles, similarity orderings according to increasing Euclidean distance and decreasing inner products xT y are equivalent. Here x and y are expression profiles considered as realvalued vectors. The correlation coefficient, on the other hand, is equivalent to the inner product of vectors from which the average has been subtracted. In summary, correlation is invariant to both the zero level and the scale of gene expressions. Inner product (of normalized vectors) is invariant to scale. Euclidean distances takes into account differences in both.2 Note that the absolute value of correlation is sometimes used. This is very sensible within one time series, assuming that there is some common underlying factor that inhibits the activity of one gene and enhances the other. In this paper the advantages are not as salient, however, since the multitude of factors may cause different directions for each gene/ treatment combination. Hence we decided to avoid potential interpretation difficulties in the later stages of analysis, and did not use the absolute value. Plain Euclidean or correlation-based distances weight the expression of each gene at each treatment equally. A standard way of preprocessing variables in the absence of prior information is to scale their variance to unity. Hughes et al. (2000) have additionally measured the noise of each
Exploratory Clustering of Gene Expression Profiles of Mutated Yeast Strains 65
gene-mutation pair individually, and the expression values can be downscaled by the standard deviation of the noise. In this section we measure empirically which distance measures are most appropriate for gene expression data. As a criterion we use the ability of a simple non-parametric classifier to distinguish between functional classes of the genes.
3.1
Methods
The k-nearest neighbor (KNN) classifier stores its teaching data set, and classifies a new expression profile x according to majority voting between the classes of the K closest samples in the teaching set. That is, if there are more samples from class C than from the other classes, then the prediction will be that x belongs to class C. We used K = 5. The different metrics were compared by cross-validation: The data set was divided into 27 subsets, and the classification result was computed 27 times for each metric. Each subset was used for evaluating the accuracy of the classifier in turn, while the teaching set consisted of the remaining 26 subsets. Since the genes may belong to several functional classes we had to modify the basic KNN classifier slightly. The new expression profile was considered to be randomly assigned to one of its classes, and the classifier chooses the predicted class randomly from a distribution it has estimated. The percentage of correct classification was then estimated by the probability that the two classes coincide. The classifier estimates the distribution by searching for the K closest samples and calculating the proportion of the classes in them. A gene x belonging to NC (x) functional classes contributes for each of the classes with a proportion of 1/NC (x). The classification error is estimated by the average of (1 − p(C(x)) ˆ over x and its correct classes within the test set. Here C(x) is the correct class of x, and p(C) ˆ is the proportion of samples of class C within the neighborhood of K samples.
3.2
Results
The classification errors for the three metrics, with and without the different preprocessing methods are shown in Table 5.1. The best results were obtained by the correlation of expression profiles, which were normalized with respect to the measurement noise. A paired t-test was used to measure significance of the difference between the best result, and the best results of the other two metrics. Only the difference from the Euclidean metric was significant (P < 0.01). The
66
COMPUTATIONAL GENOMICS
Table 5.1. Comparison of different metrics for classifying mutations. The figures are estimated classification errors. For the metrics see Fig. 5.1. Normalized noise: Each expression value is scaled by the inverse of the estimated standard deviation of the noise. Normalized variance: The variance of the expression of each gene is normalized to unity. Note that the number of classes is very large compared to the size of the data set; classification error of the naive classifier that predicts according to prior class probabilities is 0.9566. Preprocessing Metric
None
Normalized noise
Normalized variance
Normalized noise & variance
Euclidean Inner product Correlation
0.9172 0.8898 0.8880
0.9138 0.8878 0.8866
0.9152 0.8899 0.8904
0.9186 0.8899 0.8905
effects of preprocessing were additionally tested for the correlation measure, and they were not significant. Although the correlation measure was slightly better than the inner product distance, we chose the latter one for the further analysis in the next section to keep the results comparable to the results of Hughes et al. (2000).3
4. 4.1
Self-Organizing Map-Based Exploratory Clustering Self-Organizing Maps
The Self-Organizing Map (SOM; Kohonen, 1982; Kohonen, 1995) is a discrete grid of map units. Each map unit represents certain kinds of data, in the present paper different mutations of yeast, the genes of which are expressed in similar ways. The input data is represented in an ordered fashion on the map: Map units close-by on the grid represent more similar expression profiles and units farther away progressively more different profiles. The map is a similarity diagram that presents an overview of the mutual similarity of the large amount of high-dimensional expression profiles. The mapping is defined by associating an n-dimensional model vector mi with each map unit i, and by mapping each expression profile x ∈ R n to the map unit having the closest model vector. Here we use the inner product metric: Each expression measurement is first normalized by the estimate of the noise standard deviation, after which the length of the vectors is normalized to unity. The inner product is then used as the
Exploratory Clustering of Gene Expression Profiles of Mutated Yeast Strains 67
similarity measure, and the closest model vector is the vector c fulfilling the condition xT mc ≥ xT mi (5.1)
for all i. The mapping becomes ordered and represents the data after the values for the model vectors are computed in an iterative process. In ”online” type of computation at step t one expression profile x(t) is selected at random, the closest model vector c is found by (5.1), and the model vectors are adapted according to mi (t + 1) = mi (t) + h ci (t)x(t),
(5.2)
and normalized to unit length. Here h ci (t) is the neighborhood function, a decreasing function of the distance of the units c and i on the map grid. We have used a Gaussian function that decreases in width and height during the iterative computation. For more details, variants, and different methods of computing SOMs see (Kohonen, 1995; Kaski et al., 1998). The SOM can be used both as a nonlinear projection method for dimensionality reduction, and as a clustering method. According to recent evidence (Venna and Kaski, 2001) the similarity diagrams formed by the SOM are more trustworthy than multidimensional scaling-based alternatives, in the sense that if two data points are close-by on the display they are more likely to be close-by in the input space as well. Compared with most clustering methods the SOM has the advantage that instead of extracting a set of clusters it visualizes the cluster structures. The SOM display is an overview of the similarity relationships and cluster structures of the data set. Such visualizations will be used below to study clusters in gene expression patterns. Several measurements are missing from the mutation data set, as is common for other real-word data sets as well. For SOM there exists a simple method for computing with missing values: In each step in the iteration, simply neglect the values missing from the current input vector. This approach works well at least if only a small percentage of the values is missing (Samad and Harp, 1992). For the inner product metric we have carried out all computations within the subspace defined by the available components.
4.2
Overview of the Cluster Structure of the Data
The self-organizing map used in this work was the inner-product version of the SOM consisting of 13 times 9 map units arranged in a regular hexagonal grid. The map was computed in two phases. In the first ordering phase of 120,000 iterations the standard deviation of the Gaussian-shaped neighborhood function decreased linearly from 5 to 2 while its height decreased
68
COMPUTATIONAL GENOMICS
from 0.2 to 0.02. In the second fine-tuning phase of 1,200,000 iterations the standard deviation decreased to 1 and the height to zero. The SOM overview of the yeast gene mutation data is presented in Fig. 5.2. Dense cluster-like areas are indicated in light shades and the sparser areas separating them in data space are indicated in dark shades. The lightness is proportional to the distances between the model vectors of neigh-
gcn4 yel008w ckb2 tet-AUR1 vps8
dig1 ymr014w ymr269w rps24a yel033w yhr034c bul1 yel044w pfd2 isw1 cka2
FR901,228
ste5 ste4 ste12
rml2
aep2 yer050c
ecm10 pep12
gpa2
bub3
vac8 ade2
bub3
erg4
qcr2
kim4
rad6
afg3
tup1 ssn6
yar014c 2-deoxy-D-glucose anp1 kin3 Glucosamine
erg2 Itraconazole yer044c
rad57
sst2
isw1,isw2
hda1 swi4
hst 3
isw2 spf1
Terbinafine
erg3
clb2 dig1,dig2 dig1,dig2 ras2 hog1
tet-HMG2 top3 hmg1 Lovastatin
tet-ERG11
tet-CMD1 tet-PMA1
Tunicamycin
tet-KAR2 tet-CDC42 med2
cup5 yhr011w vma8
yor080w
bim1 yer083c
MMS HU
rnr1
cem1
imp2
fus3 fks1 tet-FKS1
ste11
msu1 rip1 cyt1 mrpl33
ymr293c
she4 swi6 bub1
gas1
fus3,kss1 ste7 ste18 yjl107c
sod1
rtg1
rts1
mrt4
sap30 rpd3
sin3 mac1
sgt2
swi5
sir2
ymr031w-a
yhl029c yap1
npr2 pex12
ste24 sir3 sir4 rrp6
imp2’ ase1
ppr1
arg80 rpl27a dot4 rps27b rps24a
Figure 5.2. Overview of the cluster structure of the data: Light shades denote dense areas in the data space and dark shades sparse areas, respectively. Dots denote the map units without any data vectors. Clusters found by Hughes et al. (2000) with hierarchical clustering have been encircled with the thick lines. Yeast strains that did not belong to the original clusters have been encircled with the thin lines, and boxed strains have moved away from their original clusters.
Exploratory Clustering of Gene Expression Profiles of Mutated Yeast Strains 69
boring map units in the data space. Because the density of the model vectors in the data space reflects the density of the data, the distances reflect the density structure of the data. The distance matrix is called the U-matrix (Ultsch, 1993). The clusters formed by hierarchical clustering of gene expression patterns in mutated yeast strains by Hughes et al. (2000) have been manually drawn on the U-matrix in Fig. 5.2. Overall, clusters derived from hierarchical clustering nicely fall into dense areas of the SOM-based display, indicating that these two clustering methods in general produce very comparable results. Closer analysis of the SOM-based U-matrix reveals features which highlight the usefulness of the ability of SOM to visualize the spatial relationships between neighboring clusters. First, mutations in genes related to mitochondrial respiration are clustered in the upper middle region of the SOM display, perfectly agreeing with the cluster of mutations affecting mitochondrial function derived by Hughes et al. (2000). At the lower right corner of this cluster is a subcluster, which includes cup5 and vma8, genes which affect mitochondrial function indirectly by affecting iron homeostasis. Whereas these genes were clustered in a separate cluster from the other mitochondrial mutants by hierarchical clustering (Hughes et al., 2000), SOM places them into a separate group, yet in the close vicinity of the main cluster of mutations affecting mitochondrial respiration. To the right of the mitochondrial respiration cluster is an area which in SOM analysis appears as a single cluster, but was divided into two distinct albeit spatially related clusters by hierarchical clustering (Hughes et al., 2000). All the mutations in this cluster are related to DNA damage or block of DNA replication (Zhou and Elledge, 1992), which indicates that clustering into a single cluster is functionally justified. Finally, SOM places sir2, sir3 and sir4 mutants in neighboring neurons within a single cluster in the lower right end of the SOM display. These three genes directly interact to form a protein complex involved in telomeric silencing. Whereas sir2 and sir3 were clustered also by hierarchical clustering, sir4 was not clustered together with them. In conclusion, in contrast to methods which simply extract a set of clusters, SOM-based display offers more information in the form of hints about the neighbor relations and topography of the data space.
4.3
Interpretation of the Clusters
The overview display (Fig. 5.2) reveals clusters and visualizes their similarity relationships: Clusters close-by on the map are similar. Ultimately the clusters need be analyzed based on biological knowledge and further
70
COMPUTATIONAL GENOMICS
experimentation, but there exist also SOM-based exploration tools that may help in understanding properties of the clusters. Perhaps the most simple and intuitive method for characterizing the clusters is to compute the average expression profiles in each interesting map area, and then to visualize their differences in different map areas. It is of particular interest to measure what changes when moving along the map to neighboring map units; note that the map is a non-linear description of the data space and hence the coordinates are non-linear. We have previously (Kaski et al., 2002) characterized differences between neighboring map units in terms of original data variables, here expression of different genes, using local factors. The Self-Organizing Map can be regarded as an approximative principal surface, a non-linear generalization of principal components, and measuring changes along the map thus reveals the local principal variation in the data. In Fig. 5.3 we have characterized two areas of the map. About the cluster area in the top right corner of the map we already have some biological knowledge based on the functional similarity of the mutated genes. In contrast, the cluster area in the lower end of the map is a previously unknown cluster structure found with the SOM-based overview in Fig. 5.2. We characterize the clusters by analyzing which genes have the most different expression values in two neighboring clusters, and how those genes behave in other mutations. In this manner the similar, or dissimilar, behavior of the genes in the previously identified cluster structures may clarify the functionality of the mutated genes in the still unknown clusters. The results of the analysis of the original variables, the expressions of different genes, in the top right corner of the map are in accordance with the previous biological identification of these cluster structures. In Fig. 5.3, panel A reveals that the components having the largest positive differences, when moving into the upper cluster from the lower cluster, characterize specifically the lower cluster. Indeed, the expression of these genes, many of which are related to metabolism of lipids and carbohydrates, is much lower in this cluster than anywhere else in the data space. Conversely, genes which are downregulated (panel B) when moving from the lower to the upper cluster and generally expressed at a lower level in the upper cluster than in other parts of the data space. Many of these downregulated genes are related to mating and pheromone response, which is consistent with the clustering of mating-related mutations in the upper right corner. Figure 5.2 reveals also a new, possibly interesting cluster in the middle of the lower end of the map. This cluster was not analyzed in Hughes et al. (2000). The mutated genes in this cluster do not form a clear functional group, but many of them are related to transcriptional control, including transcription factors, histone deacetylases and cell-cycle regulated genes.
Exploratory Clustering of Gene Expression Profiles of Mutated Yeast Strains 71 20 LARGEST DIFFERENCES 0.0212 oy1
oyk2
bi g
0.16 0.15 0.14 0.13 0.12 0.11 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 101112131415161718192021
0.0131
A
0.047 3
oy1 oyk2 big
20 SMALLEST DIFFERENCES 0.116 oyk1 oyk2 smal l
0 -0.01 -0.02 -0.03 -0.04 -0.05 -0.06 -0.07 -0.08 -0.09 -0.1 -0.11 -0.12 -0.13 -0.14 -0.15 -0.16 0 1 2 3 4 5 6 7 8 9 101112131415161718192021
UNKNOWN
0.025
B
0.166
20 SMALLEST DIFFERENCES
20 LARGEST DIFFERENCES 0.16 0.15 0.14 0.13 0.12 0.11 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 101112131415161718192021
0.0782
0.0154
0.0474
C
0 -0.01 -0.02 -0.03 -0.04 -0.05 -0.06 -0.07 -0.08 -0.09 -0.1 -0.11 -0.12 -0.13 -0.14 -0.15 -0.16 0 1 2 3 4 5 6 7 8 9 101112131415161718192021
D
0.0864
0.0229
0.0405
Figure 5.3. Characterization of a sample cluster (panels A and B) and a previously unknown cluster (panels C and D). The bar-plots represent changes in average gene expressions when moving according to the arrow from one cluster to another. The 20 largest positive differences and 20 largest negative differences between the average expression vectors of the clusters are shown. The grayshaded map grids in the same panels show the average expression of the same 20 components plotted on the whole map. Light shade denotes high average value and dark shade low average value.
In Fig. 5.3 the panels C and D reveal the components that are up or downregulated when moving from the mid-left to the mid-right cluster at the lower part of the map. Upregulated genes are diffusely localized not only to the right cluster, but also to a large area above it, further indicating that the right cluster does not form a functionally uniform group. Downregulated genes highlight the left cluster, where the genes are expressed on average at a higher level than in the other parts of the data space. Many of the regulated genes encode for heat shock proteins indicating the activation
72
COMPUTATIONAL GENOMICS
of heat shock reaction in response to mutations clustered in the lower left region.
5.
Conclusions
We have shown that cluster displays constructed of self-organizing maps correspond well with clusterings formed earlier by hierarchical clustering (Hughes et al., 2000). The main difference between the methods seems to be in the ways they visualize the data: trees vs. 2-dimensional maps. Map displays enable focusing of the further analysis and provide a framework for intuitive exploration of the data. We demonstrated such exploration by investigating known and still unknown cluster structures. We compared the performance of different distance measures, Euclidean metric, inner product, and correlation coefficient, in functional classification of the expression profiles. The correlation coefficient was the best, although only the difference from Euclidean metric was significant. This suggests that the zero level in the signals may not be reliable, and that only the relative values of the expression ratios are important. Scaling the expressions by the inverse of a noise estimate improved the results, but not significantly. The metrics we studied in this paper were global, i.e., the same everywhere in the data space. We will later apply metrics that learn to concentrate on the most important kinds of differences (Kalki et al., 2001b; Sinkkonen and Kaski, 2001). The importance is derived from auxiliary data.
Acknowledgments This work was supported by the Academy of Finland, in part by the grants 50061 and 1164349.
Notes 1 http://www.mips.biochem.mpg.de/proj/yeast. 2 Correlation can be viewed as a method of reducing both additive and mul-
tiplicative noise, whereas inner product reduces only multiplicative noise. Note, however, that all noise reduction models may also remove essential information. 3 Note that our terminology is different; they call the inner product measure correlation.
REFERENCES
73
References Hughes, T. R., Marton, M. J., Jones, A. R., Roberts, C. J., Stoughton, R., Armour, C. D., Bennett, H. A., Coffrey, E., Dai, H., He, Y. D., Kidd, M. J., King, A. M., Meyer, M. R., Slade, D., Lum, P. Y., Stepaniants, S. B., Shoemaker, D. D., Gachotte, D., Chakraburtty, K., Simon, J., Bard, M., and Friend, S. H. (2000). “Functional Discovery via a Compendium of Expression Profiles.” Cell 102:109–126. Kaski, S., Kangas, J., and Kohonen, T. (1998). “Bibliography of SelfOrganizing Map (SOM) Papers: 1981–1997.” Neural Computing Surveys 1:1–176. Kaski, S., Nikkil¨a, J., and Kohonen, T. (2002). “Methods for Exploratory Cluster Analysis.” In: eds. P. S. Szczepaniak, J. Segovia, J. Kacprzyk, and L. A. Zadeh, Intelligent Exploration of the Web. Berlin: Springer, forthcoming. Kaski, S., Nikkil¨a, J., T¨or¨onen, P., Castr´en, E., and Wong, G. (2001a). “Analysis and Visualization of Gene Expression Data Using Selforganizing Maps.” In: Proceedings of NSIP-01, IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing. Kaski, S., Sinkkonen, J., and Peltonen, J. (2001b). “Bankruptcy Analysis with Self-organizing Maps in Learning Metrics.” IEEE Transactions on Neural Networks 12:936–947. Kohonen, T. (1982). “Self-organized Formation of Topologically Correct Feature Maps. Biological Cybernetics 43:59–69. Kohonen, T. (2001). Self-Organizing Maps, 3rd edition. Berlin: Springer. Samad, T. and Harp, S. A. (1992). “Self-organization with Partial Data.” Network: Computation in Neural Systems 3:205–212. Sinkkonen, J. and Kaski, S. (2001). “Clustering Based on Conditional Distributions in an Auxiliary Space.” Neural Computation, forthcoming. T¨or¨onen, P., Kolehmainen, M., Wong, G., and Castr´en, E. (1999). “Analysis of Gene Expression Data Using Self-organizing Maps.” FEBS Letters 451:142–146. Ultsch, A. (1993). “Knowledge Extraction from Self-organizing Neural Networks.” In: eds. Opitz, O., Lausen, B., and Klar, R., Information and Classification, pp. 301–306. Heidelberg: Springer Verlag. Venna, J. and Kaski, S. (2001). “Neighborhood Preservation in Nonlinear Projection Methods: An Experimental Study.” In: eds. Dorffner, G., Bischof, H., and Hornik, K., Artificial Neural Networks—ICANN 2001, pp. 485–491. Berlin: Springer. Winzeler, E. A., Shoemaker, D. D., Astromoff, A., Liang, H., Anderson, K., Andre, B., Bangham, R., Benito, R., Boeke, J. D., Bussey, H., et al.
74
COMPUTATIONAL GENOMICS
(1999). “Functional Characterization of the S. Cerevisiae Genome by Gene Deletion and Parallel Analysis.” Science 285:901–906. Zhou, Z. and Elledge, S. J. (1992). “Isolation of CRT Mutants Constitutive for Transcription of the DNA Damage Inducible Gene RNR3 in Saccharomyces Cerevisiae.” Genetics 131:851–866.
Chapter 6 SELECTING INFORMATIVE GENES FOR CANCER CLASSIFICATION USING GENE EXPRESSION DATA Tatsuya Akutsu1 and Satoru Miyano2 1 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan 2 Human Genome Center, Institute of Medical Science, University of Tokyo, Japan
1.
Introduction
Accurate classification of tumor types is very important in order to apply adequate therapies to patients. Therefore, a lot of studies have been done for cancer classification. Most of such studies were based primarily on morphological appearance of the tumor. Recently, a new approach based on global gene expression analysis using DNA microarrays (Brown, 1997) has been proposed (Cole, 1999; Golub, 1999; Kononen, 1998; Tsunoda, 2000). Although effectiveness of such an approach was already demonstrated, the proposed information processing methods were rather heuristic (Golub, 1999; Tsunoda, 2000). Therefore, further studies should be done for making more accurate classification. Golub et al. divided cancer classification into two problems (Golub, 1999): class discovery and class prediction. Class discovery is to define previously unrecognized tumor subtypes, whereas class prediction is to assign particular tumor samples to already-defined classes. Although class discovery is more challenging, we consider class prediction in this chapter because class prediction seems to be more basic and thus information processing methods for class prediction should be established earlier. In the previous methods (Golub, 1999; Tsunoda, 2000), predictions were made by means of the weighted voting. All genes were not used for weighted votes, but several tens of genes relevant to class distinction were selected and used. Use of selected genes seems better because of several reasons. For example, computational cost for determining parameters and cost for measurement of gene expression levels are much lower
76
COMPUTATIONAL GENOMICS (A)
Class 1
Class 2
gene 1 gene 2 gene 3 gene 4
(B)
gene 5 gene 6 gene 7 gene 8
sample 1
high
high
high
high
high
high
high
high
sample 2
high
high
high
high
low
high
high
high
sample 3
low
low
low
low
high
low
high
high
sample 4
high
high
high
high
high
high
low
high
sample 5
high
high
high
high
high
high
high
low
sample 6
low
low
low
low
high
low
low
low
sample 7
low
low
low
low
low
high
low
low
sample 8
high
high
high
high
low
low
high
low
sample 9
low
low
low
low
low
low
low
high
Figure 6.1. Selection of informative genes. Set of genes {gene 5, gene 6, gene 7, gene 8} is better than {gene 1, gene 2, gene 3, gene 4} because wrong predictions might be made for sample 3 and sample 8 if the latter set is used.
if the selected genes are used. Golub et al. called these selected genes informative genes. However, selection methods of informative genes were rather heuristic. Indeed, Golub et al. wrote that the choice to use 50 informative genes in the predictor was somewhat arbitrary. Tsunoda et al. selected informative genes based on dependency between every pair of genes and clustering results. Moreover, various other techniques in AI, machine learning and statistics were recently applied to selection of informative genes (see http://bioinformatics.duke.edu/CAMDA/ and (Bagirov, 2003; Ding, 2003; Li, 2003)) since this problem can be treated as a feature selection problem (Blum, 1997), which is well-studied in these areas. However, most of these methods are still heuristic. Therefore, this chapter focuses on the selection of informative genes. In order to demonstrate the importance of selection of informative genes, consider an extreme example shown in Fig. 1. In this figure, “high” means that the expression level of a gene is high and “low” means that the expression level is low. If informative genes were to be selected based on correlation between the classification result and the expressions of each gene only, every gene would have the same significance and thus (A) and (B) would be equivalent. However, if class prediction were made by the majority voting using set (A), sample 3 and sample 7 would be classified into wrong classes. On the other hand, all samples would be classified into correct classes if class prediction were made using set (B). Therefore, informative genes should be determined by considering dependencies among multiple genes. Although use of pairwise dependencies is helpful (Tsunoda, 2000), there may be a difficult case: the correlation between gene 1
Selecting Informative Genes for Cancer Classification
77
and gene 2 and the correlation between gene 1 and gene 3 are low whereas the correlation between gene 2 and gene 3 is high. In this chapter, we treat this selection problem as an inference problem of threshold functions for Boolean variables. We treat class prediction as a problem of deciding whether or not a given sample belongs to the target class. Note that class prediction with multiple classes can be treated by making class prediction for each class independently. We do not use real values because it is difficult to give an appropriate mathematical definition using real values and it is widely recognized that gene expression data obtained by DNA microarrays contain large noises. Instead, each value is simplified to either 1 (high expression level) or 0 (low expression level), where the method of simplification is described in Section 2. In particular, we consider an inference problem of r -of-k threshold functions (Littlestone, 1988; Littlestone, 1994). This function outputs 1 if at least r variables among k variables are 1. For the example of Fig. 1, we can make correct predictions using set (B) by letting r = 3 or r = 2, where k = 4 in this case. The inference problem is to select k genes so that the r -of-k function makes correct predictions for all samples when Boolean values of n genes (usually n ≫ k) for m samples are given as an input (i.e., a training data set). Since threshold functions are useful and important for classification, many studies have been done in the field of machine learning. Among them, the WINNOW algorithm (Littlestone, 1988) and its variants (Littlestone, 1994) are famous and widely used. For example, a variant of WINNOW was applied to protein-coding region prediction (Furukawa, 1996) and inference of genetic networks (Noda, 1998). We applied WINNOW to selection of informative genes. However, as shown in Section 4, the results were not satisfactory. Therefore, we developed a new algorithm. We compared the algorithm with WINNOW and a very simple algorithm, using gene expression data (Golub, 1999) obtained from human acute leukemia patients. The results show that the greedy algorithm is as good as the other two algorithms for the test data set and is much better for the training data set.
2.
Selection of Informative Genes
As mentioned in Section 1, we do not use real values. Instead, each value is rounded to either 1 (high expression level) or 0 (low expression level) by a simple comparison with a threshold. The threshold values are also determined by a very simple method. The following is the method. Assume that expression data for n genes are available. Let {g1 , . . . , gn } denote the set of genes. Let {s1 , . . . , sm } denote the set of samples from
78
COMPUTATIONAL GENOMICS
patients. Assume that it is known whether each sample s j belongs to the target cancer class. We let class(s j ) = 1 if s j belongs to the class, otherwise we let class(s j ) = 0. Let M be the number of samples that belong to the target class (i.e., M = #{s j |class(s j ) = 1}). Let ei, j denote the observed expression level of gene gi for sample s j . For each gene i, ei,1 , ei,2 , . . . , ei,m are sorted in the ascending order. Let ei, p and ei,q denote the M-th largest value and the (M + 1)-th largest value among ei,1 , ei,2 , . . . , ei,m , respectively. We define the threshold value eˆi for gene gi by eˆi = (ei, p + ei,q )/2. Then we round ei, j to a Boolean value xi, j by
1, if ei, j > eˆi , xi, j = 0, otherwise. Note that, in an ideal case (i.e., ei, j > ei, j ′ hold for all j, j ′ such that class(s j ) = 1 and class(s j ′ ) = 0), the following property holds: xi, j = 1 if and only if class(s j ) = 1. The symmetric case in which the expression levels for samples belonging to the target class are lower than those for the other samples can be treated in an analogous way. Next, we formalize the selection problem using threshold functions for Boolean variables. In particular, we use r -of-k threshold functions (Littlestone, 1988; Littlestone, 1994). An r -of-k threshold function f (z 1 , . . . , z n ) is defined by selecting a set of k significant variables. The value of f is 1 whenever at least r of these k variables are 1. If the k selected variables are z i1 , . . . , z ik , then f is 1 exactly when z i1 + . . . + z ik ≥ r . For example, consider a case of n = 5, k = 3, r = 2 and i 1 = 1, i 2 = 2, i 3 = 5. Then, f (1, 1, 1, 1, 1) = 1, f (0, 0, 0, 0, 0) = 0, f (1, 0, 1, 1, 1) = 1, f (1, 0, 1, 1, 0) = 0, and f (1, 1, 0, 0, 0) = 1. An r -of-k threshold function can be considered as a special case of weighted voting functions or linearly separable functions. These functions are usually defined by f (z 1 , . . . , z n ) = 1 if and only if
n i=1
μi z i ≥ 1,
where μi and z i can take real values. If we let μi = 1/r for each of z i1 , . . . , z ik and let μi = 0 for the other variables, the linearly separable function becomes an r -of-k threshold function. Recall that the weighted voting was successfully applied to classification of cancer genes (Golub, 1999). Thus, it is reasonable to define the selection problem of informative genes using r -or-k functions when Boolean values are used. We define the selection problem of informative genes as follows. Assume that expression data and k are given as an input. Then, the problem is to determine a set of k genes {gi1 , . . . , gik } which maximizes r under the
Selecting Informative Genes for Cancer Classification
79
condition that xi1 , j + . . . + xik , j ≥ r, (1 − xi1 , j ) + . . . + (1 − xik , j ) ≥ r,
if class(s j ) = 1, otherwise .
The latter case means that at least r variables must be 0 if the corresponding sample does not belong to the target class. It should be noted the above problem can be transformed into the inference problem of r -of-k functions by inverting xi, j for all i and for all j such that class(s j ) = 0. It is expected that predictions can be done more robustly if r is larger because the prediction result for each s j by the majority voting does not change even if values of r − ⌈k/2⌉ − 1 input variables are changed.
3.
Algorithms for the Selection Problem
In this section, we describe algorithms for the selection of informative genes. We review the WINNOW algorithm first and then we show a new algorithm. Before reviewing the WINNOW algorithm, we define a very simple algorithm (denoted by SIMPLE) because it is also compared in Section 4. For each gene gi , SIMPLE computes the number err (gi ) defined by err (gi ) = #{s j |xi, j = class(s j )}. Then, SIMPLE sorts gi in the increasing order of err (gi ) and selects first k genes in the sorted list. That is, genes with k smallest err (gi ) values are selected by SIMPLE.
3.1
The WINNOW Algorithm
The WINNOW algorithm is simple too. The algorithm maintains nonnegative real-valued weights w1 , . . . , wn . At the beginning, each wi is set to 1. For each sample from the training data set, it makes a prediction by the following rule: n pr ed(s j ) = 1, if i=1 wi xi, j > θ, pr ed(s j ) = 0, otherwise, where this prediction is made only for learning parameters. The weights are changed only if the prediction result is not correct (i.e., pr ed(s j ) = class(s j )). The amount by which the weights are changed depends on a fixed parameter α, where α = 1 + 2r1 and θ = n are suggested for inferring r -of-k threshold functions (Littlestone, 1988).
80
COMPUTATIONAL GENOMICS
WINNOW algorithm { let wi ← 1 for all i = 1, . . . , n; for j = 1 to m { if (class(s j ) = 0 and pr ed(s j ) = 1) {let wi ← wi /α for all i such that xi, j = 1}; else if (class(s j ) = 1 and pr ed(s j ) = 0) {let wi ← αwi for all i such that xi, j = 1}. } } WINNOW has a very nice property: the number of times that wrong predictions are made is bounded. It is proven (Littlestone, 1988) that the number is at most 8r 2 + 5k + 14kr ln n when we let α = 1 + 2r1 and θ = n. Since WINNOW does not select variables but outputs the weights of variables only, we must select k informative genes using the weights wi . In order to select k informative genes, we employ a very simple method. We simply select k genes with k highest weights.
3.2
A Simple Greedy Algorithm
Although WINNOW has a nice mathematical property, the number of mistakes is not low for a small number of samples. It is difficult to get many samples because of costs (microarrays are expensive) and other reasons. Indeed, as shown in Section 4, the performance of WINNOW was not so good even for the training data set. Therefore, we developed an algorithm whose performance was good for a small number of samples. This algorithm is a kind of greedy algorithm, where greedy algorithms are used for various problems as heuristic algorithms. It should be noted that inference of an r -of-k function consistent with training data is known to be computationally hard (NP-hard) (Akutsu, 2000). Therefore, development of heuristic algorithms is a reasonable choice. This new algorithm is denoted by GREEDY in this chapter. GREEDY maintains non-negative real-valued weights w1 , . . . , wm . It should be noted that, in this case, the weights are not assigned to genes, but are assigned to samples. We say that gene gi covers sample s j if class(s j ) = xi, j . First, genes which do not cover most samples are removed and are not considered as candidates for informative genes. Precisely, any gene gi such that #{s j |class(s j ) = xi, j } > θ0 is removed, where θ0 is a threshold and we are currently using θ0 = 7 ∼ 9. Next, GREEDY selects informative genes iteratively (i.e., one gene is selected per iteration). For h-th iteration, gene gi h which maximizes the score is
81
Selecting Informative Genes for Cancer Classification
selected. The score is defined by m β j=1
δ(class(s j ),xi h , j )×(h−w j )
,
where δ(x, y) = 1 if x = y, otherwise δ(x, y) = 0. β is a constant defined based on the experience and we are currently using β = 1.5. Weight w j is increased by 1 if the selected gene gi contributes to the classification of sample s j . Precisely, w j is updated by w j ← w j + 1 if xi h , j = class(s j ). Thus, w j represents the number of genes (among gi1 , . . . , gi h ) which cover s j . GREEDY tries to cover each sample as many times as possible. Although there was no mathematical proof, GREEDY showed good performance as mentioned in Section 4. The following is the description of the GREEDY algorithm. GREEDY algorithm { remove all gi such that err (g j ) > θ0 ; let wi ← 0 for all i = 1, . . . , m; for h = 1 to k { select gi h which maximizes the score m
β δ(class(s j ),xih , j )×(h−w j ) , j=1
}
}
where gi h ∈ / {gi1 , . . . , gi h−1 }; let w j ← w j + 1 for all j such that xi h , j = class(s j ).
It is easy to see that GREEDY runs in O(kmn) time. It is fast enough since the total size of the training data set is O(mn) and k is not so large (usually less than 100).
3.3
Prediction by the Majority Voting
For each of the three algorithms mentioned above, predictions for samples are made by the majority voting. Let s j be a sample for which class prediction is to be made. Let gi1 , gi2 , . . . , gik be the informative genes selected by either one of SIMPLE, WINNOW and GREEDY. We define Vwin (resp. Vlose ) to be the number of genes gi h for which xi h , j = 1 (resp. xi h , j = 0). −V Vlose We define prediction strength (PS) by VVwwin Vlose (Golub, 1999). If PS is in +V less than some threshold value P S , we do not make prediction (i.e., s j is assigned as uncertain). If PS≥ P S and Vwin > Vlose , we predict that s j
82
COMPUTATIONAL GENOMICS
belongs to the target class. Otherwise, we predict that s j does not belong to the target class.
4.
Computational Results
In this section, we show the results of computational experiments on SIMPLE, WINNOW and GREEDY. For implementation and comparison, we used a PC with a 700 MHz AMD Athron processor. In each case, the inference could be done within ten seconds. We used the data set obtained by Golub et al. (Golub, 1999). Golub et al. chose acute leukemias as a test case. Acute leukemias are basically classified into two classes: those arising from lymphoid precursors (acute lymphoblastic leukemia, ALL) and those arising from myeloid precursors (acute myeloid leukemia, AML). They used two data sets: one for training and the other for test. The training data set (TR) consisted of 38 bone marrow samples (27 ALL, 11 AML) obtained from acute leukemia patients. The test data set (TS) consisted of 24 bone marrow and 10 peripheral blood samples. Note that the test data set included a much broader range of samples than the training data set because it included samples from peripheral blood, from childhood AML patients. For each sample, expression levels for 6817 genes were measured by microarrays produced by Affymetrix.
4.1
Comparison of Prediction Methods
Before comparing the selection methods, we compared prediction methods using the set of 50 informative genes shown in (Golub, 1999). Comparison of prediction methods should be done because the prediction accuracy may heavily depend on the prediction methods rather than the informative genes. We implemented the following methods: Simple Majority Voting: As mentioned in Section 3.3, Vwin and Vlose are computed and compared. Support Vector Machines (SVM): Support vector machines (Cortes, 1995) are considered as one of the most powerful methods for estimating parameters for the weighted voting. In this method (using a simple kernel function), the weighted sum a0 + a1 · e1, j + a2 · e2, j + · · · + a50 · e50, j is computed for each sample s j , and s j is assigned to the target class if the sum is positive. a0 , a1 , . . . , a50 are determined from the training data set TR by using quadratic programming. We compared the prediction results for the test data set TS with those by Golub et al.. In order to apply SVM to the prediction, input expression levels were normalized. We counted the number of samples for which
83
Selecting Informative Genes for Cancer Classification Table 6.1. Comparison of prediction methods.
Golub et al. ( P S = 0.30) Golub et al. ( P S = 0.0) Majority Voting ( P S = 0.30) Majority Voting ( P S = 0.0) SVM
Correct
Incorrect
Uncertain
29 32 25 31 32
0 2 0 3 2
5 0 9 0 0
correct predictions were made, the number of samples for which incorrect predictions were made, and the number of samples for which “uncertain” was assigned (i.e., PS was less than P S ). Since we did not know how to define PS for SVM, we did not consider PS for SVM. Table 1 shows the results. From the table, it is seen that SVM and the voting method by Golub et al. were better than the simple majority voting. However, the differences were not so large. Even the simple majority voting method made three wrong predictions for the case of P S = 0.0. It should be noted that the definition of PS for the majority voting is different from that by Golub et al. and thus the comparison result for the cases of P S = 0.30 does not have a significant meaning. It is also seen that SVM is as good as the method by Golub et al. This suggests that the prediction method by Golub et al. is not necessarily the best although further computational experiments must be done for detailed comparison of these two methods. It should also be noted that all algorithms (in cases of P S = 0.0) made wrong predictions for the same sample (sample No. 66). All algorithms assigned this sample as ALL (with low PS) although this sample belonged to AML. Thus, this sample may be a very special case of AML.
4.2
Comparison of Selection Methods
In this subsection, we show the results of comparison on the selection methods. We compared three algorithms: SIMPLE, WINNOW and GREEDY. Since the majority voting with Boolean variables was not so bad, we used the majority voting as a prediction method. As for data sets, we used the following three pairs. (A) The training data set (TR) and the test data set (TS) by Golub et al were used. TR consisted of 38 samples, among which 27 samples (resp. 11 samples) were classified into ALL (resp. AML). TS consisted of 34 samples, among which
84
COMPUTATIONAL GENOMICS
(B) TR was used as the test data set and TS was used as the training data set. (C) It is known that ALL samples are further classified into T-cell ALL and B-cell ALL. Golub et al. made successful prediction on the distinction between T-cell ALL and B-cell ALL. 27 ALL samples in TR were used as the training data set, among which 19 samples (resp. 8 samples) were classified into B-cell ALL (resp. T-cell ALL). 20 ALL samples in TS were used as the test data set, among which 19 samples (resp. 1 sample) were classified into B-cell ALL (resp. T-cell ALL). Since the rounded Boolean values were biased when the number of the training samples belonging to the target class was different from the number of the other training samples, we made the numbers to be equal by duplicating samples not belonging to the target class. Thus, 27+27 training samples were used in case (A), 20+20 training samples were used in case (B), and 19+19 training samples were used in case (C). For each of SIMPLE, WINNOW and GREEDY, we examined four cases: k = 20, k = 30, k = 40 and k = 50. Recall that k is the number of informative genes to be selected. We used r = 0.8 · k as a parameter in WINNOW, where we examined several values for r and obtained similar results for the other r . It should be note that WINNOW does not necessarily select k genes satisfying the condition of the r -of-k threshold function. First we measured the qualities of the sets of informative genes by means of r . In Table 2, the maximum r computed from each set of informative genes is shown. Recall that we defined the selection problem as a maximization problem on r . From this table, it is seen that GREEDY is much better than SIMPLE and WINNOW. It is also seen that there is no significant difference between SIMPLE and WINNOW. Therefore, it is confirmed that GREEDY is the best among three algorithms under the definition of Section 2. However, it is unclear whether or not the selected informative genes are useful for predictions. So, we made computational experiments on predictions. Using the informative genes computed by each algorithm, we made predictions on both the samples in the training data set and the samples in the test data set, by means of the majority voting. As in (Golub, 1999), the samples with PS(< 0.30) were assigned as uncertain (i.e., P S = 0.30). Results of the predictions are shown in Table 3. In this table, the number of samples that were assigned as uncertain is shown for each case. The number of samples which were classified into the wrong class is also shown (after the symbol ‘+’). For example, 5+1 means that 5 samples were assigned as uncertain and 1 sample was classified into the wrong class.
85
Selecting Informative Genes for Cancer Classification Table 6.2. Qualities of sets of informative genes measured by r . k = 20
k = 30
k = 40
k = 50
(A)
SIMPLE WINNOW GREEDY
15 14 16
19 20 24
24 24 33
29 29 40
(B)
SIMPLE WINNOW GREEDY
13 12 17
17 18 25
22 22 34
29 27 43
(C)
SIMPLE WINNOW GREEDY
10 9 15
15 15 23
18 20 31
21 27 38
From this table, it is seen that GREEDY always made correct predictions for samples in the training data set. Therefore, it is confirmed again that GREEDY is the best for the training data set. However, for the test data set, there was no significant difference among SIMPLE, WINNOW and GREEDY. Therefore, we can not discuss about the performances of the algorithms for the test data set.
5.
Discussions
We have studied the selection problem of informative genes for molecular classification of cancer. We defined this problem by using threshold functions on Boolean variables. We also proposed a new algorithm (GREEDY) for the selection problem. The effectiveness of GREEDY for the training data set was confirmed using real expression data from leukemia patients. Therefore, GREEDY might be useful when the training data set covers a wide range of samples. For the test data set, there was no significant difference among SIMPLE, WINNOW and GREEDY. It is surprising that SIMPLE was not worse than WINNOW and GREEDY for the test data set because correlations between genes were not taken into account by SIMPLE. Of course, if errors (items such that xi, j = class(s j )) are distributed uniformly, SIMPLE will work well because all genes will be correlated similarly. Moreover, recent theoretical studies show that simple algorithms perform well for most classes of Boolean functions under the uniform distribution of samples ( Arpe, 2003; Fukagawa, 2003; Mossel, 2003). This can be listed as a reason why there was no significant difference. Another possible reason is that the
86
COMPUTATIONAL GENOMICS Table 6.3. Comparison of selection algorithms. TRAINING
TEST
k = 20
30
40
50
k = 20
30
40
50
(A)
SIMPLE WINNOW GREEDY
0 0 0
1 0 0
1 1 0
1 1 0
6+1 5+1 7+1
7 7 9
5+1 6+1 6+1
7 7 7
(B)
SIMPLE WINNOW GREEDY
2 2 0
2 2 0
2 2 0
2 2 0
5 5 5
6 7 7
7 7 7
9 11 8
(C)
SIMPLE WINNOW GREEDY
0 1 0
1 1 0
1 1 0
1 1 0
1 1 1
1 1 1
1 1 1
1 1 1
test data set included a broader range of samples than the training data set (Golub, 1999) and thus the informative genes selected from the training data set were not so useful for the test data set. Of course, we examined the case (case (B)) in which the training data set and the test data set were exchanged. But, in this case, the number of samples in the training data set might not be enough because this data set included a broad range of samples. Overfitting can also be listed as a possible reason. Since we do not have other appropriate data sets now, further comparison using additional data sets is left as future work. In this chapter, we defined the selection problem using Boolean variables. However, the definition and algorithms using real-valued variables should also be important and more useful in practice. Actually, a lot of methods have been proposed for selection of informative genes based on various techniques in AI, machine learning and statistics (for example, see (Bagirov, 2003; Ding, 2003; Li, 2003)). Further studies on these approaches might lead to practical methods for the cancer classification problem.
Acknowledgments This work was partially supported by the Grant-in-Aid for Scientific Research on Priority Areas, “Genome Science”, “Genome Information Science”, and Grant-in-Aid for Scientific Research (B)(1)(No. 12480080) of the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan.
Selecting Informative Genes for Cancer Classification
87
References Akutsu, T., Miyano, S., and Kuhara, S. (2000). “Algorithms for Identifying Boolean Networks and Related Biological Networks Based on Matrix Multiplication and Fingerprint Function.” Journal of Computational Biology 7:331–343. Arpe, J. and Reischuk, R. (2003). “Robust Inference of Relevant Attributes.” Lecture Notes in Artificial Intelligence 2842:9–113. Bagirov, A. M., Ferguson, B., Ivkovic, S., Saunders, G., and Yearwoord, J. (2003). “New Algorithms for Multi-class Cancer Diagnosis Using Tumor Gene Expression Signatures.” Bioinformatics 19:1800–1807. Blum, A. and Langley, P. (1997). “Selection of Relevant Features and Examples in Machine Learning.” Artificial Intelligence 97:245–271. Cole, K. A., Krizman, D. B., and Emmert-Buck, M. R. (1999). “The Genetics of Cancer - A 3D Model.” Nature Genetics (Supplement) 21:38–41. Cortes, C. and Vapnik, V. (1995). “Support-vector Networks.” Machine Learning 20:273–297. DeRisi, J. L., Lyer, V. R., and Brown, P. O. (1997). “Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale.” Science 278:680–686. Ding, C. H. Q. (2003). “Unsupervised Feature Selection Via Two-way Ordering in Gene Expression Analysis.” Bioinformatics 19:1259–1266. Fukagawa, D. and Akutsu, T. (2003). “Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions.” Lecture Notes in Artificial Intelligence 2843:114–127. Furukawa, N., Matsumoto, S., Shinohara, A., Shoudai, T., and Miyano, S. (1996). “HAKKE: a Multi-strategy Prediction System for Sequences.” Genome Informatics 7:98–107. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. (1999). “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring.” Science 286:531–537. Kononen, J., Bubendorf, L., Kallioniemi, A., Barlund, M., Schrami, P., Leighton, S., Torhorst, J., Mihatsch, M. J., Sauter, G., and Kallioniemi, O-P. (1998). “Tissue Microarrays for High-throughput Molecular Profiling of Tumor Specimens.” Nature Medicine 4:844–847. Li, J., Liu, H., Downing, J. R., Yeoh, A. E-J., and Wong, L. (2003). “Simple Rules Underlying Gene Expression Profiles of More Than Six Subtypes of Acute Lymphoblastic Leukemia (ALL) Patients.” Bioinformatics 19:71–78.
88
COMPUTATIONAL GENOMICS
Littlestone, N. (1988). “Learning Quickly When Irrelevant Attributes Abound: A New Linear-threshold Algorithm.” Machine Learning 2: 285–318. Littlestone, N. and Warmuth, M. K. (1994). “The Weighted Majority Algorithm.” Information and Computation 108:212–261. Mossel, E., O’Donnell, R., and Servedio, R. P. (2003). “Learning Juntas.” Proc 35th ACM Symposium on Theory of Computing pp. 206–212. ACM Press. Noda, K., Shinohara, A., Takeda, M., Matsumoto, S., Miyano, S., and Kuhara, S. (1998). “Finding Genetic Network from Experiments by Weighted Network Model.” Genome Informatics 9:141–150. Tsunoda, T., Hojo, Y., Kihara, C., Shiraishi, N., Kitahara, O., Ono, K., Tanaka, T., Takagi, T., and Nakamura, Y. (2000). “Diagnosis System of Drug Sensitivity of Cancer Using cDNA Microarray and Multivariate Statistical Analysis.” In: Currents in Computational Molecular Biology pp. 16–17. Universal Academy Press.
Chapter 7 FINDING FUNCTIONAL STRUCTURES IN GLIOMA GENE-EXPRESSIONS USING GENE SHAVING CLUSTERING AND MDL PRINCIPLE Ciprian D. Giurcaneanu1 , Cristian Mircean1,2 , Gregory N. Fuller2 , and Ioan Tabus1 1 Institute of Signal Processing, Tampere University of Technology, Tampere, Finland 2 Cancer Genomics Core Laboratory, Department of Pathology, The University of Texas
M. D. Anderson Cancer Center, USA
1.
Introduction
The recent technological advances in genomics made it possible to measure simultaneously, in similar experimental conditions, the expressions of thousands of genes, allowing an unprecedented wide probing at the transcriptome level. The huge amount of data thus available calls for improved methods of data analysis, well grounded in the classical statistical methods, and providing fast and reliable processing. Finding clusters can be a preliminary step for data analysis, and is a valuable task in itself: the obtained clusters can convey useful information regarding the similar expression patterns for a number of genes, allowing thus a dimensionality reduction of the data set, but furthermore, a cluster of similar genes induces hypotheses regarding the genes that act in synergy, and that are on the same pathway influencing the studied disease. The traditional clustering methods are very appealing in gene data analysis, since they do not rely on any a priori knowledge or prior models for the data set. Clustering was traditionally a method of excellence for biological data; many of the existing clustering algorithms have been developed to be used in taxonomy for the classification of plants and animals. Therefore the first option was to employ classical clustering methods on the increasingly large amount of gene expression data, e.g. partitioning or hierarchical algorithms derived and well studied for some other type of biological data. The importance of studying gene expression data for
90
COMPUTATIONAL GENOMICS
understanding, diagnosing, and possibly prevailing the course of a disease made necessary to further develop clustering methods to provide multiple options in finding valuable structures in data. The general aim of clustering is to group together objects that are similar according to a certain criterion. This implies that each algorithm will identify a specific structure in a data set, based on the chosen similarity criterion. Different algorithms will be sensitive to different types of structure, and will signal different facets of the dependencies between the genes in the studied process. Understanding the inner workings of various clustering algorithms increases the ability to choose the proper algorithms to be used in a certain application. There are a number of surveys discussing classical clustering algorithms in the context of gene expression data. We concentrate in presenting a recent method, in two of its variants, since this method was specifically designed for clustering gene expression data. The name of the original algorithm is “Gene Shaving” (GS) (Hastie et al., 2000a), and we present it in a separate section. GS makes clever use of the geometrical structure of large matrices, and also requires a series of statistical decisions, which originally were solved using statistics obtained by extensively sampling from random distributions, but later the decisions could be based on the simpler solutions offered by minimum description length (MDL) principle (Giurcaneanu et al., 2004a). We treat also the problem of estimating the number of gene clusters in genomic data sets by analyzing two solutions that lead to different versions of the GS algorithm: GS-Gap and GS-MDL (Hastie et al., 2000a; and Giurcaneanu et al., 2004a). The motivation of the two estimation approaches is different, and we mention that the last one is derived using the minimum description length principle (Rissanen, 1978). Since most of clustering algorithms require the user to set the number of clusters as input information, the accurate estimation of the number of clusters is extremely important. Many of the traditional estimators are analyzed and compared in Milligan et al., (1985) in a generic context, while studies closer to genomic problems can be found in Giurcaneanu et al., (2004a); Giurcaneanu et al., (2004b); Tabus et al., (2003), including the principled MDL method for deciding the number of clusters in gene expression data. We apply the two versions of the GS algorithm to the glioma data set studied in (Fuller et al., submitted). Fuller et al., propose to alleviate the subjectivity in glioma classification using gene expression profiling. The voting algorithm is able to provide more subtle information regarding the molecular similarities to the neighboring classes, specifically for mixed gliomas. A number of computational algorithms have been used to perform molecular classification based on groups of genes (Fuller et al., submitted; and Nutt et. al., 2003).
Finding Functional Structures in Glioma Gene-Expressions
91
Gliomas are usually classified into two lineages (astrocytic and oligodendrocytic) and low-grade (astrocytoma and oligodendroglioma), midgrade (anaplastic astrocytoma and anaplastic oligodendroglioma), and high-grade (glioblastoma multiforme). The labeling-class is based on morphological features given by pathologists based on cellularity and features such as mitotic index and the presence of necrosis. This classification scheme has limitations because a specific glioma can be at any stage of the continuum of cancer progression and may contain mixed features. In this work, we focus on the gene expression that makes the difference between morphological classifications of glioma subtypes. We search for capturing the feature-genes that mark the transition from lower grades to higher grades (see Caskey et al., 2000 for a review) and the increased aggressiveness of glioma multiforme (GM) subtype of glioma, and the lineage differentiation of gliomas. The initial data contain 2303 different genes (see Taylor et al., 2001) that are observed over p = 49 different samples (experiments). The number of feature-genes is reduced in several steps to N = 121, where an optimal classifier was obtained for the four glioma-subtypes in quantized values (Fuller et al., submitted). In the next section we clarify the procedures used. Grouping the genes of obtained classifier in clusters that express with similar patterns over patients is a method to provide insight into the molecular basis of gliomas and reveal intrinsic interaction at molecular level between components and common function. If the cluster is enriched with a certain function, we expect with higher probability to have genes that act together in a possible-chained path. Further efforts can concentrate around these genes in finding the molecular-pathways and validate them with biological procedures.
2.
Description of Processing Glioma Data Set
Patient samples and microarray experiments. The cDNA microarray used for this experiment included 2,300 genes printed in duplicates. The sequences were verified before printing in the Cancer Genomics Core Laboratory of M. D. Anderson Cancer Center; details are described in (Taylor et al., 2001). The glioma tissues are obtained from the Brain Tumor Tissue Bank of M. D. Anderson Cancer Center with the approval of Institutional Reviewing Board. The original pathological assessment of the tissues was reviewed by a pathologist (GNF). Procedural details related to RNA isolation, microarray experiments, imaging analysis are described in (Fuller et al., 1999; Shmulevich et al., 2002; Kobayashi et al., 2003). The 49 cases include 27 glioblastomas (GM), 7 anaplastic astrocytomas (AA),
92
COMPUTATIONAL GENOMICS
6 anaplastic oligodendrogliomas (AO), 3 oligodendrogliomas (OL), 5 mixed gliomas, and one case of atypical meningioma. Gene selection in a Leave-Out One Cross-validation (LOO-CV) procedure. The gene selection process is in itself very important, revealing information about the important genes for glioma diagnosis, and finally for the performance of the classifier. We concentrate as first step, to select genes bearing a higher discrimination power for the four glioma-types. These genes are determined such that the performance of a given classifier is optimal in a cross-validation experiment, making the genes to perform well on sets different than the ones used for training. We ranked the genes according to Fisher discriminant, and then choose sets of genes of size “r ”of them in a κ- nearest neighbor (κ − N N ) classifier. The parameters, r and the number of neighbors κ, are selected in a leave-one-out crossvalidation (LOO-CV) procedure that leaves in turns, one patient outside the training set. The resulting errors will have a granularity of 1/43. To overcome this granularity, we can use a scheme that randomize the training set and test set, but the small number of patients does not permit the estimation of the classifier parameters in sub-sequential splits inside the cross-validation loop with large test sets. We group the expressions of genes in a matrix, such that Munq (i, j) is the expression of gene “i” on the sample “ j”. In the preprocessing step, we removed the genes with spots inaccurately replicated (see Mircean et. al., 2004 for a description of the method) and we are leaving out the mixed cases. The algorithm uses only the 43 patients out of the total p = 49 samples. After flagging the spots considered being incorrectly measured, we estimate the expression of gene Munq (i, j) as the mean of its replicates; we quantize using Lloyd algorithm separately for each patient to four levels, and remove genes with low variance over patients. We denote with Mq the set of genes reduced from 2303 to 1826 genes. In cross-validation, the following procedure is applied for each j ∈ {1, 2, . . . , p} : the j th column of the matrix Mq is chosen to be the test set, and the rest of the columns of Mq form the training set denoted training training and the . For simplicity we drop the index j from Mq, j Mq, j training . Only on the training set, the genes are ranked notation becomes Mq using Fisher discriminant and we select the number of retained features in steps r ∈ [10, 20, . . . , 70] and the number of neighbors of κ − N N , training κ ∈ [1, 2, . . . , 7], splitting again Mq with LOO-CV procedure. For training we observe the error of misclassification each of the selection in Mq of size r × κ × 41. We use r ∗ and κ ∗ that gives lower observed error overall training . For each j th sample, the vote vector (see Fuller et al., patients in Mq
Finding Functional Structures in Glioma Gene-Expressions
93
submitted) is saved (the figures show in the second column these votes). We further use this set of N = 121genes because we consider these genes being very likely to play a special role in discrimination of the four glioma sub-types, since they are minimizing the discrimination errors We continue the gene shaving experiments by using the N = 121 genes selected by LOO-CV procedure from the set Munq , observed in p = 49 different patients (samples). We denote with X the retained matrix with the size N × p. Our goal is to identify groups of genes that operate similarly in the considered experiments.
3.
A Brief Review of “Gene Shaving” (GS)
We briefly revisit the strategy of the GS algorithm. Each cluster Sn found by GS is a set of n genes (out of N ) where n and the set of genes are selected trying to satisfy the following requirements: (a) all n genes have a similar profile over the p experiments; (b) the variance of the cluster center (mean) forSn is the largest among all possible groups of ngenes; (c) the value of n is such that the cluster obtained is the largest cluster containing only similar genes. These requirements correspond to the intuition of the biologist who is looking for those genes that operate similarly, and whose measurements vary significantly from one patient to another. There is no interest on genes with flat profile over various conditions since their measurements are almost the same for all types of analyzed tumors, and consequently they cannot be used in diagnosis. To keep the computational complexity at a reasonable level, GS uses the largest principal component of the rows of X to select the genes according to the above requirements. After all rows of the matrix X are centered to have zero mean, a sequence of nested clusters S N ⊃ Sn 1 ⊃ Sn 2 ⊃ · · · ⊃ S 1 is yield by an iterative procedure. The subscript of the set Si denotes the number of genes contained in it, thus the largest cluster in the chain contains all N genes, and the smallest cluster contains just one gene. The nested clusters are generated as follows: first the principal component of the rows of X is computed, then α = 10% of the rows of X having smallest inner product with this principal component are eliminated, and the remaining genes are grouped in the cluster Sn 1 . The procedure continues iteratively, by shaving off a proportion α of the genes kept at the previous iteration, until the cluster S1 is found. Each set Si will have associated an i × p matrix containing the expressions of the genes kept in the cluster, and the principal component of that matrix will give the preferred direction of the genes to be kept after each shaving-off stage. This process will ensure that the candidate clusters in their decreasing size sequence
94
COMPUTATIONAL GENOMICS
are becoming more and more “pure”, having the genes better and better aligned to the principal component of the matrices associated to the candidate clusters. Once proper candidates are generated by the consecutive shaving process, one needs to choose the optimal cluster from the nested sets S N , Sn 1 , Sn 2 , . . . , S 1 . There are two ways to choose a suitable set: by using the Gap statistics (as was proposed in the original GS) or to use a MDL selection. Both are reviewed in this survey. After the selection of the most appropriate set as a cluster, a new cluster can be obtained by “banning-off” the direction of the principal component of the previously extracted cluster from the following stages. That can be achieved either by orthogonalizing all the rows of X with respect to direction of the principal component previously extracted (Hastie et al., 2000a), or by simply removing from the matrix X of the genes found in all previous clusters (Giurcaneanu et al., 2004a). Depending on the “banning-off” procedure used, the iterative process of extracting clusters continues for psuccessive clusters when using orthogonalization, or until only pgenes are left in the matrix X , when the genes of the clusters found are iteratively removed. The Gap statistics selection of the optimal size of the cluster is a simple process of discriminating the structure (regularity) of a cluster by checking against the statistics of randomly permuted data. The nested clusters have been selected such that the variance of the cluster mean is high, the cluster center being almost aligned to the first principal component, which by its definition has the highest possible variance. Now Gap will choose the cluster Sn∗ containing genes that are very similar one to each other and simultaneously have a high degree of dissimilarity with the genes not included in Sn∗ . Based on the analogy with the analysis of variance (Mardia et al., 1979), the criterion used in cluster selection relies on computing for Sn the within variance ⎤ ⎡ p 2 1 ⎣1 VnW = xi j − ave j (Sn ) ⎦, p n j=1
i∈Sn
and the between variance VnB
p 2 1 ave j (Sn ) − ave (Sn ) = p j=1
where ave j (Sn ) = n1 i∈Sn xi j is calculated as the average of the measurements in cell line j that correspond to genes in Sn , and ave (Sn ) =
Finding Functional Structures in Glioma Gene-Expressions 1 np
i∈Sn
p
the criterion
j=1 x i j .
95
The cluster Sn is recommended to be selected when VnB VnW Dn = 100 1 + VnB VnW
takes large values. To check that Dn is larger than one could expect by chance, the value of the Gap statistic is evaluated. The procedure implies to generate a number of matrices, each of them obtained by applying a different permutation to the columns of X . For example, assume that 20 such matrices are generated, and let us denote them X 1 , . . . , X 20 . Nested clusters are identified for every matrixX i , 1 ≤ i ≤ 20, by using the same algorithm employed for the original matrix X , and let Dn (X i ) be the criterion computed for the cluster size n of the matrix X i . The Gap statistic is given by Gap (n) = Dn − ave (Dn ) , 1 20 where ave (Dn ) = 20 i=1 Dn (X i ), and the optimal cluster size is n ∗ = arg max Gap (n) . n
Once the optimal cluster is found, each row of X is orthogonalized with respect to the vector representing the mean of Sn∗ , and the procedure continues until finding an a priori indicated number of clusters. GS proved to be a successful algorithm in clustering gene expression arrays (Hastie et al., 2000b), but we have to note the computational burden due to the necessity of clustering not only the matrix X , but also the other twenty matrices {X i , 1 ≤ i ≤ 20}. Moreover, the Gap statistic is mostly heuristically motivated. In Giurcaneanu et al., (2004a), a principled method was introduced for the selection of the optimal cluster. The principle on which the method relies is the celebrated MDL (Minimum Description Length; Rissanen, 1978), and leads to a fast algorithm for optimal cluster selection. To distinguish between the two forms of the GS algorithm, we call GS-Gap the algorithm proposed by Hastie (Hastie et al., 2000a) and GS-MDL the modified version introduced by Giurcaneanu (Giurcaneanu et al., 2004a). In the next section we briefly present the GSMDL algorithm in connection with probability models for gene clustering.
4.
The GS-MDL Clustering Algorithm
In this section we show how the MDL principle can lead to a method for choosing the optimal cluster from the set S N , Sn 1 , Sn 2 , . . . , S 1 , without resorting to the evaluation of the Gap statistic. The idea is very simple: we just need a wand that points on S N and shows us how many clusters it contains. If S N contains one single cluster, then we decide that S N is the
96
COMPUTATIONAL GENOMICS
optimal choice; otherwise, we use again the wand until finding an index n ∗ such that Sn ∗ +1 contains more clusters and Sn ∗ contains exactly one cluster. We conclude that Sn ∗ is the optimal choice. We explain in the sequel how MDL can provide the desired wand. The roots of the MDL are in information theory, and it allows selecting the best parametric model for a data set using as criterion the minimum description length or, equivalently, the minimum code length. We emphasize here that all the considered models belong to a finite family that will be defined latter. The MDL principle does not search for the “true” model of the observed data, but selects the best model from a family that is a priori defined. The selection relies on a scenario of transmitting the whole data set from an hypothesized encoder to a decoder. The encoder is constrained to use only models from the given family. In the most celebrated form of the MDL, the code length is evaluated as the sum of two parts:
Choose a model from the family and tune its parameters such that to obtain the best fit to the given data set. Practically this step corresponds to finding the maximum likelihood estimators for the model parameters. The estimated values are then sent to the decoder by using 1 2 log N bits for each parameter, where N is the number of samples in the data set. The symbol log (•) states for the natural logarithm, thus the codelength is expressed in nats. Equally well we can consider the logarithm base two in order to express the codelength in bits. Therefore the first term of the codelength is 21 q log N , where q denotes the number of parameters of the model. Once both the encoder and the decoder know the values of the parameters, it remains only to encode the samples in the data set according to the chosen model. This leads to a code length that equals the minus logarithm of the maximum likelihood.
Remark that the code length is not constrained to be an integer number. This is not a difficulty since we are not interested on realizable code lengths, but rather to use code lengths for comparing various models. To compute the two-parts code length criterion described above, we have to resort to probabilistic models, which appear frequently in clustering.
4.1
Background on Mixture Models for Gene Expression Data and Traditional Estimation Methods
The effort of developing mathematical models for clustering gene expression data obtained with various technologies was not very extensive so far. Most of the recent papers (Dougherty et al., 2002; C. D. Giurcaneanu
97
Finding Functional Structures in Glioma Gene-Expressions
et al., 2004a; Hastie et al., 2000a; Yeung et al., 2001) that treat the issue of model-based clustering for gene expression arrays have shown that the finite mixture model can be successfully applied. In finite mixture model, a multivariate distribution is associated to each gene cluster. To fix the ideas, let us assume that the number of gene clusters does not exceed p, the total number of cell lines. Following the approach from Hastie (Hastie et al., 2000a), we make the hypothesis that the number of “signal” clusters is K , and there exist also a “noise” cluster. The last cluster groups all genes with flat profile over all conditions, and consequently its mean is hypothesized to be a row vector with size 1 × p and having all entries only zeros. The mean vectors of “signal” clusters are denoted by b1T , b2T , . . . , b TK . Supplementary we assume that the matrix whose rows are b1T , b2T , . . . , b TK is full-rank. Moreover, the covariance matrix is assumed to be the same for all clusters, and to equal σ 2 I , where σ 2 is a parameter of the model and I denotes the p × p identity matrix. Under the mixture sampling, the rows of the matrix X , x 1T , x 2T , . . . , x TN are taken at random from the mixture density such that the number of observations from each cluster has a multinomial distribution with sample size N and probability parameters p1 , p2 , . . . , p K +1 . The results reported (Hastie et al., 2000a; Hastie et al., 2000b; and Yeung et al., 2001) encourage us to use such simple parametric models. The finite mixture models have been extensively studied in statistics (Redner et al., 1984), and they are suitable for the application of the ExpectationMaximization (EM) algorithm (Dempster et al., 1977). Various instances of the use of the EM algorithm for clustering based on finite mixture models, generically named Classification-Expectation-Maximization (CEM), are investigated in Celeux et al., (1995). The EM algorithm is designed to be applied for incomplete data, which recommends it to be used in clustering. The aim of gene clustering when the data set is recorded in matrix X is to assign to each row x iT of X a cluster label. Following the approach from Celeux et al., (1995) we assign to x iT a row vector v iT whose length equals the number of clustersK + 1. If x iT is assigned to cluster k, then the k-th entry of v iT takes value one, and all other entries are zeros. One can easily observe that the “complete” data are given by the pairs x iT , v iT . To illustrate how the EM algorithm can be employed in gene clustering, we consider the famous case when the distribution for each cluster is Gaussian with covariance matrix σ 2 I where I is the p × pidentity matrix, and the parameter σ 2 is unknown. Therefore the set of parameters is =
V, p1 , p2 , . . . , p K +1 , b1T , b2T , . . . , b TK +1 , σ 2
98
COMPUTATIONAL GENOMICS
where bkT are the mean vectors of the clusters, and pk are the mixing pro K +1 pk = 1. The CEM algorithm is an iterative portions, 0 < pk < 1, k=1 procedure to search for the values of the parameters that maximize the log-likelihood function:
K +1 N
L x 1T , x 2T , . . . , x TN ; = v i k log pk gk x iT |bkT , σ 2 I i=1
k=1
where the notation gk (•|•) is used for the Gaussian distribution. We briefly revisit in the sequel each step of the algorithm (Celeux et al., 1995):
M step Assume that every gene is assigned to a cluster, or equivalently the entries of the matrix V are fixed. Denote ηk the total number of genes that have been assigned to the cluster k. Elementary calculations prove that the log-likelihood function is maximized by selecting the values of the parameters such that:
T bˆ k
pˆ k =
ηk , N
N
ηk
=
T i=1 v ik x i
σˆ 2 =
,
tr (W ) , Np
where tr (W ) is the trace of the within cluster scattering matrix K +1 N ˆ ˆ T W = k=1 i=1 v i k (x i − bk ) (x i − bk ) .
probability of x iT , condiE step For every row x iT is estimated the
T pˆ k gk x iT |bˆ k , σˆ 2 I
. tional on cluster k, as tk x iT = K +1 T T 2 l=1
pˆl gl x i |bˆl , σˆ I
C step For every index i ∈ {1, 2, . . . , N }, the entries of the row v it are modified so vi k =
1, k = k ∗ 0, other wise
where k ∗ = arg maxk tk x iT . The algorithm is initialized with an arbitrary matrix V , and the iterations stop when a convergence criterion is fulfilled. In practical applications the CEM is started from many different random points. Each initialization can lead to a different convergence point, and the one that maximizes
Finding Functional Structures in Glioma Gene-Expressions
99
the log-likelihood function is chosen as final result. Observe that we have described the algorithm in the case of spherical model which is particularly interesting for our application, but in Celeux et al., (1995) more complex models are studied. We have assumed that the true number of clusters is given as input to the algorithm. When the number of clusters is not known a priori, it is estimated by using the Bayesian Information Criterion (BIC; Schwarz, 1978). A discussion on the use of CEM in conjunction with BIC can be found, for example, in Fraley et al., (1998), and we note here that BIC is not the only criterion applied for estimating the number of clusters. In the general case, BIC and the two-part code MDL have very similar expressions even if their theoretical motivations are rather different. For the reader interested in information theoretic criteria, we refer to two tutorials, Barron et al., (1998) Stoica et al., (2004). We emphasize here that the use of MDL principle in GS-MDL represents a totally different approach than the tandem CEM-BIC. It is well known that the convergence of the EM algorithm can be very slow. EM algorithm has also the drawback that it can converge to local maxima. The interested reader can find in Giurcaneanu et al., (2004a) a comparison of the performances of the GS-MDL and the CEM algorithm reported for artificial data that mimic the gene expression arrays: the GSMDL is significantly faster than CEM at the same level of accuracy of assigning the genes to clusters.
4.2
MDL Estimation of the Number of Clusters
Let us investigate how to apply the MDL principle for estimating the number of gene clusters in the hypothesis of the particular finite mixture discussed above, without the need of running the slowly convergent EM algorithm. For every possible number of clusters k, we have to compute the best code length which implicitly means that we have to label each gene with a cluster name. Considering all possible partitions of N genes in kclusters, and then evaluating the code length for every case is not a practical way to solve the problem. We emphasize here that we just need a fast method to estimate the number of clusters, and we do not need to decide which gene belongs to which cluster, since this is successfully solved by the GS algorithm. The key observation that leads to the design of the GS-MDL method relates to the particular structure of the mixture covariance matrix . It was shown in Giurcaneanu et al., (2004a) that the eigenvalues of can be sorted such that λ 1 ( ) > λ 2 ( ) > · · · > λ K ( ) > λ K +1 ( ) = λ K +2 ( ) = · · · = λ p ( ) = σ 2
100
COMPUTATIONAL GENOMICS
Therefore it is enough to count how many eigenvalues of are larger than σ 2 , and we obtain at once the number of clusters. Unfortunately, in practice we do not know the “true” mixture covariance matrix , but only the covariance matrix that is estimated from data and denoted S. The eigenvalues of S are all different with probability one, therefore we can write without any loss of generality: λ 1 (S) > λ 2 (S) > · · · > λ p (S) . The problem has similarities with the estimation of the number of sources, which is well known in the signal processing community, and for which an elegant solution based on MDL was proposed in Wax et al., (1985). The GS-MDL method is inspired from this solution. Since the application of the MDL principle strongly requires a coding scenario, let us assume that each row x iT of the matrix X is encoded independently. Moreover, to perform the encoding, we assume that each row x iT is Gaussian distributed with mean b T and covariance matrix , where K pk bkT . This coding procedure is not optimal in terms of comb T = k=1 pression results, but allows us to estimate in a fast way the number of gene clusters. The likelihood function depends only on b T and , and furthermore can be expressed using only its eigenvalues and eigenvectors. The N T ML estimator for the mean is bˆ = N1 i=1 x i . Note that the ML estimators for the eigenvalues λi ( ) and the eigenvectors u i are given by a famous result from Anderson (1963): N −1 λ i (S) , i ∈ {1, 2, . . . , K } N N − 1 p σˆ 2 = λ i (S) i=K +1 N (p − K) uˆ i = ci , i ∈ {1, 2, . . . , K }
λˆ i ( ) =
where λ 1 (S) , λ 2 (S) , . . . , λ p (S)and c1 , c2 , . . . , c K are the eigenvalues and, respectively the eigenvectors of the sample covariance matrix S. The log–likelihood function reduces to
log
p
1/( p−K ) ( p−K2 )N
i=K +1 λ i (S) 1 p i=K +1 λ i p−K
(S)
(7.1)
We now focus on the term of the two-part MDL criterion, which is mainly given by the number of parameters that must be sent from the encoder to the decoder. In the analyzed case, we have to transmit the estimated entries for the mean vector b and the matrix . Instead of sending
Finding Functional Structures in Glioma Gene-Expressions
101
all entries of one by one, we transmit the eigenvectors and the eigenvalues of relying on the following identity which is implied by the spectral representation theorem:
K = σ2I + λ k ( ) − σ 2 u k u kT k=1
where u k , 1 ≤ k ≤ K , are the eigenvectors of and I is the p × p identity matrix. We can easily count the number of parameters:
The mean vector b T has length p, therefore the number of parameters is p. Since p does not depend on K , we ignore the cost of transmitting the mean vector to the decoder; The eigenvalues of are all real-valued. Recall that K of them are distinct, and the rest of p − K are equal with σ 2 , therefore the number of parameters is K + 1;
The eigenvectors of have all entries real numbers. Since the number of eigenvectors is K and each has length p, it results K ∗ pparameters. Each eigenvector is constrained to have norm 1, thus only p − 1 entries of each eigenvector are independent parameters. Similarly the orthogonality constrained for the eigenvectors reduces the number of parameters with K (K2−1) .
+1 The total number of parameters results to be 1 + K 2 p−K , which 2 together with equation (7.1) leads to the following MDL criterion: k p − k + 1) A (2 k Kˆ = arg min ( p − k) N log + log N (7.2) k Gk 2
1/( p−k) 1 p p = where Ak = p−k . Obi=k+1 λ i (S) i=k+1 λ i (S) and G k serve that Ak is the arithmetic mean of the last p − k eigenvalues of S, and G k is the geometric mean of the same eigenvalues. We know from an elementary inequality that Ak ≥ G k with equality if and only if λ k+1 (S) = λ k+2 (S) = . . . = λ p (S), which leads to the conclusion that the first term of the criterion is nonnegative. It is obvious that the second term is positive and penalizes the number of parameters. The expression of the criterion has an intuitive form, and is easy to be computed. It still remains the question if such a criterion can lead to an accurate estimation of the number of clusters despite various approximations we have done when deriving it. The answer is given by a result from Giurcaneanu et al., (2004a) where the consistency of the MDL criterion
102
COMPUTATIONAL GENOMICS
(7.2) is proven. Therefore if the number of samples N is large enough, the estimated number of clusters is guaranteed to be exact. The result is very interesting since also in Giurcaneanu et al., (2004a) is proven that another famous criterion, namely Akaike’s criterion (Akaike, 1974), is not consistent even if its form is close to the MDL criterion. In fact in Giurcaneanu et al., (2004a) is discussed a larger family of consistent estimators, but among them the criterion (7.2) is shown to be the best in experiments with simulated data. To conclude this section on the use of MDL for finding the number of clusters, we mention several related methods. In the derivation of the GSMDL criterion the crucial role is played by the eigenvectors and the eigenvalues of the covariance matrix. Heuristical methods have been already applied to determine how many eigenvalues are small and correspond just to noise. From this category of methods we mention the graphical methods (Cattell, 1966), or the comparisons of the eigenvalues with some heuristically found thresholds (Everitt et al., 2001). All these methods became recently popular in the bioinformatics community, especially in connection with the application of Singular Value Decomposition (SVD) in microarray data analysis. In this approach, the left singular vectors (‘eigenassays’), the right singular vectors (‘eigengenes’), and the singular values are used for the biological interpretation of data recorded in matrix X (Wall et al., 2002). An implementation of this method named SVDMAN is described in Wall et al., (2001), and in the same paper is drawn a parallel between GS-Gap and SVDMAN.
5.
Functional Insights in Clustering Glioma Gene-Expression
In the original version of Gap, the found clusters are non-exclusive in the sense that the same gene can potentially be assigned to more different clusters. The same non-exclusive property occurs also in the case of SVDMAN algorithm. Our aim is to partition the gene set in non-overlapping clusters, and consequently once a cluster is identified by the GS-MDL algorithm, all the corresponding genes are eliminated, and a new matrix X is generated. The rows of X are not orthogonalized with respect to the average gene in the last found cluster. Therefore, at each step the number of rows in matrix X is decreasing, while the number of columns remains constant. We decide to stop the procedure when the number of genes in X becomes smaller than the number of columns of X . After running the GS-MDL algorithm, 72 genes are grouped in 16 different clusters, and the rest of 49 genes form the last big cluster. In our
Finding Functional Structures in Glioma Gene-Expressions
103
analysis we focus on the 16 clusters that are labeled with numbers from 1 to 16, according to the order in which they have been found by the algorithm. We also run the Gap-GS algorithm, which groups the 121 genes in 11 clusters. GS groups together the genes highly correlated, either positive or negative correlated. Clusters are approximately aligned to principal components of data, e.g. first cluster is aligned to first principal component. We are interested in the next three molecular-biological discrimination problems:
The transition from low grades to most aggressive grade (i.e., Oligodendroglioma Low Grade, Anaplastic Oligodendroglioma and Anaplastic Astrocytoma vs. Glioblastoma Multiforme) To discover what are the key differences, between the two different lineages of glioma (Anaplastic Astrocytoma vs. Anaplastic Oligodendroglioma and Oligodendroglioma Low Grade) The transition from lowest glioma grades to high grades (Oligodendroglioma Low Grade vs. all others)
For each of problems we evaluate the clusters generated by GS-Gap and GS- MDL using the discrimination power of the average gene1 expressed by its BSS/WSS statistics (see Fig. 7.1). We selected the cluster with highest discrimination power from the three defined molecularbiological discrimination problems and compare the functions of the genes in the clusters selected by the two algorithms. Then, we evaluate the implications of each gene with GoMiner (Zeeberg et. al., 2003), and NCBI “EntrezDatabase”(http://www.ncbi.nlm.nih.gov). The knowledge referring the biological questions revealed by the two algorithms are comparable. Both algorithms suggest that the transition from low grades to most aggressive grade (Oligodendroglioma Low Grade, Anaplastic Oligodendroglioma and Anaplastic Astrocytoma vs. Glioblastoma Multiforme) is preponderant due to “Cell growth” function (see Fig. 7.2). SRPR (signal recognition particle receptor) gene is present in the clusters generated by GS-Gap and GS-MDL. In mammalian cells, SRP receptor is required for the targeting of nascent secretory proteins to the ER (endoplasmic reticulum) membrane; among cellular physiological processes SRP receptor is involved in intracellular protein transport of growing cells. Also, we identify the implication of members of the POU transcription factor family together with highly preponderant “Cell growth” genes such
POWER OF DISCRIMINATION
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
Cluster 1 Cluster 2 Cluster 3 Cluster 4 AA vs.AO+OL Cluster 5 Cluster 6 GM vs.all Cluster 7 Cluster 8 Cluster 9 OL vs.all Cluster 10 Cluster 11 discrimination power
POWER OF DISCRIMINATION
discrimination power
OL vs.all
GM vs.all
AA vs.AO+OL
Figure 7.1. The power of discrimination between classes (AA, AO, OL and GM) and the three important discrimination problems. In left side the clusters obtained from Gene Shaving with Minimum Description Length (GS-MDL) and in the right side the clusters of the Gene Shaving with Gap (GS-Gap). In each figure, from left to right for all clusters, we considered power of discrimination by measuring the ratio of Between Sum of Squares and Within Sum of Squares (BSS/WSS) in case of: 1) transition from lowest glioma grades to high grades (Oligodendroglioma Low Grade vs. all others); 2) transition from low grades to most aggressive grade (Oligodendroglioma Low Grade, Anaplastic Oligodendroglioma and Anaplastic Astrocytoma vs. Glioblastoma Multiforme); 3) discriminating between the two different lineages of glioma (Anaplastic Astrocytoma vs. Anaplastic Oligodendroglioma and Oligodendroglioma Low Grade).
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12 Cluster 13 Cluster 14 Cluster 15
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
104 COMPUTATIONAL GENOMICS
NL
CLCN6
12 43
+ + “average gene”
SRPR MYO6
17
+
Symbol
Votes
8.0
Cell growth
Cell growth, regulation of vol.
3.9
5.8
2.8
2.8
2.6
AA
7.5
4.2
5.9
2.5
2.3
1.8
AO
7.6
5.4
5.7
1.2
0.9
1.8
OL
mean / std. dev.
7.6
5.4
2.3
1.4
2.0
GM 6.2
Figure 7.2. The transition from low-medium grades to the highest grade (all vs. GM) is influenced by genes from “Cell growth” function family. Gene Shaving with Minimum Description Length (GS-MDL) and Gene Shaving with Gap (GS-Gap) collect genes from same group functions in the clusters (Cluster 3 and Cluster 4) that have highest discriminatory power in this problem. SRPR (signal recognition particle receptor, a ’docking protein’) is present in both selections. Other relevant genes are FADD (Fas-TNFRSF6 associated via death domain), POU6F1 (POU domain, class 6, transcription factor 1), and IGFBP1 (insulin-like growth factor binding protein 1). For the four-type classification problem, Cluster 3 (GS-MDL) situates in first position and Cluster 4 (GS-Gap) in the second position. We consider genes from this cluster to be related to evolution of glioma from aggressivity facet. (See description for color map and table content in caption of Fig. 7.3).
chloride channel 6
Cell growth, protein metabol.
signal recognition particle receptor myosin VI Cell growth, binding
Function
Name
Genes and Properties selected by MDL Gene Shaving +/-
DETAIL of “average gene” and label of patients for GS-MDL grouping
GM
Glioma Subtypes
AA AO OL
Finding Functional Structures in Glioma Gene-Expressions 105
106
COMPUTATIONAL GENOMICS
IGFBP1 (insulin-like growth factor binding protein 1), MYO6 (myosin VI), MATN2 (matrilin 2), and DCLRE1 (DNA cross-link repair 1A). The POU domain transcription factors are a subgroup of homeodomain proteins that appear to control cellular phenotypes. The POU-domain factor-1 (POU6F1) is expressed in dispersed populations of neurons in the dorsal hypothalamus, where its expression is restricted to the medical habenula, to a dispersed population of neurons in the dorsal hypothalamus, and to subsets of ganglion and amacrine cells in the retina (Andersen et. al., 1993). No specific function has been proposed, and a knockout study has not been reported. In epithelial cells, Myosin VI (MYO6) functions to move endocytic vesicles out of actin-rich regions. In the absence of this motor activity, the uncoated vesicles present a Brownian-like motion and are trapped in actin (Aschenbrenner et. al., 2004). The walking mechanism is controversial since its very large steps of 36-nm step size (Okten et. al., 2004). Myosin VI is known to be part of cytoskeleton organization in growing cells. Although gene shaving is not supervised clustering, in our experiment MYO6 is specifically over-express in GM comparing the other subtypes. The differences between the two different lineages of glioma (Anaplastic Astrocytoma vs. Anaplastic Oligodendroglioma and Oligodendroglioma Low Grade) are underlined by “Transcription and DNA binding” functions (see Fig. 7.4). The results are consistent and overlap in both Gene Shaving (GS-Gap and GS-MDL) algorithms. As can be seen from Fig. 7.1a, Cluster 6 has the highest discrimination power for lineage differentiation comparing the other groupings. GS-MDL (marked with red-shaded background) presents a higher selectiveness compared to the genes clustered by GS-Gap (with light-yellow background). In both cases, the genes own preponderant the “Transcription and DNA binding” function. Leger (Leger et. al., 1995), suggested Translin to be involved in the process of translocation; and HMGA1 in the activation of viral gene expressions (together with POU domain protein Tst-1/Oct-6 because of its endogenous expression in myelinating glia). Also, the differential expression of HMGA1 could be a result of treatment with morphine which may trigger independent reaction pathways affecting transcription, regulation, and/or post-synthetic phosphorylation (Chakarov et. al., 2000). For the transition from lowest glioma grades to high grades (Oligodendroglioma Low Grade vs. all others) any classification algorithm could have limitations because of the small number of samples from Oligodendroglioma Low Grade. In our case, most of selected genes in the two winner clusters have the metabolic function. Figures 7.7 and 7.8 show the expressions of Myosin light polypeptide kinase (MYLK), Endothelin receptor type B (EDNRB), Heat shock protein 75 (TRAP1), Sialyltransferase 7 (SIAT7B), and RNTRE (USP6NL).
+ – – – – – –
–
FADD
POU6F1
H2BFQ
EWSR1
SRPR
“average gene”
1
1
2
3
3
3
7
16
17
sorted
B25-NaN A15-GM
B46-GM A16-GM
A13-GM A14-GM
A17-GM A13-GM
B26-GM A12-GM
A08-AO A19-OL
B35-GM A18-OL
B34-GM B34-AO
B28-GM A11-AO
B41-GM A09-AO
B40-NAN A07-AO
B44-GM B04-AA
A11-AO B01-AA
A09-AO A06-AA
A10-AO A03-AA
A07-AO A01-AA
5.8
5.3
4.4
4.3
4.7
4.6
5.0
4.2
5.8
6.3
0.7
0.8
0.6
1.4
1.8
1.4
2.2
0.9
2.6
1.0
0.6
1.6
5.1
4.0
4.9
3.9
4.5
4.5
5.5
4.0
5.9
5.6
5.5
7.4
1.3
1.7
1.0
1.1
1.1
1.1
1.3
2.0
1.8
2.2
0.9
0.9
AO
1.1
4.5
3.6
4.0
4.4
4.5
5.0
3.3
5.7
4.2
4.7
6.8
1.6
2.7
0.9
1.2
2.7
1.4
1.2
2.3
1.8
1.9
2.0
1.3
OL
mean / std. dev.
Cell growth, Cell cycle
Transduction, Apoptosis
Metabolism, Cell growth Transcription, Cell growth, Stem cells
Transcription
Metabolism, DNA repair Cell growth, Metabolism
6.0
7.4
AA
CLUSTER 4 significances: 0.56856 (OL vs.all= 0.2023 GM vs.all=0.27056 AA vs.AO+OL=0.96931)
B27-GM B35-GM
4.4
5.7
4.5
4.8
4.9
4.6
5.7
4.7
6.2
6.6
5.5
1.4
1.3
1.5
1.4
0.8
1.6
1.3
1.4
2.0
1.0
1.4
1.6
GM 6.6
Figure 7.3. The transition from low-medium grades to the highest grade (all vs. GM) is influenced by genes from “Cell growth” function family. The figure shows details of the Cluster 4 found by GS-Gap algorithm, which is enriched for the same function as the Cluster 3 found by GS-MDL (represented in Figure 7.2). (Please refer to the color section of this book to view this figure in color).
0
average
Nicotiana tabacum clonePR50 H2B histone family, member Q POU domain, class 6, transcription factor Homo sapiens cDNA FLJ38365 TNFRSF6 associated via death H.ch.14 DNA sequence BAC RHuman T-cell receptor germ-line
B34-AO B23-GM
+
B43-GM B01-GM
DNA cross-link repair 1A signal recognition particle receptor Ewing sarcoma breakpoint region 1
B29-NaN B22-GM
DCLRE1A
B42-GM B27-GM
18
B39-NaN B26-GM
Cell growth
B48-GM B37-GM
MATN2
B22-GM B38-GM
43
B36-GM B42-GM
– +
B31-NaN B28-GM
Transduction, Cellgrowth, Cell-cycle
B37-GM B44-GM
insulin-like growth factor binding protein matrilin 2
B32-NaN B43-GM
IGFBP1
B38-GM B33-GM
Function
B47-GM B45-GM
Name
B01-GM B46-GM
43
B30-GM A17-GM
DETAIL of “average gene” and label of patients for GS-Gap grouping
average
A02-AA
A18-OL
A05-AA
A20-OL
A08-AO
A19-OL
A10-AO
B24-AA
A20-OL A06-AA
B20-GM A03-AA
B30-GM A05-AA
B36-GM B21-AA
B41-GM B01-AA
B48-GM A02-AA
Symbol
B33-GM B47-GM
Votes
B23-GM B25-NaN
Genes and Properties selected by Gap Gene Shaving
B29-GM B29-NaN
+/– +
B20-GM B31-NaN
NL
A15-GM B32-NaN
GM
A39-GM B39-NaN
Glioma Subtypes
A16-GM B40-NaN
AA AO OL
Finding Functional Structures in Glioma Gene-Expressions 107
NL
TRAP1 STHM NFYB REG1A MBD1 RNTRE INPP5D TBP KIAA0268
40 35 25 16 13 9 4 3 2 2
+ – – – + + + – + + TSN
2 1
+ “average gene”
HOXD1
HMGA1
2
+ +
ATRX
Votes
+/–
Symbol
homeo box D1
Transcription
Transcription
Transcription
Transcription
Transcription
Signal Transduction
Cell proliferation
Metabolism
Cell proliferation
Transcription
Catalysis
6.1
7.2
7.4
8.7
5.4
4.6
7.2
4.5
7.6
6.0
4.7
5.3
0.4
1.8
0.8
1.0
2.7
1.6
1.6
3.1
1.9
1.0
1.5
0.8
2.3
AA 4.6
4.8
5.9
6.1
7.6
4.0
4.7
6.0
5.0
7.2
6.5
4.9
4.9
3.7
1.2
1.2
1.6
0.8
0.5
1.1
0.8
0.6
0.6
1.5
1.3
1.4
1.0
AO
3.8
6.3
6.9
7.4
5.5
4.3
6.2
4.4
6.8
6.2
4.1
2.8
3.6
0.8
1.1
1.0
1.1
0.8
1.0
0.8
2.9
1.2
1.3
0.8
1.7
0.5
OL
mean / std. dev.
7 genes with Transcription Function
high mobility group AT-hook 1 translin
methyl-CpG binding domain protein related to the N terminus of tre. inositol polyphosph. 5- phosph. TATA box binding protein C219-reactive peptide
nuclear transcription factor Y, beta regenerating isletderived 1 alpha
sialyl-transferase
Function Transcription Cell Growth Tumor necros. factor receptor, chaperone
Name alpha thalassemia mental retard syndr. heat shock protein 75
Genes and Properties GS-MDL and GS-Gap
5.7
5.9
6.8
7.7
5.3
4.5
6.9
5.7
6.9
6.5
4.4
4.7
1.2
1.7
1.5
0.8
1.3
0.9
1.5
1.1
1.1
1.1
1.5
1.3
1.5
GM 4.4
Figure 7.4. “Transcription and DNA binding” function are characteristic for the lineage differentiation separation. Left figure presents genes: columns are gene expression profiles; the 49 patients are grouped by glioma labels: AA, AO, OL, and GM. For positive correlated genes the color green is used for low expression values, the color red for high expression values. Gene shaving algorithm groups together positive and negative correlated genes. If a gene is negative correlated, we represent the expression as a “negative” film. On the column “+/” we labeled with “+” positive correlated genes and with “−” negative correlated genes. On right-side of the table we show for each class the mean (upper row, left aligned) and the variance of expression values (down and right aligned). (Please refer to the color section of this book to view this figure in color).
Glioma Subtypes AA AO OL GM
108 COMPUTATIONAL GENOMICS
Figure 7.4. Continued (Please refer to the color section of this book to view this figure in color).
DETAIL of a verage gene and l abel of patients for GS-MDL grouping
Finding Functional Structures in Glioma Gene-Expressions 109
NL SULT1A3 EFNB2 OPRK1 NAPG MEF2C DCLRE1A
43 39 43 20 43 18
+ + + + + + + 6.6
6.3
8.0
7.1
7.1
7.0
7.8
3.3
1.0
3.3
3.0
4.1
3.8
3.1
AA
5.1
5.6
5.6
6.4
5.0
8.3
6.1
2.6
2.2
1.0
1.4
1.0
1.5
2.1
AO
4.2
4.2
4.9
6.2
6.6
8.3
5.5
1.1
1.9
1.8
0.4
1.7
1.2
4.3
OL
mean / std. dev.
Metabolism, Transcription
Metabolism, DNA repair
Transcription
Cell growth, Metabolism
Transduction
Metabolism, Transduction
Human estrogen sulfotransferase ephrin-B2 Opioid receptor, kappa 1 N-ethylmaleimidesensitive factor, MADS box transcription enhancer DNA cross-link repair 1A Oenothera elata hookeri chloroplast
Function
Name
CLUSTER 1 significances: 0.65937 (OL vs.all= 0.11673 GM vs.all=0.044389 AA vs.AO+OL=0.77953)
“average gene”
10
Symbol
Genes and Properties selected by GS-MDL Votes
+/–
5.9
6.6
6.1
6.0
5.9
8.5
1.5
1.0
1.8
1.7
1.8
1.1
1.7
GM 6.1
Figure 7.5. Cluster 1 groups the genes aligned to the first principal component. In both, GS-MDL and GS-Gap, the first clusters contain genes that have highest vote-rate (43/43) from LOO-CV algorithm, as shown in Fuller et al., submitted. Five genes are represented in red; they belong to the first cluster for both GS-MDL and GS-Gap. The first clusters are composed of genes with a mixture of metabolic functions: DNA-binding, transduction, and cell-cycle. These genes discriminate well in the problem of transition from lowest grade to medium-high glioma grades (OL vs. all). We can observe a low expression of AO and OL subtype versus over-expression in AA glioma subtype. (Please refer to the color section of this book to view this figure in color).
GM
Glioma Subtypes
AA AO OL
110 COMPUTATIONAL GENOMICS
NL
MEF2C MPI NAPG
TERF1 TSPY
43 36 20
1 1 1 1
+ + +
SGCD
“average gene”
10 UROS
OPRK1
+ + + + + + + 16
SULT1A3
43 43
+
Symbol
6.0
5.4
6.0
6.9
6.6
7.5
7.1
6.3
8.0
3.1
2.6
2.4
2.4
2.6
3.3
3.4
3.0
2.3
3.3
4.1
5.4
5.5
5.8
6.4
5.1
7.2
6.4
4.5
5.6
5.0
6.1
0.9
0.6
1.3
1.2
2.6
1.7
1.4
1.2
1.0
1.0
2.1
AO
6.0
5.3
6.0
7.1
4.2
5.5
6.2
6.4
4.9
6.6
5.5
0.9
1.3
0.9
0.6
1.1
0.9
0.4
1.2
1.8
1.7
4.3
OL
mean / std. dev.
Metabolism, Transcription
Transcription, Cell prolif. Transcription, Cell prolif.
Cell growth
Metabolism
Cell growth, Metabolism
Metabolism
Transcription
7.1
7.8
AA
6.0
6.3
6.5
6.7
5.9
6.6
6.0
5.3
6.1
5.9
1.4
1.5
1.5
1.5
1.5
1.9
1.7
1.4
1.8
1.8
1.7
GM 6.1
Figure 7.6. Cluster 1 groups the genes aligned to the first principal component. The figure shows the list of genes grouped by the GS-Gap algorithm in Cluster 1. (Please refer to the color section of this book to view this figure in color).
Hs.ch. 5 Bac clone 111n13 Oenothera elata hookeri chloroplast plastome I Uroporphyrinogen III synthase Sarcoglycan, delta (35kD dystrophin – glycoprot.) Hs. telomeric repeat binding factor (TRF1) Testis specific protein, Ylinked
MADS box transcription enhancer Mannose phosphate isomerase N-ethylmaleimidesensitive factor, gamma
Metabolism, Transduction
Human estrogen sulfotransferase Opioid receptor, kappa 1 Transduction
Function
Name
Genes and Properties selected by GS-Gap Votes
+/–
DETAIL on ”average gene” and label of patients
GM
Glioma Subtypes
AA AO OL
Finding Functional Structures in Glioma Gene-Expressions 111
SIAT7B USP6NL
25 4
+ + “average gene”
TRAP1
35
+
Symbol
7.2
Metabolism
Protein Metabolism
4.7
5.3
1.7
1.6
0.8
AA
6.0
4.9
4.9
0.9
1.3
1.4
AO
6.2
4.1
2.8
0.9
0.8
1.7
OL
mean / std. dev.
CLUSTER 10 significances: 0.38475 (OL vs.all= 0.24703 GM vs.all=0.0018932 AA vs.AO+OL=0.0036034)
related to the terminus of tre; rntre
Protein Metabolism
Tumor necros. factor receptor, chaperone
heat shock protein 75 sialyltransferase 7
Function
Name
Genes and Properties selected by GS-MDL Votes
+/-
6.9
4.4
1.6
1.5
1.3
GM 4.7
Figure 7.7. The winners in the discriminatory problem of transition from lowest grade (OL) to the other grades, are different in GS-MDL and GS-Gap. Cluster 10 of GS-MLD creates a better structure than GS-Gap. The genes agree for metabolic function. TRAP1 (heat shock protein 75) is intensively studied.
0
sort.avg.gene
average.gene
NL
DETAIL on “average gene” and label of patients
Glioma Subtypes AA AO OL GM
112 COMPUTATIONAL GENOMICS
EDNRB
23 10 1
+ + “average gene”
MYLK
Votes
+/-
Symbol Metabolism, Cell Growth Cell Growth, Migration 4.2
5.6 4.5
4.9
7.0
1.2
1.0
0.9
AO
4.0
5.4
7.4
0.5
0.7
0.9
OL
mean / std. dev.
CLUSTER 4 significances: 0.28159 (OL vs.all= 0.23609 GM vs.all=0.011157 AA vs.AO+OL=0.012255)
1.0
0.9
2.1
AA 7.8
Not Defined
Function
Name Hs. chr. 4 B368A9 map 4q25 myosin, light polypeptide kinase endothelin receptor type B
Genes and Properties selected by GS-Gap
4.1
5.4
1.1
1.1
1.3
GM 7.3
Figure 7.8. The winners in the discriminatory problem of transition from lowest grade (OL) to the other grades, are different in GS-MDL and GS-Gap. Cluster 9 of GS-Gap is composed of MYLK (myosin, light polypeptide kinase); EDNRB (endothelin receptor type B) and clone B368A9 map 4q25.
0
average
sorted
average
NL
DETAIL on “average gene” and label of patients
Glioma Subtypes AA AO OL GM
Finding Functional Structures in Glioma Gene-Expressions 113
114
COMPUTATIONAL GENOMICS
Note 1 Average
Gene is defined as mean of genes from one cluster. Negative correlated genes are represented with the reversed sign.
References Akaike, H. (1974). “A New Look at the Statistical Model Identification.” IEEE Trans. Autom. Control AC-19:716–723. Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma C, Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson, Jr. J., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O., and Staudt, L. M. (2000). “Distinct Types of Diffuse Large B-cell Lymphoma Identified by Gene Expression Profiling.” Nature 403:503–11. Andersen, B., Schonemann, D., Pearse II, V., Jenne, K., Sugarman, J., and Rosenfeld, G. (1993). “Brn-5 is a Divergent POU Domain Factor Highly Expressed in Layer IV of the Neocortex.” J Biol Chem 268: 23390–23398. Anderson, T. W. (1963). “Asymptotic Theory for Principal Component Analysis.” Ann Math Stat 34:122–148. Aschenbrenner, L., Naccache, S. N., and Hasson, T. (2004). “Uncoated Endocytic Vesicles Require the Unconventional Myosin, Myo6, for Rapid Transport Through Actin Barriers.” Mol Biol Cell 15:2253–63. Barron, A., Rissanen, J., and Yu, B. (1998). “The Minimum Description Length Principle in Coding and Modeling.” IEEE Trans Info Theory IT-44:2743–2760. Borg, I., and Groenen, P. (1997). Modern Multidimensional Scaling: Theory and Applications. New York: Springer. Caskey, L. S., Fuller, G. N., Bruner, J. M., Yung, W. K., Sawaya, R. E., Holland, E. C., and Zhang, W. (2000). “Toward a Molecular Classification of the Gliomas: Histopathology, Molecular Genetics, and Gene Expression Profiling.” Histol Histopathol 15:971–981. Cattell, R. B. (1966). “The ‘Scree’ Test for the Number of Factors.” Multivariate Behavioral Research 1:245–276. Celeux, G. and Govaert, G. (1995). “Gaussian Parsimonious Clustering Models.” Pattern Recognit 28:781–793. Chakarov, S., Chakalova, L., Tencheva, Z., Ganev, V., and Angelova, A. (2000). “Morphine Treatment Affects the Regulation of High Mobility Group I-type Chromosomal Phosphoproteins in C6 Glioma Cells.” Life Sci 24;66:1725–31.
REFERENCES
115
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). “Maximum Likelihood from Incomplete Data Via the EM Algorithm.” J R Stat Soc Ser B Stat Methodol 39:1–38. Dougherty, E. R., Barrera, J., Brun, M., Kim, S., Cesar, R. M., Chen, Y., Bittner, M., and Trent, J. M. (2002). “Inference from Clustering with Application to Gene-expression Microarrays. J Comput Biol 9: 105–126. Entrez Database Website. http://www.ncbi.nlm.nih.gov/. National Center for Biotechnology. Everitt, B. S. and Dunn, G. (2001). “Applied Multivariate Data Analysis.” London: Arnold. Fix, E. and Hodges, J. (1951). “Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties.” Technical Report, Randolph Field, Texas: USAF School of Aviation Medicine. Fraley, C. and Raftery, A. E. (1998). “How Many Clusters? Which Clustering Method? Answers Via Model-based Cluster Analysis.” Comput J 41: 578–588. Fuller, G. N., Hess, K. R., Rhee, C. H., Yung, W. K., Sawaya, R. A., Bruner, J. M., and Zhang, W. (2002). “Molecular Classification of Human Diffuse Gliomas by Multidimensional Scaling Analysis of Gene Expression Profiles Parallels Morphology-based Classification, Correlates with Survival, and Reveals Clinically-relevant Novel Glioma Subsets.” Brain Pathol 12:108–16. Fuller, G. N., Mircean, C., Tabus, I., Taylor, E., Sawaya, R., Bruner, J., Shmulevich, I., and Zhang, W. Molecular Voting for Glioma Classification Reflecting Heterogeneity in the Continuum of Cancer Progression, submitted. Fuller, G. N., Rhee, C. H., Hess, K. R., Caskey, L. S., Wang, R., Bruner, J. M., Yung, W. K., and Zhang, W. (1999). “Reactivation of Insulin-like Growth Factor Binding Protein 2 Expression in Glioblastoma Multiforme: A Revelation by Parallel Gene-expression Profiling.” Cancer Res 59:4228–32. Giurcaneanu, C. D., Tabus, I., Astola, J., Ollila, J., and Vihinen, M. (2004a). “Fast Iterative Gene Clustering Based on Information Theoretic Criteria for Selecting the Cluster Structure.” J Comput Biol 11:660–682. Giurcaneanu, C. D., Tabus, I., Shmulevich, I., and Zhang, W. (2004b). “Clustering Genes and Samples from Glioma Microarray Data.” In: R. Dobrescu and C. Vasilescu, eds. Interdisciplinary Applications of Fractal and Chaos Theory, pp. 157–171. Bucharest: The Publishing House of the Romanian Academy. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. (1999). “Molecular Classification
116
COMPUTATIONAL GENOMICS
of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring.” Science 286:531–7. Hastie, T., Tibshirani, R., Eisen, M. B., Alizadeh, A., Levy, R., Staudt, L., Chan, W. C., Botstein, D., and Brown, P. (2000a). “ ‘Gene Shaving’ as a Method for Identifying Distinct Sets of Genes with Similar Expression Patterns.” Genome Biol 1. Hastie, T., Tibshirani, R., Eisen, M., Brown, P., Ross, D., Scherf, U., Weinstein, J., Alizadeh, A., Staudt, L., and Botstein, D. (2000b). “Gene Shaving: A New Class of Clustering Methods.” http://www.stat. stanford.edu/∼hastie/Papers/ Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O. P., Wilfond, B., Borg, A., and Trent, J. (2001). “Gene-expression Profiles in Hereditary Breast Cancer.” N Engl J Med 344:539–48. Huber, P. J. (1981). Robust Statistics. p. 107. New York: John Wiley & Sons. Kim, S., Dougherty, E. R., Shmulevich, I., Hess, K. R., Hamilton, S. R., Trent, J. M., Fuller, G. N., and Zhang, W. (2002). “Identification of Combination Gene Sets for Glioma Classification.” Mol Cancer Ther 13:1229–36. Kleihues, P. and Cavenee, W. K. (2000). Pathology and Genetics of Tumours of the Nervous System. Lyon: IARC Press. Kobayashi, T., Yamaguchi, M., Kim, S., Morikawa, J., Ogawa, S., Ueno, S., Suh, E., Dougherty, E., Shmulevich, I., Shiku, H., and Zhang, W. (2003). “Microarray Reveals Differences in Both Tumors and Vascular Specific Gene Expression in De Novo CD5+ and CD5− Diffuse Large B-cell Lymphomas.” Cancer Res 63:60–6. Leger, H., Sock, E., Renner, K., Grummt, F., and Wegner, M. (1995). “Functional Interaction Between the POU Domain Protein Tst-1/ Oct-6 and the High-mobility-group Protein HMG-I/Y.” Mol Cell Biol 15:3738–47. Lloyd, S. P. (1982). “Least Squares Quantization in PCM.” IEEE Transactions on Information Theory IT-28:129–137. Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate Analysis. London: Academic Press. Milligan, G. W. and Cooper, M. C. (1985). “An Examination of Procedures for Determining the Number of Clusters in a Data Set.” Psychometrika 50:159–179. Mircean, C., Tabus, I., and Astola, J. (2002). “Quantization and Distance Function Selection for Discrimination of Tumors Using Gene Expression Data.” Proceedings of SPIE Photonics West 2002, San Jose, CA.
REFERENCES
117
Mircean, C., Tabus, I., Astola, J., Kobayashi, T., Shiku, H., Yamaguchi, M., Shmulevich, I., and Zhang, W. (2004). “Quantization and Similarity Measure Selection for Discrimination of Lymphoma Subtypes Under κ - nearest Neighbor Classification.” SPIE Photonics West 2004, BiOS 2004 Symposium, San Jose, CA. Nutt, C. L., Mani, D. R., Betensky, R. A., Tamayo, P., Cairncross, J. G., Ladd, C., Pohl, U., Hartmann, C., McLaughlin, M. E., Batchelor, T. T., Black, P. M., Deimling, A., Pomeroy, S. L., Golub, T. R., and Louis, D. N. (2003). “Gene Expression-based Classification of Malignant Gliomas Correlates Better with Survival than Histological Classification.” Cancer Research 63:1602–1607. Okten, Z., Churchman, L. S., Rock, R. S., and Spudich, J. A. (2004). “Myosin VI Walks Hand-over-hand Along Actin.” Nat Struct Mol Biol. Epub 2004: Aug 01. Redner, R. A. and Walker, H. F. (1984). “Mixture Densities, Maximum Likelihood and the EM Algorithm.” SIAM Rev 26:195–239. Rissanen, J. (1978). “Modeling by Shortest Data Description.” Automatica J IFAC 14:465–471. Schwarz, G. (1978). “Estimating the Dimension of a Model.” Ann Stat 6:461–464. Shmulevich, I. and Zhang, W. (2002). “Binary Analysis and OptimizationBased Normalization of Gene Expression Data.” Bioinformatics 18: 555–565. Shmulevich, I., Hunt, K., El-Naggar, A., Taylor, E., Ramdas, L., Laborde, P., Hess, K. R., Pollock, R., and Zhang, W. (2002). “Tumor Specific Gene Expression Profiles in Human Leiomyosarcoma: an Evaluation Of Intratumor Heterogeneity.” Cancer 94:2069–2075. Stoica, P. and Selen, Y. (2004). “Model-order Selection.” Signal Processing Mag 21:36–47. Stone, C. J. (1977). “Consistent Nonparametric Regression (With Discussion).” Ann Statist 5:595–645. Tabus, I. and Astola, J. (2003). “Clustering the Non-uniformly Sampled Time Series of Gene Expression Data.” In: Proc. ISSPA 2003, EURASIPIEEE Seventh Int. Symp. on Signal Processing and its Applications, pp. 61–64, Paris, France. Taylor, E., Cogdell, D., Coombes, K., Hu, L., Ramdas, L., Tabor, A., Hamilton, S., and Zhang, W. (2001). “Sequence Verification as Quality Control Step for Production of cDNA Microarray.” BioTechniques 31:62–65. Wall, M. E., Dick, P. A., and Brettin, T. S. (2001). “SVDMAN-singular Value Decomposition of Microarray Data.” Bioinformatics 17:566–568.
118
COMPUTATIONAL GENOMICS
Wall, M. E., Rechtsteiner, A., and Rocha, L. M. (2002). “Singular Value Decomposition and Principal Component Analysis.” In: D. P. Berrar, W. Dubitzky, and M. Granzow, eds. A Practical Approach to Microarray Data Analysis, pp. 91–109. Boston: Kluwer Academic Publishers. Wax, M. and Kailath, T. (1985). “Detection of Signals by Information Theoretic Criteria.” IEEE Trans. Acoustics Speech Signal Proc., 33: 387–392. Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E., and Ruzzo W. L. (2001). “Model-based Clustering and Data Transformations for Gene Expression Data.” Bioinformatics 17:977–987. Zeeberg, B. R., Feng, W., Wang, G., Wang, M. D., Fojo, A. T., Sunshine, M., Narasimhan, S., Kane, D. W., Reinhold, W. C., Lababidi, S., Bussey, K. J., Riss, J., Barrett, J. C., and Weinstein, J. N. (2003). “GoMiner: A Resource for Biological Interpretation of Genomic and Proteomic Data.” Genome Biol 4(4):R28. Zhou, X., Wang, X., and Dougherty, E. R. (2003). “Binarization of Microarray Data on the Basis of a Mixture Model.” Mol Cancer Ther 2: 679–84.
Chapter 8 DESIGN ISSUES AND COMPARISON OF METHODS FOR MICROARRAY-BASED CLASSIFICATION Edward R. Dougherty and Sanju N. Attoor Department of Electrical Engineering, Texas A & M University, College Station, Texas, USA
1.
Introduction
A key goal for microarray-based analysis of gene expressions is to perform classification via different expression patterns (for instance, cancer classification) (Golub et al., 1999; Ben-Dor et al., 2000; Bittner et al., 2000; Hedenfalk et al., 2001; Khan et al., 2002; Kim et al., 2002). This requires designing a classifier that takes a vector of gene expression levels as input, and outputs a class label. Classification can be between different kinds of cancer, different stages of tumor development, or a host of such differences. Classifiers are designed from a sample of expression vectors. This requires assessing expression levels from RNA obtained from the different tissues with microarrays, determining genes whose expression levels can be used as classifier variables, and then applying some rule to design the classifier from the sample microarray data. Three critical issues arise: (1) Given a set of variables, how does one design a classifier from the sample data that provides good classification over the general population? (2) How does one estimate the error of a designed classifier when data are limited? (3) Given a large set of potential variables, such as the large number of expression level determinations provided by microarrays, how does one select a set of variables as the input vector to the classifier? The problem of error estimation impacts variable selection in a devilish way. An error estimator may be unbiased but have a large variance, and therefore often be low. This can produce a large number of variable sets and classifiers with low error estimates. For the near future, small samples will remain a critical issue for microarray-based classification (Dougherty, 2001).
120
2.
COMPUTATIONAL GENOMICS
Classification Rules
Classification involves a classifier ψ, a feature vector X = (X 1 , X 2 , . . . , X d ) composed of random variables, and a binary random variable Y to be predicted by ψ(X). The values, 0 or 1, of Y are treated as class labels. The error, ε[ψ], of ψ is the probability, P(ψ(X) = Y ), that the classification is erroneous. It equals the expected (mean) absolute difference, E[|Y − ψ(X)|], between the label and the classification. X 1 , X 2 , . . . , X d can be discrete or real-valued. In the latter case, the domain of ψ is ddimensional Euclidean space ℜd . An optimal classifier, ψ• , is one having minimal error, ε• , among all binary functions on ℜd . ψ• and ε• are called the Bayes classifier and Bayes error, respectively. Classification accuracy, and thus the error, depends on the probability distribution of the featurelabel pair (X, Y) –how well the labels are distributed among the variables being used to discriminate them, and how the variables are distributed in ℜd . The Bayes classifier is defined in a natural way: for any specific vector x, ψ• (x) = 1 if the expected value of Y given x, E[Y | x], exceeds 1/2, and ψ• (x) = 0 otherwise. Formulated in terms of probabilities, ψ• (x) = 1 if the conditional probability of Y = 1 given x exceeds the conditional probability of Y = 0 given x, and ψ• (x) = 0 otherwise; that is, ψ• (x) = 1 of and only if P(Y = 1|x) = P(Y = 0|x). The label 1 is predicted upon observation of x if the probability that x lies in class 1 exceeds the probability that x lies in class 0. Since the sum of the probabilities is 1, P(Y = 1|x) = P(Y = 0|x) if and only if P(Y = 1|x) = 1/2. The problem is that we do not know these conditional probabilities, and therefore must design a classifier from sample data. Supervised classifier design uses a sample Sn = {(X1 , Y 1 ), (X2 , Y 2 ), . . . , (Xn , Y n )} of feature-label pairs and a classification rule to construct a classifier ψn whose error is hopefully close to the Bayes error. The Bayes error ε• is estimated by the error εn of ψn . Because ε• is minimal, εn ≥ ε• , and there is a design cost (estimation error), n = εn −ε• . Since it depends on the sample, εn is a random variable, as is n . We are concerned with the expected value of n , E[n ] = E[εn ] − ε• . Hopefully, E[n ] gets closer to 0 as the sample size grows. This will depend on the classification rule and the distribution of the feature-label pair (X, Y). A classification rule is said to be consistent for the distribution of (X, Y) if E[n ] → 0 as n → ∞, where the expectation is relative to the distribution of the sample. The expected design error goes to zero as the sample size goes to infinity. This is equivalent to P(n > τ ) → 0 as n → ∞ for any τ > 0, which says that the probability of the design error exceeding τ goes to 0. As stated, consistency depends upon the relation between the classification rule and the joint feature-label distribution. If
Design Issues and Comparison of Methods
121
E[n ] → 0 for any distribution, then the classification rule is said to be universally consistent. Since we often lack an estimate of the distribution, universal consistency is desirable.
3.
Some Specific Classification Rules
Since the Bayes classifier is defined by ψ• (x) = 1 if and only if P(Y = 1|x) > 1/2, an obvious way to proceed is to obtain an estimate Pn (Y = 1|x) of P(Y = 1|x) from the sample Sn . The plug-in rule designs a classifier by ψn (x) = 1 if and only if Pn (Y = 1|x) > 1/2. If the data is discrete, then there is a finite number of vectors and Pn (Y = 1|x) can be defined to be the number of times the pair (x, 1) is observed in the sample divided by the number of times x is observed. Unfortunately, if x is observed very few times, then Pn (Y = 1|x) is not a good estimate. Even worse, if x is never observed, then ψn (x) must be defined by some convention. The rule is consistent, but depending on the number of variables, may require a large sample to have E[n ] close to 0, or equivalently, E[εn ] close to the Bayes error. Consistency alone is of little consequence for small samples. For continuous data, P(y|x) is a conditional probability density estimated by Pn (y|x). Many classification rules partition ℜd into a disjoint union of cells. Pn (Y = 1|x) is the number of 1-labeled sample points in the cell containing x divided by the total number of points in the cell. A histogram rule is defined by the plug-in rule: ψn (x) is 0 or 1 according to which is the majority label in the cell. If x is a point in cell C, then ψn (x) = 1 if there are more 1-labeled sample points in C and ψn (x) = 0 if there are more 0-labeled sample points in C. The cells may change with n and may depend on the sample points. They do not depend on the labels. To obtain consistency for a distribution, two conditions are sufficient when stated with the appropriate mathematical rigor: (1) the partition should be fine enough to take into account local structure of the distribution, and (2) there should be enough labels in each cell so that the majority decision reflects the decision based on the true conditional probabilities. The cubic-histogram rule partitions ℜd into same-size cubes. On each cube, the designed classifier is defined to be 0 or 1, according to which is the majority among the labels of the points in the cube. If the cube edge length approaches 0 and n times the common volume approaches infinity as n → ∞, then rule is universally consistent (Devroye et al., 1996; Gordon and Olshen, 1978). For discrete data, the cubic histogram rule reduces to the plug-in rule if the cubes are sufficiently small. The cubic-histogram rule partitions the space without reference to the actual data. One can instead partition the space based on the data, either with our without reference to the labels. Tree classifiers are a common way
122
COMPUTATIONAL GENOMICS
of performing data-dependent partitioning. Since any tree can be transformed into a binary tree, we need only consider binary classification trees. A tree is constructed recursively based on some criteria. If S represents the set of all data, then it is partitioned according to some rule into S = S1 ∪S2 . There are then four possibilities:
S1 is partitioned into S1 = S11 ∪ S12 and S2 is partitioned into S2 = S21 ∪ S22 . S1 is partitioned into S1 = S11 ∪ S12 and partitioning of S2 is terminated. S2 is partitioned into S2 = S21 ∪ S22 and partitioning of S1 is terminated. Partitioning of both S1 and S2 is terminated.
In the last case, the partition is complete; in any of the others, it proceeds recursively until all branches end in termination, at which point the leaves on the tree represent the partition of the space. On each cell (subset) in the final partition, the designed classifier is defined to be 0 or 1, according to which is the majority among the labels of the points in the cell. As an example of a classification tree, we consider the median tree. Based on one of the coordinates, say the first, split the observation vectors into two groups based on the median value of the selected coordinate in the sample. If the first coordinate is used, then the x 1 -axis is split at the median, x˜1 , thereby partitioning the data into S = S1 ∪ S2 : for x = (x1 , x2 , . . . , xd ) ∈ S, x ∈ S1 if x1 ≤ ˜1 , and x ∈ S2 if x1 > x˜1 . The second level of the tree is constructed by partitioning S1 and S2 according to the median values of the second coordinate among those points in S1 and S2 , respectively, where the point selected for the first level is not considered. The partitioning is continued to some chosen level k. If the level goes beyond the dimension d of the space, then one recycles through coordinates again. Note that the median tree does not depend on the labels of the sample data. Median-tree classifiers are consistent if k → ∞ in such a way that n/k2k → ∞, so long as X possesses a density (Devroye et al., 1996). Figure 8.1 illustrates a median-based tree classifier, both the data partitioning and the binary tree. The nearest-neighbor (NN) rule forms a partition by defining ψn (x) as the label of the sample point closest to x. The result is a data-dependent partition having a cell for each sample point. That cell consists of all points in the space closest to the sample point. The partition is known as a Voronoi partition and the cells are called Voronoi cells. The nearestneighbor rule is simple, but not consistent. An extension of this rule is
Design Issues and Comparison of Methods
123
Figure 8.1. Median tree: (a) data partition; (b) tree.
the k-nearest-neighbor (kNN) rule (see also Chapter 14). For k odd, the k points closest to x are selected and ψn (x) is defined to be 0 or 1 according to which is the majority among the labels of the chosen points. A slight adjustment is made if k is even. The kNN is universally consistent if k → ∞ in such a way that k/n → 0 as n → ∞ (Stone, 1977). Instead of taking the majority label among a pre-decided number of nearest neighbors as with the NN rule, the moving-window rule pre-sets a distance and takes the majority label among all sample points within that distance of x. The moving-window rule can be “smoothed” by giving more weight to sample points closer to x. A histogram rule based on the majority between 0 and 1 labels is equivalent to the median of the labels, where in case of a tie a 0 is given. More generally, a weighted median of binary values is computed by adding up the weights associated with the 0- and 1-labeled points separately, and defining the output to be the larger sum. A kernel rule is constructed by defining a weighting kernel based on the distance of a sample point from x, in conjunction with a smoothing (scaling) factor. The Gaussian, and Epanechnikov kernels are given by x − xk 2 K h (x, xk ) = exp − (8.1) h 1 − (x − xk )/ h2 , if (x − xk ) ≤ h K h (x, xk ) = (8.2) 0, if (x − xk ) > h respectively, where · denotes Euclidean distance, x is the point at which the classifier is being defined, and xk is a sample point. Since the Gaussian kernel is never 0, all sample points get some weight. The Epanechnikov
124
COMPUTATIONAL GENOMICS
kernel is 0 for sample points more than h from x, so that, like the movingwindow rule, only sample points within a certain radius contribute to the definition of ψn (x). The moving-window rule is a special case of a kernel rule with the weights being 1 within a specified radius. The three kernel rules mentioned are universally consistent if h → ∞ in such a way that nh d → ∞ as n → ∞ (Devroye and Kryzak, 1989).
4.
Constrained Classifiers
To reduce design error, one can restrict the functions from which an optimal classifier must be chosen to a class C. This leads to trying to find an optimal constrained classifier, ψC ∈ C, having error εC . Constraining the classifier can reduce the expected design error, but at the cost of increasing the error of the best possible classifier. Since optimization in C is over a subclass of classifiers, the error, εC , of ψC will typically exceed the Bayes error, unless the Bayes classifier happens to be in C. This cost of constraint (approximation) is C = εC − ε• . A classification rule yields a classifier ψn,C ∈ C with error εn,C , and εn,C ≥ εC ≥ ε• . Design error for constrained classification is n,C = εn,C − εC . For small samples, this can be substantially less than n , depending on C and the rule. The error of the designed constrained classifier is decomposed as εn,C = ε• +C +n,C . The expected error of the designed classifier from C can be decomposed as E[εn,C ] = ε• + C + E[n,C ]
(8.3)
The constraint is beneficial if and only if E[εn,C ] < E[εn ], which means C < E[n ] − E[n,C ]. If the cost of constraint is less than the decrease in expected design cost then the expected error of ψC is less than that of ψn . The dilemma: strong constraint reduces E[n,C ] at the cost of increasing εC . The matter can be graphically illustrated. For the discrete-data plugin rule and the cubic histogram rule with fixed cube size, E[n ] is nonincreasing, meaning that E[n+1 ] ≤ E[n ]. This means that the expected design error never increases as sample sizes increase, and it holds for any feature-label distribution. Such classification rules are called smart. They fit our intuition about increasing sample sizes. The nearest-neighbor rule is not smart because there exist distributions for which E[n+1 ] ≤ E[n ] does not hold for all n. Now consider a consistent rule, constraint, and distribution for which E[n+1 ] ≤ E[n ] and E[n+1,C ] ≤ E[n,C ]. Figure 8.2 illustrates the design problem. The axes correspond to sample size and error. The horizontal solid and dashed lines represent ε• and εC , respectively; the decreasing solid and dashed lines represent E[εn ] and
125
Design Issues and Comparison of Methods
Figure 8.2. Relation between sample size and constraint.
E[εn,C ], respectively. If n is sufficiently large, then E[εn ] < E[εn,C ]; however, if n is sufficiently small, then E[εn ] > E[εn,C ]. The point N at which the decreasing lines cross is the cut-off: for n > N , the constraint is detrimental; for n < N , it is beneficial. When n < N , the advantage of the constraint is the difference between the decreasing solid and dashed lines.
5.
Perceptrons and Neural Networks
A classical way of constructing classifiers is to use parametric representation. The classifier is postulated to have a functional form ψ(x1 , x2 , . . . , x d ; a0 , a1 , . . . , ar ), where the parameters a0 , a1 , . . . , ar are to be determined by some estimation procedure based on the sample data. For parametric representation, we assume the labels to be −1 and 1. The most basic functional form involves a linear combination of the coordinates of the observations. A binary function is obtained by thresholding. A perception has the form d ψ(x) = T a0 + ai xi (8.4) i=1
where x = (x1 , x2 , . . . , xd ) and T thresholds at 0 and yields −1 or 1. A perceptron divides the space into two half-spaces determined by the
126
COMPUTATIONAL GENOMICS
hyperplane defined by the parameters a0 , a1 , . . . , ad . The hyperplane is determined by the equation formed from setting the linear combination equal to 0. Using the dot product, a · x, which is equal to the sum in the preceding equation absent the constant term a0 , the hyperplane is defined by a · x = −a0 . An optimal perceptron minimizes the error ε[ψ] = P(ψ(X) = Y ). A specific design procedure finds parameters that hopefully define a perceptron whose error is close to the optimal perceptron. Often, analysis of the design procedure depends on whether the sample data is separable by a hyperplane. The sample is linearly separable if there exists a hyperplane such that points with label −1 lie on one side of the hyperplane and the points with label 1 lie on the other side. The classical Rosenblatt perceptron algorithm uses an iterative procedure to find a separating hyperplane in the case of a linearly separable sample (Rosenblatt, 1962). Let {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )} be the sample-data set, a be a parameter vector defining the perceptron, and {(xt , yt )} be a sequence of observation-label pairs formed by cycling repeatedly through the sample data. Proceed iteratively by forming a sequence a(0), a(1), a(2), . . .. Let a(0) = (0, 0, . . . , 0). Given a(t), define a(t + 1) by a(t + 1) =
a(t), a(t) + yt+1 xt+1 ,
if ψt (xt+1 ) = yt+1 if (xt+1 ) = yt+1
(8.5)
where ψt is the perceptron formed using the parameter vector a(t). The parameter vector changes if and only if the label for time t + 1 is not correctly given by applying the perceptron for time t to the observation vector for time t + 1. In finite time, the procedure produces a hyperplane that correctly separates the data. The time can be bounded in terms of the maximum norm of the observation vectors and the degree to which the −1 and 1-labeled vectors are separated. For data sets that are not linearly separable, another method has to be employed to estimate the optimal perceptron. Gradient-based methods lead to consistent design under very general conditions. The Wiener filter provides non-iterative computationally efficient perceptron design. It has been incorporated into a massively parallel method for finding promising very small gene sets for microarray-based classification that has been applied to breast cancer (Kim et al., 2002). Given the joint feature-label distribution, the optimal mean-square-error linear estimator of Y based on X is determined by a weight vector a = (a0 , a1 , . . . , ad )t , ‘t’ denoting transpose. The autocorrelation matrix for X and the
127
Design Issues and Comparison of Methods
cross-correlation vector for X and Y are given by ⎛ E[X 1 X 1 ] E[X 1 X 2 ] · · · E[X 1 X d ] ⎜ E[X 2 X 1 ] E[X 2 X 2 ] · · · E[X 2 X d ] RX = ⎜ .. .. .. .. ⎝ . . . . E[X d X 1 ] E[X d X 2 ] · · · E[X d X d ] E[XY ] = (E[X 1 Y ], E[X 2 Y ], . . . , E[X d Y ])t
⎞ ⎟ ⎟ ⎠
(8.6)
(8.7)
respectively. If R X is nonsingular, then the optimal weight vector is given −1 by a = R−1 X E[XY ]. If R X is singular, then R X is replaced by the pseudoinverse of R X . An approximation of the optimal perceptron is given by T[at X0 ], where X0 = (1, Xt )t . The sample-based classification rule for the weight vector is determined by estimating R X and E[XY ] from the sample data. The support vector machine (SVM) provides another method for designing perceptrons (Vapnik et al., 1997; Vapnik, 1998). It has been used with microarray data to identify gene sets possessing common functionality (Brown et al., 2000). Figure 8.3 shows a linearly separable data set and three hyperplanes (lines). The outer lines pass through points in the sample data, and the third, called the maximal-margin hyperplane (MMH) is equidistant between the outer lines. It has the property that distance from it to the nearest −1-labeled vector is equal to the distance from it
Figure 8.3. Support-vector hyperplane.
128
COMPUTATIONAL GENOMICS
to the nearest 1-labeled vector. The vectors closest to it are called support vectors. The distance from the MMH to any support vector is called the margin. The matter is formalized by recognizing that differently labeled sets are separable by the hyperplane u · x = c, where u is a unit vector and c is a constant, if u · xk > c for yk = 1 and u · xk < c for yk = −1. For any unit vector u, define c1 (u) = c0 (u) =
min
{xk :y k =1}
min
u · xk
{xk :y k =−1}
(8.8)
u · xk
Let u0 be the unit vector that maximizes ρ(u) = 1/2[c1 (u − c0 (u)] over all unit vectors u. The MMH is given by the vector u0 and the constant c0 = ρ(u0 ). This hyperplane is unique. A method is needed to find the MMH. It can be found by solving the following quadratic optimization problem: among the set of all vectors v for which there exists a constant b such that v · xk + b ≥ 1, if yk = 1 v · xk + b ≤ −1, ify f k = −1
(8.9)
find the vector v0 of minimum norm, v0 . Then the vector defining the MMH and the margin are given by u0 = v0 / v0 and ρ(u0 ) = v0 −1 , respectively. The SVM methodology can be extended to the situation in which the sample is not linearly separable. We leave a detailed explanation to the literature. It can also be extended to other functions besides perceptrons. We will also not pursue that issue here. Instead, we turn to the complex parametric classifiers defined by neural networks. ( ) two-layer neural network has the form A (feed-forward k ψ(x) = T c0 + ci σ [ψi (x)] (8.10) i=1
where σ is a sigmoid function, and ψi (x) =
d
ai j x j
(8.11)
j=0
where x0 is the constant 1. A sigmoid function is nondecreasing with limits −1 and +1 at −∞ and ∞, respectively. Each operator in the sum of Eq. 8.11 is called a neuron. These form the hidden layer. We consider
129
Design Issues and Comparison of Methods
neural networks with the threshold sigmoid, σ (x) = −1 if x ≤ 0 and σ (x) = 1 if x > 0. Other sigmoid functions include the logistic sigmoid, σ (x) = (1 − e−x )/(1 + e−x ), and the arctan sigmoid, σ (x) = (2/π) arctan x. Increasing the complexity of the neural network by placing more functions in the hidden layer provides increasing approximation to the Bayes classifier. This approximation can be obtained to any desired degree (Cybenko, 1989; Funahashi, 1989). Greater accuracy of the best classifier for a network structure results in greater design cost. Not only does the increase in network complexity result in the need for larger data sets, it also makes estimation of the weights more problematic. Typically, the method of steepest decent on the error surface (as a function of the weights) is used (Bishop, 1995). In the case of linear filters, the mean-square error is a quadratic and the resulting error surface (as a function of the weights) has a single local minimum that is also the global minimum; however, the error surface for a neural network can have many local minima, so that a gradient method such as steepest descent can get stuck at a local minimum, making the matter much more problematic. Finally, a neural network is consistent if k → ∞ such that (k log n)/n → 0 as n → ∞ (Farago and Lugosi, 1993).
6.
Error Estimation
The error of a designed classifier needs to be estimated. If there is an abundance of data, then it can be split into training and test data. A classifier is designed on the training data. Its estimated error is the proportion of errors it makes on the test data. The estimate is unbiased and its variance tends to zero as the amount of training data goes to infinity. If the test-data error estimate is εn and there are m sample pairs in the test data, then E[|εn − εn |2 ] ≤
1 4m
(8.12)
It is necessary to use 25 test sample pairs to get the corresponding standarddeviation bound down to 0.1. The problem is that, for small samples, one would like to use all the data for design. One small-sample approach is to use all the sample data to design a classifier ψn , and estimate εn by applying ψn to the same data. The resubstitution estimator, εn , is the fraction of errors made by ψn . It is typically quite low-biased. Here we specifically consider histogram rules for fixed (non-data-dependent) partitions. For these, εn is biased low, meaning E[εn ] ≤ E[εn ]. For small samples, the bias can be severe. It improves for large samples. The variance of the resubstitution estimator satisfies the
130
COMPUTATIONAL GENOMICS
Figure 8.4. Relation between expected design error and resubstitution error.
bound, V ar [εn ] ≤ 1/n. If the number of cells in the partition is bounded by K, then an upper bound for the mean-square error of ε n as an estimator of εn is given by K E[|εn − εn |2 ] ≤ (8.13) n (see (Devroye et al., 1996) for the theoretical details of the bounds in this section). In the case of discrete binary features, K = 2d and the bound is exponential in terms of the number of variables. Figure 8.4 shows a generic situation for the inequality E[εn ] ≤ E[ε• ] ≤ E[εn ] for increasing sample size. Another small-sample approach is cross-validation. Classifiers are designed from parts of the sample, each is tested on the remaining data, and εn is estimated by averaging the errors. For leave-one-out estimation, n classifiers are designed from sample subsets formed by leaving out one sample pair, each is applied to the left-out pair, and the estimator εˆ n is 1/n times the number of errors made by the n classifiers. Since the classifiers are designed on sample sizes of n − 1, εˆ n actually estimates the error εn−1 . It is an unbiased estimator of εn−1 , meaning that E[ˆεn ] = E[εn−1 ]. Unbiasedness is important, but of critical concern is the variance of the estimator for small n. For a sample of size n, εˆ n estimates εn based on the same sample. Performance depends on the classification rule. For histogram rules with fixed
Design Issues and Comparison of Methods
partitions, E[|ˆεn − εn |2 ] ≤
1 + 6/e 6 +√ n π(n − 1)
131
(8.14)
From Eqs. 8.13 and 8.14, E[|εn − εn |2 ] is of the order n −1/2 for leave-oneout estimation as opposed to only n −1 for resubstitution. Unbiasedness comes with the cost of increased variance. There is a certain tightness to the bound of Eq. 8.14. For any partition, there exists a distribution for which 1 E[|ˆεn − εn |2 ] ≥ . (8.15) √ 1/12 e 2π n Performance can be very bad for small n. To appreciate the difficulties inherent in the leave-one-out bounds, we will simplify them in a way that makes them more favorable to precise estimation. The performance of εˆ n guaranteed by Eq. 8.14 becomes better if we √ lower the bound. A lower bound than the one in Eq. 8.14 is (1.8)/ n − 1. The corresponding standard-deviation bounds for n = 50 and 100 exceed 0.5 and 0.435, respectively. These are essentially useless. The minimum worst-case-performance bound of Eq. 8.15 would √ be better if it were lower. A lower bound than the one given is (0.35)/ n . The corresponding standard-deviation bounds for n = 50 and 100, exceed 0.22 and 0.18, respectively.
7.
Feature Selection
Given a large set of potential features, such as the set of all genes on a microarray, it is necessary to find a small subset with which to classify. There are various methods of choosing feature sets, each having advantages and disadvantages. The typical intent is to choose a set of variables that provides good classification. The basic idea is to choose nonredundant variables. A critical problem arises with small samples. Given a large set of variables, every subset is a potential feature set. For v variables, there are 2v −1 possible feature vectors. Even for choosing from among 200 variables and allowing at most 20 variables, the number of possible vectors is astronomical. One cannot apply a classification rule to all of these; nonetheless, even if the classes are moderately separated, one may find many thousands of vectors for which εˆ n ≈ 0. It would be wrong to conclude that the Bayes errors of all the corresponding classifiers are small. Adjoining variables stepwise to the feature vector decreases the Bayes error but can increase design error. For fixed sample size n and different numbers of variables d, Fig. 8.5 shows a generic situation for the Bayes error ε• (d) and the expected error E[εn (d)] of the designed filter
132
COMPUTATIONAL GENOMICS
Figure 8.5. Effect of increasing numbers of variables.
as functions of d. ε• (d) decreases; E[εn (d)] decreases and then increases. Were E[εn (d)] known, then we could conclude that ε• (d) is no worse than E[εn (d)]; however, we have only an estimate of εn (d), which for small samples can be well below (or above) ε• (d). Thus, the estimate curve εˆ n (d) might drop far below the Bayes-error curve ε• (d), even being 0 over a fairly long interval. Regarding the general issue of the number of variables, the expected design error is written in terms of n and C in Eq. 8.3, but C depends on d. A celebrated theorem of pattern recognition provides bounds for E[n,C ] (Vapnik and Chervonenkis, 1974; Vapnik and Chervonenkis, 1971). The empirical-error rule chooses the classifier in C that makes the least number of errors on the sample data. For this rule, E[n,C ] satisfies the bound $ VC log n + 4 E[n,C ] ≤ 4 (8.16) 2n where VC is the VC (Vapnik-Chervonenkis) dimension of C. Details of the VC dimension are outside the scope of this chapter. Nonetheless, it is clear from Eq. 8.16 that n must greatly exceed VC for the bound to be small. The VC dimension of a perceptron is d + 1. For a neural network with an even number k of neurons, the VC dimension has the lower bound VC > dk. If k is odd, then VC ≥ d(k − 1). To appreciate the implications, suppose d = k = 10. Setting VC = 100 and n = 5000 in Eq. 8.16 yields a bound exceeding 1, which says nothing. Admittedly, the bound of Eq. 8.16 is
133
Design Issues and Comparison of Methods
worst-case because there are no distributional assumptions. The situation may not be nearly so bad. Still, one must proceed with care, especially in the absence of distributional knowledge. Adding variable and neurons is often counterproductive unless there is a large sample available. Otherwise, one could end up with a very bad classifier whose error estimate is very small!
8.
Illustration of Classification Techniques on Microarray Data
This section illustrates classification and error estimation using cDNAmicroarray gene-expression data that have previously been used to classify childhood small, round blue cell tumors (SRBCTs) into four cancer types: neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (NHL), and the Ewing tumor family (EWS) (Khan et al., 2002). We will break down the data in the same manner as the original study. There are 6,567 geneson each microarray, and omitting genes that do not satisfy a threshold level of expression reduces the number to 2.308. There is a total of 83 microarravs. 63 for training and 20 for testing. Principle component analysis has been performed using the training samples and the 10 dominant components have been selected as features. To this point, that data have been handled exactly as in the original study. Henceforth, we will illustrate classification rules discussed in the present chapter. The leftmost column of Table 8.1 lists the classifier models applied to the data: (1) perceptron using the thresholded Wiener filter, (2) neural network with 3 hidden nodes and the logistic sigmoid activation function for both layers, (3) nearest-neighbor, (4) k-nearest neighbor with k = 5, (5) linear support vector machine, (6) polynomial nonlinear support vector machine, (7) moving window with a Gaussian kernel with h = 0.5, (8) moving
Table 8.1. Classifier errors. Model Perceptron Neural Network Nearest Neighbor 5-Nearest Neighbor Linear SVM Nonlinear SVM Gaussian Kernel Moving Window Epanechnikov Kernel
Test Error
Leave-one-out
Resubstitution
0.05 0.1 0.1 0 0.05 0.15 0.1 0.15 0.05
0 0.0469 0.0313 0.0156 0.0313 0.0156 0.0313 0.0625 0.0156
0 0.0156 0 0.0156 0 0 0 0.0313 0
134
COMPUTATIONAL GENOMICS
window, and (9) moving window with the Epanechnikov kernel. Four-way classification has been achieved by first grouping NB (class 1) with RMS (class 2) and RMS (class 3) with NHL (class 4), and building a classifier for the two groups, after which, classifiers are built to separate the classes within the groups, after which, classifiers are built to separate the classes within the groups. As reported in the table, three error measures have been computed: test-data error, leave-one-out error, and resubstitution error. We note that the leave-one-out error has been computed using the 10 principle components found from the full training set. This has been done for two reasons: (1) to keep throughout the same components used in the original study; and (2) to ease the computational burden. In fact, since feature reduction is part of the training, to be fully rigorous, principle component analysis should be redone on each of the 63 sample subsets used for computing the leave-one-out error. The relationship between small samples and classifier complexity is illustrated by the results. The perceptron (using a very simple training method) outperforms the more complex neural network for all error measurements. The differing conclusions that can be drawn according to which error estimate is used can also be seen. The linear support vector machine has a lower test error than the nonlinear support vector machine, but the situation is reversed for leave-one-out error. A similar situation occurs with respect to the nearest-neighbor and 5-nearest-neighbor classifiers. Moreover, with respect to the latter, the test error is less than the resubstitution error. Owing to the excellent results achieved by the perceptrons (Wiener and support vector machine), it is clear that the classes (in principlecomponent representation) are readily separated by a very small number of hyperplanes.
9.
Conclusion
Except in situations where the amount of data is large in comparison to the number of variables, classifier design and error estimation involve subtle issues. This is especially so in applications such as cancer classification where there is no prior knowledge concerning the vector-label distributions involved. It is clearly prudent to try to achieve classification using small numbers of genes and rules of low complexity (low VC dimension), and to use cross-validation when it is not possible to obtain large independent samples for testing. Even when one uses a cross-validation method such as leave-one-out estimation, one is still confronted by the high variance of the estimator. In many applications, large samples are impossible owing to either cost or availability. Therefore, it is unlikely that a statistical approach alone will provide satisfactory results. Rather, one can use the results of
REFERENCES
135
classification analysis to discover gene sets that potentially provide good discrimination, and then focus attention on these. In the same vein, one can utilize the common engineering approach of integrating data with human knowledge to arrive at satisfactory systems.
References Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., and Yakhini, Z. (2000). “Tissue Classification with Gene Expression Profiles.” Computational Biology 7:559–583. Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford: University Press. Bittner, M., Meltzer, P., Khan, J., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A., Dougherty, E., Wang, E., Marincola, F., Gooden, C., Lueders, J., Glatfelter, A., Pollock, P., Gillanders, E., Leja, A., Dietrich, K., Beaudry, C., Berrens, M., Alberts, D., Sondak, V., Hayward, N., and Trent, J. (2000). “Molecular Classification of Cutaneous Malignant Melanoma by Gene Expression Profiling.” Nature 406:536–540. Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, Jr. M., and Haussler, D. (2000). “Knowledge-Based Analysis of Microarray Gene Expression Data by Using Support Vector Machines.” Proc National Academy Science 97:262–267. Cybenko, G. (1989). “Approximation by Superposition of Sigmoidal Functions.” Mathematics Control Signals Systems 2:303–314. Devroye, L., Gyorfi, L., and G. Lugosi. (1996). A Probabilistic Theory of Pattern Recognition. New York: Springer-Verlag. Devroye, L. and Kryzak, A. (1989). “An Equivalence Theorem for L 1 Convergence of the Kernel Regression Estimate.” Statistical Planning and Inference 23:71–82. Dougherty, E. R. (2001). “Small Sample Issues for Microarray-Based Classification.” Comparative and Functional Genomics 2:28–34. Farago, A. and Lugosi, G. (1993). “Strong Universal Consistency of Neural Network Classifiers.” IEEE Trans on Information Theory 39:1146–1151. Funahashi, K. (1989). “On the Approximate Realization of Continuous Mappings by Neural Networks.” Neural Networks 2:183–192. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. (1999). “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring.” Science 286:531–537. Gordon, L. and Olshen, R. (1978). “Asymptotically Efficient Solutions to the Classification Problem.” Annals of Statistics 6:525–533.
136
COMPUTATIONAL GENOMICS
Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon. R., Meltzer, P., Gusterson, B., Esteller, M., Raffeld, Yakhini, Z., Ben-Dor, A., Dougherty, E., Kononen, J., Bubendorf, L., Fehrle, W., Pittaluga, S., Gruvverger, S., Loman, N., Johannsson, O., Olsson, H., Wifond, B., Sauter, G., Kallioniemi, O. P., Borg, A., and Trent, J. (2001). “Gene Expression Profiles Distinguish Hereditary Breast Cancers.” New England J Medicine 34:539–548. Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., and Meltzer, P. S. (2002). “Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks.” Nature Medicine 7:673–679. Kim, S., Dougherty, E. R., Barrera, J., Chen, Y., Bittner, M., and Trent, J. M. (2002). “Strong Feature Sets From Small Samples.” Journal of Computational Biology 9. Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Washington DC: Spartan. Stone, C. (1977). “Consistent Nonparametric Regression.” Annals of Statistics 5:595–645. Vapnik, V. N., Golowich, S. E., and Smola, A. (1997). “Support Vector Method for Function Approximation, Regression, and Signal Processing.” In: Advances In Neural Information Processing Systems 9. Vapnik, V. N. (1998). Statistical Learning Theory. New York: John Wiley. Vapnik, V. and Chervonenkis, A. (1974). Theory of Pattern Recognition. Moscow: Nauka. Vapnik, V. and Chervonenkis, A. (1971). “On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities.” Theory of Probability and its Applications 16:264–280.
Chapter 9 ANALYZING PROTEIN SEQUENCES USING SIGNAL ANALYSIS TECHNIQUES Karen M. Bloch1 and Gonzalo R. Arce 1 DuPont Company, Wilmington, Delaware, USA, 2 University of Delaware, Department of
Electrical & Computer Engineering, Newark, Delaware, USA 19716
Abstract This chapter discusses the use of frequency and time-frequency signal processing methods for the analysis of protein sequence data. The amino acid sequence of a protein may be considered as a twenty symbol alphabet sequence, or it may be considered as a sequence of numerical values reflecting various physicochemical aspects of the amino acids such as hydrophobicity, bulkiness, or electron-ion interaction potential. When primary protein sequence information is mapped into numerical values, it is possible to treat the sequence as a signal and apply well known signal processing methods for analysis. These methods allow proteins to be clustered into functional families and can also lead to the identification of biologically active sights. This chapter discusses frequency and time-frequency methods for protein sequence analysis and illustrates these concepts using various protein families. In addition, a method for selecting appropriate numerical mappings of amino acids is introduced. Keywords:
1.
Wigner-Ville, time-frequency, amino acid, protein sequence, Fourier transform, information theory, entropy
Introduction
Genomes carry all information of life from one generation to the next for every organism on earth. Each genome, which is a collection of DNA molecules, can be represented as a series of strings comprised of four
138
COMPUTATIONAL GENOMICS
letter symbols. Less than 10 years ago, determining the sequence of these letters to read a single gene was a slow tedious process. But today, through the use of new strategies, genome sequencing is a billion-dollar worldwide effort in both industry and academia. At the end of 1998, researchers had completely read the genome of only one multicellular organism, a worm known as C. elegans. Now, sequences exist for the fruit fly, the human and for the weed important to plant geneticists known as Arabidopsis. Researchers have also been working on simpler organisms. Several dozen microbial genomes are now available, including those that cause cholera and meningitis. Most of these data are accessible free of charge, encouraging the exploration of this data. However, it is not the genes, but the proteins they code for that actually do all the work. The search for protein function has lead to the era of proteomics, the identification and characterization of each protein and its structure, and of every protein-protein interaction (Pennisi, 2000). Proteins are the molecules that accomplish most of the functions of living cells. All proteins are constructed from linear sequences of smaller molecules called amino acids. There are twenty naturally occurring amino acids and they can be represented in a protein sequence as a string of alphabetic symbols. Protein molecules fold to form specific three dimensional shapes which specify their particular chemical function (Hunter, 1993). Analysis of protein sequences can provide insights into function and can also lead to knowledge regarding biologically active sites of the protein. While analysis of protein sequences is often performed directly on the symbolic representation of the amino acid sequence, patterns in the sequence are often too weak to be detected as patterns of symbols. Alternative sequence analysis techniques can be performed by assigning numerical values to the amino acids in a protein. The numerical values are derived from the physicochemical properties of the amino acid and are relevant to biological activity. It has been shown that the electron-ion interaction potential (EIIP), as one such measure, correlates with certain biological properties (Veljkovic et al., 1985). Once a numerical mapping for a protein sequence is achieved, the sequence can be treated as a signal. From a mathematical point of view, a signal can be described in a variety of ways. For example, a signal can be represented as a function of time which shows how the signal magnitude changes over time. Alternatively, a signal can be written as a function of frequency by performing a Fourier transform. This tells how quickly a signal’s magnitude changes (Qian and Chen, 1996). For many real world applications, it is useful to characterize a signal in the time and frequency domains simultaneously. Such signal analysis methods can provide fingerprints which indicate the existence of
Analyzing Protein Sequences using Signal Analysis Techniques
139
some event of importance. In the case of protein sequences represented as numerical signals, such an event might be the existence of a binding site. This chapter illustrates the use of frequency and time-frequency signal analysis techniques with two classes of proteins, fibroblast growth factors and homeodomain proteins. Fibroblast growth factors constitute a family of proteins that affect the growth, migration, differentiation, and survival of certain cells. Homeodomain proteins contain a single 60-amino acid DNA binding domain. It is the numerical representation of these amino acid sequences, along with various frequency and time-frequency analysis methods which we describe herein. Another aspect addressed in this chapter is the selection of appropriate numerical mappings. A new method, based on information theory, for selecting the approriate representation of an amino acid sequence in numerical space is presented.
2.
Frequency Analysis of Proteins
The Resonant Recognition Model (RRM) (Cosic, 1994) is a physicomathematical model that analyses the interaction of a protein and its target using signal processing methods. One application of this model involves prediction of a protein’s biological function. In this technique, a Fourier transform is applied to a numerical representation of a protein sequence and a peak frequency is determined for a protein’s particular function. The discrete Fourier Transform (DFT) is defined as follows: X (n) =
N −1
m=0
x(m)e− j (2/N )nm
n = 1, 2, . . . , N /2
(9.1)
where x(m) is the mth member of a numerical series, N is the total number of points in the series, and X (n) are coefficients of the DFT. The coefficients describe the amplitude, phase, and frequency of sinusoids which make up the original signal. In the RRM, the protein sequences are treated as discrete signals, and it is assumed that the points are equidistant with the distance d = 1. In this case, the maximal frequency in the spectrum is F = 1/2d = 0.5. The aim of this method is to determine a single parameter that correlates with a biological function expressed by a set of genetic sequences. To determine such a parameter, it is necessary to find common characteristics of sequences with the same biological function. The cross-spectral function determines common frequency components of two signals. For a discrete series, the cross-spectral function is defined as:
140
COMPUTATIONAL GENOMICS
S(n) = X (n)Y ∗ (n),
n = 1, 2, . . . , N /2
(9.2)
where X (n) are the DFT coefficients of the series x(n) and Y ∗ (n) are the complex conjugate DFT coefficients of the series y(n). Peak frequencies in the cross-spectral function define common frequency components for analyzed sequences. The common frequency components for a group of protein sequences can be defined as follows: |S(n)| = |X 1 (n)||X 2 (n)| . . . |X M (n)|,
n = 1, 2, . . . , N /2.
(9.3)
where M is the number of sequences. This methodology can be illustrated via an example. We have chosen to study fibroblast growth factors (FGF) which constitute a family of proteins that affect the growth, differentiation, and survival of certain cells. The symbolic representations of two FGF amino acid sequences are shown below: > Basic bovine FGF PALPEDGGSGAFPPGHFKDPKRLYCKNGGFFLRIHPDGRVDGVREKSDPH IKLQLQAEERGVVSIKGVCANRYLAMKEDGRLLASKCVTDECFFFERLES NNYNTYRSRKYSSWYVALKRTGQYKLGPKTGPGQKAILFLPMSAKS > Acid bovine FGF FNLPLGNYKKPKLLYCSNGGYFLRILPDGTVDGTKDRSDQHIQLQLCAES IGEVYIKSTETGQFLAMDTDGLLYGSQTPNEECLFLERLEENHYNTYISK KHAEKHWFVGLKKNGRSKLGPRTHFGQKAILFLPLPVSSD
Symbolic representations, such as these, can be translated into numerical sequences using the EIIP index (Tomii and Kanehisa, 1996). It has been shown that the EIIP correlates with certain biological properties (Veljkovic et al., 1985). The graphical representation of the corresponding numerical sequences for the FGF proteins obtained by replacing every amino acid with its EIIP value can be see in Fig. 9.1. A DFT is performed on each numerical sequence. The resulting spectra are shown in Fig. 9.2. The crossspectral function of the 2 FGF spectra generates the consensus spectrum shown in Fig. 9.3. For the spectra, the x-axis represents the RRM frequencies and the y-axis represents the normalized intensities. The prominent peak denotes the common frequency component for this family of proteins. The presence of a peak frequency in a consensus spectrum implies that all the analyzed sequences have one frequency component in common. This frequency is related to the biological function provided the following conditions are met:
Analyzing Protein Sequences using Signal Analysis Techniques EIIP of Acidic Bovine FGF
0.14
0.12
0.12
0.1
0.1
0.08
0.08
EIIP of Basic Bovine FGF
EIIP
EIIP
0.14
141
0.06
0.06
0.04
0.04
0.02
0.02
0 0
20
40 60 80 100 Amino Acid Position
120
0 0
140
50 100 Amino Acid Position
150
Figure 9.1. Numerical EIIP representations of FGF proteins.
Acidic Bovine FGF
100
100
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Frequency
0
Basic Bovine FGF
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Frequency
Figure 9.2. FFT representations of FGF proteins.
One peak only exists for a group of protein sequences sharing the same biological function.
No significant peak exists for biologically unrelated protein sequences.
Peak frequencies are different for different biological functions.
What is lacking in this technique is the ability to reliably identify the individual amino acids that contribute to the resonant recognition peak frequency. Like signals in general, signals associated with biological sequences are often non-stationary, meaning that the frequency components change along the primary protein sequence chain. These issues can be addressed with time-frequency analysis techniques.
142
COMPUTATIONAL GENOMICS FGF Consensus Spectrum
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.05
0.1
0.15
0.2 0.25 0.3 Frequency
0.35
0.4
0.45
0.5
Figure 9.3. Consensus spectrum of the two bovine FGF proteins shown in Figure 9.2.
3.
Time-Frequency Analysis
For many real world applications, it is useful to characterize a signal in the time and frequency domains simultaneously. Such signal analysis methods can provide fingerprints which indicate the existence of some event of importance. In the case of protein sequences represented as numerical signals, such an event might be the existence of a binding site. Frequency analysis alone cannot handle the transitory nature of non-stationary signals. A time-frequency (or space-frequency) representation of a signal provides information about how the spectral content of the signal evolves with time (space), and therefore provides a tool to analyze non-stationary signals.
3.1
Non-Stationary Signals
Let us consider as a first example the sum of two synthetic signals with constant amplitude, and a linear frequency modulation. This type of signal is called a chirp, and as its frequency content is varying in time, it is a non-stationary signal. An example of such a signal can be seen in Fig. 9.4. It consists of the sum of 2 linear chirps with frequencies varying from [00.5] and [0.2-0.5]. From this time-domain representation, it is difficult to determine what kind of modulation is contained in this signal. If we now consider the energy spectrum of this signal by squaring the modulus of
143
Analyzing Protein Sequences using Signal Analysis Techniques Signal X = Sum of 2 Linear Chirps Varying from 0.2-0.5 and 0.0-0.5
2 1.5
Real Component
1 0.5 0 -0.5 -1 -1.5 -2
0
20
40
60
80
100
120
140
Time
Figure 9.4. Sum of 2 chirps.
its Fourier transform, as illustrated in Fig. 9.5, we still cannot determine anything about the evolution in time of the frequency components. In order to more accurately describe such signals, it is better to directly represent their frequency content while still keeping the time (or spatial) parameter. This is the aim of the time-frequency analysis methods discussed in the next sections.
3.2
Wavelet Transform
A method for protein analysis proposed by Fang and Cosic (Fang and Cosic, 1999) uses a continuous wavelet transform to analyze the EIIP representations of protein sequences. The continuous wavelet transform (CWT) is one example of a time-frequency or space-frequency representations. Because the CWT provides the same time/space resolution for each scale, the CWT can be chosen to localize individual events such as active site identification. The amino acids that comprise the active site(s) are identified as the set of local extrema of the coefficients in the wavelet transform domain. The energy concentrated local extrema are the locations of sharp variation points of the EIIP and are proposed as the most critical locations for a protein’s biological function (Fang and Cosic, 1999). Experiments have shown that the potential cell attachment sites of FGF’s
144
COMPUTATIONAL GENOMICS Fourier Transform of Chirps 1800 1600
Squared Modulus
1400 1200 1000 800 600 400 200 0 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 Normalized Frequency
0.3
0.4
0.5
Figure 9.5. FFT of the signal.
(Fibroblast Growth Factors) are between residues 46-48 and 88-90. Figure 9.6 is a CWT spectrogram of basic bovine FGF protein. It can be observed that there are two bright regions at the higher frequencies from scale 1.266 to scale 2.062 which correspond to the amino acids at the active sites. The use of wavelet transforms shows promise for identifying amino acids at potential biologically active sites, but does not reveal the characteristic frequency component of the Resonant Recognition Model. An additional weakness exists in that the spectrogram of the CWT can often be difficult to interpret. The weakness of this approach can be addressed by the use of a different time-frequency transform.
3.3
Wigner-Ville Distribution
Quadratic time-frequency representations are powerful tools for the analysis of non-stationary signals. The Wigner-Ville distribution (WVD), % (9.4) Wx (t, f ) = x(t + τ/2)x ∗ (t − τ/2)e− j2π f τ dτ τ
for example satisfies a number of desirable mathematical properties and possesses optimal resolution in time-frequency space (Arce and Hasan,
145
Analyzing Protein Sequences using Signal Analysis Techniques
scales a
Absolute values of Ca,b Coefficients for a = 11.0156 1.0312 1.0469 1.0625 ... 5.7812 5.5156 5.25 4.9844 4.7188 4.4531 4.1875 3.9219 3.6562 3.3906 3.125 2.8594 2.5938 2.3281 2.0625 1.7969 1.5312 1.2656 1 20
40
60
80 100 Position - FGF
120
140
Figure 9.6. Continuous wavelet spectrogram of basic bovine FGF protein. The red boxes indicate the active sites. (Please refer to the color section of this book to view this figure in color).
2000). Application of the Wigner-Ville distribution allows more subtle signal features to be detected, such as those having short length and high frequency variation. However, the use of the Wigner-Ville distribution has been limited because of the presence of cross or interference terms. The Wigner-Ville distribution of the sum of two signals x(t) + y(t) Wx+y (t, f ) = Wx (t, f ) + 2ℜ(Wx y (t, f )) + W y (t, f )
(9.5)
has a cross term 2ℜ(Wx y (t, f )) in addition to the two auto components. Because the cross term usually oscillates and its magnitude is twice as large as that of the auto components, it can obscure useful time dependent spectral patterns. Figure 9.7 shows the Wigner-Ville transform1 of the sum of the two chirps presented in Fig. 9.4. One can observe the individual components of this signal, but there is also a great deal of cross-term interference evident in the figure. Reduction of the cross term interference without destroying the useful properties of the WVD is very important to time-frequency analysis (Qian and Chen, 1996).
146
COMPUTATIONAL GENOMICS Wigner-Ville of X 0.45 0.4
Frequency [Hz]
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 20
40
60 Time [s]
80
100
120
Figure 9.7. Wigner-Ville transform of the sum of 2 chirps.
3.4
Interference Terms
Minimal interference terms are necessary for appropriate interpretation and discrimination of the signals being analyzed (Auger and Flandrin, 1995). The interference terms of the Wigner-Ville distribution are due to the WVD’s quadratic structure. Interference terms occur in the case of multicomponent signals and can be represented mathematically with quadratic cross terms. While filtering of the interference terms is desirable, the use of linear filters can distort the resolution and concentration of the auto component terms. However, the nonlinear center affine filter2 as described in (Arce and Hasan, 2000) can be applied to effectively filter out the cross terms while leaving the auto component terms relatively unaffected. For the previous example of the sum of 2 chirps, the center affine filter was applied to the Wigner-Ville transform. A window size of 13 was used in this example and the resulting image can been seen in Fig. 9.8. In another example, we take the cross Wigner-Ville transform of 2 signals which are each the sum of 2 chirps, but share a common component which is a chirp varying from [0.2-0.5]. In Fig. 9.9 we can detect the common component, but once again the interference terms are strongly present. Application of the center affine filter to this figure produces the
147
Analyzing Protein Sequences using Signal Analysis Techniques
Affine smoothed WV L=13 0.45 0.4
Frequency [Hz]
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
20
40
60 Time [s]
80
100
120
Figure 9.8. Affine filter applied to a Wigner-Ville transform.
Cross WV of X and Y 0.45 0.4
Frequency [Hz]
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 20
40
60 Time [s]
80
100
120
Figure 9.9. Cross Wigner-Ville of 2 chirp signals with a common component.
148
COMPUTATIONAL GENOMICS Affine smoothed Cross WV L=13 0.45 0.4
Frequency [Hz]
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 20
40
60 Time [s]
80
100
120
Figure 9.10. Cross Wigner-Ville affine filtered.
result in Fig. 9.10 allowing the common frequency component to be readily discerned. In the next section, the use of the Wigner-Ville timefrequency representation (TFR) along with the center affine filtering method is used to analyze two protein families.
4. 4.1
Application of Time-Frequency Analysis to Protein Families Fibroblast Growth Factors
We illustrate the use of the Wigner-Ville time-frequency distribution using the same FGF proteins and EIIP representation that was discussed in Sections 2 and 3.2. As stated previously, experiments have shown that the potential cell attachment sites of FGF’s are between residues 46-48 and residues 88-90 and the characteristic frequency has been shown in the literature as 0.4512 (Fang and Cosic, 1999). The discrete Wigner-Ville time-frequency representation of human basic FGF is shown in Fig. 9.11. Cross terms make interpretation of this representation difficult. After application of the center affine filter as described in (Arce and Hasan, 2000), one can see in Fig. 9.12 that the bright regions in the lowest frequency range correspond to experimentally
149
Analyzing Protein Sequences using Signal Analysis Techniques
Wigner-Ville TFR of Basic Human FGF 0.45 0.4
Frequency [Hz]
0.35 0.3 0.25 0.2 0.15 0.1 0.05 20
40
60 80 100 Amino Acid Position
120
140
Figure 9.11. Wigner-Ville TFR of basic human FGF. (Please refer to the color section of this book to view this figure in color).
Filtered Wigner-Ville Representation 0.45 0.4
Frequency [Hz]
0.35 0.3 0.25 0.2 0.15 0.1 0.05 20
40
60 80 100 Amino Acid Position
120
Figure 9.12. Activation sites of basic human FGF.
140
150
COMPUTATIONAL GENOMICS
proven activation sites and the bright region around frequency 0.45 corresponds to the characteristic frequency component.
4.2
Homeodomain Proteins
Homeodomain proteins contain a 60-amino acid DNA binding domain found in numerous eukaryotic transcription factors. The homeodomain family is a useful system for studying sequence-structure and sequencefunction relationships because several hundred sequences are known and the structures of several homeodomains have been determined (Clarke, 1995). Application of the Wigner-Ville TFR to a homeodomain B6 protein that has been represented via the EIIP mapping results in the plot shown in Fig. 9.13. There doesn’t appear to be any clear signature in the plot which one could relate to active regions on the protein. However, if an alternative mapping is used, such as one measuring the hydrophobicity of the amino acids, a clear indication of the binding region is detected at amino acid positions 146-205 (Fig. 9.14). This example illustrates the importance of the particular numerical representation being used to describe the protein family being studied. This is the primary motivation of Section 5 on the selection of amino acid mappings.
5.
Selection of Amino Acid Mappings
As explained in previous sections, the amino acid sequence of a protein is typically represented as a string of 20 different letters indicating the 20 naturally occurring amino acids. An amino acid index is a set of 20 numerical values representing specific physicochemical or biochemical properties of the amino acids. We have shown in Sections 2 and 3 that mapping the symbolic representation of an amino acid sequence into a numerical representation allows signal analysis methods to be used to study the structural or functional patterns within protein sequences. As was also shown in Section 4, the selection of a particular amino acid index is important in being able to detect signatures related to the biological activity of a protein. In this section we develop a method for selecting amino acid indices which yield biologically relevant signatures.
5.1
Amino Acid Indices
Each of the twenty amino acids has various properties that are responsible for the specificity and diversity of protein structure and function. Experimental and theoretical research has characterized different kinds of
Analyzing Protein Sequences using Signal Analysis Techniques
TFR of Homeodomains using EIIP Mapping 0.45 0.4
Frequency [Hz]
0.35 0.3 0.25 0.2 0.15 0.1 0.05 20
40
60
80 100 120 140 160 180 200 220 Amino Acid Position
Figure 9.13. TFR of homeodomain proteins using EIIP mapping.
TFR of Homeodomains using Hydrophobicity Mapping 0.45
Frequency [Hz]
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 20
40
60
80
100 120 140 160 180 200 220
Amino Acid Position
Figure 9.14. TFR of homeodomain proteins using hydrophobicity mapping.
151
152
COMPUTATIONAL GENOMICS
properties of individual amino acids and represented them in terms of a numerical index (Kawashima and Kanehisa, 2000). In 1996, Tomii and Kanehisa (Tomii and Kanehisa, 1996) collected 402 published amino acid indices and studied their interrelationships by performing hierarchical cluster analysis. The cluster analysis revealed six major clusters: α and turn propensities; β propensity; amino acid composition; hydrophobicity; physicochemical properties; and other properties such as frequency of lefthanded helices (Tomii and Kanehisa, 1996). Examples of some of the indices are listed in Table 9.1. EIIP is the electron-ion interaction potential which is the energy of delocalized electrons of each amino acid. This mapping represents the distribution of the free electron energies along the protein sequence. Kyte-Doolittle is a measure of hydrophobicity (Kyte and Doolittle, 1982). A score of 4.5 is the most hydrophobic and a score of −4.5 is the most hydrophilic. Eisenberg (Eisenberg, 1984) is an average hydrophobicity score calculated from 5 other hydrophobicity scales. Figures 9.15 and 9.16 show the results of using two amino acid mappings on four members of the epidermal growth factor (EGF) protein family. Note the more structured pattern in the signals which were obtained from the index which is a measure of the width of side chains on the amino
Table 9.1. Examples of amino acid indices. Amino Acid
EIIP
Kyte-Doolittle
Eisenberg
A R N D C Q E G H I L K M F P S T W Y V
0.37100 0.95930 0.00359 0.12630 0.08292 0.07606 0.00580 0.00499 0.02415 0.00000 0.00000 0.37100 0.08226 0.09460 0.01979 0.08292 0.09408 0.05481 0.05159 0.00569
1.8 -4.5 -3.5 -3.5 2.5 -3.5 -3.5 -0.4 -3.2 4.5 3.8 -3.9 1.9 2.8 -1.6 -0.8 -0.7 -0.9 -1.3 4.2
0.62 -2.53 -0.78 -0.90 0.29 -0.85 -0.74 0.48 -0.40 1.38 1.06 -1.50 0.64 1.19 0.12 -0.18 -0.05 0.81 0.26 1.08
153
Analyzing Protein Sequences using Signal Analysis Techniques 1 0.5 0
0
10
20
30
40
50
60
0
10
20
30
40
50
60
0
10
20
30
40
50
60
0
10
20
30 Index 31
40
50
60
1 0.5 0 1 0.5 0 1 0.5 0
Figure 9.15. Four different epidermal growth factors represented by an amino acid mapping reflecting charge transfer donor capability.
1 0.5 0
0
10
20
30
40
50
60
0
10
20
30
40
50
60
0
10
20
30
40
50
60
0
10
20
30 Index 82
40
50
60
1 0.5 0 1 0.5 0 1 0.5 0
Figure 9.16. Four different epidermal growth factors represented by an amino acid mapping reflecting the width of a side chain.
154
COMPUTATIONAL GENOMICS
acids (Fig. 9.16) whereas the charge transfer signal in Fig. 9.15 appears more random. The amino acid index database currently contains nearly 500 indices (GenomeNet, ). We will now discuss how to select an appropriate index based on the protein family being studied.
5.2
Information Theory
We guide the selection of amino acid indices for a protein sequence based on principles derived from communication theory. The information metrics are based the work of Shannon (Shannon, 1948), who used the thermodynamic notion of entropy to measure the information capacity in communication channels. In much of physics and chemistry, the description the state of a given collection of molecules (the “system”) involves energy and how it is distributed. This energy is partitioned into translational motions, vibrations and rotations within the molecules, and into motions of electrons in the atoms and bonds. For a given level of total energy (the “macrostate”) there is range of energies associated with the translational motions, vibrations, rotations and electronic structures. For each of these energy components there is a range of microstates over which the energy is dispersed. In this context, entropy is an indicator of the dispersal or dissipation of energy within a system (or between the system and its surroundings). An entropy increase in a system involves energy dispersal among more microstates, or to say this another way, the entropy of a macrostate measures the number of ways in which a system can differ microscopically (Lambert, 2002). Statistical mechanics interprets an increase in entropy as a decrease in order. In communication theory, entropy is measured from a communication source which may produce any one of a number of possible messages. The amount of information expressed in the message increases as the amount of uncertainty as to what message the source produced increases. The entropy of communication theory is a measure of this uncertainty (Pierce, 1980). Entropy is the measurement of the average number of bits required to encode a message. The entropy of an IID random variable X is H (X ) = −
p(x) log p(x).
(9.6)
x∈X
H (X ) ranges in value from 0 when p(x) = 1, to log n when p(x) = 1 n , where n is the number of possible symbols. By considering a protein sequence as the message, we can apply the concepts of entropy to the problem of selecting an amino acid index.
155
Analyzing Protein Sequences using Signal Analysis Techniques
To calculate the information content, I , of a protein sequence with N unique symbols we begin by defining F = { f1, f2, . . . , f N } ,
f tot =
N
(9.7)
fi
i=1
where f i is the frequency count of symbol i, and f tot is the number of symbols in the sequence. The information value of a protein sequence can then be defined as
I =
N
pi log pi
i=1
log
1 N
0≤I ≤1
(9.8)
where p=
fN , ,..., f tot f tot f tot f1
f2
&
= { p 1 , p2 , . . . , p N } ,
N i=1
pi = 1. (9.9)
The entropy measure is normalized by dividing by log N1 which bounds I between 0 and 1. An example of calculating I is given in Fig. 9.17. In the examples shown in Fig. 9.17, the number of symbols in the alphabet was 2. In the case of amino acid sequences, we could fix the value of N = 20 for the different amino acids, however, we want to calculate I for a given amino acid index which may contain fewer than 20 discrete values. By examining the indices listed in Table 9.1 it can be seen that some indices have a many to one relationship of amino acids to index values and the range of the values varies considerably. To take into consideration the dynamic range of the amino acid indices, we developed a method to determine N when calculating I for a given index mapping. We break the dynamic range of an index into 20 equally sized bins, the size of which is given by binsi ze =
amax − amin 20
(9.10)
where amax and amin are the maximum and minimum values in the specific index. For each amino acid measure i, the bin that measure i occupies is determined by ' ( ai − amin Bi = . (9.11) binsi ze
156
COMPUTATIONAL GENOMICS
Calculating I Assume 2 symbols N=2 Given the sequence APAPAPAP F={f {f1, f2, ... fN}={4,4}
Assume 2 symbols N=2 Given the sequence AAAAAAAA F={f {f1, f2, ... fN}={8,0}
N
ftot =
N
S ft =8
ftot =
i=1
f f p= 1 , 2 ftot ftot
fN 4, 4 = ftot 8 8
i=1
f1 f2 , ftot ftot
fN 8, 0 = ftot 8 8
N
N
S pi =1
S pi =1
i=1
i=1
N
N
S pi log((pi)
I = i=1
p=
S ft =8
log( 1 ) N
S pi log((pi)
I = i=1
=1
=0
log( 1 ) N
Figure 9.17. Examples of calculating I on 2 simple sequences where the number of symbols is N = 2.
An example of determining N using the Kyte-Doolittle hydrophobicity index is presented in Fig. 9.18. By placing each of the values into bins using the formulas defined in Equations 9.10 and 9.11 we discover that the amino acids with equal or approximate property values fall into the same bin (binsi ze = 0.45). By counting the number of occupied bins out of a possible 20, we determine N = 12 for this index. Now we can use this value of N = 12 to calculate I for a hypothetical 16-mer protein sequence: HPDGRVDGVREKSDPH First, count the frequency of occurrence of each amino acid, keeping track of the tallies in the 12 bins. This yields the frequency vector, F = {2, 1, 6, 2, 0, 1, 2, 0, 0, 0, 0, 2} ,
f tot = 16
(9.12)
Therefore, p=
1 2 2 2 1 6 2 , , , , 0, , , 0, 0, 0, 0, 16 16 16 16 16 16 16
and I =
12
&
(9.13)
pi log pi
i=1 1 log 12
= 0.7059.
(9.14)
157
Analyzing Protein Sequences using Signal Analysis Techniques R
binsize = =
Bw = =
amax - amin 20 4.5-(-4.5) 20 aw - amin 0.45 -0.9-(-4.5) 0.45
=8 N = 12
= 0.45
A: 1.8 C: 2.5 D: -3.5 E: -3.5 F: 2.8 G: -0.4 H: -3.2 I: 4.5 K: -3.9 L: 3.8 M: 1.9 N: -3.5 P: -1.6 Q: -3.5 R: -4.5 S: -0.8 T: -0.7 V: 4.2 W: -0.9 Y: -1.3
2 2 K 3 DEHNO 3 4 5 6 7
P
4
8
WY
5
9
ST
6
10
G
7
AM
8
11 12 13 14 15 16
C
9
17
F
10
18 19
L
11
20
IV
12 = N
Figure 9.18. Bin the values of the amino acid index to determine the value of N .
This is the entropic measure of this amino acid segment if it is mapped into the Kyte-Doolittle representation.
5.3
Analysis
To continue with our study of amino acid index selection, let us examine a plot of I values for each amino acid index using a homeodomain protein sequence. This result can be seen in Fig. 9.19. One can observe that the majority of the indices have an I value greater than 0.8. Since we are interested in representations that reveal structure associated with the protein sequence, we can examine the indices which yield a lower value of I , characterizing the less random nature of these representations. By examining the information values for several protein families, we determined that a good “rule of thumb” is to select the amino acid index, whose I value is the median between Imin and 0.8. We also determined that the values of I for protein sequences from the same family are very similar. Therefore, one can use the same index representation for each member of the protein family being studied. Figure 9.20 is the consensus spectrum
158
COMPUTATIONAL GENOMICS
Sorted Values of I
1
Information Content I
0.9
0.8
0.7
0.6
0.5
0.4 0
50
100
150
200 250 300 350 Amino Acid Index
400
450
500
Figure 9.19. Sorted values of I for a homeodomain protein.
Consensus Spectrum of Homeodomain Proteins 0.45 0.4
Frequency [Hz]
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 20
40
60
80
100 120 140 160 180 200 Time [s]
220
Figure 9.20. Consensus time-frequency spectrum of two homeodomain proteins using an amino acid index which is a measure of hydrophobicity.
159
Analyzing Protein Sequences using Signal Analysis Techniques
obtained by applying the time-frequency transform discussed in Section 3 to the homeodomain family. The amino acid mapping was chosen by using the index which yielded the median value of I for I < 0.8. The binding domain known to exist in these proteins at amino acid locations 146 − 205 is clearly highlighted in the spectrum. As another example, we examine epidermal growth factors (EGF). EGF’s have a variety of functions such as to stimulate RNA, DNA and protein production (Mroczkowski and Hall, 1990). By calculating I for each amino acid index on a set of four EGF’s from human, mouse, pig and rat, we determined that an index first published by Fauchere, et al. (Fauchere et al., 1988), which measures the minimum width of each amino acid side chain, yielded the median value of the I quantities valued ≤ 0.8. Figure 9.21 reflects the Wigner-Ville transform using this index. The regions with the largest coefficients are located along the sequence where functionally important amino acids have been identified (Cosic, 1997).
6.
Conclusions
This chapter illustrates the usefulness of time-frequency signal processing for the analysis of protein sequence data. Time-frequency representations such as the Wigner-Ville distribution, when appropriately filtered for interference terms, provide frequency as well as spatial information and lead to EGF - Index 82 0.45 0.4
Frequency [Hz]
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 5
10
15
20
25 30 Time [s]
35
40
45
50
Figure 9.21. Consensus spectrum of the Wigner-Ville transform of EGF proteins using an amino acid index which measures the minimum width of each amino acid side chain.
160
COMPUTATIONAL GENOMICS
the ability to identify biologically active sites of certain proteins. However, as illustrated in the homeodomain example, not all mappings are capable of correlating to biological properties. The selection of the numerical mapping is problem-specific since different mappings may highlight different structural properties of proteins which are directly related to a particular protein’s function. We have demonstrated that by using an entropic measure of a protein sequence representation incorporating both the amino acid composition of the protein sequence along with the dynamic range of the amino acid indices, one can select an amino acid index which yields biologically relevant signatures when applying time-frequency analysis to the numerical representation.
Notes 1 The
figures in this chapter illustrating the Wigner-Ville transform were generated using the Time-Frequency Toolbox for MATLAB found at http://crttsn.univ-nantes.fr/auger/tftb.html. 2 A MATLAB implementation of the center affine filter can be found at http://www.ece.udel.edu/˜bloch/downloads.
References Arce, G. R. and Hasan, S. R. (2000). “Elimination of Interference Terms of the Discrete Wigner Distribution Using Nonlinear Filtering.” IEEE Transactions on Signal Processing 48:2321–2331. Auger, F. and Flandrin, P. (1995). “Improving the Readability of Timefrequency and Time-scale Representations by the Reassignment Method.” IEEE Transactions on Signal Processing 43:1068–1089. Clarke, N. D. (1995). “Covariation of Residues in the Homeodomain Sequence Family.” Protein Science 4:2269–2278. Cosic, I. (1994). “Macromolecular Bioactivity: Is It Resonant Interaction Between Macromolecules? - Theory and Applications.” IEEE Transactions on Biomedical Engineering 41:1101–1114. Cosic, I. (1997). The Resonant Recognition Model of Macromolecular Bioactivity. Basel: Birkhauser Verlag. Eisenberg, D. (1984). “Three-dimensional Structure of Membrane and Surface Proteins.” Annual Reviews in Biochemistry 53:595–623. Fang, Q. and Cosic, I. (1999). “Prediction of Active Sites of Fibroblast Growth Factors Using Continuous Wavelet Transforms and the Resonant Recognition Model.” In: Proceedings of The Inaugural Conference of the Victorian Chapter of the IEEE EMBS, pp. 211–214. Fauchere, J. L., Charton, M., Kier, L. B., Verloop, A., and Pliska, V. (1988). “Amino Acid Side Chain Parameters for Correlation Studies in Biology
REFERENCES
161
and Pharmacology.” International Journal of Peptide Protein Research 32:269–278. GenomeNet. http://www.genome.ad.jp/dbget/aaindex.html. Kyoto University: Institute for Chemical Research. Hunter, L. (1993). “Molecular Biology for Computer Scientists.” In: Hunter, L., ed. Artificial Intelligence and Molecular Biology, pp. 1–46. Menlo Park: AAAI Press. Kawashima, S. and Kanehisa, M. (2000). “Aaindex: Amino Acid Index Database.” Nucleic Acids Research 28:374–5. Kyte, J. and Doolittle, R. F. (1982). “A Simple Method for Displaying the Hydropathic Character of a Protein.” Journal of Molecular Biology 157: 105–132. Lambert, F. L. (2002). “Disorder - A Cracked Crutch for Supporting Entropy Discussions.” Journal of Chemical Education 79:187–192. Mroczkowski, B. and Hall, R. (1990). “Epidermal Growth Factor: Biology and Properties of Its Gene and Protein Precursor.” In: Habenicht, A., ed. Growth Factors, Differentation Factors and Cytokines, Berlin: SpringerVerlag. Pennisi, E. (2000). “Genomics Comes of Age.” Science 290:2220–2221. Pierce, J. R. (1980). An Introduction to Information Theory, Symbols, Signals and Noise. New York: Dover Publications, Inc. Qian, S. and Chen, D. (1996). Joint Time-Frequency Analysis: Methods and Applications. Upper Saddle River, NJ: Prentice-Hall PTR. Shannon, C. E. (1948). “A Mathematical Theory of Communication.” Bell Systems Technical Report 27:379–423. Tomii, K. and Kanehisa, M. (1996). “Analysis of Amino Acids and Mutation Matrices for Sequence Comparison and Structure Prediction of Proteins.” Protein Engineering 9:27–36. Veljkovic, V., Cosic, I., Dimitrjevic, B., and Lalovic, D. (1985). “Is It Possible to Analyze DNA and Protein Sequences by the Methods of Digital Signal Processing?” IEEE Transactions on Biomedical Engineering 32:337–341.
Chapter 10 SCALE-DEPENDENT STATISTICS OF THE NUMBERS OF TRANSCRIPTS AND PROTEIN SEQUENCES ENCODED IN THE GENOME* Vladimir A. Kuznetsov Information and Mathematical Sciences, Genome Institute of Singapore, Singapore
1.
Introduction
Recent achievements in genome sequencing, coupled with the ability of large-scale gene-expression technologies to simultaneously measure the numbers of copies of distinct mRNAs and the number of distinct proteins corresponding to thousands of genes in a given cell type have revolutionized biological research by allowing statistical analysis of data sets related to varying structural and functional properties of biological systems. These studies have provided a lot of information on the gene, proteins, genetic and biochemical networks and, potentially, allow the discovery of general regularities that integrate this information into organizational principles governing complex biological systems and their evolution. A basic topic of any statistical inference of complex systems is the characterization of the distribution of object frequencies for a population (for example, the distribution of allele frequencies in a population (Huang and Wier, 2001) or the distribution of homologous families by family size ratios (Friedman and Hughes, 2001) based on statistical analysis of the samples. In this chapter we will analyze the class of skew distributions that appear in samples provided by large-scale gene expression experiments and by proteome data sets. The observed distributions have the following characteristic in common: there are few frequent, and many rare classes. Such distributions appearance is so frequent in many natural and artificial large-scale complex systems and in the phenomena in which they appear so diversely (physics, biology, economics, sociology, internet, linguistics, etc. (Yule, 1924; Simon and Van Wormer, 1963; * In loving memory of my mother, Mariya M. Zazulina
164
COMPUTATIONAL GENOMICS
Mandelbrot, 1982; Borodovsky and Gusein-Zade, 1989; Adami, 1998; Ramsden and Vohradsky, 1998; Jeong et al., 2000; Kuznetsov, 2001b; Kuznetsov, 2003b; Newman et al., 2001), that one may be lead to conjecture that all these phenomena can be similar in terms of the structure of the underlying probability mechanisms. The empirical probability distributions to which we shall refer specifically are: (A) the distribution of mRNA transcripts by their frequencies of occurrences in gene-expression databases; (B) the distribution of the numbers of promoters regulated by the regulatory DNA-binding proteins; (C) the distribution of identifiable protein clusters by the cluster sizes; (D) the distribution of the identifiable protein domains by their appearance values in the proteome. For such data sets, the true underlying probability distribution functions have not been previously quantitatively characterized. In particular, RNA-DNA hybridization experiments (Bishop et al., 1974) and many more recent large-scale gene-expression profiling studies of mRNA populations have demonstrated a very broad range of transcript copy numbers for genes expressed in a given cell type (0.1-30,000 copies per mammalian cell). Many thousands of expressed genes in eukaryotic cells are “rare” (less than 1 copy per cell) in the numbers of associated transcripts. However, the true distribution of gene expression levels (the numbers of mRNAs at the each possible level of occurrence) in a cell has not been previously identified due to experimental errors, undersampling, and non-reliable detection of the low abundant transcripts. The determination of expressed genes at all biologically significant levels in eukaryotic cells is a challenging biological problem (Bishop et al., 1974; Velculescu et al., 1999). A more simple and practically important goal of genome projects is to systematically identify genes. Last year, two papers announced drafts of the human genome sequence (Venter et al., 2001; IHGSC, 2001), but estimates of the number of human genes continues to fluctuate. Current estimates center around 30,000-40,000, with occasional excursions to 100,000 or more (Pennisi, 2000; Hogenesch et al., 2001). One reason for the continuing ambiguity is that many genes are neither well-defined nor easily recognizable. These estimations are based on samples of four methods: cDNA sequencing and cloning and EST (expression sequence tag) sequencing of mRNAs (Emmert-Buck et al., 2000; Ewing and Green, 2000), the Serial Analysis of Gene Expression (SAGE) method using short-nucleotide sequence tags that match different poliadenilated mRNAs (Velculescu et al., 1997; Velculescu et al., 1999), conserved coding exon and protein domain coding sequence identification by comparative genome analysis (IHGSC, 2001; Crollius et al., 2000), and computational gene prediction (Venter et al., 2001; IHGSC, 2001). These methods work better for large,
Scale-Dependent Statistics of Transcripts and Protein Sequences
165
highly expressed, evolutionarily conserved protein coding genes, and ambiguously predict other genes. For example, although the large-scale gene expression technologies have become increasingly powerful and are widely used, frequent errors in identification of hundreds and thousands of rarer transcripts create considerable uncertainty in fundamental questions such as the total number of genes expressed in an organism and the biological significance of rare transcripts. Many of the low-abundant transcripts may be essential for determining normal and pathological cell phenotypes (Guptasarma, 1995; Ohlsson et al., 2001). However, thousands of rarelyexpressed genes have not been characterized and a rationale for the extreme number of the low abundant transcripts has remained unresolved. Thus, an important issue for gene identification in the postgenome era is determining the true statistical distributions of the number of genes expressed at all possible expression levels both in a single cell and in a population of the cells. For better understanding the general statistical principal governing the biological complexity we also need to analyze the distributions of regulatory proteins in protein-DNA networks, proteins in protein clusters, and protein domains in proteomes. In this chapter we will analyze the statistical distributions arising in large-scale gene-expression and protein data sets. This analysis allows us to identify the true underling probability functions and to estimate the number of evolutionarily conserved protein-coding genes in the human and mouse genomes, and the number of the protein-coding genes expressed in a human cell. We also will show how these distributions help in understanding of the general principal governing the functional adaptation of cells to environmental stress-factors and in understanding the biological complexity growth during evolution.
2. 2.1
Distributions of the Gene Expression Levels Empirical Distributions
Gene expression within a cell is a complex process involving chromatin remodeling, transcription and export of RNA from the nucleus to cytoplasm where mRNA molecules are translated into proteins. Each of these steps is carried out by highly specialized set, elaborate machinery, typically consisting of tens to hundreds of components. Currently, large-scale gene-expression profiling methods (e.g., SAGE) (Velculescu et al., 1997; Velculescu et al., 1999), cDNA or oligonucleotide microarrays (Holstege et al., 1998; Jelinsky and Samson, 1999) involve making cDNA sequences to the less stable mRNA molecules and then using shortnucleotide sequence tags that match different mRNAs to quantify their
166
COMPUTATIONAL GENOMICS
relative abundance in the cell sample. These methods can measure mostly highly and moderately abundant transcripts from large numbers of cells (i.e., not a single cell). However, thousands of genes expressed at very low copy number (≤1 copy per cell) can not be reliably and unambiguously detected (Kuznetsov, 2005). The complete gene expression profile for a given cell type is the list of all expressed genes, together with each gene’s expression level defined as the average number of cytoplasmic mRNA transcripts per cell. For each gene-expression data set, we define the library as a list of sequenced tags that match mRNAs associated with genes, together with the number of occurrences of each specific tag. The statistics of expressed genes in such libraries can be characterized by gene expression level probability functions (GELPF), which for each possible gene expression level value gives the probability of that value occurring for a randomly-chosen gene. The empirical GELPF specifies the proportions of distinct tags which have 1, 2, etc. transcripts present in an associated mRNA sample (i.e., a normalized histogram of gene expression levels). Analysis of such empirical histograms can lead to models of the underlying GELPF in a cell and in a population of cells. Let M be the size of the library, i.e. the total number of tags in it, and n m be the observed number of distinct tags that have the expression level m (occurring m times) in a given library of size M. The discrete value n m roughly reflects the number of expressed genes with expression level m in the cell sample due to experimental errors, non-unique tag-gene matching and incorrect annotation of genes (see below). Let J denote the observed expression level for the most abundant tag in the library. The J m=1 n m = N is the number of distinct tags in the library. The points (m, g(m)) for m = 1, . . . , J , where g(m) = n m /N , form the histogram corresponding to the empirical GELPF. In late 1998, analyzing more than 50 human cDNA and SAGE libraries presented by Cancer Genome Anatomy Project (CGAP) database (http://www.ncbi.nlm.nih.gov/CGAP; http://www.ncbi.nlm.nih.gov/ SAGE) we observed that the empirical relative frequency distribution of expressed genes, constructed for all analyzed libraries exhibited remarkably similar monotonically-skewed shapes with a greater abundance of rare transcripts and more gaps among the higher-occurrence expression levels (Kuznetsov and Bonner, 1999; Emmert-Buck et al., 2000). Moreover, histograms for yeast SAGE libraries, as well as for mouse and for human SAGE or cDNA libraries also exhibit remarkably similar behavior (Figure 10.1, Table10.1). Several classes of distribution functions (Poisson, exponential, logarithmic series, power law Pareto-like (Johnson et al.,1992) were fit to the
167
Scale-Dependent Statistics of Transcripts and Protein Sequences 100
10
-3
0.5 0.0 0.001 0.01 0.1
1
1/m
10-4 1
10
100
Invasive Carcinoma
10
-2
Re, R
10-1 Frequency
1000
1.0
100 0
10 1
1000
1
100 10-1 10-2 10-3 10-4
1
10
100
1000
Expression levels, m (tags) (b)
100
1000
Normal Epithelium (c) Frequency of distinct tags
Frequency of distinct tags
Expression levels, m (tags) (a)
10
100 10-1 10-2 10-3 10-4 1
10
100
1000
Expression levels, m (tags) (d)
Figure 10.1. Fitting of the empirical frequency distributions of the gene expression levels. Log-log plots. (a) ◦: G2/M-arrested yeast cell library with 20,096 SAGE tags; solid stepfunction line: GDP model; ✷ : three pooled yeast cell libraries with 59,494 SAGE tags (see Table 10.1); dotted step-function line GDP model for ✷-data. Insert plot.◦: empirical cumulative fraction function Re -values (Re = ( Jj=m j · n( j, M))/(N M), where n( j, M) is the observed number of distinct tags represented exactly by j copies in a library with size M; N is the total number of distinct tags in a library (main plot); solid line is corfor the ◦-histogram responding theoretical model R (R(m) = Jj=m j · f ( j))/ Jj=1 j · f ( j)) computed by the GDP model f ( j)(main plot). (Cumulative data reduces the apparent “noise” in the histogram data). (b) Log-log plot. ✷ : human normal brain cells library 154 with 81,516 SAGE tags; dotted step-function line : best-fit GDP model for ✷-data; ◦: 8,936 SAGE tags sampled without replacement from library 154 (sub-library 154.1; Table 10.1); solid step-function line: best-fit GDP model for ◦-data. (c) Correlation plot of genes expression values in normal human mammary epithelial cells (library 6360) and human breast invasive ductal carcinoma cells (library 17716); (d) Log-log plot: ◦, △: empirical histograms for normal and carcinoma cells described on plot (c), respectively; solid line links fitted GDP model data for ◦-data.
168
COMPUTATIONAL GENOMICS
Table 10.1. Characterization of empirical frequency distributions of gene expression levels for human (H.s), mouse (Mo.) and yeast SAGE libraries. URLs of libraries for human and mouse tissues (http://www.ncbi.nlm.ni h.gov/ v U ni Lib; http://www.ncbi. nlm.ni h.gov/ v S AG E) and for yeast cells on G2/M-, S-, and log- phases of cell growth (http://www.sagenet.org). Sample H.s.154.1 H.s.17716 H.s. 6360 H.s.154 Mo. 19018 Mo. 20427 Yeast, G2/M Yeast, S Yeast, log Yeast, total
M
N
8936 39473 49167 81516 43274 64240 19527 19871 20096 59494
4590 14978 18511 19137 17754 24796 5303 5785 5324 11329
p1 0.76 0.73 0.72 0.53 0.72 0.73 0.67 0.67 0.66 0.62
J
J/M
k ± SE
181 655 540 1598 630 425 519 561 636 1716
0.030 0.017 0.011 0.020 0.015 0.007 0.027 0.028 0.032 0.029
1.36±0.023 1.09±0.001 1.10±0.001 1.25±0.012 1.19±0.001 1.14±0.001 0.96±0.006 0.98±0.004 0.97±0.004 0.94±0.008
b ± SE −0.12±0.013 −0.26±0.001 −0.23±0.001 0.57±0.016 −0.17±0.001 −0.20±0.001 −0.20±0.006 −0.20±0.004 −0.17±0.004 −0.11±0.008
histograms. The model that best fit the empirical GELPF’s was the Paretolike function (Kuznetsov, 2001b): f (m) = z −1 /(m + b)k+1 ,
(10.1)
where f (m) is the probability that a distinct tag (representing a gene), randomly chosen from N distinct tags, occurs m times in the library. The function f (m) involves two unknown parameters, k, and b, where k > 0, and b > −1; the normalization factor z is the generalized Riemann ZetaJ −(k+1) , where J is the observed exfunction value: z = j=1 ( j + b) pression level for the most abundant tag in the library; consequently it increases with library size M and for large M value J >> a M, where 0.032 ≥ a ≥ 0.007. We will call Eq.10.1 the Generalized Discrete Pareto (GDP) model. The parameter k characterizes the skewness of the probability function; the parameter b characterizes the deviation of the GDP distribution from a simple power law (the GDP with b = 0, see, for example, dotted line on Figure 10.1a). Insert plot on Figure 10.1a illustrates that the fitted GDP model predicts well the empirical cumulative fraction function Re (m), which demonstrates that our model fits well over the entire range of experimental values.
2.2
Effect of Sample Size on the Distribution Shape
Similarly-sized libraries made using the same method from many different human tissues and cell lines have similar numbers of distinct gene tags and are characterized by empirical GELPFs with nearly equivalent parameters
Scale-Dependent Statistics of Transcripts and Protein Sequences
169
in their best-fit GDP models (Table 10.1). For example, the correlation plot in Figure 10.1c shows many differences between expression levels of a same distinct SAGE tag (representing mRNAs) observed in two libraries derived from normal human mammary epithelial cells and from breast invasive carcinoma tissue (library sizes 49,167 and 39,473 SAGE tags, respectively). The Pearson correlation coefficient, r , between the gene expression values for these two samples is 0.45. This is a moderate level of similarity of gene-expression profiles. However, our empirical GELPF and best-fit parameters of the GDP model for these libraries are very similar (Figure 10.1d; Table 10.1). As the size of any library increases the shape of the empirical GELPF changes systematically: 1) the p1 , fraction of distinct tags represented by only one copy, becomes smaller, 2) the J increases proportionally to M; 3) the parameter b becomes bigger, and 4) the parameter k slowly decreases (Figure 10.1b; Table 10.1). We observed these properties of the SAGE libraries obtained from all studied types of cells and cell tissues, in particular, purified and bulk cell tissue samples (Zucchi et al., 2004) and in the libraries for different eukaryotic organisms (Kuznetsov, 2005). For instance, although the yeast genome is less complex, yeast libraries behave similarly (Table 10.1; Figure 10.2a). We also found that for yeast and human libraries, all values of scaling parameter a (a = J/M) fall within narrow ranges (Table 10.1). These and our recent observations suggest that all studied cell types have a general underlying probability distrubution whose skewed form is dependent on the size of the sample. We called such distributions the “large number of rare gene expression event law” (Kuznetsov, Bonner, 1999). Importantly, in self-similar (fractal) biological and physical systems, described by a simple power law, the parameter(s) are independent of the size of the system (Adami, 1998; Stanley et al, 1999; Jeong et al., 2000; Rzhetsky and Gomez, 2001), but not in our case. For example, Table 10.1 shows that the parameter b in the GDP model becomes larger as library size increases (r = 0.9 for 18 SAGE libraries, data not presented). Power law models, including Eq.10.1, also predict an unlimited increase in the number of species as the sample size approaches infinity (Appendix A), whereas the number of species (expressed genes) is a finite number. Thus, we must take into account the sample-size effect. We developed a new statistical distribution model, called the Binomial Differential (BD) distribution (Kuznetsov, 2001a; Kuznetsov, 2001b), which assumes explicitly that parameters are size-independent and a number of expressed genes is a finite number (see below). This model also assumes that each expressed gene has a finite probability of being observed in any given library obtained from a cell population, and each gene is statistically independently
170
COMPUTATIONAL GENOMICS
expressed. Although transcription events of specific genes may be correlated in a given cell, most transcription events within a cell population are expected to be statistically independent events. These assumptions are consistent with observations (Chelly et al., 1989; Ko, 1992; Ross et al., 1994; Newlands et al., 1998; Kuznetsov et al., 2002a). In the next section we briefly describe our model.
3.
Probability Distribution and an Estimator of the Total Number of Expressed Genes
We will consider a simple probabilistic model of mRNAs sampling (representing by SAGE tags due to sequencing and cloning process) from a transcriptome, assuming exlicitly that parameters of the model are sizeindependent and a number of expressed genes is a finite number. Let us assume that Nt genes 1, 2, ..., Nt are expressed with Mt associated transcripts in total in the cells of a large cell population. Also assume that these genes are expressed independently with respective probabilities q1 , q2, ..., q Nt , where Pr(a random transcript corresponds to gene i) = qi . Let the random variable si denote the number of transcripts in a random N t library of size M. Note i=1 si = M. When M ≪ Mt , sampling with replacement is an acceptable model of library construction. This follows a multinomial distribution (Johnson et al.,1992) . The joint probability of observing s1 = y1 mRNA transcripts of gene 1, s2 = y2 mRNA transcripts of gene 2, ...., s N t = y Nt mRNA transcripts in a given library with size M is defined by the probability function f (y1 , . . . , y N t ; M) := Pr[s1 = y1 , . . . , s Nt = y N t ], where M! f (y1 , . . . , y N t ; M) := N t
j=1 y j !
Nt
y
qj j.
(10.2)
j=1
The function f has the unknown parameters q1 , q2 , ..., q N t , and Nt , to Nt t gether with the constraints i=1 qi = 1 and Nj=1 y j = M. Let pm denote the probability that a randomly chosen distinct gene is represented by m associated transcripts in the library for m = 1, 2, . . .. Based the joint probability function f, f we estimated the number of genes at each expression level m and taken for all sampled transcript, when M is large enough, we obtain the probability function of gene expression levels, pm , in terms of the number of transcripts M and the estimated number of genes N (M) in a sample of size M as follows (Kuznetsov, 2001a; Kuznetsov, 2001b; Kuznetsov, 2005): pm ≈ h(m) := (−1)m+1
M! dm N 1 , N m!(M − m)! d M m
(10.3)
171
Scale-Dependent Statistics of Transcripts and Protein Sequences
Fraction of genes/ORFs
1 0.1 0.01 0.001 1e-4
1
10
100
Number of distinct tags or ORFs
Expression levels, m (molecules/cell)
(a) 8000 6000 4000 2000 0 0
20000
40000
60000
80000
1e5
Library size, M (tags)
(b) Figure 10.2. Analysis of the GELPF for yeast transcriptome. (a) Log-log plot. Dotted stepfunction line: best-fit GDP model for original G2/M phase-arrested yeast cell library (see Figure 10.1a). Solid step-function line: the fraction of genes/ORFs estimated by the BD model for a single yeast cell; • : data simulated from fitted GDP model for N =3,009 ORFs; ◦: relative frequency of 3,009 genes/ORFs in a single log-phase yeast cell, estimated from GeneChip data (www.hsph.harvard/geneexpression) (Jelinsky et al., 2000). Dashed line links the fitted GDP model with k = 0.86 ± 0.01, b = 0.37 ± 0.003 for ◦-data. Histogram for ◦-data was constructed as follows: for each ORF/gene, the scaled hybridization intensity signal value, I , in the yeast GeneChip database (Jelinsky et al., 2000), was converted using empirical formula m = (I − 20)/165 to the number of mRNA molecules per single yeast cell ◦-data. (b) Growth curves for the SAGE library. ◦ : number of “true” distinct tags of sub-libraries from pooled yeast library of 47,393 “true” tags; dashed line: LG model with d = 20, 000 ± 1, 946; c = 0.356 ± 0.02 for ◦-data; •: number of genes/ORFs observed in these sub-libraries; LG model with d = 6, 575±185, c = 0.579±0.01 for •-data; short-dashed line: the number of redundant “true” tags.
172
COMPUTATIONAL GENOMICS
where m = 1, 2, . . . and N (M) is the differentiated function of M. We will call the function h(m) the binomial differential (BD) function . Taking m = 1 in Eq.10.2, we obtain the ordinal differential equation:
dN N = p1 (10.4) dM M with N (1) = 1. We call the function N (M) defined by Eq.10.4 the population “logarithmic growth” (LG) model. p1 is a decreasing function of M (Kuznetsov, 2001b). We have used the empirical approximation: p1 =
1 + (1/d)c , 1 + (M/d)c
(10.5)
where the c and d are positive constants. Using an explicit specification of p1 allows us to fit the BD and LG models to empirical histograms. With p1 defined by Eq.10.5, Eq.10.4 has an exact solution for N (M) in the limit as M → ∞: 1+1/d c ) c * c 1 + 1/d N (M) = M c (10.6) 1 + (M/d)c with 1+1/d c lim N (M) = Nt = (1 + d c ) c . (10.7) M→∞
Nt is “the gene richness” estimator of the number of expressed genes in a large population of cells. Using Eq.10.2 with the fitted parameters d and c provides a mean of computing p1 , p2 ,... at a given library size M. Note, unlike the fixed GDP models, the BD probability function depends on the number of distinct genes, N , and the library size, M; it also yields the finite value Nt for the total number of genes as M → ∞. The probability function pm has a skewed form, and is approximated by the power law form ( pm ∼ m −2 ; Lotka-Zipf law, http://linkage.rockefeller.edu/wli/zipf), which has been used to describe many other large-scale, complex phenomena such as income, word occurrence in text, numbers of citations to a journal article, biological genera by number of species, etc. (Yule, 1924; Simon and Van Wormer, 1963; Mandelbrot, 1982; Kuznetsov, 2005).
4. 4.1
Determination of the Number of Expressed Genes and GELPF in a Single Cell The Number of Expressed Genes and GELPF in a Single Yeast Cell
Without removing experimental errors in SAGE libraries one can not obtain an accurate estimate of the total number of expressed genes Nt
Scale-Dependent Statistics of Transcripts and Protein Sequences
173
and GELPF. We selected only SAGE tags whose location on the organism’s chromosome map coincided with protein-coding gene or open reading frame (ORF) regions. Our gene richness estimator (Eq.10.7) could then from this data set estimate Nt and reconstruct the true underlining GELPF, even when the SAGE library matches a fraction of all genes. Figure 10.2 illustrates our approach using the Tag Location database (http:genome-www.stanford.edu/Saccharomyces), which contains ∼8,500 distinct SAGE tags that match ∼4,700 of ∼6,200 yeast genes or open reading frames (ORFs). An ORF is a DNA sequence which is (potentially) translatable into protein. First, since almost all yeast protein-coding genes/ORFs and their location on chromosomes are known, we obtained a fraction the “true” distinct tags (and their expression levels) in a yeast SAGE library by eliminating “the sequencing error tags” (Velculescu et al., 1997) that fail to match yeast genome, and erroneous tags that fail to match known 3’NLaIII genes/ORFs regions and adjacent 3’end regions presented in the chromosome Tag Location database. Second, by random sampling “true” tags, we constructed population growth curves for the numbers of “true” distinct tags and for the corresponding numbers of genes/ORFs tabulated in the Tag Location database (Figure 10.2b). The LG model (Eq.10.6) fits the size-dependent data both for “true” tags and data for genes/ORFs. In the case of “true” distinct tags (◦, Figure 10.2b), our estimator (Eq.10.7) predicts a very large value: 25, 103 ± 2, 000 distinct “true” tags in a large yeast cell population. For genes/ORFs (•, Figure 10.2b), a reasonable estimate (see, for example, Cantor and Smith, 1999; Johnson, 2000) of the total number of expressed genes, Nt = 7, 025 ± 200, was obtained. Because a pooled library has been used, we can conclude that all or almost all yeast genes are expressed in growing normal yeast cell population, i.e. Nt ≈ G, where G is the total number of genes in the entire genome. At any given time, a cell generates only a subset of the possible mRNA transcripts encoded by Ncell genes in the genome. Using the estimated parameters c = 0.579 and d = 6, 580 in the LG function (Eq.10.6) and an estimate Mcell = 15, 000 of the number of mRNAs per yeast cell ( Velculescu et al., 1997), Eq.10.7 predicts Ncell = 3, 009 ORFs per cell. This estimate is consistent with our estimate for a single yeast cell in the G2/M phase-arrested state (2,936 genes/ORFs; Kuznetsov, 2001b) and ˜10% smaller than a published estimate of the number of expressed genes/ORFs in a log-phase yeast cell (3,298 genes/ORFs; Velculescu et al., 1997). Note, the latter estimate has a positive bias because it was obtained without elimination of SAGE tags that do not match to non-coding regions of DNA and without correction of the estimate on SAGE tag redundancy. The GELPF for a single yeast cell (Figure 10.2a) was estimated for corrected data with both the BD (Eq.10.3) and GDP models (Eq.10.1)
174
COMPUTATIONAL GENOMICS
(Kuznetsov, 2001b). To validate an GELPF model based on SAGE data, we also used independent data obtained by Affymetrix GeneChip technology (Jelinsky and Samson, 1999) which detects more, rarely-expressed yeast genes than the SAGE method. Figure 10.2a shows a histogram constructed for the normalized hybridization signals converted to gene expression values for 3,000 more-highly expressed genes/ORFs representing ∼16,000 transcripts in a yeast log-phase cell. Figure 10.2a shows that the frequency distribution for the GeneChip microarray data follows the GDP model and is consistent with the GELPF for corrected SAGE data. Similar skewed form of frequency distributions were also observed in 30 other microarray samples (untreated and treated) from yeast cells ( Holstege et al., 1998; Jelinsky and Samson, 1999; Jelinsky et al., 2000).
4.2
Estimate of the Number of Expressed Genes and the GELPF in a Human Cell
After validation with the yeast database, we can apply the BD model to a large human cell SAGE transcriptome (Velculescu et al., 1999) in order to predict the GELPF for a single human cell and for large populations of the same cell type. Figure 10.3a shows that the GDP model (step-function 1) fits well to an empirical histogram of distinct tags for a pooled library of colon cancer cells (with library size 591,717 tags) (Velculescu et al., 1999). To obtain the “true” distinct tags for human colon cancer transcriptome data set, we first selected all 69,381 distinct tags and then discarded the tags that did not match UniGenes associated with “known” genes in the UniGene database (Lash et al., 2000). We also discarded those tags that are classified as “sequence errors” in the yeast database. We obtained 19,777 “true” tags forming a new library with 411,361 tags. The number of mRNAs in a typical human cell, Mcell , is ∼300, 000 copies per cell (Velculescu et al., 1999). Taking 300,000 tags randomly from 411,361 tags, we obtained 17,690 “true” distinct tags which represent the content of a single colon cancer cell (Figure 10.3a, GDP model: step-function line 2). Figure 10.3b shows that Monte-Carlo generated growth curve for “true” tags data is fitted by the LG model. However, this fit yields the value Nt = 72, 884 as M → ∞ (with c = 0.319 ± 0.010; d = 46, 232 ± 200). This estimate is about twice as big as the recent estimates (∼26,00035,000 genes) provided by the Human Genome Sequencing Projects for the entire human genome (Venter et al., 2001; IHGSC, 2001). This implies that the total number of “true” tags that match the same gene dramatically grows when the SAGE library size becomes bigger. Let N g (M) denote the true number of expressed genes in a library of size M. If N (M) is the total number of “true” distinct tags in the same
1 2 1000
2 0.5
0.0 1e-4
3
100
175
1.0
10000 Re, R
Number of distinct tags or genes
Scale-Dependent Statistics of Transcripts and Protein Sequences
1 0.001
0.01
0.1
1
1/m
10 1 1
10
100
1000
10000
Number of distinct tags or genes
Number of mRNA copies per cell
(a) 25000 1
20000
2
15000 3
10000
4 5
5000 0 0
200000
400000
600000
Library size, M (tags)
(b) Figure 10.3. Analysis of the GELPF for human and yeast transcriptomes. (a) Log-log plot. ⋄ : the numbers of 69,381 distinct tags for the human colon cancer transcriptome SAGE library of size 591,717 tags; dotted step-function line 1: best-fit GDP function with k = 0.98±0.005, b = 0.405 ± 0.007 for ⋄-data. Solid step-function line in curve-sets 2 and 3 : the numbers of genes represented by 1, 2, . . . , 6 transcripts per cell which were predicted by the BD model for a single yeast cell and a single human cell, respectively; ◦: GELPF simulated from the inversed best-fit GDP function for a single yeast cell; ◦: GELPF simulated from the best fit inverse GDP function for a single human cell; dashed line in curve-sets 3 and 2 : the best-fit the GDP function to these data points for the yeast (+) and the human single cell (◦; with k = 0.75 ± 0.089, b = 1.07 ± 0.25), respectively. Histograms 3 and 2 were generated from inversed fitted GDP function 3,009 times (3,009 genes) in a single yeast cell and 10,337 times (10,337 genes) in a single human cell, respectively. Insert plot. Dotted lines 2 and 1 the cumulative fraction plots computed from the ◦- and +- histograms, respectively; solid lines 2 and 1: the cumulative fraction functions, R, from best-fit GDP functions (for dashed lines 3 and 2 on main plot). (b) Growth curves for tag subsets of human SAGE library. ◦ : the number of “true” distinct tags in random sub-libraries taken from the human colon cancer cell lines transcriptome library of 411,391 “true” tags; curve 1 (dashed line) : best-fit LG model (with d = 46, 232 ± 200, c = 0.319 ± 001) for ◦-data; curve 2 (dotted line) : best-fit of N g + Nr (see Appendix B) to ◦data; curve 3 (solid line) the number of different genes N g ; curve 4 ( – – — ) : the number of redundant distinct tags Nr ; curve 5 (short-dashed line) : the number of redundant distinct tags for yeast data (see also short-dashed line in Figure 10.2b).
176
COMPUTATIONAL GENOMICS
library, then Nr = N − N g is the number of distinct tags that redundantly represent genes. Introducing the function Nr (M) allows us to deal with the tags-to-gene and tag-to-genes multiple matching problem ( Velculescu et al., 1999; Lash et al., 2000; Caron et al., 2001). Figure 10.3b shows the number of expressed genes N g (curve 3), the number of redundant tags Nr (curve 4), and the sum of N g + Nr (curve 2) with increasing sample size. Figure 10.3b also shows that the number of redundant tags occurring in a human library (curve 4) is larger than the number of redundant tags extrapolated for the same-size yeast library (curve 5) (see also Figure 10.2b). The latter difference is expected due to the higher complexity of the human genome. The best-fit estimate of N g (curve 3, Figure 10.3b) predicts Ncell = 10, 336 expressed genes for a single cell (at Mcell = 300, 000 tags); Nt = 31, 238 expressed genes are predicted by the LG model for a large colon cancer cell population (M → ∞). Note that our procedure for selection of “true” tags tend to be conservative, because some of the UniGene clusters may be erroneous itself, and UniGene database is large, but not a complete catalogue of all human transcript, i.e. does not cover entire human genome(Lash et al., 2000). Therefore, some SAGE tags which we discarded may still represent real transcripts, as was shown many times (for example, Croix et al., 2000). Comparison of GELPFs for yeast and human single cells. Figure 10.3a shows the probability function for the number of expressed genes for a single colon cancer human cell (step-function 2) and for a single yeast cell (step-function 3) estimated by the BD model. These functions were computed as described in Appendix B. To show the variability of the frequencies of different genes at the different expression levels, we also sampled expression levels from the fit GDP model Ncell times (N Ncell = 10, 336) (◦, Figure 10.3a), just as we did for yeast data. The shapes of the yeast and human GELPFs are quite similar (Figure 10.3a). However, 38% and 31% of active genes are represented by a single mRNA copy per cell for yeast and human cells, respectively. The GDP model parameter values are significantly different (k = 1.51 ± 0.002 and b = 2.07 ± 0.009 for a single yeast cell, and k = 0.747 ± 0.089 and b = 1.074 ± 0.250 for a single human cell, respectively) as well as the cumulative fraction of mRNAs plots (insert plot on Figure 10.3a). On average, there are five mRNAs per gene in a single yeast cell and 30 mRNAs per gene in a single human cell.
5.
Global Transcription Response to Damaging Factors
The GELPF can be used to characterize global changes in transcriptional profiles that take place when cells are exposed to reactive chemical
Scale-Dependent Statistics of Transcripts and Protein Sequences
177
agents, such that virtually every molecule in a cell is at risk of being altered in some way. We have inspected data at http:/www.hsph.harvard.edu/ geneexpression which represents 30 mRNA profiles of ∼6,200 yeast genes/ORFs sampled from normal cells at the different phases of cell growth and cells treated with clinically relevant chemical and physical damaging factors. These data sets have been obtained by GeneChip microarrays technology. Despite the differences of chemical factors and differences in transcription profiles in cells on different phases of cell growth, the responses of cells on diverse exposures look very uniform. We found that global response of transcription machinery in yeast cells is a nonlinear function of damaging level (Figure 10.4a). Interestingly, the relationships between total number of expressed genes in treated cells and the levels of cell damaging is opposite to mRNAs response (Figure 10.4b). Figure 10.4c shows the typical changes of the empirical frequency distribution and of the shape of fit GDP model in the case of response S. cerevisiae genes upon exposure to a mildly toxic dose of a alkylating agent, methyl methanesulfonate (MMS). Inserted plot in Figures 10.4c shows that the number of low-expressed genes per treated cell is increased, but expression levels of highly-abundant transcripts are decreased. Note that for all our analysis the hybridization signals corresponding to ≥ 0.5 mRNA molecule per cell were considered. Correlation plot on Figure 10.4d shows that gene-expression profiles for pairs of the samples obtained from normal cells are highly reproducible (r12 = 0.91). Figure 10.4d also shows that damaging factor mostly amplifies a transcription of many hundreds of rarely-transcribed genes and reduces an abundance of transcripts of highlyand moderately- active genes. At 60 min after initiation of the treatment, the total number of expressed genes per cell was increased by ∼ 600 (20%) genes; however, the total number of mRNAs was decreased by ∼ 2, 000 molecules (12%). We obtained similar relationships in normal and MMStreated cells sampled on the G1-, S- and G2- phase of cell growth and in cells sampled on log-phase of cell growth and then treated with several distinct damaging agents (data not presented). Jelinsky et al. (1999; 2000) shows that MMS exposure represses groups of genes involved in RNA and nucleotide synthesis, in the synthesis and assembly of ribosomal proteins and, simultaneously, MMS activates protein-degradation processes. Non-linear amplification and/or induction of the expression of hundreds of very low-transcribed genes at any phase of cell growth in a quantitatively predictable manner is a big surprise. How does the cell regulate the massive production of these rare mRNAs in such low, but reasonably determined numbers per cell? We could assume that in a normal yeast cell the limited amount of specific transcription factors and other less specific regulatory elements of transcription machinery may exist. Most of these
178
ORG/Gernes
18000
1000 Genes/ORFs
Number of Transcripts
COMPUTATIONAL GENOMICS
16000
14000
100
1000 0
1
10 m
10 1
100
80
60 40 Survival, %
20
1 10 100 Expression levels, m (molecules/cell)
0
(c)
(a) 200 3600
r122=0.914; r132=0.65
150 m2,m3
Number of genes/ORFs
2000
3200
100 50
2800 100
80
60 40 Survival, %
(b)
20
0
0 0
50
100 m1
150
200
(d)
Figure 10.4. Global transcriptional response of yeast cells to damaging agent MMS. Relationships between the number of mRNA transcript per cell (a), the number of expressed genes per cell (b) and the percent survival (as determined by colony-forming activity test). The percent of survival for several data points did not measured directly, but it was estimated by the dosedependence or time-dependence relationships for a same damaging factor. ◦, • :untreated and treated cells, respectively. (c): Log-phase yeast cells were exposed to 0.1% MMS for one hour and resulted in the 90% survival (sa determined by the colony-forming ability). Log-log plot. ◦ : normal cell sample (#zero2) with 2,970 genes/ORFs representing 16,762 mRNAs/cell; dotted step-function line : best-fit GDP model at k = 0.86 ± 0.001 and b = 0.37 ± 0.003 for ◦-data; • : 3,548 genes/ORFs representing 14,733 mRNAs/cell; solid step-function line: best-fit GDP model at k = 1.26 ± 0.003 and b = 1.0 ± 0.009 for ◦-data. Insert plot: Log-linear plots for the left-side of the distributions presented on major plot at m < 15. (d) ◦, • : Correlation plots 2 = 0.91) of the expression value of pairs (m 1 , m 2 ) in normal cell samples ( #zero1, #zero3;r12 2 and of (m 1 , m 3 ) in pair of normal and treated cell samples (#zero2, #MMS;r13 = 0.65), respectively.
regulator molecules could be pre-occupied during much time with the transcription of genes coding the highly-abundant and moderately-abundant proteins (i.e., ribosomal genes). Stress factors or damaging agents could dramatically reduce de novo transcription and de novo protein synthesis (Jelinsky and Samson, 1999; Jelinsky et al., 2000), lead to temporal increasing of free transcription factors in cell nucleus and increase a possibility of the interaction of transcription factors with promoter regions of many normally rarely-expressed genes. Interestingly, the observed global gene response to different stress conditions in yeast cell populations demonstrates the properties of finite-size self-organized critical (SOC) systems (diverse class of random cellular
Scale-Dependent Statistics of Transcripts and Protein Sequences
179
automata models) (Kauffman, 1993; Adami, 1998). Such systems, which based on simple logical rules, assume that a finite number of different components (i.e., genes) locally interact in the system, where a noise can propagate through the entire system and leads, at some conditions, to catastrophically big events (Adami, 1998). Power law-like distributions associated with SOC systems demonstrate size-dependent effects, robustness to perturbations, changes in the system composition and in network re-organization in response to random environment.
6.
Stochastic and Descriptive Models of Gene Expression Process
Analysis of human transcriptome has allowed us to estimate the cumulative numbers of both expressed protein-coding genes as well as erroneous and redundant sequences. After eliminating the erroneous tags and redundant tags, we estimated that there are ∼10,000 protein-coding genes expressed in any random single human colon cancer cell. This estimate is ∼5.5 times smaller than the previous estimate of 56,000 expressed genes per cell made for the same human SAGE transcriptome (Velculescu et al., 1999). Thus, an effective process for removing both erroneous tags and redundant multiple matching tags from SAGE libraries is critical for obtaining an accurate profile of expressed genes and for more accurately estimating the number of genes in human cells and in other organisms. After removing the errors and redundancy, GELPF for yeast shows that ∼1,200 genes/ORFs are represented on average by a single mRNA molecule per cell (Figure 10.2a). The population growth curve for genes/ORFs (Figure 10.2b) indicates that ∼3,800 additional genes/ORFs (∼55% of all yeast genes) are expressed at less then one transcript per cell. These estimates are consistent with our estimates based on data found in the largescale yeast oligonucleotide microarray databases (Figure 10.4c). Similar statistical analysis for human transcriptome data indicates that ∼70% of all protein-coding human genes are expressed with less than 1 transcript per cell (Figure 10.3a). These results strongly support our assumption that a major fraction of expressed genes in an eukaryotic cell are transcribed sporadically and randomly. This point is supported by many observations. Activating gene expression is a complex, multi-step probability process that must ultimately load the transcription machinery onto a gene. This can only occur when appropriate chromatin states allow specific transcription factors to access the relevant sequences within a promoter. Once bound, these factors and associated transcriptional cofactors form a complex that positions RNA polymerase so transcription can begin. The transcription of different genes
180
COMPUTATIONAL GENOMICS
in different species is a fast process with a rate of ∼1-2 kb/min (Femino et al., 2000; Jackson, 2000). During proliferation, a mammalian cell can produce ∼ 107 copies of the specific protein. An example of this is the transcription initiation factor eIF-4A (Jackson, 2000). A single copy gene can sustain this level of protein synthesis if it transcribed continuously with one to two initiations per minute (assuming transcription rate of 2 kb/min, translation rate of 250 amino acids per minute at 1 initiation per 0.4 minutes and average mRNA life-time and cell cycle parameters taken in (Jackson, 2000). However, even in proliferating mammalian cells, many of genes are transcribed infrequently, typically less than once per hour (Jackson, 2000). It is clear from the corrected probability distribution of gene expression levels in both proliferative and non-proliferative cells (Figure 10.1, Figure 10.3; Table 10.1) that the most mRNAs presented at < 2 copies per cell need only to be transcribed very seldom (i.e. > 1 molecule per 5 hours at a typical half-life (τ1/2 = 5 h −1 ) for human mRNAs (Jackson, 2000). Mechanistic aspects of such pulse transcription process in a cell may be due to (1) temporal modulation of supercoiling effects of chromosomes, (2) temporal inactivation of regulatory repressors (i.e. the promoter methylation), and (3) the limited amount of specific transcription factors and other regulatory elements of transcription machinery which could be preoccupied for a long time with the transcription of genes coding for abundant proteins. There is growing evidence that initiation of transcription of at least a major fraction of expressed genes has been observed to occur sporadically and randomly both in time and location on chromosomes in a variety of cell systems. Much work in support of this mechanism has been done in cell cultures in which gene activity was determined on separated cells rather than as the entire cell tissue population. It was reported (Weintraub, 1988; Walters et al., 1995) that enhancers act to increase the probability but not the rate of transcription. This random on/off model of initiation/suppression of transcription is supported by the results in (Ko, 1992; McAdams, 1999). It was also shown that luciferase reporter promoter genes in transfected cells shuttle on and off in individual cells (Ross et al., 1994; Hume, 2000), theorizing that transcription might be a random event occurring infrequently and in short pulses. Newlands et al. (1998) and Sano et al. (2001) have shown striking examples of such stochastic gene expression patterns. Based on statistical analysis of GELPFs in different cell tissue types, states of the same-type cells and cell types from different species, we suggested that the existence of such random transcription process implies that all or almost all protein-coding genes in a genome should have a small
Scale-Dependent Statistics of Transcripts and Protein Sequences
181
but positive probability to be transcribed in any given cell during any fixed time-interval. This suggestion is consistent with the observation of many low abundance transcripts of various tissue-specific genes in human cells of different type, such as fibroblasts, lymphocytes, etc. (Chelly et al., 1989). Although not all cells of a population would have a copy of a specific transcript at a given moment, we would expect to see all these genes expressed, at least at a low level, in a sufficiently large cell population at any point in time. That is, ergodicity holds. This point is supported by data in GeneChip database (Jelinsky et al., 2000): we observed that only 250 ORFs (∼150 of them are “questionable” or “hypothetical” ORFs) of ∼6,200 genes/ORFs were not expressed in 6 samples from normal growing yeast cells. In mammalian cells, a significant fraction of genes are established with silencing (transcripts are not observed); the silent state of a gene can be inherited, but later reactivated involving the stochastic, allor-non mechanism at the level of a single cell (Sutherland, et al., 2000). Hybridization experiments show that in proliferated HeLa cells at least 50,000 different transcriptional units in non-repeated DNA are transcribed (Jackson, 2000). That number overestimates recent estimates of the number of genes in the entire human genome reported by the Human Genome Sequencing Projects (26,000-35,000 genes; Venter et al., 2001; IHGSC, 2001). By our conservative estimates, presented above, ∼31,200 genes are active in human colon cell type. Based on this data, we might suggest that all or almost all genes of entire human genome are transcribed in distinct human cell types, i.e. we postulate a formula: G ≈ Nti , where G, Nti are the total number of expressed genes in the entire human genome and in i-th tissue type. Random and seldom on/off transcription events must be more general in genes and gene clusters with low probability of transcription. Physically, such random “basal” levels of gene transcription events might reflect nonlinear on/off-triggering of the independent “gene regulatory complexes” to internal or external fluctuations including thermal molecular motion. Generally, movement of proteins within the nucleus by passive, randomly directed diffusion and other dynamical properties of proteins in the nucleus are associated with stochastic mechanisms of gene expression ( Misteli, 2001). Low probability transcription events for many genes could be mediated by its own specific transcripts (Thieffry et al. 1998). In recent years, a diverse class of nuclear small regulatory RNAs (often denoted riboregulators) was found, which involved in cellular responses to stress conditions, and which act as global (negative) regulators by affecting the expression of many genes (Eddy, 2001). Many of these non-coding RNAs are cis-anti-sense RNAs that overlap coding genes on the other
182
COMPUTATIONAL GENOMICS
genomic strand. In our analysis of yeast transcriptome, we found that in a cell ∼500 distinct tags matched anti-sense sequences matches totally ∼1000 genome regions. A significant fraction of RNA transcripts associated with this large number of SAGE tags might be involved in negative control of transcription of protein-coding genes. On the another hand, fast metabolism of nuclear non-coding RNA molecules provides more mobility of enhancer and promoter regions, as well as many protein molecules involved in transcription process and, finally, permits a spatio-temporal conditions for the transcription of particular groups genes which previously were non-activated (Jackson, 2000). Thus, we suspect that in response to environmental changes, various stress conditions and local fluctuation of molecular composition the initiation of transcription process and regulation post-transcription events for many rarely-expressed genes could be under dynamical control of small non-coding genes, many of them represent anti-sense RNA transcripts. Such autoregulation tends to keep the low-expressed gene “half-on” sporadically providing the mechanism of low expression for many genes. Because the distribution of the life-times of “switch-on” and “switchoff” states for genes in a single cell appears to have a long right tail (McAdams, 1999), one of the two alleles of the same locus for a given low-expressed gene might be present in the same state for a long period of time. If this is the case, a natural clonal selection process in a population of the cells could select the clone(s) with monoallelic gene expression , i.e. phenomenon in which only a single copy, or allele, of a given gene of diploid organisms appears (Hollander, 1999; Ohlsson et al., 2001; Cook et al., 1998). Asymmetric transcription of “long-term controlled” rarely-expressed genes might be also limited by non-uniformly distributed concentration of components of the preinitiation complex and some specific transcription factors. These concentration limits might also lead to monoallelic gene expression phenomena.
7.
Probability Distributions of the Numbers of Putative Proteins by Their Cluster Sizes and DNA-binding Proteins by the Regulated Promoters
The probability distribution of expressed genes at the level of cytoplasmic mRNAs in a cell effects the probability distribution of the synthesizing proteins. We expect certain similarity in (self-)organization of the gene expression processes at the protein level, to those that we observed for the mRNA transcription level. Recently, Drs. Ramsden and Vohradsky (Ramsden and Vohradsky, 1998; Vohradsky and Ramsden, 2001) have found that
Scale-Dependent Statistics of Transcripts and Protein Sequences
183
the rates of occurrence of observed highly abundant proteins of prokaryotic organisms follow a skew distribution, which we assume could be fit by our GDP and BD models.1 The proteins clustered by the similarity of the protein sequences shows another interesting example of the skew distribution in proteomic. Li et al. (2001) applied the clustering algorithm, which they developed, to count the number of the protein clusters in a population of 40,580 putative ‘proteins’ found in the International Protein Index database (http://www.ensemble.org/IPI). Log-log plot in Figure 10.5a shows that empirical frequency distribution of the constructed protein clusters versus the cluster sizes contains two distinct distributions. The frequency distribution of the fist highly diverse subpopulation of the clusters (contained ∼97% of 40,580 proteins), fitted well by the GDP model (at k = 1.65 ± 0.006; b = −0.32 ± 0.04; J = 71). The second subpopulation of the clusters contains a few large-size clusters (i.e. zinc finger proteins cluster is the largest, with 479 proteins). The negative value of parameter b, and large fraction of “singleton” protein sequences ( p1 = 0.87), indicates the large number of “erroneous” small-size clusters and to undersampling of the data set. We might expect that many of observed 25,237 “singleton” proteins might be clustered when a population of well-defined proteins will be larger and when clustering algorithms become more accurate. Analysis of the empirical frequency distribution of the numbers of 97 identified proteins regulating a given number of promoters in E.coli provides interesting insight into the structure of the entire regulatory system of initiation of transcription (Thieffry et al. 1998). On a log-log plot, this data give us an idea of the complexity of transcriptional regulation in E. coli. by proteins (Figure 10.5b). This figure shows that GDP model sufficiently well approximates the empirical frequency distribution at k = 1.45 ± 0.053; b = 1.91 ± 0.17. Figure 10.5a reflects also how complex the initiation of transcription for a given promoter can be. For example, the last right-hand point of Figure 10.5b shows that there is a single common protein in E.coli proteome which in potentially involved in regulation of 71 distinct promoters; however, 54 of 97 regulating proteins (55%) interact with only 1 to 2 promoters and, on average, one of these DNA-binding proteins controls 4 promoters. Interestingly, that global identification of transcription factor binding sites (TFBS) at entire genome level for human p53 protein detected by using chromatin immunoprecipitation (ChIP) -Paired End diTag (PET) method (Wei et al., 2005) and for CREB transcription factor detected by ChIP-SAGE method (Impey et al, 2005) also exhibits the Pareto -like distribution (or binomial difference probability function) of the number of transcription factor molecules clustered in a given promotor region (V.A. Kuznetzov, not presented).
184
COMPUTATIONAL GENOMICS 1.00 Fraction of regulatory proteins
Number of Clusters, nm
10000
1000
100
10
0.10
1 0.01
1
5 10
50 100
Cluster Size, m
500
1 10 100 Number of regulated promoters
(a)
(b)
Figure 10.5. Skew probability distributions in proteomic. (a) Probability distribution of putative protein clusters by the cluster sizes. The 40,580 putative proteins have been selected and clustered by a measure of proportionality of identical amino acids in aligned region between paired sequences ( Li et al., 2001) at the level of similarity I ≥ 30% proposed by the authors. Log-log plot: ◦ : a number of protein clusters; solid line and dotted line: fitted GDP (b = −0.32 ± 0.004; k = 1.65 ± 0.006) and simple power law (at κ = 2.33 ± 0.03) for ◦-data at J≤71, respectively. The 2-d, 3-d and 4-th empirical points on the left side of the plot represent averages of 3, 5 and 10 observations, respectively. (b) Relationships between a fraction of regulatory proteins and the number of regulated promoters in E. Coli (Thieffry et al. 1998). Log-log plot: ◦ : data; solid line: best-fit GDP model at b = 1.91 ± 0.17 and k = 1.45 ± 0.053 for ◦-data.
8. 8.1
Protein Domain Statistics in Proteomes Statistical Analysis of Proteome Complexity
Basic evolutionary processes are probably most effectively understood in terms of distinct protein domain coding DNA sequences and thus in terms of associated protein domains, rather than in terms of genes as the primary evolving units (Rubin et al., 2000; Li et al., 2001; Koonin et al., 2000; Venter et al., 2001; IHGSC, 2001). Protein domains generally correspond to locally folded segments of proteins that serve as active binding sites governing selective interactions of a protein with other specific molecules. For many different proteins, these domains can be found alone or
Scale-Dependent Statistics of Transcripts and Protein Sequences
185
in combination with other or the same domains so as to combinatorially form the functional parts of the proteins that make up the proteome. In late 2001, about 4,700 “identifiable” domains that collectively occurred in more than 600,000 putative (predicted) proteins of 6 eukaryotes and more than 40 prokaryotes were described in the Interpro database at the end of 2001 (www.ebi.ac.uk/interpro/). Current protein domain and protein family databases are incomplete and have errors and redundancies. It is obvious that many distinct domain coding DNA sequences and thus the associated protein domains have been (1) highly conserved in evolution and (2) shared among different organisms (Rubin et al., 2000; Li et al., 2001; Venter et al., 2001; IHGSC, 2001). For example, about 50% of the 1401 Interpro protein domains of Saccharamyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans are shared among these organisms. Only about 7% of Interpro domains are unique to the human proteome in comparison to these organisms (IHGSC, 2001). During evolution, proteomes have become more complex through such processes as gene duplication, mutation, and shuffling in vertical descent within the same species organisms (paralogous processes) and in horizontal displacement of genes between different species (orthologous processes) (Rubin et al., 2000; Koonin et al., 2000; Friedman and Hughes, 2001). It appears that a domain that has appeared in a specie has a significant chance to be conserved in evolution. Rather, more combinations of domains arise as protein complexity grows. Thus, most of the biological functions appear to be provided by a limited set of shared protein domains which can be traced back to a common ancestor species (Rubin et al., 2000; Venter et al., 2001; IHGSC, 2001). This suggests that basic evolutionary processes could be fruitfully understood in terms of protein domains. How can we quantitatively describe proteome complexity? The frequencies of occurrence of distinct domains are very different even in the same eukaryotic proteome (Rubin et al., 2000): most domains are found only once (though the same domain is present in many other organisms) in a given proteome; however, a small percentage of domains occur very often. A given protein contains from one to many dozen domains, and may include the same domain several times. Moreover, a given domain may appear in many distinct proteins in the same proteome and in the proteomes of many distinct species. Generally, possible relationships between protein domains and proteins may be presented by a bipartite graph. Figure 10.6 shows a small bipartite graph with six domains (labeled 1 to 6) and 12 proteins (labeled A through L), with edges joining each protein to the domains it includes. Let the proteome domain profile for a specific organism be the list of all distinct protein domains inherent to the organism, together with the numbers of occurrences of each of the domains in the proteome of
186
COMPUTATIONAL GENOMICS 1
A
2
B
3
C D E
4
F G H I
5 J K 6
L
Figure 10.6. A bipartite graph network of six domains (labeled 1 through 6) and 12 proteins (labeled A through L), with edges joining each protein to the domains in which it is included. Dashed line indicates an appearance of domain 6 in protein I.
the organism. Each protein that makes up the proteome is represented just once in counting domain occurrences. There is a fundamental statistical description of a proteome domain profile which consists of the distribution of the domain occurrence values in the profile. Let X denote the number of occurrences of a random distinct domain in the profile. We may define the domain occurrence probability function (DOPF) f (m) := P(X = m) for m = 1, 2, . . .. The DOPF gives the probability of each possible domain occurrence value 1,2, etc. arising for a random domain in the proteome domain profile. Currently, the proteome domain profiles for studied organisms, even those having fully sequenced genomes, are incomplete and can only be approximated by various inference-based methodologies such as analysis of the genome sequences of the organism. Let n m be the observed number of distinct domains which have the occurrence value m (occurring m times) in a given sample from the proteome. Let J denote the observed occurrence value for the most frequent domain in the sample proteome. J The m=1 n m = N is the number of distinct domains in the sample. The points (m, g(m)) for m = 1, . . . , J , where g(m) = n m /N , form the histogram corresponding to the empirical DOPF. Thus, the empirical DOPF specifies the fractions of distinct domains that occur 1, 2, . . . times in the sample. The empirical DOPF is just the normalized histogram of domain occurrence values (Figure 10.7a,b). We analyzed the empirical DOPF for the proteomes of a number of fully-sequenced organisms. The data we used was obtained from the
Scale-Dependent Statistics of Transcripts and Protein Sequences
187
0.010
0.001
0.010
0.001 1 10 100 Number of domain occurrences, m
(b)
100
10
1 1
10 Yeast
(c)
Drosophila C. Elegans
0.100
1 10 100 Number of domain occurrences, m
(a)
Drosophila
Proportion of distinct domains
Yeast
0.100
100
Proportion of distinct domains
Proportion of distinct domains
Interpro database (http://www.ebi.ac.uk/interpro). We examined how the empirical DOPFs change with the increasing complexity of eukaryotic organisms including S. cerevisiae, C. elegans, Drosophila melanogaster, Homo Sapience, Guillardia Theta nucleomorph and also several bacteria (L. Lactis, E. Coli, etc). We observed that for each proteome, the histograms of domain occurrence values exhibited remarkably similar monotonically-skewed shapes with a greater abundance of bigger domain occurrence values and more gaps among these higher-occurrence values. Figures 10.7a,b show the empirical probability functions (histograms) of the domain occurrence values
1.000
0.100
0.010
0.001 1 10 100 Number of domain occurrences, m
(d)
Figure 10.7. Comparative analysis of proteome complexity in yeast, worm and fly. (a) • : Empirical frequency distribution of domain occurrence values in yeast proteome sample; solid step-function line: the GDP model for ◦-data; dotted line: best-fit Riemann-Zipf model (b = 0; k > 0) for •-data. (b) Empirical frequency distribution of domain occurrence values in fly (◦) and worm (+) proteome samples; solid step-function line: the GDP model for ◦-data; (c) Correlation plot: co-occurrences of protein domains in yeast and fly proteomes. Solid line: the regression the model m f ly = 2.23 + 1.85 · m yest , where m f ly and m yest are the occurrence values for fly and yeast proteome samples, respectively; dotted line is a diagonal. (d) Empirical frequency distribution of domain occurrence values in yeast proteome sample (•) and samesize sub-samples randomly-chosen from fly (◦) and worm (+) proteome samples; solid line, dashed line and dotted line fit points calculated by the best-fit GDP model for •-, ◦-, and +-data, respectively; sub-samples from worm and fly data sets containing by 4,745 domain occurrences.
188
COMPUTATIONAL GENOMICS
in the proteome samples of S. cerevisiae, C. Elegans and Drosophila melanogaster. The histograms for the fly and for the worm are most similar, reflecting their greater complexity and proteome size. Note, “the protein sets” used were samples of the proteomes, not the complete proteome data sets. In particular, we have used a list of 1401 InterPro domains (Rubin et al., 2000). We observed 1,000 of these domains in the yeast protein set, 1,144 in the worm protein set, and 1,177 in the fly protein set. All the histograms we studied can be approximated by the generalized discrete Pareto probability (GDP) function (Kuznetsov, 2001b): f (m) = z −1 /(m + b)k+1 , where f (m) is the probability that a domain, randomly chosen from the N distinct domains, occurs m times in a proteome. The function f involves two unknown parameters, k, and b, where k > 0, and b > −1; the normalization factor z is the generalized Riemann Zeta-function value: z = Jj=1 ( j + b)−(k+1) . Let M be the total number of non-redundant occurrences of domains in the sample proteome. By the non-redundant statistic, we mean that a distinct domain (as a homology class) is counted only once even if it occurs several times in the protein (for instance, in repeats). We will call M the connectivity number. This number reflects the complexity of the proteome: M is larger for multicellular eukaryotic organisms (the fly and the worm has the connectivity numbers 12,822 and 11,569, respectively) than for single-cell organisms (M is only 4,547 for the yeast). Interestingly, that the mean of domain occurrence value in a set of distinct domains, M/N , is also larger for more complex organisms. Moreover, M positively correlates with the occurrence value J of the most frequent domain in the sample proteome, and negatively correlates with a frequency of domains occurred only once, p1 , in sample proteome. New domain usage within a proteome results from gene duplication, shuffling, and/or adding or deleting sequences that code for domains. This results in new proteins from old parts. A large proportion of proteins in different organisms share the same domains, even over widely divergent evolutionary paths. Importantly, a domain occurring once in some species, has a high probability to be used in other species multiple times in different proteins. To specify this rule we analyzed the occurrence of domain-coding sequences that have been found in tiny 551-kilobase eukaryotic genome of the G. theta, in the domain-coding sequences for more complex organism yeast. We counted the number of occurrences of all 211 distinct domains which have been found in G. theta. Table 3 shows the frequency distributions of the G. theta domains, identified (or missing) in the yeast proteome.
Scale-Dependent Statistics of Transcripts and Protein Sequences
189
Only 9% (19 of all 211 distinct G. theta domains) are missing in the yeast proteome; 18 of these 19 domains occurring in the G. theta proteome, only once or twice. Most of the G. theta domains have been multiplied in the yeast proteome. We observed a positive correlation (r = 0.61; p < 0.001) between the G. theta domain occurrence values for these two species. These observations suggest that gene loss (or protein-coding sequence loss) is under negative selection, while the processes that add a new gene are under positive selection during organism complexity growth. The Pearson correlation coefficient, r , between the domain occurrence values for pair of organisms provides an estimate of their evolutionary similarity. Figure 10.7c shows that a greater similarity is observed between the worm and the fly (r = 0.66) and that yeast is more closely related to the fly (r = 0.58) than to the worm (r = 0.47). The changes of the parameter values of the GDP model for the different species (see Figures 10.7a,b) strongly correlate with the changes of major characteristics of empirical DOPF: M, p1 , and J . This suggests the existence of a size-dependence for the parameters of the distribution.2 In particular, the slope parameter k, negatively correlates with mean number of times that a random domain occurs within the proteome and positively correlates with the fraction of single-domain proteins in the proteome. Thus the slope parameter, k, might be used to characterize the evolutionary trend toward using more domains (selected by evolution) in the proteome as well as trend toward using already existing specific domain(s) (by reusing the new multi-domain protein(s) or recombining domains). The values of parameters k and b in the GDP model may also reflect the evolutionary branching order of species (e.g., yeast (first), worm (second), fly (third) and human (fourth) (Friedman and Hughes, 2001). Several theoretical models for the growth of multi-domain proteins during evolution have been suggested in (Koonin et al., 2000; Rzhetsky and Gomez, 2001; Friedman and Hughes, 2001). Koonin et al. (2000) for example, showed that multi-domain proteins frequently are created through the (genetic) fusion of additional domains at one or both ends of a protein. We considered a more flexible evolutionary process in which single domain insertion occurs randomly within multi-domain proteins. To be precise, we simulated the reverse process (i.e., going back one evolutionary step to a simpler proteome) modeled by the random removal of one domain copy from the proteome. We applied this random removal process to both the worm and fly proteomes separately until we obtained proteomes with the same total number of domains (i.e., M = 4, 547) as observed for yeast (Figure 10.7d). The empirical DOPF of the domain occurrence values for these two sub-samples closely approximate the yeast DOPF (Figure 10.7, Table 10.2). The parameters of the GDP model, fly (at M = 4, 547) for
190
COMPUTATIONAL GENOMICS Table 10.2. Characteristics of protein domain families. Nt : the total number of genes/ORFs; A: the number of identifiable proteins corresponding to domain families; Am>1 : the number of identifiable proteins contained more than one InterPro domains; N : the number of distinct InterPro domains; M: the total number of domain occurrences per proteome (Rubin et al., 2000). α: average number of appearances of a given domain in proteome (“average diameter” of domains-proteins network); μ: average number of observed domains per protein; p1 : the fraction of domain occurred ones; k, b : parameters of the GDP model; κ : the slope parameter of simple power law; G D P , P L : model selection criterion (Kuznetsov, 2001b) for the GDP and the simple power law, respectively.
Nt A Am>1 N M μ α J p1 k b G D P κ P L
Fly
Worm
Yeast
Flys
Worms
13603 7419 2130 1177 11569 9.8 1.56 579 0.26 1.24±0.02 3.0±0.016 4.7 0.69 0.8
18424 8356 2261 1144 12822 11.2 1.53 545 0.29 1.01±0.001 1.95±0.007 5.7 0.65 1.4
6241 3056 672 1000 4547 4.4 1.49 125 0.39 1.54±0.019 2.10±0.063 5.5 0.82 0.03
847 4547 5.4 236 0.38 1.41±0.008 2.00±0.03 5.3 0.46 3.6
787 4547 5.8 199 0.42 1.0±0.007 0.91±0.02 5.5 0.53 4.7
the simulated “fly ancestor” and for yeast were very similar. Greater differences between the DOPF for the simulated “worm ancestor”, worm (at M = 4, 547) and the observed DOPF for yeast were observed. In both simulations, a longer right tail of the DOPF for the simulated data (compared to the observed DOPF for the yeast) was observed. These deviations probably reflect a natural selection process that preserves various nonrandom co-occurrences for some sets of domains. Such co-occurrences are not properly presented by random removal of domains. Figure 10.8 shows that human protein domain complexity is greater than the other organisms with fully sequenced genomes; for example, the connectivity number for humans, E. coli and N. meningitidis is 36,082, 3,649 and 1,693, respectively. On the other hand, a fraction of domains that occurred once in these organisms is 0.27, 0.52, and 0.66, respectively. The slope parameter, k, of the DOPFs is the smallest in the case of human sample proteome. Figure 10.8 also shows that random sub-samples taken from the human proteome have domains occurrence profiles that
191
H.s. (M=36082) E. coli 12 (M=3649) N. meningitidis (M=1693)
0.100
0.010
0.001 1
10 100 1000 Number of domain occurrenes,m
(a) 1.000 H.s.(M=3649) E. coli
0.100
0.010
0.001 1
10 100 1000 Number of domain occurrenes,m
(c)
Proportion of distinct domains
1.000
Proportion of distinct domains s
Proportion of distinct domains
Proportion of distinct domains
Scale-Dependent Statistics of Transcripts and Protein Sequences 1.000 H.s. (M=5461) Yeast
0.100
0.010
0.001 1
10 100 1000 Number of domain occurrences,m
(b) 1.000 H.s. (M=16796) Worm
0.100
0.010
0.001 1
10 100 1000 Number of domain occurrences,m
(d)
Figure 10.8. Comparative analysis of proteome complexity in human vs E.coli, yeast, worm and N. meningitis-B. (a)◦, +, • : Empirical frequency distributions of domain occurrence values in the human, E.coli, and N. meningitis-B proteome samples, respectively; ◦-data set is connected by line. All empirical DOPFs fit well by the GDP model. Empirical frequency distribution of domain occurrence values in human proteome sub-samples (◦) vs yeast (b;), E. coli (c; +) and worm (d; ) proteome samples. M is the connectivity number. To construct the histograms we have used data from the fourth release (on December, 11, 2001) of the InterPro database.
have quite similar empirical DOPFs to the DOPFs for different species having smaller complexity. Thus, the process of evolutionary growth of the proteome as measured by its domain network complexity can be characterized by the DOPF across species. Our reverse evolution simulation experiments based on random removal of domains fit sufficiently well to claim that such randomness is likely the predominant processes in the evolution of the proteome (Figure 10.7 and Figure 10.8) These experiments assume that the likelihood of addition of any domain to the proteome is equally likely and not dependent on the time of its occurrence and the location within a protein. Missing some older domains in the course of evolution also presents but with the slower rate (see, for example, Table 10.3) than for adding a new domain or using old one. The major non-random process would probably consist of removing or adding domain pairs and more larger protein
192
COMPUTATIONAL GENOMICS
Table 10.3. Co-occurrence distribution of protein domains, which have been found in cryptomonad organism G. theta, in the sample proteome of C. Serevisiae. For all the domains that occurred s times (1 ≤ s ≤ 12) in G. theta, we counted the number of times these domains occurred in the sample proteome for the yeast. G. Theta 1
Yeast 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17 18 19 20 21 22 33 35 37 40 54-115
14 44 50 14 8 10 1 1 3 2 1 1 1 1 2 1 1
Total 2 4 2 5 3 2 1 3 1 1
3
4
5
6
9
10
11
12
1 1 2 1
1 1
1 1 1
1 1 2
1 1
1 1 1
1 1 1
1
1 1 1 1
158
28
9
1 3
1
2 4
2 4
1 1
1
1 2
19 47 55 20 12 10 3 5 4 4 1 1 2 2 5 2 1 1 1 2 1 1 2 1 1 1 7 221
families or domain clusters domains. This would clearly introduce cooccurrence events (which certainly exist in evolution) in our model. In bulk, the process of proteome complexity growth looks like the cumulative advantage dynamics of the system of independent elements (protein domains) with associated counters: new domains are occasionally added while the counters are incremented at a rate depending on their count values (Simon and Van Wormer, 1963). Note that our model can be also formulated in probability terms. Kuznetsov et al., 2002b; Kuznetsov, 2003a (1) In each random time-point
Scale-Dependent Statistics of Transcripts and Protein Sequences
193
in the course of evolution, the proteome connectivity number is increased by one domain entry. (2) Any new entry (repeated gene duplication, loss, gene translocation, as well as the de novo origin of genes and the horizontal transferred gene, etc.) occurs in the course of evolution at approximately the same probability. (3) The probability of entry of a given domain which has already presented in a proteome is defined by the proportion of this domain in the proteome. Such probabilistic model, excepting domain loss process, leads to the DOPF in form of the BD probability distribution (Kuznetsov, 2001b; Kuznetsov, 2003b) (see next subsection). Our assumptions are weaker than Koonin et al. (2000). However since the majority of randomly selected addition sites within the entire proteome are at ends (that is the majority of proteins contain only one or two domains (Rubin et al., 2000; see also Table 10.2), the left side of the DOPF should not be too sensitive to differences between the two specific models. The right tail of the frequency distribution might be much longer for our model. Within the framework of our model one could specify and test different mechanisms for evolution of the genome and proteome complexity and explore changes in DOPF’s for specific models of biological evolution, including Yule’s (Yule, 1924), Fisher’s (Fisher, 1930) and more recent models. In doing this analysis, we can suggest that at the proteome level, selection pressures do not produce a drastic global effect on the functional form of the distribution of evolutionarily conservative protein domains, but do effect the shape of the distribution, accounting for the increase in the sizes of some gene families and re-location of some gene clusters.
8.2
Prediction of the Numbers of Protein-Coding Genes in the Genome and of Protein Domains in the Entire Proteome
Although the exact numbers of genes in the human and mouse genomes remains to be determined, vertebrate genomes clearly contain more genes than those of Drosophila and C. elegans (Venter et al., 2001; IHGSC, 2001). In Figure 10.9, we show that remarkable linear relationship between the number of evolutionarily conserved protein-coding genes, G, and the proteome connectivity number (the total number of observed InterPro domains), M, in complete proteome samples of about 65 ± 3.4% (n = 20) of assumed complete proteome. Eighteen of these 20 data points are for organisms with fully-sequenced genomes, except for the two prediction points for mouse and human shown as solid circles. According to the plot, 36,600 and 20,900 evolutionarily conserved protein-coding genes are in human and in mouse genomes, respectively.3
194
COMPUTATIONAL GENOMICS 40000 H. s s.
Number of Genes, G
30000
A. thaliana M. musculus usc 20000
C. elegans D. melanogaster
10000
R. loti S. cerevisiae e ev
G. theta
0 0
10000
20000
30000
40000
M
Figure 10.9. Relationships between the number of evolutionarily conserved protein-coding genes in the genome, G, and the observed proteome connectivity number, M, for H. sapiens, M. musculus and the 18 organisms with fully-sequenced genomes: A. thaliana, , C. elegans, M. melanogaster, R. loti, P. aeruginosa, S. cerevisiae, C. acetobutylicum, V. cholerae, M. tuberculosis, C. crescentus, E. coli K-12, Synechocystis sp. (PCC 6803), D. radiodurans, L. lactis, N. meningitidis-A, N. meningitidis-B, M. leprae, G. theta. ◦ : data points for the 18 fullysequenced genomes, •: the predicted points for mouse and human. The regression line G=1.015M was estimated using the 18 data points of the fully-sequenced genomes. (We analysed data of InterPro database, http//:www.ebi.ac.uk/interpro).
The linearity of this graph is surprising. It may be that we are seeing the early part of a saturation curve of proteome evolution or did not have sufficiently representative statistical samples. Indeed our extrapolated human data point may be an over-estimate of the number of genes because the human and mouse proteomes available to us are not yet complete and the results must be considered preliminary. However, even if the average of these proteomes represents 60-65% of the total, it seems unlikely that predictions will be changed substantially with additional data. The slope in Figure 10.9 is 1.015. It suggests that in proteome size growth, occasionally appearing new entry of a domain sequence encoded in the genome of the organism leads, on average, to a new functional gene copy. This is consistent with our model of the protein complexity growth (see above)
Scale-Dependent Statistics of Transcripts and Protein Sequences
195
and with our observations that for all analyzed prokaryotic and eukaryotic organisms the average values of the connectivity number, α (α = M/A; or “the average diameter” of proteome network) is approximately constant, or more precisely, ranges in a very narrow interval (1.27 − 1.83 with mean value 1.5 ± 0.139, (n = 20)). Although the human, A. thaliana and mouse genomes show a higher proportion of recent gene duplications than other known eukaryotic organisms (Douglas et al., 2001; Venter et al., 2001), a-value for these organisms is still only 1.83, 1.46, 1.82, respectively. It implies a high proportion of single-gene and two-member families in both genomes. Figure 10.10 shows that as the number of evolutionarily conserved protein-coding genes increases the relationships between the number of distinct protein domains, N , the fractions of rarely observed domains ( p1 , p2 , and p3 ) and the occurrence value of the more frequent domain change in the reciprocate and predictable manner. For example, the p1 , fraction of distinct domains represented by only one entry in proteome, becomes smaller, but the occurrence value of mostly common domain in proteome, J , increases proportionally to the number of the genes in genomes (J/G ≫ a, where a is a nearly constant parameter for all organisms studied here, equal to 0.027 with the standard errors ±0.008 (n = 20)). Increasing the number of genes also leads to monotonically decreasing the slope of probability function. Generally, we can consider such the size-dependent evolution process in terms of the BD model (Kuznetsov, 2001b; Kuznetsov, 2003b) if in the Eqs. (10.3-10.7) the variable N is, formally, the number of distinct domains and the variable M to re-sign as G. Figure 10.10 shows that our BD model fit well all observed data, except the data point for A. thaliana. Only two fitted parameters of the model (c = 0.486 ± 0.0260, d = 4, 500 ± 300) allows us to estimate the DOPF for a given organism with known number of genes. (Note: Our estimates are relatively robust and consistent with data in Figure 10.9. When we excluded the data point for mouse, human and A. thaliana from our curvefitting analysis, we obtained the numerically similar solution of the BD model with the parameter values closely to the values presented above). Moreover, Eq.10.7 predicts existence of ∼5, 400 ± 500 InterPro domains in the entire proteome world (see also Figure 10.10a). Thus, as the proteome increases due to evolution, each independent domain, alone or in combination with other domains, forms the repertoire of proteins of the proteome. This process may be considered mostly as a random birth and death process with advantage accumulation of domains in more complex proteomes. New domains and new use of already present
COMPUTATIONAL GENOMICS 3000
0.8
2000
0.6
p1
Number of Distinct Domains, N
196
1000
0.4
0
0.2 0
10000 20000 30000 Number of Genes, G
40000
0
10000 20000 30000 Number of Genes, G
40000
(b)
(a) 1250
0.20 p2
0.10
p3
750 500
p4
0.05 0.00
J
p2, p3, p4
1000 0.15
250 0
0
10000 20000 30000 Number of Genes, G
(c)
40000
0
10000 20000 30000 Number of Genes, G
40000
(d)
Figure 10.10. Relationships between the number of evolutionarily conserved protein-coding genes in the 20 genomes and some characteristics of the DOPF. (a) The number of distinct domains in the proteomes, (b) a fraction of distinct domains that occurred once, (c) fractions of distinct domains that occurred two and three times in a given proteome. ◦ : 17 data points for fully-sequenced genomes (see Figure 10.9), except for the three data points for mouse and human (solid circles), and A. thaliana (3); lines in the all plots: best-fit curves providing by the BD model (at c=0.486±0.0260, d=4,500± 400) to ◦- and •-data points. (d) Observed occurrence values for the most frequent domain in a given proteome. Regression line J(G)=0.028G fitted to ◦and •-data points. In all plots, assignments are uniformed.
domains likely are appearing in proteomes for different species relatively independently with both accumulating with similar probability. Generally, our analysis suggests that the increase in gene number in evolution occurred predominantly as a result of independent and random events: adding of single domains, either by mutation, gene conversion, reverse transcription, transposition or by occasional horizontal transferring of genes, or by combinatorial genome shuffling; multiple independent gene duplications and occasional gene block duplications became more probable in higher eukaryotic organisms.4 These forms of genome adaptations allow the populations of unicellular and multicellular organisms ensuring survival in unpredictably changing environments. This can appear as the several overlapping Pareto-like probability distributions with very long left
Scale-Dependent Statistics of Transcripts and Protein Sequences
197
tail of the empirical DOPF. Figures 10.7, 10.8 and 10.10 also demonstrate a general rule of growing of the organism’s complexity: to solve adaptation problems more complex organisms preferentially increase using the assembly of the “old” protein domains and blocks into new combinations. This conclusion is consistent with the results of recent analysis by Hughes and Friedman (Hughes et al., 2001; Friedman and Hughes, 2001), who suggested that the increase in the gene numbers in vertebrates may occur as a result of multiple independent gene duplication, as well as occasional duplication of chromosomal blocks. The molecular basis of observed universality of the BD model in the genome-related phenomena is not well understood at present, but it is at least fundamentally important in that this universality is related to stochastic and probabilistic Birth-death-innovation processes in the course of evolution. (Kuznetsov, 2003a; Kuznetsov et al., 2002b)
9.
Conclusion
The goal of this study was to characterize the basic statistical distributions appearing in different genome-related phenomena. We observed such distributions for transcript copy numbers in a single eukaryotic cell, for sizes of protein clusters, for protein domain occurrence values in bacteria and eukaryotic proteomes, and for numbers of regulatory DNA-binding proteins in the prokaryotic genome. In all these cases, the data follows a Pareto-like distribution. We also found that the form of the distribution systematically depends on the size of the sample, the physiological state of the cells, the cell type and the complexity of the organism. It allows us to reveal several general rules of cell response to damaging stress factors and of proteome complexity growth in the course of evolution. We developed a stochastic model of finite population growth that leads to a size-dependent Pareto-like distribution of the number of occurrences of species in the population. This model is called the Binomial Differential (BD) distribution. We presented evidence that the empirical frequency distributions of gene expression levels for all analyzed yeast, mouse and human cell types are well described by this model. Our distribution models could explain this universality by a simple stochastic mechanism. We also developed a new methodology to remove major experimental errors from gene-expression data sets generated by the SAGE method (see also Kuznetsov, 2005). The Binomial Differential model allows us to estimate the total number of expressed genes in a single cell and in a population of given cells even when transcriptome databases are incomplete and have errors.
198
COMPUTATIONAL GENOMICS
Mechanistic models of stochastic gene-expression processes in adaptability, phenotypic and genetic variability in a same-type cell population are discussed. Further identification of the distribution of gene expression levels in different cells appears to be important in understanding the relationships between stochastic and more probabilistic mechanisms that govern expression of genes associated with different cell types, normal and pathological processes in multi-cellular organisms. Statistical analysis of the global gene-expression data could also help to understand how the stochastic variability of gene expression in a single cell might lead to changes of the genotype repertoire in developing tissues and in the entire organism (Till et al., 1964; Ross et al., 1994; Hollander, 1999; Hume, 2000; Shulman and Wu, 1999; Ohlsson et al., 2001). Our ergodicity hypothesis of an initiation of the gene-expression process suggests that a random and sporadic “basal” transcription mechanism exists for all protein-coding genes in all or almost all eukaryotic cell types. This mechanism may provide a basic level of phenotypic diversity and adaptability in cell populations (Kuznetsov et al., 2002a). For example, sporadic expression of a large number of rarely-transcribed genes can provide apparent monoallelic expression phenomena in drawing cell populations and can explain the phenomenon of “no-phenotype” gene knockouts (cases where loss of function of a gene has no visible effect on the organism). It is possible that the rarely expressed genes might evolve by a weak selection process. Many of these genes are of no immediate benefit to the genome, but could be significant during natural selection. Based on the analysis of “identifiable” protein domain occurrences in the sample proteomes of several fully-sequenced organisms, we developed a simple probabilistic model that describes the proteome complexity of an organism and shown that this description is consistent as the proteome grows in evolution. Our model is the Binomial Differential probability function of domain occurrence values in a proteome. We found that such a function is common for all studied organisms with fully-sequenced genomes and that the form of this distribution changes systematically as the proteome complexity grows. Fitting and extrapolation of the BD model predicts ∼5, 400 distinct protein domains in the entire proteome. We also discovered that the rate of growth of the number of protein-coding genes in evolving genomes appears to be a linear function of the total number of domains in the corresponding proteome samples. It allows us to estimate the number of protein-coding genes in complex genomes. Finally, ∼31, 200 expressed genes were estimated based on the distribution of transcript numbers in the human colon cancer transcriptome and ∼36, 600 evolutionarily conserved protein-coding genes were predicted
Scale-Dependent Statistics of Transcripts and Protein Sequences
199
for the entire human genome based on the analysis of the numbers of protein domains in proteomes of fully-sequenced genome organisms. Thus, discovery of universal size-dependent relationships in distributions of the genes by their expression levels in cells of different organisms and in distributions of proteins by their domains in evolution of eukaryotic organisms may help us understand the basic mechanics governing the growth of genome and proteome complexities during evolution and to quantitatively evaluate the relationship between natural selection and random events in the evolution process. Obviously, our assumptions and predictions were based on the sample analysis and on the simple statistical models: further development of theory and obtaining new and more complete and reliable data is required. The Pareto-like skewed distributions observed in genomic and proteomic phenomena could be associated with fundamental properties of the self-organized critical system models. Such cellular automata models demonstrate the prominent size-dependent effect and are relatively robust to changes in the composition of their distinct components, to damaging of their sub-systems, to dynamics of network in the system. It will be important to study more deeply the relationships between properties of SOC systems and empirical statistical distributions presented here. Promising results in this direction have been recently obtained by introducing socalled Probabilistic Boolean Networks, the probabilistic generalization of the well-known Boolean network models, which, in particular, allows the analysis of small networks in large-scale genetic systems (Shmulevich et al., 2002). It is likely that our statistical approach and models could be applicable for analysis of many other evolving large-scale systems (for example, in business, linguistics, sociology, informatics, internet, physics, etc.).
Acknowledgments I thank Robert Bonner and Gary Knott for useful discussions and critical comments.
Appendix A: Infinity Limit for Population Growth Associated with the Generalized Pareto Probability Distribution Here we will estimate relationships between N and M for the Generalized Pareto distribution when M is large enough and parameters of the distribution are constants (i.e. size-independent). For convenience let us fist find
200
COMPUTATIONAL GENOMICS
these relationships for a continued form of the Generalized Pareto density function: (m o + b)k f (m) = ks , (10.8) (m + b)k+1
where m ⊂ [m o , a M], s = 1/(1 − ((m o + b)/(a M + b))k ),where m o , k, b are constants. m o > 0; k > 0, b > −1; ao < a < 1, ao > 0. Function f could be estimated as follows: f = n m /N ,
(10.9)
where n m is the observed number of distinct items (tags or genes) which have the occurrence level m in a given sample of size M. Then, using estimation % aM
M=
(10.10)
n m mdm,
mo
we can obtain a formula N = M · A(M),
(10.11)
where A(M) = [ks(m o + b)
k
%
aM
mo
m dm]−1 . (m + b)k+1
(10.12)
Taking the integral we find that if M → ∞ than N ∼ M k at 0 < k < 1; N ∼ M/ log M at k = 1; and N ∼ M at k > 1. Thus, in all cases for parameter k we have N → ∞ when M → ∞. Using a discrete analog of the Pareto probability function (see Eq.10.1) M M a M 1 and M = am=1 m · n m = N am=1 m (m+b) k+1 , where N = m=1 n m a M = m=1 (1/(m + b)k+1 ) and k > 0; b > −1, ao < a < 1, ao > 0, we can obtain a formula: N = M · Ad (M), (10.13) where Ad (M) =
a M
1 m=1 (m+b)k+1 a M m m=1 (m+b)k+1
.
In the case of the Lotka-Zipf law (b = 0, k = 1), we have: a M 1 π 2 /6 m=1 2 Ad (M) = a M m1 ≈ , ln(a M) + γ m=1 m
201
Scale-Dependent Statistics of Transcripts and Protein Sequences
where γ = 0.5772... is the Euler constant. Then, lim N ≈
M→∞
(π 2 /6)M → ∞. log M
In the other specific case b = 0, k = 2 we have: lim N ≈ (1 + π 2 /15)M → ∞.
M→∞
One could check that Eq.(10.13) has a similar growth to infinity for N as we have shown for Eq.(10.11), when M → ∞.
Appendix B: Population Growth Curve for the Number of Human Expressed Genes We model size-dependences Nr and N g for human gene-expression data as follows: d N g /d M = p1g N g /M, (10.14) d Nr /d M = p1 N g /(M + a),
(10.15)
)c )/(1
+ (M/ddg )c ), where N g (1) = 1, Nr (1) = 0, p1g = (1 + (1/ddg p1 = (1 + (1/d)c )/(1 + (M/d)c ); c, d, dg , a are positive constants. The parameter a reflects the lag in accumulation of the redundant distinct tags in a growing library (see also the analogous curve on Figure 10.2b). The parameters c = 0.319 and d = 46, 232 were determined above by fitting Eq.10.6 to “true” distinct tag data points. The parameter values a = 8, 890 ± 1430 and dg = 17, 532 ± 400 were estimated by fitting the function N g (M) + Nr (M) (curve 2, Figure 10.3b) to the same data. Eq.10.14 was used to compute the number of expressed genes in a single human cell and in a large population of these cells. Eq.10.14 was combined with Eqs.10.4-10.5 for a specified human cell type sample to estimate the numbers of expressed genes at each expression level in a single human cell (curve 2, Figure 10.3a). This computation method is the same as we used for yeast data (Kuznetsov, 2001b; Kuznetsov et al., 2002a).
Notes 1 To
analyze empirical distributions of the rates of proteins synthesis Drs. Ramsden and Vohradsky (Ramsden and Vohradsky, 1998) used a model proposed by Mandelbrot for the rank-frequency distribution of words in text sequences Mandelbrot, 1982. However, the model fit systematically only a portion of the range that can be approximated by the model. The results indicated that occurrences of most probable “protein” words as well
202
COMPUTATIONAL GENOMICS
as most rarely observed “protein” words are very different from those predicted by the Mandelbtot model for natural languages. Note also, that the linguistic value of Mandelbrot model has been doubted for a long time. Mandelbrot (Mandelbrot, 1982) and others (Li, 1992) have shown that even randomly generated “text” (with words of different length) exhibits power-law behavior closed to that of natural languages. 2 Our goodness of fit analysis of frequency distributions of domain occurrence levels in eukaryotic proteomes shows that data do not follow a simple power law (dotted line on Figure 10.1a,b; Table 10.2) or the Zipf law (rank-frequency distribution form of power law). Recent attempts to use the linear regression models to fit such data sets in log-log plot appear to have led authors to a strong bias in parameter estimations and to incorrect conclusions that protein domain networks in proteomes and in genomes (Rzhetsky and Gomez, 2001; Wuchty, 2001) follow scale-free statistical prorperties, i.e. size-independency of the distributions. 3 Note: much smaller estimate of the number of protein-coding genes in mouse should be due to essential incompleteness of sequencing mouse genome and poor annotation of mouse genes at the end of 2001 year. Using more complete and accurate InterPro data in the begining of 2003 (Kuznetsov, 2003a) we have obtained much smaller differences between the estimates of the number of protein-coding genes in the human and mouse genomes. 4 Evolutionary biologists have hypothesized that gene duplication, giving rise to multi-gene families, has played an important role in evolution (Ohno, 1970; Vision et al., 2000). In particular, a role for duplication of complete genomes in adaptive evolution through the single round or two rounds genome duplication by polyploidization (the 1R and 2R hypotheses) have been widely cited. In particular, genomic duplication in A. thaliana has been argued (Vision et al., 2000). Consistent with such hypotheses, there has been pronounced non-monotonicity in the shape of the empirical PODF or the peak(s) at the domain occurrence value 2 and/or 4 on the PODF for some organisms. However, for all of the 20 analyzed prokaryotic and eukaryotic organisms we did not observe peak(s) on the PODF (see for example Figure 10.7 and Figure 10.8). For the vast majority of the organisms we did not observe significant increment in proportion of the domain occurrence value 2 and/or 4 on the empirical PODFs (Figure 10.7, Figure 10.8, Figure 10.10). Only A. thaliana, D. melanogaster and a few bacteria organisms demonstrate pronounced negative deviation of the observed proportion p1 -value, and some positive deviation of the observed proportion p2 values from corresponding values predicted by our probability model (Figure 10.10). Thus, our analysis shows inconsistencies of observations with the one round or two rounds of polyploidization hypotheses for most of the analyzed organisms.
Scale-Dependent Statistics of Transcripts and Protein Sequences
203
References Adami, C. (1998). Introduction to Artificial Life. New York: SpringerVerlag. Bishop, J. O., Morton, J. G., Rosbash, M., and Richardson, M. (1974). “Three Classes in Hela Cell Messenger RNA.” Nature 250:199–204. Borodovsky, M. Yu. and Gusein-Zade, S. M. (1989). “A General Rule for Ranged Series of Codon Frequencies in Different Genomes.” J Biomolecular Structure and Dynamics 6:1001–1012. Cantor, C. R. and Smith, C. L. (1999). Genomics. New York: J. Willey and Sons. Chelly, J., Concordet, J. -P., Kaplan, J. -C., and Kahn, A. (1989). “Illegitimate Transcription: Transcription of any Gene in Cell Type.” Proc Natl Acad Sci USA 86:2617–2621. Chen, J. -J., Rowley, J. D., and Wang, S. M. (2000). “Generation of Longer cDNA Fragments from Serial Analysis of Gene Expression Tags for Gene Identification.” Proc Natl Acad Sci USA 97:349–353. Cook, D. L., Gerber, A. N., and Tatscott, S. T. (1998). “Modeling Stochastic Gene Expression: Implications for Haploinsufficiency.” Proc Natl Acad Sci USA 95:15641–15646. Caron, H., et al. (2001). “The Human Transcriptome Map: Clustering of Highly Expressed Genes in Chromosomal Domains.” Science 291:1289–1292. Croix, B. S., et al. (2000). “Genes Expressed in Human Tumor Endothelium.” Science 289:1197–1202. Crollius, R., et al. (2000). “Estimate of Human Gene Number Provided by Genomewide Analysis Using Tetraodon Nigroviridis DNA Sequence.” Nature Genetics 25:235–238. Douglas, S., et al. (2001). “The Highly Reduced Genome of an Enslaved Aldal Nucleus.” Nature 410:1091–1096. Eddy, S. R. (2001). “Non-coding RNA Genes and the Modern RNA World.” Nature Rev Genetics 2:919–928. Emmert-Buck, M. R., et al. (2000). “Molecular Profiling of Clinical Tissue Specimens: Feasibility and Applications.” Am J Pathol 156: 1109–1115. Ewing, B. and Green, P. (2000). “Analysis of Expressed Sequence Tags Indicates 35,000 Human Genes.” Nature Genetics 25:232–234. Femino, A. M., Fay, F. S., Fogarty, K., and Singer, R. H. (1998). “Visualization of Single RNA Transcripts in Situ.” Science 280:585–590. Fisher, R. A. (1930). The Genetical Theory of Natural Selection. Oxford: Clarendon Press. Friedman, R. and Hughes, A. L. (2001). “Pattern and Timing of Gene Duplication in Animal Genomes.” Genome Res 11:1842–1847.
204
COMPUTATIONAL GENOMICS
Guptasarma, P. (1995). “Does Replication-induced Transcription Regulate Synthesis of the Myriad Low Number Proteins of Escherichia Coli?” BioAssays 17:987–997. Hogenesch, J. B., et al. (2001). “A Comparison of the Celera and Ensemble Predicted Gene Sets Reveals Little Overlap in Novel Genes.” Cell 106:413–415. Hollander, G. A. (1999). “On the Stochastic Regulation of Interleukin-2 Transcription.” Seminars in Immunology 11:357–367. Holstege, F. C. P., et al. (1998). “Dissecting the Regulatory Circuitry of a Eukaryotic Genome.” Cell 95:717–728. Huang, S. -P. and Weir, B. S. (2001). “Estimating the Total Number of Alleles Using a Sample Coverage Method.” Genetics 159:1365–1373. Hughes, A. L., da Silva, J. and Freadman, R. (2001). “Ancient Genome Duplications did not Structure the Human Hox-bearing Chromosomes.” Genome Res 11:771–780. Hume, D. A. (2000). “Probability in Transcriptional Regulation and Implications for Leukocyte Differentiation and Inducible Gene Expression.” Blood 96:2323–2328. International Human Genome Sequencing Consortium (2001). “Initial Sequencing and Analysis of the Human Genome.” Nature 409: 860–921. Impey, S., McCorkle, S. R., Cha-Molstad, H., Dwyer, J. M., Yochum, G. S., Boss, J. M., Mc Weeney, S., Dunn, J. L., Mandel, G., and Goodman, R. H. (2004). “Defining the CREB Regulon: A Genome-wide Analysis of Transcription Factor Regulatory Regions.” Cell 119:1041–1054. Jackson, D. A., Pombo, A., and Iborra, F. (2000). “The Balance Sheet for Transcription: An Analysis of Nuclear RNA Metabolism in Mammalian Cells.” FASEB J 14:242–254. Jelinsky, S. A. and Samson, L. D. (1999). “Global Response of Sacchar omyces Cer evisiae v to Alkylating Agent.” Proc Natl Acad Sci USA 96:1486–1491. Jelinsky, S. A., Estep, P., Church, G. M., and Samson, L. D. (2000). “Regulatory Networks Revealed by Transcriptional Profiling of Damaged Sacchar omyces Cer evisiae v Cells: Rpn4 Links Base Excision Repair with Proteasomes.” Molec and Cell Biology 20:8157–8167. Jeong, H., Tombor, B., Albert, R., Ottval, Z. N., and Barabasi, A. -L. (2000). “The Large-scale Organization of Metabolic Networks.” Nature 407:651–654. Johnson, M. (2000). “The Yeast Genome: On the Road to the Gold Age.” Current Opinion in Genetics and Development 10:617–623.
Scale-Dependent Statistics of Transcripts and Protein Sequences
205
Johnson, N. L., Kotz, S., and Kemp, A. W. (1992). Univariate Discrete Distributions. New York: John Wiley & Sons. Kauffman, S. A. (1993). “The Origins of Order: Self-Organization and Selection in Evolution.” New York: Oxford University Press. Ko, M. S. H. (1992). “Induction Mechanism of a Single Gene Molecule: Stochastic or Deterministic.” BioAssays 14:341–346. Koonin, E., Aravind, L., and Kondrashov, A. S. (2000). “The Impact of Comparative Genomics on our Understanding of Evolution.” Cell 101:573–576. Kuznetsov, V. A. and Bonner, R. F. (1999). “Statistical Tools for Analysis of Gene Expression Distributions with Missing Data.” In: 3rd Annual Conference on Computational Genomics, p. 26. Baltimore, MD: The Institute for Genomic Research. Kuznetsov, V. A. (2000). “The Genes Number Game in Growing Sample.” J Comput Biol 7:642. Kuznetsov, V. A. (2001a). “Analysis of Stochastic Processes of Gene Expression in a Single Cell.” In: 2001 IEEE-EURASIP Workshop on Nonlinear Signals and Image Processing, Baltimore, MD: University of Delaware. Kuznetsov, V. A. (2001b). “Distribution Associated with Stochastic Processes of Gene Expression in a Single Eukaryotic Cell.” EURASIP J on Applied Signal Processing 4:285–296. Kuznetsov, V. A., Knott, G. D., and Bonner, R. F. (2002a). “General Statistics of Stochastic Process in Eukaryotic Cells.” Genetics 161: 1321–1332. Kuznetsov, V. A., Pickalov, V. V., Senko, O. V., and Knott, G. D. (2002b). “Analysis of the Evolving Proteomes: Prediction of the Numbers of Protein Domains in Nature and the Number of Genes in Eukaryotic Organisms.” J Biol Systems 10:381–408. Kuznetsov, V. A. (2003a). “A Stochastic Model of Evolution of Conserved Protein Coding Sequence in the Archaeal, Bacterial and Eukaryotic Proteomes.” Fluctuation and Noise Letters 3:L295–L324. Kuznetsov, V. A. (2003b). “Family of Skewed Distributions Associated with the Gene Expression and Proteome Evolution.” Signal Processing 83:889–910 (Available online 14 Dec., 2002: http://www.ComputerScienceWeb.com). Kuznetsov, V. A. (2005). “Mathematical Analysis and Modeling of SAGE Transcriptome.” In: San Ming Wang, ed. SAGE: Current Technologies and Applications, pp. 139–179. Rowan House, Hethersett: Horizon Science Press.
206
COMPUTATIONAL GENOMICS
Lash, A. S., et al. (2000). “SAGEmap: A Public Gene Expression Resource.” Genome Res 10:1051–1060, 2000. Li, W. (1992). “Random Texts Exhibit Zipf’s-law-like Word Frequency Distribution.” IEEE Transactions on Information Theory 38: 1842–1845. Li, W. (1999). “Statistical Properties of Open Reading Frames in Complete Genome Sequences.” Computers & Chemistry 23:283–301. Li, W. -H., Gu, Z., Wang, H., and Nekrutenko, A. (2001). “Evolutionary Analyses of the Human Genome.” Nature 409:847–849. Mandelbrot, B. (1982). “Fractal Geometry in Nature.” New York: Freeman. McAdams, H. H. and Arkin, A. (1999). “It’s a Noisy Business! Genetic Regulation at the Nanomolar Scale.” Trends in Genetics 15:65–69. Misteli, T. (2001). “Protein Dynamics: Implications for Nuclear Architecture and Gene Expression.” Science 291:843–847. Newlands, S., et al. (1998). “Transcription Occurs in Pulses in Muscle Fibers.” Genes Dev 12:2748–2758. Newman, M. E. J., Strogatz, S. H., and Watts, D. J. (2001). Physical Rev E 64:02618–1-02618-17. Pennisi, E. (2000). “And the Gene Number is...?” Science 288:1146– 1147. Ohlsson, R., Paldi, A., and Marshall Graves, J. A. (2001). “Did Genomic Imprinting and X Chromosome Inactivation Arise from Stochastic Expression?” Trends in Genetics 17:136–141. Ohno, S. (1970). Evolution by gene duplication. New York: Springer Verlag. Pombo, A., et al. (2000). “Specialized Transcription Factories Within Mammalian Nuclei.” Critical Reviews in Eukaryotic Gene Expression 10:21–29. Ramsden, J. J. and Vohradsky, J. (1998). “Zipf-like Behavior in Prokaryotic Protein Expression.” Phys Review E 58:7777–7780. Ross, I. L., Browne, C. M., and Hume, D. A. (1994). “Transcription of Individual Genes in Eukaryotic Cells Occurs Randomly and Infrequently.” Immunol Cell Biol 72:177–185. Rubin, G. M., et al. (2000). “Comparative Genomics of the Eukaryotes.” Science 287:2204–2215. Rzhetsky, A. and Gomez, S. M. (2001). “Birth of Scale-free Molecular Networks and the Number of Distinct DNA and Protein Domains Per Genome.” Bioinformatics 17:988–996. Sano, Y., et al. (2001). “Random Monoallelic Expression of Three Genes Clustered within 60 kb of Mouse T Complex Genomic DNA.” Genome Res 11:1833–1841.
Scale-Dependent Statistics of Transcripts and Protein Sequences
207
Shmulevich, I., Dougherty, E. R., Kim, S., and Zhang, W. (2002). “Probabilistic Boolean Networks: A Rule-based Uncertainty Model for Gene Regulatory Networks.” Bioinformatics 18:261–274. Shulman, M. J. and Wu, G. E. (1999). “Hypothesis: Genes which Function in a Stochastic Linage Commitment Process are Subject to Monoallelic Expression.” Seminars in Immunology 11:369–371. Simon, H. A. and Van Wormer, T. A. (1963). “Some Monte-Carlo Estimates of the Yule Distribution.” Behavior Science 8:203–210. Stanley, H. E., et al. (1999). “Scaling Features of Noncoding DNA.” Physica A 273:1–18. Sutherland, H. G., et al. (2000). “Reactivation of Heritably Silenced Gene Expression in Mice.” Mammalian Genome 11:347–355. Thieffry, D., Huerta, A. M., Perez-Rueda, E., and Collado-Vides, J. (1998). “From Specific Gene Regulation to Genomic Networks: A Global Analysis of Transcriptional Regulation in Escherichia Coli.” BioEssays 20:433–440. Till, J. E., McCulloch, E. A., and Siminovish, L. (1964). “A Stochastic Model of Stem Cell Proliferation, Based on the Growth of Spleen Colony-forming Cells.” Proc Natl Acad Sci USA 51:29–38. Velculescu, V. E., et al. (1997). “Characterization of Yeast Transcriptome.” Cell 88:243–251. Velculescu, V. E., et al. (1999). “Analysis of Human Transcriptomes.” Nat Genet 23:387–388. Venter, J. C., et al. (2001). “The Sequence of the Human Genome.” Science 291:1304–1351. Vision, T. J., Brown, D. G., and Tanksley, S. D. (2000). “The Origins of Genome Duplications in Arabidopsis.” Science 290:2114–2117. Vohradsky, J. and Ramsden, J. J. (2001). “Genome Resource Utilization During Prokaryotic Development.” FASEB J (express article 10.1096/fj.00-0889fje). Walters, M. C., et al. (1995). “Enhancers Increase the Probability but not the Level of Gene Expression.” Proc Natl Acad Sci USA 92:7125–7129. Wei, C. L., Wu, Q., Vega, V. B., Chiu, K. P., Ng, P., Zhang, T., Shahab, A., Yong, H. C., Fu, Y. T., Weng, Z., Liu, J. J., Lee, Y. L., Kuznetsov, V. A., Sung, K., Lim, B., Liu, E. T., Yu, Q., Ng, H. H., and Yijun, R. (2005). “A Precise Global Map of p53 Transcription Factor Binding Sites in the Human Genome.” (submitted). Weintraub, H. (1988). “Formation of Stable Transcription Complexes as Assayed by Analysis of Individual Templates.” Proc Natl Acad Sci USA 85:5819–5823. Wuchty, S. (2001). “Scale-free Behavior in Protein Domain Networks.” Molec Biol Evol 18:1694–1702.
208
COMPUTATIONAL GENOMICS
Yule, G. U. (1924). “A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F. R. S.” Philosophical Transactions of the Royal Society of London Ser B 213:21–87. Zucchi, I., Mento, E., Kuznestov, V. A., Scotti, M., Valsecchi, V., Simionati, B., Valle, G., Pilotti, S., Vicinanza, E., Reinbold, R., Vezzoni, P., Albertini, A., and Dulbecco, R. (2004). “Gene Expression Profiles of Epithelial Microscopically Isolated from Breast-invasive Ductal Carcinoma and Nodal Metastasis.” PNAS USA 101:18147– 18152.
Chapter 11 STATISTICAL METHODS IN SERIAL ANALYSIS OF GENE EXPRESSION (SAGE) 1 and Helena Brentani2,3 ˆ Ricardo Z. N. Vencio 1 BIOINFO-USP, N Nucleo de Pesquisa em Bioinformatica, Universidade de Sao ´ ´ ˜ Paulo, Sa˜ o Paulo, Brazil, 2 Hospital do C Cancer A. C. Camargo, Sao ˆ ˜ Paulo, Brazil, 3 Ludwig Institute for
Cancer Research, Sao ˜ Paulo, Brazil
1.
Introduction
The objective of this chapter is to give a general view of the state-of-art on statistical methods applied to Serial Analysis of Gene Expression (SAGE) (Velculescu et al., 1995). The methods were divided into the main Statistics’ tasks: parameter estimation and differential expression detection. The estimation section was divided into point estimation and estimation by interval problems, i.e., to find numerical values for unknown quantities and to find error-bars for them. The differential gene expression section was divided into methods to perform: pair-wise class comparison between single libraries or pooled libraries, pair-wise class comparison considering biological replicates individually inside of each class, and outlier discovery in a single multi-library group. Finally, we applied some discussed methods to real data to give more insight to the readers. Several presented methods are not SAGE-specific and could be applied to other counting data such as obtained from non-normalized Expressed Sequence Tags (EST) or Massively Parallel Signature Sequencing (MPSS) technique (Brenner et al., 2000). The application to real data, additional figures, R language (Ihaka and Gentleman, 1996) scripts for some of methods discussed and supporting information can be found at the chapter’s supplemental web-site: http:// www.vision.ime.usp.br/∼rvencio/CSAG/
210
2.
COMPUTATIONAL GENOMICS
Biology and Bioinformatics Background
The SAGE method, described by Velculescu et al. (1995), is based on the isolation of unique sequence tags from individual transcripts. SAGE counts transcripts by sequencing a short 14-bp tag at the gene’s 3’ end adjacent to the last restriction enzyme site, usually NlaIII. The 4 bases of the 3’-most NlaIII restriction site (CATG) plus the following 10 variable bases define the transcript tag. After the concatenation of tags into long DNA molecules, sequencing of these concatemer clones allows quantification and identification of cellular transcripts. When dealing with SAGE data, it is important to bear in mind that this technology yields transcript counts, expressed as a fraction of the total amount of transcripts counted, and not results relative to another experiment or a particular housekeeping gene, such as in hybridization-based techniques. This advantage avoids error-prone normalization between experiments. Another advantage is that it determines expression levels directly from RNA samples and it is not necessary to have a gene-specific fragment of DNA arrayed to assay each gene. That is why SAGE is called an open system, justifying the analogy “Linux of the functional genomics” (SAGE2000 Conference, September 2000). Transcript tags were extracted from the automatic sequencing machines output files, the chromatograms, using specific softwares. The steps are as follows (Lash et al., 2000): 1 locate the NlaIII sites (i.e., CATG “punctuation signals”) within the ditag concatemer; 2 extract ditags of 20-26 length bases, which fall between these sites; 3 remove repeat occurrences of ditags, including repeat occurrences in the reverse-complemented orientation; 4 define tags as the end-most 10 bases of each ditag, reversecomplementing the right-handed tag; 5 remove tags corresponding to linkers (e.g., TCCCCGTACA and TCCCTATTAA), as well as those with unspecified bases; 6 for each tag, count its number of occurrences. A relevant problem in Bioinformatics is to assign the observed tags to its correspondent gene. Tag-to-gene mapping is accomplished by first orientating GenBank sequences using poly-adenylation signal (ATTAAA or AATAAA), poly-adenylation tail (minimum of 8 A’s) and orientation annotation (3’ or 5’). Tags for genes are defined from the 10-base sequence
Statistical Methods in Serial Analysis of Gene Expression (SAGE)
211
directly 3’ adjacent to the 3’-most NlaIII site (CATG), and then linked to an UniGene cluster identifier. UniGene is a system for automatically partitioning GeneBank sequences, including ESTs, into a non-redundant set of gene-oriented clusters. Each UniGene cluster theoretically contains sequences that represent a unique gene (Schuler, 1997). When extracting tags, 5 arbitrary reliability “classes” are defined for tag-to-UniGene assignment, following the reliability order: 1 derived from well-characterized mRNA’s cDNA sequences from GeneBank; 2 tags extracted from EST sequences with a polyadenylation signal and/or polyadenylation tail and annotated as 3’ sequences; 3 derived from EST sequences with a polyadenylation signal and/or polyadenylation tail, but without a 3’ or 5’ annotation; 4 from EST sequences with a polyadenylation signal and/or polyadenylation tail, but annotated as 5’ orientation; 5 from EST sequences without a polyadenylation signal or tail but annotated as having a 3’ orientation. For each tag, two other quality parameters are calculated: (i) the geneto-tag assignment frequency (how many different genes, from the total of unique genes in the library, the tag is the best-match tag) and (ii) the tagto-gene assignment frequency (how many different tags, from the total of unique tags in the library, the gene is the best-match gene). The Cancer Genome Anatomy Project SAGE Project made available new Bioinformatics tools taking into account measurable reliability in tagto-gene matching. This is accomplished in three major steps: (i) a confident tag list was distilled from ∼6.8 million experimentally observed SAGE tags; (ii) virtual SAGE tags (predicted from cDNA transcript sequences) were obtained, parsed into databases reflecting the origin of the transcript sequence and ranked according to it; and (iii) custom programs were created to sift through the databases, choose the best tag-to-gene match, and present results online. Alternative transcripts, redundant tags, and internal priming were also considered for tag selection. This is the SAGE Genie interface (http://cgap.nci.nih.gov/SAGE) (Boon et al., 2002). Ideally, these tags are long enough to be unique to one transcript, and the abundance of a given tag is assumed to be proportional to the expression level of that transcript in the original pool of RNA. However, SAGE is a sampling method: some transcripts present in low abundance may be missing, and the number of copies of others may not accurately reflect
212
COMPUTATIONAL GENOMICS
their true abundance in cells due to selection bias (Margulies et al., 2001). In addition, there are errors in sequencing, possibility of non-unique tags or transcripts that produce no tag. A small number of transcripts are expected to lack the NlaIII enzyme anchoring site and therefore would be missed in the analysis. Transcripts that failed to be represented due to the lack of restriction enzyme anchoring site are estimated to be as low as 1% (Boon et al., 2002). It is interesting to note that about 30% of the full length transcripts have a repetitive sequence inside its sequence. SAGE tags that fall inside repetitive elements will be counted many times for different transcripts and their frequency will be very high, compromising the gene-to-tag assignment (see web-Fig. 1). In the mouse SAGE web-site (http://mouse.biomed.cas.cz/sage/), for example, the tags with reliable associations to 12 or more UniGene clusters were labeled as “repetitive/lowcomplexity” to be easily distinguished. These problems, in time, may be disregarded by the increasing number of SAGE tags collected for future SAGE screens, and the use of longer SAGE tags or different anchoring enzymes.
3.
Estimation
The objective of the Estimation process is to obtain numerical values for the unknown quantities in a sample or a population. The estimation is, along with hypothesis testing, a fundamental task of Statistics and is divided into Point Estimation and Estimation by Interval. Point Estimation attributes the best possible scientific guess for a numerical unknown quantity. The Frequentist approach for Statistics claims that the “best” estimators have certain properties such as being non-biased, for example. The Bayesian radically disagrees since estimators are always biased by the a priori knowledge. Estimation by Interval is expected to find numerical intervals that contain the unknown quantity, attaching some probability to this fact. An interval is also viewed with a very different interpretations depending if it is a Frequentist or a Bayesian approach. In the Frequentist view, obtaining an α% confidence interval means just the application of a calculation procedure to our data that yields a numerical interval. This procedure claims that the calculated intervals, applied to virtual data, should contain the true parameter’s value in at least α% of the times that one uses it. This virtual data is assumed to be created by the same underlying probability density function (pdf) that created the observed data. This does not mean that the probability that the true value is contained in the interval is α. In the Frequentist framework, the true value is a number and not a random variable, not allowing such a probabilistic assertion. However,
Statistical Methods in Serial Analysis of Gene Expression (SAGE)
213
this kind of desired assertion is possible using a Bayesian framework. The Bayesians do not believe in the “data that could be observed but was not” (virtual data) concept and base their conclusions on the parameter’s a posteriori probability density function. The probability that a parameter’s unknown value is inside an interval can be calculated for any numerical interval. To match the intuitive notion of “error-bar”, one calculates the smaller interval around posteriori pdf maximum that integrates α probability. This is the Bayesian credibility interval.
3.1
Point Estimation (Counts, Errors, Size)
The main tasks of Point Estimation are estimation of: error-rates of sequencing process, tag’s counts, tag’s abundance (also known as normalized counts, proportion, concentration, etc) and transcriptome size. Transcriptome’s size and distribution estimation, i.e., to find out how many different transcripts are expressed in a given cell condition and how transcripts are distributed among the expression levels, are very important problems. We let these problems aside because they are well discussed in Chapter 10. We remind the reader that transcripts’ distribution is very skewed toward zero, i.e., there are several transcripts with low abundance and few highly abundant transcripts. Along with careful reading of Chapter 10, we recommend the works of Stollberg et al. (2000) and Stern et al. (2003) that show, by simulation approaches, the ill-posed nature of these kinds of estimation problems with the usual size of SAGE libraries. Perhaps these problems could be resolved in the near future by using alternative technologies such as Massively Parallel Signature Sequencing (MPSS) (Brenner et al., 2000). In the following, we will discuss error-rates estimation, the first issue to be considered in a SAGE statistical analysis pipe-line. One can imagine that tag counting is clearly defined by sequencing machine output and it is sufficient to count the sequences identified in the chromatograms. However, the SAGE sequencing itself is subjected to stochastic processes like enzyme amplification errors or sequencing base miscalling. Therefore, a tag outcome could be modeled and estimated. The estimation of the error-rates are relevant issues. The number of sequencing errors in a 10-base tag, assuming that base miscalling is equally probable for all nucleotides and independent of position inside the sequence, follows a Binomial(10,ε) distribution where ε is the error-rate in errors per base units. The relevant quantities are P(just ( 1 error) = 10 · ε1 (1 − ε)9 , P(errors ≥ 1) = 1 − (1 − ε)10 and P(errors ≥ 2) = P(errors ≥ 1) − P(just ( 1 error). The first estimate of ε came from the work of Velculescu et al. (1997), since they compared their
214
COMPUTATIONAL GENOMICS
SAGE results with the yeast’s complete sequenced genome. They found ε∼ = 0.007 or P(errors ≥ 1) ∼ = 0.068. The use of 1% approximation for the error-rate is very common in SAGE analysis, imported from genome and EST sequencing projects, because it corresponds to a widely used quality score cutoff of 20 in the well-known phred sequencing chromatogram analysis software (Ewing and Green, 1998). Colinge and Feger (2001) worked with ε = 0.01 and thus P(errors ≥ 1) = 0.096 ≫ P(errors ≥ 2) = 0.004. This fact suggests that the majority of sequencing errors should be a single base change. A tag may have neighbor tags due to nucleotide substitution, insertion and deletion, i.e., tags that differ from each other by only one application of such editing operations. They constructed a transition matrix with {P}ik = p( j|k) elements as the probability that the observed k-th tag is counted as the observed j-th tag due to one-step substitution sequencing error. Note that a rare tag with no single-substitution distant neighbor tag has p( j| j) = 1, in spite of having P(no errors) = (1 − ε)10 = 0.904. A more refined approximation would consider insertions and deletions in the path from tag k to tag j: p( j|k) = P( j ← k) = P(I ) + P(D) + P(S)
− P(I )P(D) − P(I )P(S) − P(D)P(S)
+ P(I )P(D)P(S)
(11.1)
where I , D or S are the transitions { j ← k} due to one-step insertion, deletion or substitution errors, respectively. Blades et al. (2004a) devised a simple procedure to estimate these firstorder error-rates from each library. They call shadows the tags generated by process artifacts. Cleverly, they write the error-rate as the increase of shadows counts s obtained with the increasing of true counts x for a tag: ε=
s 1 s 1 = = = x y + s (s/y)−1 + 1 θ −1 + 1
(11.2)
where y are the observed counts. If all neighbors were shadows, we would be able to estimate θ as the angular coefficient of usual regression in observed counts vs. neighbors counts space. However, this is not true and therefore, we need to identify in this space a subset of (y j , s j ) points for which this assumption seems reasonable. Due to the highly skewed form of the transcriptome towards rare transcripts, one expects that this trend is only visible next to the region of abundant tags, i.e., higher y j points. Also, the error rate is expected to be small, thus one expects a regression line close to a horizontal line.
+ – – – – – –
–
+
FADD
POU6F1
H2BFQ
EWSR1
SRPR
“average gene”
1
1
2
3
3
3
7
16
17
sorted
0
average
B25-NaN A15-GM
B46-GM A16-GM
A13-GM A14-GM
A17-GM A13-GM
B26-GM A12-GM
A08-AO A19-OL
B35-GM A18-OL
B34-GM B34-AO
B28-GM A11-AO
B41-GM A09-AO
B40-NAN A07-AO
B44-GM B04-AA
A11-AO B01-AA
A09-AO A06-AA
A10-AO A03-AA
A07-AO A01-AA
Nicotiana tabacum clonePR50 H2B histone family, member Q POU domain, class 6, transcription factor Homo sapiens cDNA FLJ38365 TNFRSF6 associated via death H.ch.14 DNA sequence BAC RHuman T-cell receptor germ-line 5.8
5.3
4.4
4.3
4.7
4.6
5.0
4.2
5.8
6.3
6.0
Figure 7.3. (see page 107)
0.7
0.8
0.6
1.4
1.8
1.4
2.2
0.9
2.6
1.0
0.6
1.6
5.1
4.0
4.9
3.9
4.5
4.5
5.5
4.0
5.9
5.6
5.5
1.3
1.7
1.0
1.1
1.1
1.1
1.3
2.0
1.8
2.2
0.9
0.9
AO 7.4
Cell growth, Cell cycle
Transduction, Apoptosis
Metabolism, Cell growth Transcription, Cell growth, Stem cells
Transcription
Metabolism, DNA repair Cell growth, Metabolism
Cell growth
AA 7.4
CLUSTER 4 significances: 0.56856 (OL vs.all= 0.2023 GM vs.all=0.27056 AA vs.AO+OL=0.96931)
B43-GM B01-GM
DNA cross-link repair 1A signal recognition particle receptor Ewing sarcoma breakpoint region 1
B34-AO B23-GM
DCLRE1A
B29-NaN B22-GM
18
B42-GM B27-GM
MATN2
B39-NaN B26-GM
43
B31-NaN B28-GM
Function
B48-GM B37-GM
– +
B38-GM B33-GM
Transduction, Cellgrowth, Cell-cycle
B22-GM B38-GM
Name
B36-GM B42-GM
insulin-like growth factor binding protein matrilin 2
B37-GM B44-GM
IGFBP1
B32-NaN B43-GM
Symbol
B47-GM B45-GM
43
B01-GM B46-GM
Votes
B30-GM A17-GM
DETAIL of ”average gene” and label of patients for GS-Gap grouping
average
A02-AA
A18-OL
A05-AA
A20-OL
A08-AO
A19-OL
A10-AO
B24-AA
A20-OL A06-AA
B20-GM A03-AA
B30-GM A05-AA
B36-GM B21-AA
B41-GM B01-AA
B48-GM A02-AA
+
B33-GM B47-GM
+/–
B23-GM B25-NaN
NL
B29-GM B29-NaN
GM
1.1
4.5
3.6
4.0
4.4
4.5
5.0
3.3
5.7
4.2
4.7
1.6
2.7
0.9
1.2
2.7
1.4
1.2
2.3
1.8
1.9
2.0
1.3
OL 6.8
B20-GM B31-NaN
AA AO OL
B27-GM B35-GM
mean / std. dev.
A15-GM B32-NaN
Genes and Properties selected by Gap Gene Shaving
A39-GM B39-NaN
Glioma Subtypes
4.4
5.7
4.5
4.8
4.9
4.6
5.7
4.7
6.2
6.6
5.5
1.4
1.3
1.5
1.4
0.8
1.6
1.3
1.4
2.0
1.0
1.4
1.6
GM 6.6
A16-GM B40-NaN
Glioma Subtypes AA AO OL GM
NL
RNTRE INPP5D TBP KIAA0268
4 3 2 2
homeo box D1
Transcription
Transcription
Transcription
Transcription
Transcription
Signal Transduction
Cell proliferation
Metabolism
Cell proliferation
Transcription
6.1
7.2
7.4
8.7
5.4
4.6
7.2
4.5
7.6
6.0
4.7
5.3
4.6
0.4
1.8
0.8
1.0
2.7
1.6
1.6
3.1
1.9
1.0
1.5
0.8
2.3
AA
4.8
5.9
6.1
7.6
4.0
4.7
6.0
5.0
7.2
6.5
4.9
4.9
3.7
1.2
1.2
1.6
0.8
0.5
1.1
0.8
0.6
0.6
1.5
1.3
1.4
1.0
AO
3.8
6.3
6.9
7.4
5.5
4.3
6.2
4.4
6.8
6.2
4.1
2.8
3.6
0.8
1.1
1.0
1.1
0.8
1.0
0.8
2.9
1.2
1.3
0.8
1.7
0.5
OL
mean / std. dev.
7 genes with Transcription Function
high mobility group AT-hook 1 translin
methyl-CpG binding domain protein related to the N terminus of tre. inositol polyphosph. 5- phosph. TATA box binding protein C219-reactive peptide
nuclear transcription factor Y, beta regenerating isletderived 1 alpha
Catalysis
Transcription Cell Growth Tumor necros. factor receptor, chaperone
alpha thalassemia mental retard syndr. heat shock protein 75 sialyl-transferase
Function
Name
Figure 7.4. (see page 108)
“average gene”
HOXD1
1
MBD1
9
+
REG1A
13
TSN
NFYB
16
2
STHM
25
– – – + + + – + + HMGA1
TRAP1
35
+
2
ATRX
40
+ +
Symbol
Genes and Properties GS-MDL and GS-Gap Votes
+/–
5.7
5.9
6.8
7.7
5.3
4.5
6.9
5.7
6.9
6.5
4.4
4.7
1.2
1.7
1.5
0.8
1.3
0.9
1.5
1.1
1.1
1.1
1.5
1.3
1.5
GM 4.4
Figure 7.4. Continued (see page 109)
DETAIL of ”average gene” and label of patients for GS-MDL grouping
GM
Glioma Subtypes
AA AO OL
NL
EFNB2 OPRK1 NAPG MEF2C DCLRE1A
39 43 20 43 18
+ + + + + + 6.6
6.3
8.0
7.1
7.1
7.0
7.8
3.3
1.0
3.3
3.0
4.1
3.8
3.1
AA
5.1
5.6
5.6
6.4
5.0
8.3
6.1
2.6
2.2
1.0
1.4
1.0
1.5
2.1
AO
4.2
4.2
4.9
6.2
6.6
8.3
5.5
Figure 7.5. (see page 110)
1.1
1.9
1.8
0.4
1.7
1.2
4.3
OL
mean / std. dev.
Metabolism, Transcription
Metabolism, DNA repair
Transcription
Cell growth, Metabolism
Transduction
Metabolism, Transduction
Human estrogen sulfotransferase ephrin-B2 Opioid receptor, kappa 1 N-ethylmaleimidesensitive factor, MADS box transcription enhancer DNA cross-link repair 1A Oenothera elata hookeri chloroplast
Function
Name
CLUSTER 1 significances: 0.65937 (OL vs.all= 0.11673 GM vs.all=0.044389 AA vs.AO+OL=0.77953)
“average gene”
10
SULT1A3
43
+
Symbol
Genes and Properties selected by GS-MDL Votes
+/–
5.9
6.6
6.1
6.0
5.9
8.5
1.5
1.0
1.8
1.7
1.8
1.1
1.7
GM 6.1
NL
OPRK1 MEF2C MPI NAPG
TERF1 TSPY
43 43 36 20
1 1 1 1
+ + + + + + + + + + +
SGCD
Hs.ch. 5 Bac clone 111n13 Oenothera elata hookeri chloroplast plastome I Uroporphyrinogen III synthase Sarcoglycan, delta (35kD dystrophin – glycoprot.) Hs. telomeric repeat binding factor (TRF1) Testis specific protein, Ylinked
MADS box transcription enhancer Mannose phosphate isomerase N-ethylmaleimidesensitive factor, gamma
Metabolism, Transduction
Human estrogen sulfotransferase Opioid receptor, kappa 1
6.0
5.4
6.0
6.9
6.6
7.5
7.1
6.3
8.0
7.1
7.8
2.6
2.4
2.4
2.6
3.3
3.4
3.0
2.3
3.3
4.1
3.1
AA
5.4
5.5
5.8
6.4
5.1
7.2
6.4
4.5
5.6
5.0
6.1
0.9
0.6
1.3
1.2
2.6
1.7
1.4
1.2
1.0
1.0
2.1
AO
6.0
5.3
6.0
7.1
4.2
5.5
6.2
6.4
4.9
6.6
5.5
0.9
1.3
0.9
0.6
1.1
0.9
0.4
1.2
1.8
1.7
4.3
OL
mean / std. dev.
Metabolism, Transcription
Transcription, Cell prolif. Transcription, Cell prolif.
Cell growth
Metabolism
Cell growth, Metabolism
Metabolism
Transcription
Transduction
Function
Name
Figure 7.6. (see page 111)
“average gene”
10 UROS
SULT1A3
43
16
Symbol
Genes and Properties selected by GS-Gap Votes
+/–
DETAIL on ”average gene” and label of patients
GM
Glioma Subtypes
AA AO OL
6.0
6.3
6.5
6.7
5.9
6.6
6.0
5.3
6.1
5.9
1.4
1.5
1.5
1.5
1.5
1.9
1.7
1.4
1.8
1.8
1.7
GM 6.1
Figure 17.1. (see page 354)
Exon flanked by variable 5' splice site Terminal exon extention?
Internal cryptic exon
Constitutive exon Intron inclusion?
Terminal cryptic exon
genome cDNA A cDNA B
Variant genomic exon Non-variant genomic exon Figure 17.2. (see page 355)
cryptic internal exon Alternative start
Exon with 3' length variation
Sequence loss
cryptic initial exon Type A
Intron inclusion
Exon with 5' length variation
Sequence loss
Alternative start
Skipped splice site in initial exon Type B
Figure 17.4. (see page 360)
Figure 17.8. (see page 372)
Figure 18.1. (see page 382)
Figure 18.3. (see page 393)
Fill Holes
Gating
Cytokeratin (mask)
Binary Mask
Tumor Mask
RESA
a-catenin (membrane tag)
Membrane Merge Compartments RESA PLACE
DAPI (nuclear tag)
Nuclei
RESA Target Localization Target (b-catenin)
Target
Figure 18.4. (see page 396)
Statistical Methods in Serial Analysis of Gene Expression (SAGE)
215
To use all data (y j , s j ) points without choosing, it is necessary to carry out some very robust linear regression since there is superposition with (y j , s j ) pairs for which several rightful neighbors exists. Blades et al. (2004a) claim that robust regression performs better than regular linear regression or resistant regression. Figure 11.1 shows an illustration of their estimation method for one of the libraries utilized in their original work, the normal pancreatic HX library (GEO accession: GSM721). Using robust linear regression, they found an error-rate of ε = 0.09, with [0.07; 0.11] as 95% confidence interval, for base substitution error. The line corresponding to 20% error-rate is shown for illustration (see also web-Fig. 2). In the following, we will discuss the counting estimation. One can suspect that some of the rare tags occurrences are not real, being result from experimental errors. An example could be an abundant tag that suffered a change by sequencing error in one of the 10-base tag, yielding a non-existing unique tag or inflating the counts of other tags. Colinge and Feger (2001) proposed an estimation method that accounts for sequencing errors. They approximate the sequencing error effect to the expectations of first-order errors and build a system of linear equations: ⎧ ⎡ ⎤ p(1|1) . . . p(1|t) −1 ⎨ y1 = p(1|1)x 1 + · · · + p(1|t)x t .. .. ⎦ y .. ⇒ x = ⎣ ... . . . ⎩ yt = p(t|1)x 1 + · · · + p(t|t)x t p(t|1) . . . p(t|t) (11.3)
100
where t is the number of observed unique tags, p( j|k) is the probability that the observed k-th tag is counted as the observed j-th tag due to onestep sequencing error, x are the true unknown counts and y are the observed
s
60
80
e = 0.20
0
20
40
e = 0.11 e = 0.09 e = 0.07
0
50
100
150
200
250
300
350
y
Figure 11.1. The substitution error-rate estimation. Normal pancreatic library HX (GEO accession: GSM721). Adapted from (Blades et al., 2004a).
216
COMPUTATIONAL GENOMICS
counts. By first-order we mean that only the main effects are accounted for in the approximation, i.e., p( j|k) = P( j ← k) ≤ P( j ← · · · ← k), where arrows denote the error process of insertion, deletion or substitution. In spite of the validity of the fundamental propriety x. = m = y., their approximation forces the impossible continuity of counts, i.e., y j ∈ {0, 1, . . . , m} but their solutions x j ∈ ℜ. Akmaev and Wang (2004) criticize this fact warning for lack of interpretability of true counts estimates, especially when yielding negative counts. A possible and conceptually correct formulation to approach error-free counting estimation should search solutions in constrained space = {(x1 , . . . , x t ) : x j ∈ Z+ , x. = m}. Such a kind of problem is called by computer scientists as an Integer Programming problem, and dealing with it as if it is a continuous Linear Programming, or even a simple Linear Algebra problem, is an inappropriate approach. Akmaev and Wang (2004) offered an alternative to the Colinge and Feger (2001) method to correct for sequencing errors. They use a multi-step heuristic approach very linked to SAGE process mechanics that preserves the data’s discrete nature and uses information from chromatograms and phred scores (Ewing and Green, 1998). The statistical analysis is just one step of their Bioinformatics algorithm, available trough SAGEScreen software. Another recent alternative is the method developed by Beißbarth et al (2004) that is based on an EM-algorithm and on a rigorous and complete statistical modeling of sequencing errors, taking advantage of phred scores. It is important to notice that corrections of potential errors at counting level or denoising techniques (Blades et al., 2004b) are promising approaches, but not yet widespread standard procedures in SAGE analysis. In the following, we will discuss abundance estimation. At first sight, the problem of estimating the abundance π ∈ [0; 1] of a tag could be regarded as an easy and uninteresting problem: p = x/m and that’s all, where x is the (pre-processed or not) number of counts for a given tag, m is the total amount of sequenced tags and p is the estimate for π. However, other elaborate options exist. In fact, p is the Maximum Likelihood (ML) estimator of a Bernoulli Process. The Bernoulli or Poisson modeling, and not the Hypergeometric modeling for example, are widely used mathematical frameworks for gene expression counting data because “sampling from an infinite population approximation” is adequate (of course, m is much smaller than the total amount of mRNA molecules in the harvested cells). In the Bayesian framework, all parameters are unknown quantities and previous knowledge about it is quantified by means of a priori pdfs. If we believe in the Bernoulli Process description of SAGE, then, by Bayesian analysis:
Statistical Methods in Serial Analysis of Gene Expression (SAGE)
h(π ) ∝ (1 − π)β−1 π α−1
L(x|π) ∝ (1 − π)m−x π x
⇒ h(π|x) =
217
(1 − π)m−x+β−1 π x+α−1 B(x + α, m − x + β) (11.4)
where α > 0 and β > 0 are parameters that define the Beta a priori pdf, x is the number of counts for a given tag, m is the total of sequenced tags, π is the tag abundance, L(·) is the likelihood function and B(·) is the beta special function. Beta is a standard priori choice in Bayesian analysis and it is very flexible to accommodate various kinds of prior knowledge. Note that Eq. 11.4 is equivalent to the result: if π ∼ Beta(α, β) and x|π,m ∼ Binomial(π, m) then π |x,m ∼ Beta(α + x, β + m − x), a basic result that will be used several times in this chapter. The Posteriori Mode, i.e., the value of parameter that lead a posteriori pdf to its maximum, is p = (x + α − 1)/(m + α + β − 2). Therefore, the only way to get the same simple estimate obtained from the ML approach, is by using the uniform a priori, i.e., α = β = 1. Other informative priori choices may have influence in the abundance estimate and could be derived from the transcript level distributions. Morris et al. (2003) raised a series of criticism against the use of simple ML estimator p = x/m in SAGE analysis. In spite of the very common use of Binomial to model the outcome of a given tag, they have reminded that SAGE is an incomplete multinomial sampling: x
L(x|π π, m) = m!
t π j j j=1
x j!
(11.5)
where πj is the abundance and x j are the counts of the j-th tag, π ∈ {(π1 , . . . , πt ) : π. > 0, π. = 1}, t is the number of unique tags and m is the total amount of sequenced tags. By “incomplete” we mean that t is unknown, but to go further with this modeling, one must assume that it is known in advance. Note that π j = 0 is not allowed, but for an existent and non-observed tag, the ML estimator is p = 0. Since we (assume to) know that there must be t transcripted tags, j-th tag’s x j = 0 means that π j < 1/m. These tags are called underrepresented and, since π. = 1, the others are overrepresented. They suggest that this is not a minor effect based on the skewness of gene expression distributions and propose a “Robin Hood” non-linear shrinkage estimator for abundances. It is important to note that Stern et al. (2003) warned about usual SAGE studies’ inability to estimate the number of unique transcripts t, arguing that m should be larger than is nowadays available in SAGE studies. This could be a major drawback for the use of Morris et al. (2003) estimator in general cases.
218
COMPUTATIONAL GENOMICS
3.2
Estimation by Interval (Error-Bars)
Density
Density
The second and complementary approach to obtain quantitative insights about unknown parameters is by means of Estimation by Interval, or using informally terms, to define error-bars. The SAGE process is analogous to the well-known “balls and urns” statistical problem. In practice, this means that a lot of theoretical framework is already available and proposed statistical models have solid underlying physical basis. For microarray data, for example, this is not true since competitive hybridization is a physical phenomenon much more harder to model, requiring assumption-prone analysis from statisticians. Given the simple “counting” nature of SAGE data, it is easy to report tag abundance with some error-bar. Using basic Bayesian statistics, as in Section 3.1, and choosing a non-informative uniform a priori we saw (Eq. 11.4) that tag abundance is π |x,m ∼ Beta(1 + x, 1 + m − x), where x are the counts and m the total size of the library. Once a credibility level α is defined, it is only necessary to integrate around the posteriori’s peak until this probability is reached (Fig. 11.2). A tag with x = 16 counts in m = 80,000 has an abundance of 2.0 · 10−4 and its 68% and 95% credibility intervals are [1.5 · 10−4 ; 2.5 · 10−4 ] and [1.2 · 10−4 ; 3.1 · 10−4 ], respectively. A tag with x = 32 and m = 160,000 has also an abundance of 2.0 · 10−4 , however, its 68% and 95% credibility intervals are more precise, respectively [1.7 · 10−4 ; 2.4 · 10−4 ] and [1.4 · 10−4 ; 2.7 · 10−4 ], as one intuitively expects. Sometimes, people prefer to model such kinds of rare counting data, like SAGE, using Poisson random variables. In this case, the parameter in focus is λ, the number of counts per m (note that λ = mπ ). Audic
50% 95%
99% 0.0
0.5 p | data
95% 1.0
0.0
0.2
0.4 0.6 p | data
0.8
1.0
Figure 11.2. The error-bar construction. Left: three examples of credibility intervals Right: posteriori peak can coincide with interval boundaries.
Statistical Methods in Serial Analysis of Gene Expression (SAGE)
219
and Claverie (1997) remind the Frequentist alternative for the problem, the Ricker’s formula, to find an α% confidence interval [λ1 ; λ2 ] for λ. Using this different approach with, again, x = 16 and m = 80,000 we get λ = 16.0 with [12.0; 21.1] and [9.1; 26.0] for 68% and 95% confidence intervals, respectively, or using abundance as result λ/m = 2.0 · 10−4 with [1.5·10−4 ; 2.6·10−4 ] and [1.1·10−4 ; 3.2·10−4 ]. Note that in this example the results were very similar to Bayesian analysis but this is not a general fact. There are several examples in statistical literature in which Bayesian and Frequentist analysis disagree. In Gene Expression analysis, hybridization-based techniques such as cDNA microarray or traditional northern blot give, by construction, relative results, e.g., expression ratios. Given two gene abundances obtained by SAGE, it is easy to transform them into expression ratios but the opposite is impossible. Until now, the only solution that we know for the Estimation by Interval of expression ratio in the SAGE community is a Bayesian ˆ solution (Vencio et al. 2003). As viewed before, π |x,m follows a Beta pdf, but for two classes A and B the pdf of R = (πA |xA ,m A )/(πB |xB ,m B ) ˆ could be hard to obtain analytically. Vencio et al. (2003) sampled pseudorandom numbers from Beta pdfs that describe each class and estimated the expression ratio R pdf calculating the quotient for every pair-wise simulated observations. In fact, the estimated distribution is the pdf of a re-parametrization of expression ratios Q = 1/(1+R) ∈ [0;1] because it is better suited for non-parametric Kernel Density Estimators since R ∈ [0; ∞]. Once obtained the α% credibility interval, it is easy to go back to R space solving Q for R. For a given tag, the final result could be presented as R = 5.8 with the possible scenarios for 95% credibility intervals: [5.5;6.0], [1.2;15.3], [2.2; ∞] or [0.3;25.5], instead of simple 5.8-fold change. The first scenario is the ideal situation with an intuitively small interval indicating a relatively precise result for differential expression ratio; the second scenario suggests that, although differentially expressed, the quantitative aspect of the ratio is poor since ratio possibilities are widespread in a wide interval; the third shows that the only safe conclusion, with this level of credibility, is that the ratio is greater than 2.2-fold; and the last scenario indicates that the ratio is very wide, crossing the nondifferential expression ratio R = 1 barrier, and should not be seriously considered in spite of 5.8-fold indication of an apparent significant change.
4.
Differential Expression Detection
Although SAGE is innovative from a biological viewpoint, it is, from statistical viewpoint, a very old and known problem: to draw balls from urns. We will discuss the statistical viewpoint of the comparison between
220
COMPUTATIONAL GENOMICS
libraries and its possibilities, but one should always have in mind that we are working with difficult biological data. This means that sometimes we do not have enough samples to be compared, because the disease is very rare or because obtaining the samples is a hard technical issue, for example. For this cases, several single-library pair-wise comparison methods are available. Sometimes, in the Statistics viewpoint, this comparison sounds inappropriate. As SAGE, “Digital Northern” or MPSS are not easy and cheap techniques, we would like to stress that in this field it is very important to release data into public databases to ameliorate this problem. Only recently SAGE community learned how to account for the within-class, or between-library, variability in SAGE analysis (Baggerly et al., 2003; ˆ Vencio et al., 2004). Until that, differential expression detection methods merged all count observations from libraries that compose a class in “pseudo-libraries”, in order to use previously available pair-wise comparison techniques.
4.1
Single-Library or “Pseudo-library”
There are several methods for dealing with single library in each class, i.e., methods that consider only variability due to sampling error. Even when biological replicates are available, it is very common in SAGE studies to construct a “pseudo-library” aggregating the counts of biological replicates and use these single-library methods. There are 3 clear distinct groups of methods: simulation based, Frequentist and Bayesian. The simulation based method is well-known because it was the chosen method in the original Science paper describing differential expression analysis with SAGE (Zhang and Zhou et al., 1997). Some details of their algorithm are just available inside simulation software’s source code (Prof. Kinzler, Johns Hopkins University School of Medicine, personal communication). To find differentially expressed tags between two libraries A and B, simulated data sets were generated using a Monte Carlo technique, or in other words, by creating each k simulated data set distributing the (x j A + x j B ) counts of the j-th tag according to a rule, repeating for all t unique tags: ⎧ ⎪ ⎨
(k) xjA
=
x j A +x j B z=1
1{u z ≤m A (m A +m B )−1 }
⎪ (k) (k) ⎩ xjB = xjA + xjB − xjA
u ∼ U[0;1]
(11.6)
where u z are uniform pdf realizations, 1{·} is the indicator function and m are the total sequenced tags of each library. To quantify the evidence of
Statistical Methods in Serial Analysis of Gene Expression (SAGE)
the tag, they defined a measure called P-chance K : / 100000 1 1 (k) (k) min 100, (k) k=1
Kj =
|x j A −x j B | ≥ |x j A −x j B |
221
(k) (x j A −x j B )·(x j A −x j B ) ≥ 0
min(tot, 100000)
0
(11.7)
where tot is: tot k=1
1
1 (k) (k) (k) (k) |x j A −x j B | ≥ |x j A −x j B | (x j A −x j B )·(x j A −x j B ) ≥ 0
= 100
This is only a formal way to say that they simulate samples until 100 occurrences of the event “difference in simulated data set is equal/greater than actual observed difference”, or 100,000 runs limit is reached, counting the occurrences until this limit. The selection of a significant K is done by comparing it with those obtained from an artificial data set that represent the null hypothesis of no differential behavior between libraries: x j AN = x j B N =
xjA + xjB 2
(11.8)
Repeating the same procedure of Eq. 11.6 and Eq. 11.7 for this “null data set” one can choose a suitable K for his observed experimental differences. SAGE300 and SAGE2002 softwares implement this method using 40 “null data sets” to rank K . The Frequentist approach is based on the proposition of a function of the data and the discovery of the function’s pdf when there should not exist differences between classes, i.e., the so-called null pdf. From this Classical framework come the meaning of p-value, power of the test, size of the test, etc. It is always based on fact that there exists some (assumed) distribution from which the data was generated and one tries to estimate mistaken conclusions (false positive and false negative) facing this underlying pdf. There is considerable Statistics literature comparing methods for proportion testing. Comparisons in SAGE context are also available elsewhere (Man et al., 2000; Romualdi et al., 2001; Ruijter et al., 2002). Some studies show small advantages of one method in relation to others, but Ruijter et al. (2002) remind that the differences are technical and negligible face the drastic approximation of dealing with no biological replicates or, as they called, “one measurement” framework. We discussed replication based methods in Section 4.2. The main Frequentist methods are the well-known Classical Fisher’s exact test for contingency tables and Z or χ 2 based methods that use asymptotic results to get p-values. For large m values, the combinatorial computation needed in Fisher’s test becomes
222
COMPUTATIONAL GENOMICS
hard, but the standard approximations become very accurate. The approximate test mostly known in the SAGE community was suggested by Kal et al. (1999). For the significance level α, we say that there is differential expression if: 1 1 1 1 1 1 xi xi 1 1 A B 1 1 − 1 1 mi mi 1 1 1 1 A B 1 1 ⎞ 1 > Z α2 1 ⎛ ⎞⎛ ⎞⎛ 1 1 xi + xi xi + xi 1 1 1 1 ⎟⎜ ⎟⎜ ⎟1 B A B 1 ⎜ A 1 − + ⎠⎝ ⎠⎝ ⎠1 1 ⎝ 1 mi + mi mi + mi mi mi 1 1 1 A B A B A B (11.9)
where Zα/2 is the standard normal α/2 quantile, xi is the number of counts of a particular tag in the i-th library and m i is the sequenced total of the i-th library. The Bayesian approaches follow the Bayesian Statistics framework, using Bayes’ rule to go from previous information, the so-called a priori pdf, to the information updated by the observations, the so-called a posteriori pdf. They work at the parameter space instead of sample space and do not admit the existence of “data that could be observed but was not”. Therefore, the Bayesian p-value has not the same interpretation of the Frequentist one. It is easy to rewrite the Eq. 11.4 to accommodate several separate replicates from a Bernoulli Process, only generalizing the Likelihood function: (X|π) =
n i=1
n
L(xi |π) ∝ (1 − π)i=1
mi −
n
i=1
xi
n
π i=1
xi
(11.10)
This means that π |data ∼ Beta(α+ x., β+ m.− x.) for A or B classes. There are several ways to rank the “equality of abundances” hypothesis, ranging from simple Bayes Error Rate (Duda et al., 2000) to the wellknown Jeffreys’ test for precise hypothesis (Jeffreys, 1961) or the genuine Bayesian test for precise hypothesis presented in Madruga et al. (2003). Analysis of equality of abundances π A = π B is not the only paradigm that could be used. It is also possible to carry out significance ranking using absolute counts x. or using expression ratio fold-changes π A /π B . The most known Bayesian methods using these alternative paradigms are Audic and Claverie’s (1997) method and the method implemented by SAGEmap and SAGE Genie (Lal and Lash et al., 1999).
223
Statistical Methods in Serial Analysis of Gene Expression (SAGE)
Supposing that the sampling process is well approximated by a Poisson distribution, Audic and Claverie (1997) write the probability of observing the data from one class, given the data observed in the other class. They say that there is differential expression, with some pre-defined probability α, if: ⎡ ⎛ ⎞k ⎛ ⎞−k− A xi −1 ⎤ (k + x )! m mi i i ∞ ⎢ ⎥ α ⎟ ⎜ ⎟ ⎜B B ⎢ A ⎥< ⎣ ( xi )!k! ⎝ m i ⎠ ⎝1 + m i ⎠ ⎦ 2 k= B xi
A
A
A
(11.11) where again m i ’s and xi ’s are summed over each class A and B to create the pool and B(·) is the beta special function. Another well-know Bayesian method is the method adapted by Lal and Lash et al. (1999) from Chen et al. (1998) to accommodate classes with different total counts. This method is implemented in the important SAGE public database tools: the National Center for Biotechnology Information’s SAGEmap (http:// www.ncbi.nlm.nih.gov/SAGE), and the National Cancer Institute’s Cancer Genome Anatomy Project - SAGE Genie (http://cgap.nci.nih.gov/SAGE). They determine the posterior probability of fold-changes in expression ratio R greater than an arbitrary value R ∗ . The a posteriori pdf used is:
h(q|X, M, c) ∝ q
c+
A
xi
c+
(1 − q)
B
xi
⎛
⎜ A ⎝1 + B
mi mi
⎞−
⎟ q − q⎠
) A
xi +
B
xi
*
(11.12) where q is a convenient reparameterization of expression ratio q = R/(R+ 1) and c is a constant that came from an a priori pdf being modeled by previous knowledge of researchers. In SAGEmap and SAGE Genie tools, for example, it is assumed c = 3 (Lal and Lash et al., 1999). The Eq. 11.12 holds for fold-change R of class A relative to class B and for estimated abundance in A class greater than in class B, p A ≥ p B . If the contrary occurs, just permute the classes labels in Eq. 11.12, for simplicity. It is important to note that a crucial difference/difficulty arise because the differential expression conclusion depends on a pre-defined fold-change. The test answers the question about this fold-change and it is the user’s responsibility to define what is a fold-change that means differential expression. Using SAGE Genie’s method with A m = 80,000, A x = 60, B m = 50,000, B x = 10, p A / p B = 3.75 fold change and R ∗ = 2, for example, we obtain by integration of Eq. 11.12: P(R ≥ 2) = P(0.666 ≤ q ≤ 1) = 0.93. For another cutoff example R ∗ = 4 we obtain: P(R ≥ 4) = P(0.8 ≤
224
COMPUTATIONAL GENOMICS
q ≤ 1) = 0.19. Therefore, there is little chance that the true fold-change is greater than 4-fold but considerable chance that it is greater than 2-fold (see web-Fig. 3). However, if one defines that differential expression occurs simply if an abundance is greater than the other, then P(R > 1) = P(0.5 < q ≤ 1) = 0.999. This highlights the user’s responsibility in defining a suitable R.
4.2
Replicated Libraries in One Class
Suppose that one is sampling colored balls from several urns. Also, before choosing an urn, one choose at random which one will be sampled. To make general conclusions about blue balls, one must weight the sampling with the probability of choosing a particular urn, especially if it is known that each urn could have different abundance of blue balls. A very similar situation occurs when dealing with biological replicates in SAGE analysis. Like the different urns in the above illustration, is intuitive to accept that different biological replicates have distinct abundances for a given gene. For a given tag, the counting process of an i-th library is commonly modeled as a Bernoulli Process with a fixed unknown abundance πi ∈ [0; 1]. The pdf of this abundance among all n libraries is unknown and πi is the i-th realization of π . For a fixed tag, the likelihood of a particular count xi in total of m i sequenced tags is often modeled by the Binomial, weighted by the possible π outcome: L(xi |m i , θ ) =
%
0
1
)
* mi f (π|θθ ) (1 − π)m i −xi π xi dπ xi
(11.13)
The function f (·) is the unknown pdf of tag abundance π , and is all we want/need to know. This function is parameterized by vector θ . This is a mixture model, with the Binomial being the mixing distribution, but others, such as Poisson, could be used (Bueno et al., 2002). Since we do not know in advance the stopping rule, the pdf could not be a Binomial but differ only by a multiplicative constant. To reach the common Binomial model, used by almost all SAGE methods, it is enough to assume that f (·) is a degenerate pdf over some scalar value θ , i.e., a Dirac’s Delta function constrained to [0;1]. With this assumption, we tacitly ignore the possible variability between libraries, due to any reason other than sampling, since only π = θ has positive density. Using a much more realistic approximation, one could assume that the tag abundance in different libraries f (·) is described by a Beta random variable, with non-zero variance. This leads to the well-known Beta-Binomial model,
Statistical Methods in Serial Analysis of Gene Expression (SAGE)
225
and appears in the SAGE differential expression context introduced by Baggerly et al. (2003), derived as a hierarchical model in a Frequentist framework instead of a particular case of a mixture model. For a fixed tag, given the vector of counts X = (x1 , . . . , xn ) and the vector of sequenced totals M = (m 1 , . . . , m n ) in all n libraries of the same class, it is necessary to fit the Beta-Binomial model parameters and to test for differential expression. In the Frequentist framework first proposed by Baggerly et al. (2003), pi = xi /m i is used as an estimator of πi and a linear combination of these abundances is proposed as the correct way to combine results from different libraries:
p=
n i=1
wi pi ,
Vu =
n
i=1
wi2 pi2 − p 2 1−
n
i=1
n
i=1
wi2 wi ∝
,
wi2
m i (α + β) α + β + mi
(11.14) where wi are the weights that yield an unbiased minimum variance estimator Vu for weighted proportion’s variance and θ = (α, β) are the Beta pdf parameters. However, this unbiased variance could be unrealistically small when it becomes smaller than the sampling variability. We know that the variance of this model cannot be smaller than the variance eventually obtained if we do not consider within-class variability. Therefore, they propose the final ad hoc estimator:
where:
4 5 V = max Vu ; V pseudo-lib
V pseudo-lib
n
xi 1 i=1 = n n mi mi i=1
i=1
⎛
(11.15)
n
⎞
xi ⎟ ⎜ ⎜ ⎟ i=1 ⎜1 − n ⎟ ⎠ ⎝ mi i=1
The max(·) function assure that V is not unrealistic small when Vu is unrealistic small. V pseudo-lib is exactly the natural estimator for variance if one considers f (·) as a Dirac’s Delta instead of a Beta pdf as the underlying model. Note that the x./ m. term is equivalent to the abundance if one merges all libraries into a “pseudo-library”. To fit all these parameters, they used the computationally practical Method of Moments. Once p A , p B , V A and V B are found for classes A and B, it is necessary to test if the proportions are significantly different. Evoking asymptotic results they propose
226
COMPUTATIONAL GENOMICS
the use of a tw statistics as following a Student’s td f pdf: pA − pB tw = √ , V A + VB
df =
(V V A + V B )2
A
V A2 m i −1
+
B
V B2 m i −1
(11.16)
A different approach that accounts for within-class variability, uses the Bayesian Statistics framework and does not rely on asymptotic results was ˆ recently presented by our group (Vencio et al., 2004). Considering our likelihood as obtained by the Beta-Binomial model, it is easy to write the a posteriori pdf: g(θ1 , θ2 |X, M) ∝ 1{θθ 2 ≥σ } ·
n B(αθ + xi , βθ + m i − xi ) B(αθ , βθ )
(11.17)
i=1
where the indicator function is our a priori pdf. Note that we made a reparameterization since now θ = (θ1 , θ2 ) is the mean and standard deviation (stdv) of the Beta pdfs that describe each class. The new parameter space = {(θ1 , θ2 ): 0 ≤ θ1 ≤ 1, 0 ≤ θ22 < θ1 (1 − θ1 ) ≤ 1/4}, is more intuitive than the common (α, β) one and is bound (much more amenable to numerical computations). We use the sub-index θ in α and β to remind that they are functions of a new parameterization, easily obtained from Beta mean and stdv expressions. The a priori is an uniform pdf over , but constrained to variances greater than the variance obtained by “pseudolibrary” construction. The variance “working floor” σ 2 come from the Beta pdf obtained using the Eq. 11.10 generalization into Eq. 11.4. We find the two Beta pdfs that describe each class taking the Posteriori Mode, i.e., (θ1 , θ2 ) ∈ that lead Eq. 11.17 to its maximum (see web-Fig. 4). Finally, to test if a tag is differentially expressed between the two classes, we propose an evidence measure other than the p-value. We use the intuitive Bayes Error Rate (Duda et al., 2000): % 1 A B E= min( f (π |θˆ ), f (π |θˆ )) dπ (11.18) 0
where θˆ = argθ (max ( posteriori)). Small Bayes Error Rate E values indicate that the whole Beta pdfs are “far apart”, thus with high evidence of differential expression (web-Fig. 5). We rank tags by E evidence and let biologists say what they intuitively think that is an unacceptable level of superposition (classification error) between the two classes. An indispensable tool for checking intuitive consistency of the results obtained with any method is the graphic representation of all individual observations, like in Fig. 11.3 of Section 5. Using this simple tool one can easily note the inconsistency of “pseudo-libraries” methods in several cases.
Statistical Methods in Serial Analysis of Gene Expression (SAGE)
4.3
227
Multiple Libraries Outlier Finding
There is a third type of comparison analysis using counting data that is to find outlier libraries in a multiple libraries context. The methods reviewed in previous section deal with pair-wise comparison of classes, having one or more libraries, accounting or not for biological replication. On the other hand, multiple libraries comparison is not pair-wise but rather search for tags with a “non-usual” behavior in a set of libraries. The concept of class is non longer used, or in other words, all libraries are regarded as an unique class. Probably, the most known method for outlier detection was the method presented by Stekel et al. (2000). It came to improve the previously available method introduced by Greller and Tobin (1999) which only detects an outlier very different than others in a set of libraries. Stekel et al. (2000) proposed a flexible solution that tries to detect if a transcript has the same abundance across several libraries simultaneously. For example, the input could be several tissues and our aim could be to detect tissue-specific transcripts. They skip from p-value pitfalls and simply rank their proposed R statistics. For a given tag and fixed some cutoff R ∗ , we say that there is differential expression in at least one library if: ⎡ ⎞⎤ ⎛ n n ⎢ ⎜ x j=1 m j ⎟⎥ ⎢ ⎟⎥ ⎜ i (11.19) ⎢xi ln ⎜ ⎟⎥ = R > R ∗ n ⎣ ⎠⎦ ⎝ mi i=1 xj j=1
where xi are the counts for some tag in i-th library and m i is the total of sequenced tags in i-th library. To help the user to define a R* cutoff, they propose an alternative evidence measure called believability. This measure is obtained by an usual randomization strategy or from asymptotic appeal 2 . since 2R → χn−1
5.
Illustration of Methods Application
In order to gain intuition about the methods presented in this chapter, we applied some of them to a relevant publicly available data set. Our analysis, along with R language (Ihaka and Gentleman, 1996) scripts, are available in detail at the chapter’s supplemental web-site: http:// www.vision.ime.usp.br/∼rvencio/CSAG/. Our aim is to search for genes differentially expressed between grade II and grade III astrocytoma from bulk material collected from different patients. The data is available at SAGE Genie web-site at (http://cgap.nci.nih.gov/SAGE). Table 11.1 shows the libraries used in this illustration. Here we do not want to focus on biology of the analysis but rather in the fundamental difference of methods
228
COMPUTATIONAL GENOMICS Table 11.1. Brain tumor library from SAGE genie. #
Library Name – Class A
1 2 3 4
SAGE Brain astrocytoma grade III SAGE Brain astrocytoma grade III SAGE Brain astrocytoma grade III SAGE Brain astrocytoma grade III grade III merged “pseudo-library”
#
Library Name - Class B
5 6 7 8
SAGE Brain astrocytoma grade II SAGE Brain astrocytoma grade II SAGE Brain astrocytoma grade II SAGE Brain astrocytoma grade II grade II merged “pseudo-library”
Total Tags B B B B
H1020 H970 R140 R927
51573 106982 118733 107344 384632 Total Tags
B B B B
H563 H359 H388 H530
88568 105764 106285 102439 403056
for differential expression detection. However, we take care to define each class with libraries with the same histopathological grade and only from bulk material, excluding cell lines. Following the previous sections we applied a pipe-line for statistical analysis. We used some methods of Estimation section and concentrated our attention on Differential Expression Detection section. First we tried to estimate the sequencing substitution, insertion and deletion error-rates of our data-set. We applied the Blades et al. (2004a) errorrate estimation method described in Section 3.1 but was difficult to carry out the line fitting, similar to those used in Fig. 11.1 line fitting, due to lack of points at higher expression level. On the other hand, the method proposed by Akmaev and Wang (2004) can be applied only to original output data from sequencing machines, the chromatograms, thus it was impossible to use it here since public databases have the raw counting data and not the original chromatograms. Therefore, we moved further without counting corrections. Second, given the tags’ counts, we used the Posteriori Mode for abundance estimation with non-informative uniform a priori pdf (Eq. 11.04) to match with Maximum Likelihood estimator. We want to focus on the differential expression detection issue. Third, as an initial approach to differential expression detection, we merged all libraries of each class summing their observations and creating the so-called “pseudo-libraries”, as usual in SAGE analysis. To perform the Fisher’s Exact Test, the χ 2 , and the Audic and Claverie (1997) methods discussed in Section 4.2, we used an user-friendly freely available software called IDEG6 (Romualdi et al., 2001). It has an on-line web-version and
Statistical Methods in Serial Analysis of Gene Expression (SAGE)
229
also perform the Stekel et al. (2000) or Greller and Tobin (1999) methods for multi-library analysis. To perform the Lash and Lal et al. (1999) Bayesian method we used SAGE Genies’ on-line tool but we have implemented their method as a R script to allow the use in any data-set. Finally, to perform the same analysis in a replicated library context, discussed in Section 4.1, we used the only two available solutions: our SAGEˆ betaBin method (Vencio et al., 2004) and the first published solution, the Baggerly et al. (2003) t-test approximation. As previously and emphatically announced in Section 4 introduction, the results could be very different using the common “pseudo-libraries” methods or these two that account for within-class inherent variability. A tag considered differentially expressed using replicated library methods should always appear as differentially expressed using “pseudo-library” methods. However, the opposite is not always true. An example of such an effect is obtained for the AATAGAAATT tag, corresponding to secreted phosphoprotein 1 (osteopontin, bone sialoprotein I, early T-lymphocyte activation 1) gene. Using any available “pseudo-library” method, one is lead to believe that this tag is differentially expressed with high significance. As calculated by IDEG6 software, all methods give 0.00 (zero!) p-values. The SAGE Genie’s method gives 0.00 (zero) p-value for a difference greater than 2-fold or 0.01 for a difference greater than 4-fold. ˆ Also, our error-bar method (Vencio et al., 2003) shows that 95% credibility interval is [4.3;6.7] for R = 5.3-fold change of A relative to B class and does not surpass ratio equal to 1. All of these results indicate a very high level of confidence in the differential expression conclusion. However, if one plots the individual abundance results for each library, it is easy to note that the conclusion of all these methods is suspicious. On the other hand, the two methods that account for within-class variability do not claim a high significance for this tag, as one intuitively suspects looking at the superposition of crosses (class B) and circles (class A) in Fig. 11.3. The Baggerly et al. (2003) t-test approximation gives 0.21 for p-value and our SAGEbetaBin evidence is 0.48, indicating great superposition between pdfs that describe each class (curves in Fig. 11.3), as discussed in Section 4.1. These results do not support this tag as being differentially expressed in general terms. This is an illustrative example because just one class A library leads traditional analysis to a mistaken conclusion, but there are other more subtle cases (see supplemental website data and web-Fig. 6). In this illustration we consider a tag differentially expressed if it has Bayes Error Rate, arbitrarily defined, E ≤ 0.05. Sometimes this finding could bring difficulties in the differential expression validation by other techniques. It is important to note that when we are dealing with high variability between samples, and we are not taking
230
COMPUTATIONAL GENOMICS 800
E=0.48, tag: AATAGAATT
pdf
0
200
400
600
A=o B=x
0.000
0.001
0.002
0.003
0.004
0.005
p|q1,q2
Figure 11.3. Example of tag regarded as differentially expressed by “pseudo-library” methods but discarded by replicated-library.
this variability into account to select our differentially expressed candidates, the validation process could become completely arbitrary. It will rely on the samples chosen. The final list of differentially expressed tags between grade II and grade III astrocytomas, their values, and fold change error-bars are available at a supplemental web-site.
6.
Conclusions
In this chapter we aimed to give a guide to the state-of-art in statistical methods for SAGE analysis. We just scratch some issues for the sake of being focused in differential expression detection problems, but we hope that main ideas could be useful to track the original literature. We saw that estimation of a tag abundance could not be simpler than observed counts divided by sequenced total, but rather can receive sophisticated treatments such as multinomial estimation, correction of potential sequencing errors, a priori knowledge incorporation, and so on. Given an (assumed) error-corrected data set, one could search for differentially expressed tags among conditions. Several methods for this were mentioned, but we stress the importance of using biological replication designs to capture general information. Finally, we want to point out that only accumulation of experimental data in public databases, with biological replication, and use of good statistics could improve usefulness of SAGE, MPSS or
REFERENCES
231
EST counting data in general terms, helping to elucidate basic/applied gene expression questions.
Acknowledgments ˜ de Amparo a` Pesquisa do Estado RZNV received FAPESP (Fundacc¸ ao ˜ Paulo) support. We thank Prof. Carlos A. B. Pereira (IME-USP) de Sao for teaching us most of what we learned about Statistical Science and Dr. Sandro J. de Souza (Ludwig Institute for Cancer Research) for teaching us most of what we learned about SAGE. We give a special acknowledgement to all colleges that carefully revised our manuscript.
References Akmaev, V.R. and Wang, C.J. (2004) Correction of sequence based artifacts in serial analysis of gene expression. Bioinformatics 20, 1254–1263. Audic S. and Claverie J. (1997) The significance of digital gene expression profiles. Genome Research 7, 986–995. Baggerly, K.A., Deng, L., Morris, J.S. and Aldaz, C.M. (2003) Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics 19, 1477–1483. Beißbarth, T., Hyde, L., Smyth, G.K., Job, C., Boon, W., Tan, S., Scott, H.S. and Speed, T.P. (2004) Statistical modeling of sequencing errors in SAGE libraries. Bioinformatics 20, i31-i39. Blades, N., Velculescu, V.E. and Parmigiani, G. (2004a) Estimation of sequencing error rates in SAGE libraries. Genome Biology in press. Blades, N., Jones, J.B., Kern, S.E. and Parmigiani, G. (2004b) Denoising of data from serial analysis of gene expression. Bioinformatics in press. ´ Boon, K., Osorio, E.C., Greenhut, S.F., Schaefer, C.F., Shoemaker, J., Polyak, K., Morin, P.J., Buetow, K.H., Strausberg, R.L., Souza, S.J. and Riggins, G.J. (2002) An anatomy of normal and malignant gene expression. Proc. Natl. Acad. Sci. USA 99, 11287–11292. Brenner, S., Johnson, M., Bridgham, J., et al. (2000) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nature Biotechnology 18, 630–634. Bueno, A.M.S., Pereira, C.A.B., Rabello-Gay, M.N. and Stern, J.M. (2002) Environmental genotoxicity evaluation: Bayesian approach for a mixture statistical model. Stochastic Environmental Research and Risk Assessment 16, 267–278. Chen, H., Centola, M., Altschul, S.F. and Metzger H. (1998) Characterization of gene expression in resting and activated mast cells. J. Exp. Med 188, 1657–1668.
232
COMPUTATIONAL GENOMICS
Colinge, J. and Feger, G. (2001) Detecting the impact of sequencing errors on SAGE data. Bioinformatics 17, 840–842. Duda, R.O., Hart, P.E. and Stork, D.G. (2000) in Pattern Classification 2nd Edition, (Wiley- Interscience Press) Ewing, B. and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. error probabilities. Genome Research 8, 186–194. Greller, L.D. and Tobin, F.L. (1999) Detecting selective expression of genes and proteins. Genome Research 9, 282–296. Ihaka, R. and Gentleman, R. (1996) R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics 5, 299–314. Jeffreys, H. (1961) in Theory of Probability, (Oxford University Press). Kal, A.J., van Zonneveld, A.J., Benes, V., van den Berg, M., Koerkamp, M.G., Albermann, K., Strack, N., Ruijter, J.M., Richter, A., Dujon, B., Ansorge, W. and Tabak, H.F. (1999) Dynamics of gene expression revealed by comparison of serial analysis of gene expression transcript profiles from yeast grown on two different carbon sources. Mol. Biol. Cell 10, 1859–1872. Lal, A., Lash, A.E., Altschul, S.F., Velculescu, V., Zhang, L., McLendon, R.E., Marra, M.A., Prange, C., Morin, P.J., Polyak, K., Papadopoulos, N., Vogelstein, B., Kinzler, K.W., Strausberg, R.L. and Riggins, G.J. (1999) A public database for gene expression in human cancers. Cancer Research 21, 5403–5407. Lash, A.E., Tolstoshev, C.M., Wagner, L., Schuler, G.D., Strausberg, R.L., Riggins, G.J. and Altschul, S.F. (2000) SAGEmap: a public gene expression resource. Genome Research 10, 1051–1060. Madruga, M.R., Pereira, C.A.B. and Stern, J.M. (2003) Bayesian evidence test for precise hypotheses. Journal of Planning and Inference 117, 185–198. Man, M.Z., Wang X. and Wang Y. (2000) POWER SAGE: comparing statistical tests for SAGE experiments. Bioinfomatics 16, 953–959. Margulies, E.H., Kardia, S.L. and Innis, J.W. (2001) Identification and prevention of a GC content bias in SAGE libraries. Nucleic Acids Res. 29, e60. Morris, J.S., Baggerly, K.A. and Coombes, K.R. (2003) Bayesian shrinkage estimation of the relative abundance of mRNA transcripts using SAGE. Biometrics 59, 476–486. Romualdi, C., Bortoluzzi, S. and Danieli, G.A. (2001) Detecting differentially expressed genes in multiple tag sampling experiments: comparative evaluation of statistical tests. Human Molecular Genetics 10, 2133–2141.
REFERENCES
233
Ruijter, J.M., Kampen, A.H.C. and Baas F. (2002) Statistical evaluation of SAGE libraries: consequences for experimental design. Physiol Genomics 11, 37–44. Schuler, G.D. (1997) Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J. Mol. Med. 75, 694–698. Stekel, D.J., Git, Y. and Falciani, F. (2000) The comparison of gene expression from multiple cDNA libraries. Genome Research 10, 2055–2061. Stern, M.D., Anisimov, S.V. and Boheler, K.R. (2003) Can transcriptome size be estimated from SAGE catalogs?. Bioinformatics 19, 443–448. Stollberg, J., Urschitz, J., Urban, Z. and Boyd, C.D. (2000) A Quantitative Evaluation of SAGE. Genome Research 10, 1241–1248. ˆ Vencio, R.Z.N., Brentani H. and Pereira, C.A.B. (2003) Using credibility intervals instead of hypothesis tests in SAGE analysis. Bioinformatics 19, 2461–2464. ˆ ˜ D.F.C. and Pereira, C.A.B. (2004) Vencio, R.Z.N., Brentani, H., Patrao, Bayesian model accounting for within-class biological variability in Serial Analysis of Gene Expression (SAGE). BMC Bioinformatics 5, 119. Velculescu, V.E., Zhang, L., Vogelstein, B. and Kinzler, K.W. (1995) Serial analysis of gene expression. Science 270, 484–487. Velculescu, V.E., Zhang, L., Zhou, W., Vogelstein, J., Basrai M.A., Bassett, D.E., Hieter, P., Vogelstein, B. and Kinzler, K.W. (1997) Characterization of the yeast transcriptome. Cell 88, 243–251. Zhang, L., Zhou, W., Velculescu, V.E., Kern, S.E., Hruban, R.H., Hamilton, S.R., Vogelstein, B., and Kinzler, K.W. (1997) Gene Expression Profiles in Normal and Cancer Cells. Science 276, 1268–1272.
Chapter 12 NORMALIZED MAXIMUM LIKELIHOOD MODELS FOR BOOLEAN REGRESSION WITH APPLICATION TO PREDICTION AND CLASSIFICATION IN GENOMICS Ioan Tabus, Jorma Rissanen, and Jaakko Astola Institute of Signal Processing, Tampere University of Technology, Tampere, Finland
Abstract
1.
Boolean regression models are useful tools for various applications in nonlinear filtering, prediction, classification, and clustering. We discuss here the so-called normalized maximum likelihood (NML) models for these classes of models and discuss the connections with the minimum description length principle. Examples of discrimination of cancer types with these models for the Boolean regression demonstrate the efficiency of the method, especially its ability to select sets of feature genes for discrimination at error rates significantly smaller than those obtainable with other methods.
Introduction
We discuss the NML models for Boolean model classes. Such a model for the linear regression problem was introduced and analyzed recently, (Rissanen, 2000). We restate the classification problem as a modeling problem in terms of a class of parametric models, for which the maximum likelihood parameter estimates can be easily computed. We first review the NML model for Bernoulli strings as the solution of a minmax optimization problem. We then introduce two model classes for the case where the binary strings to be modeled are observed jointly with several other binary strings (regression variables). We derive the NML model for both classes, provide fast evaluation procedures, and review the connection with the MDL principle.
236
COMPUTATIONAL GENOMICS
The concept of gene expression was introduced four decades ago with the discovery of the messenger RNA, when the theory of genetic regulation of protein synthesis was described (Jacob and Monod, 1961). Each human cell contains approximately three billion base pairs, which encode about 30,000 to 40,000 genes. In any particular cell, only a small fraction of the genes is being actively transcribed. Genes carry the essential information for protein synthesis. Cell metabolism, in both healthy and malignant states, is governed by complex interactions between genes. A gene may have a varying level of expression in the cell at a certain time. Many important biological processes can be probed by measuring the gene transcription levels, and models of regulatory pathways can be inferred from these measurements (Russel et al., 2000). The availability of cDNA microarrays (Schena et al., 1995) makes it possible to measure simultaneously the expression levels for thousands of genes. The huge amount of data provided by cDNA microarray measurements is explored in order to answer fundamental questions about gene functions and their inter-dependencies, and hopefully to provide answers to questions like “what is the type of the disease affecting the cells and what are the best ways to treat it”. Large-scale studies of gene expression are expected to provide the transition from structural to functional genomics (Hieter and Boguski, 1997), where knowing the complete sequence of a genome is only the first step in understanding how it works. As a first step in deciphering complex biological interactions at a biomolecular level it is necessary to study the microarray data in order to provide sets of possible rules for how the expressions of certain genes determine the expression level of a target gene. Gene expression data obtained in microarray experiments may often be discretized as binary or ternary data, the values 1,0,-1 carrying the meanings of overexpressed, normal, and repressed, respectively, which are the descriptors needed to define regulatory pathways (Kim and Dougherty, 2000). The prediction and classification problems addressed here refer to processing of the binary or ternary valued cDNA microarray data in order to find groups of genes, or factors, which are very likely to determine the activity of a target gene (or a class label), as opposed to the oversimplified model in which one gene only influences a single target gene (or class label). Previous solutions to related problems were given by using perceptron predictors, see (Kim and Dougherty, 2000; Kim et al., 2000), which were shown to provide more reliable information than the determination of dependencies based solely on the correlation coefficient linking single genes. In our approach to prediction we look for flexible classes of models with good predictive properties, but we also must consider the complexity
NML Models for Boolean Regression used for Prediction and Classification 237
of the models, the right balance between the two modeling aspects being set by the MDL principle. One possible way to study the classification problem is in terms of Boolean regression models. Suppose that the data available is in a matrix X , where the entry x(i, j) ∈ {0, 1} is a binary (quantized) gene expression, the row index i ∈ {1, . . . , N } identifies the gene, and the column index j ∈ {1, . . . , n} identifies the ”patient”. We denote by x j the jth column of the matrix X . Furthermore, a class label y j is known for all patients (e.g. y j = 0 or y j = 1 for the j’th patient having disease type A, or type B, respectively). Our goal is to build Boolean models yˆ j = f (xi1 , j , . . . , xik , j ) and to select the set of informative genes, {i 1 , i 2 , . . . , i k }. One natural requirement for a good model is to get a small sum of absolute errors n j=1
|y j − yˆ j |.
(12.1)
We show that this minimization actually results as a consequence of the NML formalism. We have shown in (Tabus et al., 2003) that the more complex problem of m-ary classification can be solved using a similar modelling approach. Even though the problem of classification into m classes can be very easily connected to m individual Boolean classification problems (each of them finding a classifier for a certain class against all others), in some cases the results of these Boolean problems are conflicting, and further decisions have to be taken, (e.g. by majority voting) and therefore the m-class problem is better to be addressed by the general approach described in (Tabus et al., 2003). We deal in the following with the Boolean classification problem because it has an importance of his own in practical applications, and furthermore, it allows a more intuitive derivation and suggestive interpretation of the obtained classifiers in terms of Boolean functions.
2.
The NML Model for Bernoulli Strings
In this section we assume that a Bernoulli variable Y with P(Y = 0) = θ is observed repeatedly n times, generating the string y n = y1 , . . . yn . We look for a distribution q(y n ) over all strings of length n, such that the ideal codelength log q(y1 n ) assigned to a particular string y n by this distribution, 1 obtainable is as close as possible to the ideal codelength log ˆ n )) P(y n ;θ(y with the Bernoulli models. In the coding scenario, the decoder is allowed to use a predefined distribution, q(·), but he cannot use the distribution ˆ n ) available. The latter is the P(·; θˆ (y n )) because he does not have θ(y
238
COMPUTATIONAL GENOMICS
most advantageous distribution in the family P(y n ; θ) for the string y n , ˆ n )) and therefore minimizes the ideal codesince it maximizes P(y n ; θ(y 1 length log . We use the term ‘ideal’ code length because we drop P(y n ;θˆ (y n )) the unessential requirement that a real code length must be an integervalued quantity. The distribution q(y n ) is selected such that the ”regret” of ˆ n )), namely, using q(y n ) instead of P(y n ; θ(y log
1 1 P(y n ; θˆ (y n )) − log = log , ˆ n )) q(y n ) q(y n ) P(y n ; θ(y
(12.2)
is minimized for the worst case y n ; i.e. log min max n q
y
ˆ n )) P(y n ; θ(y q(y n )
(12.3)
Theorem 1. (Shtarkov, 1987) The minimizing distribution is given by q(y n ) =
P(y n ; θˆ (y n )) , Cn
(12.4)
where * n ) m m m n−m n 1− Cn = . m n n
(12.5)
m=0
Proof Fix y n , having m zeros. Then, if P(Y = 0) = θ, we have by independence P(y n ; θ) = (θ )m (1 − θ )n−m , and the ML estimate of θ satis ˆ n ) = m and P(y n ; θ(y ˆ n )) = m m fies mˆ − n−mˆ = 0, which gives θ(y n n θ 1−θ n−m 1 − mn . The constant Cn in (12.5) clearly normalizes P(y n ; θˆ (y n )) n to a distribution q(y ) such that y n ∈{0,1}n q(y n ) = 1. To show that this distribution is minmax, we consider any different distribution pn (y n ), and take a string z n such that pn (z n ) < q(z n ) (such a z n exists since pn (·) and q(·) must sum to unity and they cannot be equal for all strings. We n ˆ n )) n ˆ n )) n ˆ n have max y n log P(ypn;(yθ (y ≥ log P(zpn;(zθ (z > log P(zq(z;θn(z) )) = log Cn = n) n) n
ˆ
n
;θ (y )) , which shows q(·) to be minmax. The minimizing max y n log P(yq(y n) distribution in (12.17) is also called the NML (universal) model (Barron et al., 1998). Another strong optimality property of the NML models was recently proved in (Rissanen, 2001), where the following minmax problem was
NML Models for Boolean Regression used for Prediction and Classification 239
formulated: find the (universal) distribution which minimizes the average regret min max E g log q
g
ˆ n )) ˆ n )) P(Y n ; θ(Y P(y n ; θ(y n max g(y ) log = min , q g q(Y n ) q(y n ) n y
(12.6)
where g(·), the generating distribution of the data, and q(·) run through any sets that include the NML model. Theorem 2. (Rissanen, 2001) The minimizing distribution q(·) and the worst case data generating distribution for the minmax problem (12.6) is given by (12.4) and (12.5). Proof We have min max E g log q
g
P(Y n ; θˆ (Y n )) ˆ + log Cn , = min max D(g||q) − D(g||q) q g q(Y n ) n
ˆ
n
where q(y ˆ n ) = P(y C;θn(y )) and D(g||q) is the relative entropy, or KullbackLeibler distance between distributions g and q. Since only the first term depends on q, the minimizing q is q = g, which makes D(g||q) = 0, and the maximizing g is g = q, ˆ which makes D(g||q) ˆ = 0, and finally g = q = qˆ proves the claim. Note how the minmax problem generalizes Shannon’s noiseless coding theorem, which states that if the data have been generated by a single distribution g, then the ideal coding distribution that minimizes the KullbackLeibler distance E g log(g(X n )/q(X n )) has to mimic g. Note too that in the optimization problem above no assumption was made that the data generating mechanism is included in the class of models, which is of utmost relevance for practical problems in which nothing is known about any ’true’ data generating mechanism.
3.
The NML Model for Boolean Regression
We consider a binary random variable Y , which is observed jointly with a binary regressor vector X ∈ B k . In a useful model class a carefully selected Boolean function f : B k → {0, 1} should provide a reasonable prediction f (X ) of Y , in the sense that the absolute error E = |Y − f (X )| has a high probability of being 0. Since E , Y, f (X ) are binary-valued we have E = |Y − f (X )| = Y ⊕ f (X ), which also implies Y = f (X ) ⊕ E , where ⊕ is modulo 2 sum.
240
COMPUTATIONAL GENOMICS
To formalize this we consider a model defined as follows: ⎧ ⎨ f (X ) i f E = 0 Y = f (X ) ⊕ E = ⎩ f (X ) i f E = 1
(12.7)
where f (·) is a Boolean function and the error E is independently drawn from a Bernoulli source with parameter θ ; i.e., P(E = 0) = θ and P(E = 1) = 1 − θ , or for short P(E = j) = θ 1− j (1 − θ) j , for j ∈ {0, 1}.
(12.8)
Denote by bi ∈ {0, 1}k the vector having as entries the bits in the binary representation of integer i, i.e., b0 = [0, . . . , 0, 0], b1 = [0, . . . , 0, 1], etc. Further, define by (12.7) and (12.8) the conditional probability of y ∈ {0, 1} for code bi ∈ {0, 1}k , P(Y = y|X = bi ) = θ 1−y⊕ f (bi ) (1 − θ) y⊕ f (bi ) de f
= P(y; f, f bi , θ).
(12.9)
The Boolean regression problem considered is to find the optimal universal model (in a minmax sense to be specified shortly) for the following class of models:
M(θ, k, f ) = {P(y; f, f bi , θ)
= θ (1−y⊕ f (bi )) (1 − θ)(y⊕ f (bi )) }
(12.10)
where θ ∈ [0, 1], y ∈ {0, 1}, bi ∈ {0, 1}k . A more flexible class of models can be obtained by allowing the parameter θ to depend on the actually observed regressor vector bi . This leads us to define a second class of models, M( , k, f ), as follows:
M( , k, f ) = {P(y; f, f bi , θi ) (1−y⊕ f (bi ))
= θi
(1 − θi )(y⊕ f (bi )) }
(12.11)
where i = 0, . . . , 2k − 1, = {θθ0 , . . . , θ2k −1 } is the set of parameters, θi is a parameter constrained to [0, 1], and k is the structure parameter.
3.1
The NML Model for the Boolean Class M( , k, f )
When the sequence y n = y1 . . . yn and another sequence of binary regressor vectors bn = bi1 , . . . , bin are observed, a member of M( , k, f )
NML Models for Boolean Regression used for Prediction and Classification 241
assigns to the sequence y n the following probability n
n
P(y ; , k, f, f b )=
n
(1−y j ⊕ f (bi )) j
θi j
j=1
(1 − θi j )
(y j ⊕ f (bi )) j
. (12.12)
By grouping the factors we get j|i =ℓ (1−y j ⊕ f (bℓ )) (y j ⊕ f (bℓ )) n n P(y ; , k, f, f b )= θℓ j (1 − θℓ ) j|i j =ℓ ℓ|bℓ ∈bn
(12.13)
which shows that we can carry out the maximization with respect to the θℓ -parameters independently for each regression vector in the following way. Consider the subsequence z m ℓ = z 1 z 2 . . . z m ℓ formed with the observed values yi that correspond to the regressor vector bℓ ; denote by bℓ the prediction m ℓ at regressor bℓ (i.e. bℓ = f (bℓ ) ∈ {0, 1}), and by m ℓ 0 and m ℓ1 = j=1 z j the number of zeros and ones, respectively, in the sequence z m ℓ . In the rest of this subsection, we write for notational simplicity m ℓ , m ℓ0 , m ℓ1 as m, m 0 , m 1 , respectively, for a fixed regressor vector, bℓ . The likelihood is maximized independently at every bℓ : max θℓ ∈[0,1] bℓ ∈{0,1}
P(z m ; θℓ , bℓ ) =
max θℓ ∈[0,1] bℓ ∈{0,1}
m
θℓ
j=1 (1−z j ⊕bℓ )
(1 − θℓ )
m
j=1
z j ⊕bℓ
When bℓ = 0, the extremum of log P(z m ; θℓ , bℓ = 0) = (m − m 1 ) log θℓ + m1 m−m 1 1 m 1 log(1 − θℓ ) is attained for m−m θℓ − 1−θθℓ = 0, which gives θℓ = m 1 m−m 1 m 1 m 1 and Pmax ,bℓ =0 = m−m . The second derivative is negative: m m ′′ m0 m1 log P(z m ; θℓ , bℓ = 0) = − 2 − 0.5 and f (bℓ ) = 0 if E(y|bℓ ) < 0.5 will fix the choice. Since E(y|bℓ ) > 0.5 is equivalent to m 1 > m 0 , we define with this selection of the Boolean function uniquely the ML estimates to be θˆℓ (z m ) = max( mm0 , mm1 ) ≥ 0.5 and fˆy n (bℓ ) = 1m 1 >m 0 , where 1x is the selector function, being 1 whenever the condition x is true, and 0 otherwise.
3.2
The NML Model for the Boolean Class M(θ, k, f )
When the errors E (t) are assumed to be Bernoulli distributed B (θ ) with the same parameter for all the regressor vectors bℓ , we get the class of models M(θ, k, f ) defined in (12.10). When the sequence y n = y1 . . . yn and the sequence of binary regressor vectors bn = bi1 , . . . , bin are observed, a
NML Models for Boolean Regression used for Prediction and Classification 243
member of the class M(θ, k, f ) assigns to the sequence y n the following probability n
n
P(y ; θ, k, f, f b )=
n
θ
(1−y j ⊕ f (bi )) j
j=1 n
(1 − θ)
j=1 (1−y j ⊕ f (bi j ))
=θ = θ n 0 (1 − θ)n−n 0 ,
(y j ⊕ f (bi )) j
n
(1 − θ)
j=1 (y j ⊕ f (bi j ))
(12.18)
where n 0 is the number of zeros in the sequence {ε j = y j ⊕ f (bi j )}nj=1 . The ML estimate of the model parameters, (θˆ (y n ), fˆy n ) = arg max P(y n ; θ, k, f, f bn ), θ, f
(12.19)
can be obtained in two stages, first by maximizing with respect to f , max P(y n ; θ, k, f, f bn ), f
(12.20)
and observing that the optimal f (·) does not depend on θ . For a fixed θ > 0.5, the function P(y n ; θ, k, f, f bn ) = θ n 0 (1 − θ)n−n 0 increases monotonically with n 0 , and (12.20) is maximized by maximizing n 0 , or, equivalently, by minimizing n − n 0 min(n − n 0 ) = min f
f
n j=1
|y j − f (bi j )|
= min
n −1 2
= min
n −1 2
f
f
ℓ=0 j|bi =bℓ
ℓ=0
(1 − y j ) f (bℓ ) + y j (1 − f (bℓ ))
j
m ℓ0 f (bℓ ) + m ℓ1 (1 − f (bℓ )).
(12.21)
Equation (12.21) shows that f is optimum both for the mean absolute error (MAE) and the mean square error (MSE) as criteria. It can also be seen that the assignment of f (bℓ ) depends solely on m ℓ0 , m ℓ1 , and the solution is given by
ˆf y n (bℓ ) = 0 i f m ℓ0 ≥ m ℓ1 , (12.22) 1 i f m ℓ0 < m ℓ1 which can readily be computed from the data set. Denote by n ∗0 (y n ) the number of zeros in the sequence {ε j = y j ⊕ fˆy n (bi j )}nj=1 . To completely
244
COMPUTATIONAL GENOMICS
solve the ML estimation problem we have to find max P(y n ; θ, k, fˆy n , bn ),
(12.23)
θ
ˆ n) = for which the maximizing parameter is θ(y P(y ; θˆ (y ), k, fˆy n , b ) = n
n
n
)
n ∗0 (y n ) n
*n ∗0 (y n ) )
n ∗0 (y n ) n .
Therefore
n ∗0 (y n ) 1− n
*n−n ∗0 (y n ) (12.24) .
We need to define a distribution q(y n ) over all possible sequences y n , which is the best in the minmax sense P(y n ; θˆ (y n ), k, fˆy n , bn ) min max . q yn q(y n )
(12.25)
The NML model, then, is P(y n ; θˆ (y n ), k, fˆy n , bn ) Cn (k, bn ) ∗ n n ∗ (y n )
∗ n n 0 (y ) 0 n ∗0 (y n ) n−n 0 (y ) 1 − n n = ∗
∗ n , n n ∗0 (x n ) n 0 (x ) n ∗0 (x n ) n−n 0 (x ) 1− n xn n
q(y n ) =
where n
Cn (k, b ) =
∗ n ) n ∗ (y n ) *n 0 (y ) )
0
n
yn
n ∗ (y n ) 1− 0 n
*n−n ∗0 (y n )
(12.26)
(12.27)
We remark that n ∗0 depends on y n through fˆy n in a complicated manner. When k = 0, the normalization factor is Cn (0, bn ) = Cn , given in (12.5). Alternative expressions for the coefficient Cn (k, bn ) provide a faster evaluation. We need to specify the distinct elements in the set {bℓ |bℓ ∈ bn } as {b j1 , . . . , b jK }, and denote by z q the subsequence of y n observed when the regressor vector is b jq . Let n q be the length of the subsequence z q having m q zeros. For the optimal Boolean function fˆy n (b jq ) in (12.22), the number of zeros in the sequence of errors, evaluated for the regressor vector b jq , is max(m q , n q − m q ). Therefore the overall number of zeros in the sequence K max(m q , n q − m q ), and the Bernoulli paraof errors is n ∗0 (y n ) = q=1 n ∗0 (y n ) K meter is θˆy n = = 1 q=1 max(m q , n q − m q ). n
n
NML Models for Boolean Regression used for Prediction and Classification 245
We may write the conditional probability of the string y n when the optimal Boolean function fˆy n (·) is used and the numbers of zeros in the subsequences {z q } are m 1 , . . . , m K , as follows: P(y n ; fˆy n , m 1 , . . . , m K ) = K 0q=1 / K max(m q ,n q −m q ) max(m , n − m ) q q q q=1 × = n K / K 0q=1 min(m q ,n q −m q ) min(m , n − m ) q q q q=1 × . n
(12.28)
)
* nq There are q=1 ways to generate sequences y n having the given mq numbers m 1 , . . . , m K of zeros in the subsequences z 1 , . . . , z K , respectively, and all of them have the same probability, given by (12.28). In order to get the sum Cn (k, bn ) we group the terms in (12.27) as follows K
* * * n1 ) n2 ) nK ) n1 n2 nK ... × Cn (k, b ) = m1 m2 mK n
m 1 =0 K
×θˆ
m 2 =0
q=1 max(m q ,n q −m q )
m K =0 K q=1 min(m q ,n q −m q )
(1 − θˆ )
, (12.29)
K where θˆ = n1 q=1 max(m q , n q −m q ). While the expression for Cn (k, bn ) in (12.27) involves summation of 2n terms, the expression (12.29) requires K addition of only q=1 n q terms.
A fast algorithm for computation of Cn (k, bn ) Observe that (12.29) can be alternatively expressed as
* ∗ n ) ∗ *n ∗1 ) n1 n ∗1 n−n 1 Cn (k, b ) = S K ,n 1 ,...,n K (n ∗1 ), (12.30) 1− n n ∗ n
n 1 =0
where S K ,n 1 ,...,n K (n ∗1 ) is the number of sequences y n having n ∗1 ones in the K residual sequence. We note that n ∗1 = q=1 min(m q , n q − m q ) and the numbers S K ,n 1 ,...,n K (n ∗1 ) can be easily computed, recursively in K . Denote
246
COMPUTATIONAL GENOMICS
first ⎧ if m> ⎪ ⎪ 0) * ⎪ ⎪ nℓ ⎨ if m= h ℓ (m) = )m * ⎪ ⎪ nℓ ⎪ ⎪ else ⎩ 2 m
nℓ 2 nℓ 2
(12.31)
,
which is the number of sequences of n ℓ bits, having either m bits set to 1, or n ℓ − m bits set to 1, for 0 ≤ m ≤ n2ℓ . By combining each of the S K −1,n 1 ,...,n K −1 (n ∗1 − m K ) sequences having n ∗1 − m K ones in the residual sequence, with each of the h K (m K ) sequences having either m K bits set to 1, or n K − m K bits set to 1, we get sequences having (n ∗1 − m K ) + min(m K , n K − m K ) = n ∗1 bits of 1 in their residual sequence. Therefore the following recurrence relation holds: S K ,n 1 ,...,n K (n ∗1 )
=
nK
m K =0
h K (m K )S K −1,n 1 ,...,n K −1 (n ∗1 − m K ), (12.32)
where, by convention, S K −1,n 1 ,...,n K −1 (n ∗1 − m K ) = 0 for negative arguments, n ∗1 − m K < 0. We note that the recurrence is simply a convolution sum, S K ,n 1 ,...,n K = h K ⊗ S K −1,n 1 ,...,n K −1 , and from here we conclude that S K ,n 1 ,...,n K = h 1 ⊗ h 2 ⊗ . . . ⊗ h K . We can easily see that S K 1 ,n 1 ,...,n K 1 (i) = 0 for i >
K1 q
2
(12.33) nq
, due to the fact K1
n
that the optimal residual sequence cannot have more than q2 q ones. K n q terms have to be added Also, from (12.31) we note that only 21K qq=1 when evaluating all convolution sums (12.33). Therefore the computation of Cn (k, bn ) by using (12.30) is about 2 K faster than when using (12.29).
3.3
A Two Part Code for the Boolean Class M( , k, f )
We consider a coding problem where the decoder has available the sequence of regression vectors bn , and the set {b j1 , . . . , b jK }, with the numbers {n q } (recall that we denote by z q the subsequence of y n observed when the regressor vector is b jq , and by n q the length of the subsequence z q having m q zeros).
NML Models for Boolean Regression used for Prediction and Classification 247 mq nq }
it is necessary to encode
log2 n q
(12.34)
In order to specify the parameters {θˆq = the values {m q }, which can be done with ˆ = Cost ( )
K q=1
ˆ is bits, while the cost of encoding y n , conditional on , ˆ = − log2 Cost (y | ) n
K
q=1
m θˆq q (1 − θˆq )n q −m q
(12.35)
ˆ ˆ + Cost (y n | ). bits, leading to a overall description length of Cost ( )
3.4
A Two Part Code for the Boolean Class M(θ, k, f )
The decoder has again available the sequence of regressor vectors bn . In ∗ n ˆ n ) = n 0 (y ) it is necessary to encode the order to specify the parameter θ(y n value n ∗0 , which can be done with Cost (θˆ (y n )) = log2 n bits.
(12.36)
The ML estimates { fˆ(bq )} need one bit each. Therefore the Boolean function for the K distinct regressor vectors has the cost Cost ( fˆ(y n )) = K . ˆ fˆ(y n )), is The cost of encoding y n , conditional on (θ, ˆ n )n ∗0 (y n ) (1 − θ(y ˆ n ), fˆ(y n )) = − log2 θ(y ˆ n ))n−n ∗0 (y n ) Cost (y n |θ(y (12.37) bits, and the overall description length is ˆ n )) + Cost ( fˆ(y n )) Cost (y n |M(θ, k, f )) = Cost (θ(y = log2 n + K − log2
4.
K
q=1
ˆ n ), fˆ(y n )) +Cost (y n |θ(y
∗ n ˆ n ))n−n ∗0 (y n ) . θˆ (y n )n 0 (y ) (1 − θ(y
Experimental Results
We illustrate the classification based on the NML model for Boolean regression models using the microarray DNA data Leukemia (ALL/AML) of
248
COMPUTATIONAL GENOMICS
(Golub et al., 1999), publicly available at http://www-genome.wi.mit.edu/ MPR/data set ALL AML.html. The microarray contains 6817 human genes, sampled from 72 cases of cancer, of which 47 are of ALL type and 25 of AML type. The data is preprocessed as recommended in (Golub et al., 1999) and (Dudoit et al., 2000): first the gene values are truncated below at 100 and above at 16000; secondly, genes having the ratio of the maximum over the minimum less than 5, or the difference between the maximum and the minimum less than 500 are excluded; and finally, the logarithm to the base 10 is applied to the preprocessed gene expression values. The resulting data matrix X˜ has 3571 rows and 72 columns. We design a two level quantizer by applying the LBG algorithm (Linde et al., 1980) and the resulting decision threshold is 2.6455, when all the entries in the matrix X˜ are used as a training set (but we note that no information about the true classes is used during the quantization stage). The entries in the matrix X˜ are quantized to binary values, resulting in the binary matrix X .
4.1
The NML Model for the Boolean Regression Models With k = 1
Table 12.1 lists the best 10 genes as ranked by the NML model for the Boolean class M(θ, 1, f ). The corresponding codelengths are in the range 26 − 44 bits, well below 70 bits, which is the codelength assigned by the NML model for the Bernoulli class. Figure 12.1 shows the gene expression values for the gene which minimizes the codelength assigned by the NML model of the Boolean class M(θ, 1, f ). We note that the binary profile for the quantized gene expresTable 12.1. The best 10 genes for predicting the class label with the NML model of the class M(θ, 1, f ).
Codelength
Gene Index
Gene accession number
26.7 30.6 30.6 34.2 34.2 37.5 40.6 43.5 43.5 43.6
2288 1834 3252 6855 760 6376 1685 6378 2128 1882
M84526 M23197 U46499 M31523 D88422 M83652 M11722 M83667 M63379 M27891
NML Models for Boolean Regression used for Prediction and Classification 249 28 NML for M(q,k,f) NML for M(Q,k,f)
Code Length [bits]
26 24 22 20 18 16
0
10
20
30
40 50 60 70 Index of gene pair
80
90 100
Figure 12.1. Gene expressions: quantized and unquantized expression levels of gene 2288, M84526 ( DFD component of complement (adipsin)).
sion is quite robust to small changes in the value of the threshold, and also it correlates very well with the true class profile. In Fig. 12.2 we compare the codelength assigned by the NML models for each of the classes M( , 1, f ), and M(θ, 1, f ). The ranking of the best 6 genes is the same by both of the models.
4.2
The NML Model for the Boolean Regression Models with k = 2
We consider here the Boolean classes of regression models using k = 2 regressors. There are 3570 × 3571 = 12 748 470 possible pairs of genes. We computed the codelength assigned by the NML model of the Boolean class M(θ, 2, f ) for all possible pairs of genes. In Fig. 12.3 we show the codelength obtained by the best 100 pairs of genes, as assigned by NML model of the Boolean class M(θ, 2, f ). We also show for these gene pairs the codelength assigned by the NML model of the Boolean class M( , 2, f ). The codelengths assigned to each gene pair by the two models are very close most of the time, although significant differences are observed for some gene pairs. Although the model M( , 2, f ) is more flexible, it is sometimes able to explain the data in fewer bits, but the model M(θ, 2, f ) is better in most of the cases. In Fig. 12.4 we compare the NML model and the two part code for the class of models M(θ, 2, f ), for the best 100 pairs of genes. We observe that for each pair of genes the codelength associated by the simple two
250
COMPUTATIONAL GENOMICS 29 28
Code Length [bits]
27
NML for M(Q,k,f) Two part code for M(Q,k,f)
26 25 24 23 22 21 20 19 0
10
20
30
40 50 60 70 Index of gene pair
80
90
100
Figure 12.2. The 100 best genes by the NML model of the class M(θ, 1, f ) (sorted in increasing order of codelength). The NML codelength assigned by the class M( , 1, f ) is also shown for each gene. By comparison, the NML model of the Bernoulli class B(θ ) assigns to the string y n the codelength 70.57 bits.
34 NML for M(Q,k,f) Two part code for M(Q,k,f)
32
Code Length [bits]
30 28 26 24 22 20 18
0
10
20
30
40 50 60 70 Index of gene pair
80
90
100
Figure 12.3. The 100 best pairs of two genes by NML model of the class M(θ, 2, f ) (sorted in increasing order of codelength). The NML codelength assigned by the class M( , 2, f ) is also shown for each pair.
NML Models for Boolean Regression used for Prediction and Classification 251 4.5 Unquantized Gene 1182 1Quantized Gene 1182 True classes (ALL==1) Quantizer Threshold
Gene expression, class index
4 3.5 3 2.5 2 1.5 1 0.5 0
0
10
20
30
40 Patient
50
60
70
80
Figure 12.4. The 100 best pairs of two genes by the NML model of the class M(θ, 2, f ) (sorted in increasing order). The two-part code codelength assigned by the class M(θ, 2, f ) is shown to rank the gene pairs in a similar manner.
part code described in Section 3.4 is about 4 bits longer than the NML codelength. However, this difference is almost constant, suggesting that a first pre-selection according to the faster two-part code could be used to obtain a smaller pool of gene pairs out of the 12 748 470 possible pairs, and then the more precise NML model could be used for the final selection of the best gene pairs. Figure 12.5 shows a higher discrepancy between the codelengths assigned by NML model and the two part code for the class of models M( , 2, f ). Finally, Fig. 12.5 shows for the best 120 pairs of genes x 1 x2 the codelength assigned by the NML model of the class M(θ, 2, f ) together with the codelength assigned by the NML model of the class M(θ, 1, f ) (only the codelength for the best gene, x1 or x 2 , is shown). We conclude that the NML model M(θ, 2, f ) for each of the 120 pairs of genes explains the data better than their subnested NML models M(θ, 1, f ).
4.3
The NML Model for the Boolean Regression Models with k = 3
In Fig. 12.7 we show the best 100 triplets of genes (x1 x2 x3 ), ranked by the codelength assigned by the NML model of the class M(θ, 3, f ), together with the codelength assigned by the NML model of the class M(θ, 2, f ) (the codelength of the best out of three possible pairs (x 1 x2 ), (x1 x3 ), (x2 x3 )
252
COMPUTATIONAL GENOMICS
65 60
Code Length [bits]
55 50 45 40 35
NML for M(q,1,f) NML for M(Q,1,f)
30 25
0
10
20
30
40 50 60 Gene index
70
80
90
100
Figure 12.5. The 100 best pairs of two genes by the NML model of the class M( , 2, f ) (sorted in increasing order of codelength). The two-part code codelength assigned by the class M( , 2, f ) is also shown for each pair.
65 60
NML for M(q,2,f) NML for M(q,1,f)
Code Length [bits]
55 50 45 40 35 30 25 20 15
0
20
40 60 80 Index of gene pair
100
120
Figure 12.6. The 120 best pairs of genes, by the NML model of the class M(θ, 2, f ) (sorted in increasing order of codelength). The minimum of the NML codelength assigned by the class M(θ, 1, f ) for each of the genes in the pair is also shown.
NML Models for Boolean Regression used for Prediction and Classification 253 26 24
Code Length [bits]
22 20 18 16 14 12
NML for M(q,3,f) NML for M(q,2,f)
10 8 6
0
10
20
30
40 50 60 70 Index of gene pair
80
90
100
Figure 12.7. The 100 best triplets of genes by the NML model of the class M(θ, 3, f ) (sorted in increasing order of codelength). The minimum of the NML codelength assigned by the class M(θ, 2, f ) for each pair of genes in the triplet is also shown.
is shown for each triplet). All listed models with three genes are better than the sub-nested models having only two genes.
4.4
Extension of the Classification for Unseen Cases of the Boolean Regressors
The Boolean regressors observed in the training set may not span all the 2k possible binary vectors. If a binary vector bq is not in the training set, the decision f ∗ (bq ) for classification remains undecided during the training stage. We pick the value of f ∗ (bq ) by the nearest neighbor voting, taking the majority vote of the neighbors bℓ situated at Hamming distance 1, for which f ∗ (bℓ ) was decided during the training stage. Denote by N1 (bq ) = {bℓ : bℓ ∈ bn , w H (bℓ ⊕ bq ) = 1, f ∗ (bℓ ) is decided} the set of decided neighbors of bq at Hamming distance 1. The voting decision is then f ∗ (bℓ ) > b ∈N1 (b ) (1 − f ∗ (bℓ )) 1 if ∈ N (b ) b 1 ∗ ℓ q q ℓ f (bq ) = ∗ (b ) < ∗ 0 if f ℓ b ∈N1 (b ) (1 − f (bℓ )) b ∈N1 (b ) ℓ
q
ℓ
q
(12.38) If after voting there still remains a tie ( b ∈N1 (b ) f ∗ (bℓ ) = b ∈N1 (b ) ℓ q ℓ q (1 − f ∗ (bℓ ))), we take the majority vote of the neighbors at Hamming distance 2, and continue, if necessary, until a clear decision is reached.
254
COMPUTATIONAL GENOMICS
Gene1834 Gene3631 Gene6277 Class label 10
20
30
40
50
60
70
10
20
30
40
50
60
70
10
20
30
40
50
60
70
10
20
30
40
50
60
70
Gene1834 Gene3631 Gene5373 Class label Gene1834 Gene3631 Gene6279 Class label Gene1144 Gene1882 Gene5808 Class label
Colormap 2
2.4
2.8
3.2
3.6
4
Figure 12.8. The four triplets of genes with the best classification error rates, 0.01%,0.008%,0.007%,and 0.004%, respectively, ranked as second, fifth, sixth and eleventh according to the NML model of the class M(θ, 3, f ). The original values (non-quantized) of the genes are shown, pseudo-colored using the colormap indicated at the bottom of the figure. The class labels are represented as follows: the ALL class with green and the AML class with red. The predictor based on the genes x1 = 1834, x2 = 3631 and x3 = 6277 has the Boolean function AL L = f (x1 , x2 , x3 ) = x 1 x 2 x 3 . The predictor based on the genes x1 = 1834, x2 = 3631 and x3 = 5373 has the Boolean function AL L = f (x1 , x2 , x3 ) = x 1 x 2 x 3 . The predictor based on the genes x1 = 1834, x2 = 3631 and x3 = 6279 has the Boolean function AL L = f (x1 , x2 , x3 ) = x 1 x 2 x 3 . The predictor based on the genes x1 = 1144, x2 = 1882 and x 3 = 5808 has the Boolean function AL L = f (x1 , x2 , x3 ) = x1 x 2 ∨ x 2 x 3 .
4.5
Estimation of Classification Errors Achieved with Boolean Regression Models with k = 3
The Leukemia data set from (Golub et al., 1999) was considered recently in a study that compared several classification methods (Dudoit et al., 2000). The evaluation of the performance is based there on the classification error as estimated in a crossvalidation 2:1 experiment. In order to compare our classification results with these results, we estimate the classification error in the same way, namely by dividing the 72 patient set at random into a training set of n T = 48 patients and a test set of n s = 24 patients, finding the optimal predictor f ∗ (·) over the training set, classifying the test set by use of the predictor f ∗ (·) (the extension for cases unseen
NML Models for Boolean Regression used for Prediction and Classification 255 Table 12.2. The best 14 pairs of genes for predicting the class label with the NML model of the class M(θ, 2, f ).
Codelength 18.3 19.2 19.2 19.3 19.3 19.3 19.3 19.3 19.3 19.4 19.4 23.2 23.2 23.2
Pair of Genes 2288 1834 2288 2288 2288 1807 1834 1834 1834 758 1144 73 1834 1834
5714 2288 4079 3631 5808 2288 6277 5373 6279 4342 1882 1834 5714 5949
Gene accession numbers M84526 M23197 M84526 M84526 M84526 M21551 M23197 M23197 M23197 D88270 J05243 AB000584 M23197 M23197
HG1496-HT1496 M84526 X05409 U70063 HG2981-HT3127 M84526 M30703 S76638 X97748 X59871 M27891 M23197 HG1496-HT1496 M29610
in the training set is done as in the previous subsection), and counting the number of classification errors produced over the test set. The random split is repeated a number of nr = 10000 times, and the estimated classification error is computed as the percentage of the total number of errors observed in the (nr · n s ) test classifications. We mention for comparison that the best classification methods tested in (Dudoit et al., 2000) have classification errors higher than 1%. As we see in Table 12.3 and Fig. 12.9 there are several predictors with three genes, achieving classification error rates as low as 0.004%. Note the remarkable agreement in ranking of the gene triplets by the NML codelength assigned by the NML model of the class M(θ, 3, f ) and the estimated classification error rates. As for the genes involved in the optimal predictors of Table 12.3, we note that five genes belong to the set of 50 “informative” genes selected in (Golub et al., 1999), namely M23197, M84526, M27891, M83652, X 95735.
5.
Conclusions
Boolean regression classes of models are powerful modeling tools having associated NML models which can be easily computed and used in MDL inference, in particular for factor selection.
256
COMPUTATIONAL GENOMICS
Table 12.3. The best 18 triplets of genes for predicting the class label according to the NML model of the class M(θ, 3, f ). Code length
Classification error [%]
6.9 7.9 7.9 8.0 8.7 8.7 8.7 8.8 8.8 8.8 8.8 8.8 8.9 8.9 8.9 8.9 8.9 8.9
0.912 0.010 0.891 0.652 0.008 0.007 0.910 0.649 0.055 0.063 0.004 0.584 0.558 0.560 0.620 0.605 0.582 0.556
Triplet of Genes 1834 1834 758 2288 1834 1834 1144 302 1144 1834 1144 2288 2288 1399 1241 2288 2288 2288
2288 3631 4250 4847 3631 3631 1217 2288 1834 1882 1882 3932 5518 2288 2288 3660 4399 4424
Gene accession numbers 5714 6277 4342 6376 5373 6279 1882 6376 1882 6049 5808 6376 6376 6376 6376 6376 6376 6376
M23197 M23197 D88270 M84526 M23197 M23197 J05243 D25328 J05243 M23197 J05243 M84526 M84526 L21936 L07758 M84526 M84526 M84526
M84526 U70063 X53586 X95735 U70063 U70063 L06132 M84526 M23197 M27891 M27891 U90549 X95808 M84526 M84526 U72342 X63753 X65867
HG1496-HT1496 M30703 X59871 M83652 S76638 X97748 M27891 M83652 M27891 U89922 HG2981-HT3127 M83652 M83652 M83652 M83652 M83652 M83652 M83652
Classification error[%], Code Length[bits]
15
10 Code Length[bits] Classification error[%]
5
0 0
10
20
30
40 50 60 70 Index of gene triplet
80
90
100
Figure 12.9. The best 100 triplets of genes by the NML model of the class M(θ, 3, f ), and the estimated classification error of the corresponding predictors.
NML Models for Boolean Regression used for Prediction and Classification 257
Comparing the MDL methods based on the two-part codes with those based on the NML models, we note that the former is faster to evaluate, but the latter provides a significantly shorter codelength and hence a better description of the data. When analyzing the gene data, speed * ) expression n possible groupings may be a major concern, since one has to test k of k genes, with n in the order of thousands and k usually less than 10. The two-part codes may then be used for pre-screening of the gene groupings, to remove the obviously poor performers, and then the NML model could be applied to obtain the final selection from a smaller pool of candidates. The running time for all our experiments reported here is in the order of tens of minutes. The use of the MDL principle for classification with the class of Boolean models provides an effective classification method as demonstrated with the important cancer classification example based on gene expression data. The NML model for the class M(θ, k, f ) was used for the selection of informative feature genes. When using the sets of feature genes, selected by NML model, we achieved classification error rates significantly lower than those reported recently for the same data set.
References Barron, A., Rissanen, J., Bin, Y. (1998) The minimum description length principle in coding and modeling. IEEE Trans. on Information Theory, Special commemorative issue: Information Theory 1948-1998, vol.44, no. 6, 2743–2760. Dudoit, S., Fridlyand, J., Speed, T.P. (2000) Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Dept. of Statistics University of California, Berkeley, Technical Report 576. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield,C.D., Lander, E.S. (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, Vol. 286, pp. 531-537. Kim, S., Dougherty, E.R. (2000) Coefficient of determination in nonlinear signal processing. Signal Processing, 80:2219–2235. Hieter, P., Boguski, M. (1997) Functional genomics: its all how you read it. Science 278, 601-602. Jacob, F., Monod, J. (1961) Genetic regulatory mechanisms in the synthesis of proteins. Journal of Molecular Biology, Vol. 3, 318-356.
258
COMPUTATIONAL GENOMICS
Kim, S., Dougherty, E.R., Chen, Y., Sivakumar, K., Meltzer, P., Trent, J.M., Bitnner, M. (2000). Multivariate measurement of gene expression relationships. Genomics, 67, 201–209. Linde, Y., Buzo, A., Gray, R.M. (1980) An algorithm for vector quantization design. IEEE Transactions on Communications, 28:84–95. Rissanen, J. (1978) Modelling by shortest data description. Automatica, vol. 14, 465–471. Rissanen, J. (1984) Universal coding, information, prediction and estimation. IEEE Trans. on Information Theory, vol.30, 629–636. Rissanen, J. (1986) Stochastic complexity and modeling. Ann. Statist., vol. 14, pp. 1080-1100. Rissanen, J. (2000) MDL Denoising. IEEE Trans. on Information Theory, vol. IT-46, No. 7, 2537–2543. Rissanen, J. (2001) Strong optimality of the normalized ML models as universal codes and information in data. IEEE Trans. on Information Theory, vol.IT-47, No. 5, 1712–1717. Russel, P.J. (2000) Fundamentals of genetics. 2nd edition, San Francisco: Addison Wesly Longman Inc. Schena, M., Shalon, D., Davis, R.W., Brown, P.O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 20; 270: 467-470. Shtarkov, Yu.M. (1987) Universal sequential coding of single messages. Translated from Problems of Information Transmission, Vol. 23, No. 3, 3–17. Tabus, I., Astola, J. (2001) On the Use of MDL Principle in Gene Expression Prediction. Journal of Applied Signal Processing, Volume 2001, No. 4, pp. 297-303. Tabus, I., Astola, J. (2000) MDL Optimal Design for Gene Expression Prediction from Microarray Measurements. Tampere University of Technology, Technical Report,ISBN.952-15-0529-X. Tabus, I., Rissanen, J., and Astola, J. (2003) Classification and feature gene selection using the normalized maximum likelihood model for discrete regression. Signal Processing, Special issue on Genomic Signal Processing, Vol. 83, No.4, pp. 713-727.
Chapter 13 INFERENCE OF GENETIC REGULATORY NETWORKS VIA BEST-FIT EXTENSIONS ¨ ¨ 1 , Ilya Shmulevich1 , Olli Yli-Harja2 , and Jaakko Astola2 Harri Lahdesm aki 1 Cancer Genomics Laboratory, Department of Pathology
The University of Texas M. D. Anderson Cancer Center, Houston, Texas, USA 2 Institute of Signal Processing, Tampere University of Technology, Tampere, Finland
1.
Introduction
One of the most important breakthroughs in recent years in molecular biology is microarray technology, which allows monitoring of gene expression at the transcript level for thousands of genes in parallel (Schena et al., 1995) (see also e.g. (Baldi and Hatfield, 2002; Zhang et al., 2004)). Even though mRNA is not the final product of a gene, armed with the knowledge of gene transcript levels in various cell types and different types of diseases, such as cancer (Golub et al., 1999; van’t Veer et al., 2002), under different developmental stages (Wen et al., 1998; Spellman et al., 1998), and under a variety of conditions, such as in response to specific stimuli (Iyer et al., 1999; DeRisi et al., 1997), scientists can gain a deeper understanding of the functional roles of genes, of the cellular processes in which they participate, and of their regulatory interactions. Thus, gene expression data for many cell types and organisms at multiple time points and experimental conditions are rapidly becoming available (Brazma and Vilo, 2000). In fact, the amounts of data typically gathered in experiments call for computational methods and formal modeling in order to make meaningful interpretations (Huang, 1999). The emerging view is that as biology becomes a more quantitative science, modeling approaches will become more and more usual (Brazma and Vilo, 2000). In order to understand the nature of cellular function, it is necessary to study the behavior of genes in a holistic rather than in an individual manner. A significant role is played by the development and analysis of mathematical and computational methods in order to construct formal models of
260
COMPUTATIONAL GENOMICS
genetic interactions. This research direction provides insight and a conceptual framework for an integrative view of genetic function and regulation and paves the way toward understanding the complex relationship between the genome and the cell. Moreover, this direction has provided impetus for experimental work directed toward verification of these models. There have been a number of attempts to model gene regulatory networks, including linear models (van Someren et al., 2000; D’Haeseleer et al., 1999), Bayesian networks (Murphy and Mian, 1999; Friedman et al., 2000; Friedman, 2004), neural networks (Weaver et al., 1999), differential equation models (Chen et al., 1999; de Hoon et al., 2003; Sakamoto and Iba, 2001), and models including stochastic components on the molecular level (McAdams and Arkin, 1997). A model class that has received a considerable amount of attention is the so-called Random Boolean Network model originally introduced by Kauffman (Kauffman, 1969; Glass and Kauffman, 1973; Kauffman, 1993), approximately thirty years ago. In this model, the state of a gene is represented by a Boolean variable (ON or OFF) and interactions between the genes are represented by Boolean functions, which determine the state of a gene on the basis of the states of some other genes. Recent research seems to indicate that many realistic biological questions may be answered within the seemingly simplistic Boolean formalism, which in essence emphasizes fundamental, generic principles rather than quantitative biochemical details (Huang, 1999; Shmulevich and Zhang, 2002a) (see also Chapter 12). One of the appealing properties of Boolean networks is that they are inherently simple, emphasizing generic network behavior rather than quantitative biochemical details, but are able to capture much of the complex dynamics of gene regulatory networks. For example, the dynamic behavior of such networks corresponds to and can be used to model many biologically meaningful phenomena, such as, for example cellular state dynamics, possessing switch-like behavior, stability, and hysteresis (Huang, 1999). Besides the conceptual framework afforded by such models, a number of practical uses may be reaped by inferring the structure of the genetic models from experimental data, that is, from gene expression profiles. One such use is the identification of suitable drug targets in cancer therapy. To that end, much recent work has gone into identifying the structure of gene regulatory networks from expression data (Liang et al., 1998; Akutsu et al., 1998; Akutsu et al., 1999; Akutsu et al., 2000; Ideker et al., 2000; Maki ¨ ¨ et et al., 2001; Karp et al., 1999; Shmulevich et al., 2002b; Lahdesm aki al., 2003). Most of the work, however, has focused on the so-called Consistency Problem, namely, the problem of determining whether there exists a
Inference of Genetic Regulatory Networks
261
network that is consistent with the examples. While this problem is important in computational learning theory, as it can be used to prove the hardness of learning for various function classes, it may not be applicable in a realistic situation in which noisy observations or errors are contained, as is the case with microarrays. Measurement errors can arise in the data acquisition process or may be due to unknown latent factors. A learning paradigm that can incorporate such inconsistencies is called the Best-Fit Extension Problem. Essentially, the goal of this problem is to establish a rule or in our case, network, that would make as few misclassifications as possible. In order for an inferential algorithm to be useful, it must be computationally tractable. In this chapter, we consider the computational complexity of the Best-Fit Extension Problem for the Random Boolean Network model. We show that for many classes of Boolean functions, the problem is polynomial-time solvable, implying its practical applicability to real data analysis. We first review the necessary background information on Random Boolean Networks and then discuss the Best-Fit Extension Problem for Boolean functions and its complexity for Boolean networks.
2.
Boolean Networks
For consistency of notation with other related work, we will be using the same notation as in (Akutsu et al., 1999). A Boolean network G (V, F) is defined by a set of nodes V = {vv 1 , . . . , v n } and functions a list of Boolean F = ( f 1 , . . . , f n ). A Boolean function f i v i1 , . . . , v k with k specified input nodes is assigned to node v i . In general, k could be varying as a function of i, but we may define it to be a constant without loss of generality as k = maxi k (i) and allowing the unnecessary variables (nodes) in each function to be fictitious. For a function f , the variable xi is fictitious if f(x1 , . . . , xi−1 , 0, xi+1 , . . . , xn ) = f (x 1 , . . . , xi−1 , 1, xi+1 , . . . , xn ), for all x 1 , . . . , xi−1 , xi+1 , . . . , xn . A variable that is not fictitious is called essential. We shall also refer to k as the indegree of the network. Each node v i represents the state (expression) of gene i, where v i = 1 means that gene i is expressed and v i = 0 means it is not expressed. Collectively, the states of individual genes in the genome form a gene activity profile (GAP) (Huang, 1999). The list of Boolean functions F represents how genes regulate each other. That is, any given gene transforms its inputs (regulatory factors that bind to it) into an output, which is the state or expression of the gene itself. All genes (nodes) are updated synchronously in accordance with the functions assigned to them and this process is then repeated. The artificial synchrony simplifies computation while preserving
262
COMPUTATIONAL GENOMICS
the qualitative, generic properties of global network dynamics (Huang, 1999; Kauffman, 1993; Wuensche, 1998). Consider the state space of a Boolean network with n genes. Then, the number of possible GAPs is equal to 2n . For every GAP, there is another successor GAP into which the system transitions in accordance with its structural rules as defined by the Boolean functions. Thus, there is a directionality that is intrinsic to the dynamics of such systems. Consequently, the system ultimately transitions into so-called attractor states. The states of the system that flow into the same attractor state make up a basin of attraction of that attractor (Wuensche, 1998). Sometimes, the system periodically cycles between several limit-cycle attractors. It is interesting to note that such behavior even exists for some infinite networks (networks with an infinite number of nodes) (Moran, 1995), such as those in which every Boolean function is the majority function. Moreover, the convergence of a discrete dynamical system to attractors should be well known to many researchers from the area of non-linear signal processing, where convergence to root signals has been studied for many classes of digital filters (Gabbouj et al., 1992). Root signals are those signals that are invariant to further processing by the same filter. Some filters are known to reduce any signal to a root signal after a finite number of passes while others possess cyclic behavior. Although the large number of possible GAPs would seem to preclude computer-based analysis, simulations show that for networks with low connectivity, only a small number of GAPs actually correspond to attractors (Kauffman, 1993). Since other GAPs are unstable, the system is normally not found in those states unless perturbed. In fact, real gene regulatory networks are shown to have moderately low mean connectivity. For instance, estimate of mean connectivity in Escherichia coli is around 2 or 3 (Thieffry et al., 1998) while in higher metazoa, it is between 4 and 8 (Arnone and Davidson, 1997). On the other hand, exact values of k for different genes are unknown. Therefore, we let the indegree k, as well as the number of genes n and the number of measurements m, be free parameters throughout this chapter.
3.
The Best-Fit Extension Problem
One of the central goals in the development of network models is the inference of their structure from experimental data. In the strictest sense, this task falls under the umbrella of computational learning theory (Kearns and Vazirani, 1994). Essentially, we are interested in establishing “rules” or, in our case, Boolean functions by observing binary INPUT/OUTPUT relationships. Thus, this task can also be viewed as a system identification
Inference of Genetic Regulatory Networks
263
problem. One approach is to study the so-called Consistency Problem, considered for Boolean networks in (Akutsu et al., 1999). The Consistency Problem is important in computational learning theory (Valiant, 1984) and can be thought of as a search of a rule from examples. That is, given some sets T and F of “true” and “false” vectors, respectively, we aim to discover a Boolean function f that takes on the value 1 for all vectors in T and the value 0 for all vectors in F. We may also assume that the target function f is chosen from some class of possible target functions. One important reason for studying the complexity of the consistency problem is its relation to the PAC approximate learning model of Valiant (Valiant, 1984). If the consistency problem for a given class is NP-hard, then this class is not PAC-learnable. Moreover, this would also imply that this class cannot be learned with equivalence queries (Angluin, 1987). Unfortunately, in realistic situations, we usually encounter errors that may lead to inconsistent examples. This is no doubt the case for gene expression profiles as measured from microarrays, regardless of how the binarization is performed. In order to cope with such inconsistencies, we can relax our requirement and attempt to establish a rule that makes the minimum number of misclassifications. This is called The Best-Fit Extension Problem and has been extensively studied in (Boros et al., 1998) for many function classes. We now briefly define the problem for Boolean functions. The generalization to Boolean networks is straightforward. A partially defined Boolean function pdBf is defined by a pair of sets (T, F) such that T, F ⊆ {0, 1}n , where T is the set of true vectors and F is the set of false vectors. A function f is called an extension of pdBf(T, F) 7if T ⊆ T ( f ) 6 n {0, and F ⊆ F f where T f = x ∈ 1} : f (x) = 1 and F ( f ) = ( ), ( ) 6 7 n x ∈ {0, 1} : f (x) = 0 . Suppose that we are also given positive weights w (x) for all vectors x ∈ T ∪ F and define w (S) = x∈S w (x) for a subset S ⊆ T ∪ F (Boros et al., 1998). Then, the error size of function f is defined as ε ( f ) = w (T ∩ F ( f )) + w (F ∩ T ( f )) .
(13.1)
If w (x) = 1 for all x ∈ T ∪ F, then the error size is just the number of misclassifications. The goal is then to output subsets T ∗ and F ∗ such that T ∗ ∩ F ∗ = ∅ and T ∗ ∪ F ∗ = T ∪ F for which the pdBf(T ∗ , F ∗ ) has an extension in some class of functions C (chosen a priori) and so that w (T ∗ ∩ F) + w (F ∗ ∩ T ) is minimum. Consequently, any extension f ∈ C of pdBf(T ∗ , F ∗ ) has minimum error size. In practice, one may have measured identical true and/or false vectors several times, each of them associated possibly with different positive
264
COMPUTATIONAL GENOMICS
weight. Since the weight of any subset S ⊆ T ∪ F is additive it is justified to define the weight of x ∈ T (resp. F) to be the sum of all weights of vectors x belonging to T (resp. F). So, before forming the sets T and F, and assigning weights to their elements, one may need to take into account possible multiplicities in the original lists (or multisets) of measurements. It is clear that the Best-Fit Extension Problem is computationally more difficult than the Consistency Problem, since the latter is a special case of the former, that is, when ε ( f ) = 0. The computational complexity of these problems has been studied for many function classes in (Boros et al., 1998). For example, the Best-Fit Extension Problem was proved to be polynomially solvable for all transitive classes and some others, while for many classes including threshold, Horn, Unate, positive self-dual, it was shown to be NP-hard. It is important to note here that if the class C of functions is not restricted (i.e. all Boolean functions), then an extension exists if and only if T and F are disjoint. This can be checked in O (|T | · |F| · poly (n)) time, where poly(n) is the time needed to answer “is x = y?” for x ∈ T , y ∈ F. This is precisely why attention has been focused on various subclasses of Boolean functions. For the case of Boolean networks, we are given n partially defined T1 , F1 ) , · · · , (T Boolean functions defined by sets (T Tn , Fn ). Since we are making “genome-wide” observations, it follows that |T T1 ∪ F1 | = · · · = |T Tn ∪ Fn | = m. Given some class of functions C , we say that the network G (V, F) is consistent with the observations if f i from C is an extension of pdBf(T Ti , Fi ), for all i. In (Akutsu et al., 1999) it was shown that when C is the class of Boolean functions containing no more than k essential variables (maximum indegree of the network), the Consistency Problem is polynomially solvable in n and m. In fact, it turns out that if we make no restriction whatsoever on the function class, the Consistency Problem for Boolean networks is still polynomial-time solvable, because for each node v i , all we need to do is check whether or not Ti ∩ Fi = ∅. For a restricted class C , we can say that if the Consistency Problem is polynomially solvable for one Boolean function (i.e. one node), then it is also polynomially solvable for the entire Boolean network, in terms of n and m. The reason is that the time required to construct an extension simply has to be multiplied by n — the number of nodes. For example, as shown in (Akutsu et al., 1999), the time needed to construct one extension k from the class of functions with k essential variables (k fixed), is O(22 · k n k · m · poly(n)) because there are a total of 22 Boolean functions that must be checked for each of the nk possible combinations of variables and for m observations. Thus, the Consistency Problem for the entire network k can be solved in O(22 · n k · m · n · poly(n)) time, for fixed k. Instead
265
Inference of Genetic Regulatory Networks
of performing an exhaustive search over all functions, one can accelerate the Consistency Problem by constructing the truth table of a (possibly) ¨ ¨ et al., 2003), resulting in an improved consistent function (Lahdesm aki time complexity O(n k · m · n · poly(n)). We now see that the same must hold true for the Best-Fit Extension Problem as well. Consider again the class of functions with k essential variables. Then, one can calculate the error size ε( f ) for every Boolean function f , for each of the nk possible combinations of variables, over all m observations, and keep track of the minimum error size as well as the corresponding function and its variables. To generalize this for a Boolean network, the process must simply be repeated for every one of the n nodes, essentially multiplying the time needed for obtaining a best-fit extension by n. Consequently, the Best-Fit Extension Problem is polynomial-time solvable for Boolean networks, when all functions are assumed to have no more than k essential variables. The resulting time complexity is of the k order O(22 · n k · m · n · poly(n)). Moreover, if C is the class of all Boolean functions (i.e. no restrictions), then the Best-Fit Extension Problem for Boolean networks can also be solved in polynomial time by virtue of it being polynomially solvable for general Boolean functions (see (Boros et al., 1998)). So, we can say the following: If it is known that the Best-Fit Extension Problem is solvable in polynomial time in n and m for one Boolean function from class C , then the Best-Fit Extension Problem has a polynomial time solution for a Boolean network in which all functions belong to class C . For example, it is known that for the class of monotone (positive) Boolean functions, the Boolean function version of the Best-Fit Extension Problem is polynomially solvable (Boros et al., 1998). Then, it immediately follows that the Boolean network version of the Best-Fit Extension Problem is also polynomial-time solvable. A faster method for solving the Best-Fit Extension Problem for the Boolean network consisting of k-variable Boolean functions was proposed ¨ ¨ et al., 2003). Consider the inference problem for a single in (Lahdesm aki k node in the network. Assume that we have two vectors c(0) , c(1) ∈ R2 , where R denotes the set of real numbers, and c(0) and c(1) are indexed from 1 to 2k and initially zero vectors. Let s be a bijective mapping s : {0, 1}n → {1, . . . , 2n } that can be used to encode all binary vectors of length n to positive integers. Then, during one pass over the given examples, c(0) and c(1) can be updated to (0)
ci
(1)
ci
= w(x), if x ∈ F ∧ s(x) = i = w(x), if x ∈ T ∧ s(x) = i
.
(13.2)
266
COMPUTATIONAL GENOMICS (0)
(1)
Elements of ci and ci that are not set in Equation (13.2) remain zerovalued due to initialization. Let a fully determined candidate function be f with the corresponding truth table f having only binary elements. fs (x) defines the output value for the input vector x. Let f denote the complement of f, i.e., a vector where all zeros and ones are flipped. Then, the error size of function f can be written as k
ε( f ) =
2
(ff )
ci i .
(13.3)
i=1
(ff )
It is now easy to see that error size is minimized when ci i is minimum (ff ) or, conversely, ci i is maximum for all i. Thus, the truth table of the ( j) optimal Boolean function is fi = argmax j ci . So, the truth table of a function minimizing Equation (13.1) can be found within a single pass of the set elements of the vectors c(0) and c(1) . Therefore, when the time spent on memory initialization is ignored, the Best-Fit Extension Problem can be solved, for the class of k-variable Boolean functions, in time O (m · poly (k)). The generalization for the Boolean network is as straightforward as already shown for the Consistency Problem. That is, the above method must be applied to all nk variable combinations and all n genes. So, the optimal solution of the Best-Fit Extension Problem for the entire Boolean network can be found in time O n k · m · n · poly(k) . (13.4)
Gene expression time-series are usually noisy and the number of measurements is fairly small. Therefore, the best function is often not unique after applying the above inference scheme. Selecting only one predictor function per gene may lead to incorrect results. So, we may ultimately be interested in finding all functions f having error size smaller than or equal ¨ ¨ et al., 2003). to some threshold, i.e., ε( f ) ≤ εmax (Lahdesm aki We first consider only a single function and single variable combination. Let us assume that we are considering the class of functions with k variables and have already found the optimal function, in the sense of the Best-Fit Extension Problem, using the method introduced above. Thus, we know vectors c(0) and c(1) , error-size of the Best-Fit function ε( f opt ) = εopt , and the optimal binary function f opt itself through its truth table fopt . In order to simplify the following computation/notation, we introduce some new variables. Define c to contain the absolute values of elementwise differences between c(0) and c(1) , i.e., ci = |ci(0) − ci(1) | and let f′ denote truth table of a distorted (non-optimal) function f ′ . The truth table
267
Inference of Genetic Regulatory Networks k
of f ′ can now be written as f′ = fopt ⊕ d, where d ∈ {0, 1}2 is the portion of distortion from the optimal function and ⊕ stands for addition modulo 2, i.e., d = f′ ⊕ fopt . Then, we get directly from Equation (13.3) that k
′
ε( f ) =
2 i=1
k
(ff′ ) ci i
=
2 i=1
opt (ffi )
ci
+
i : di =1
ci = εopt + cT d
(13.5)
and can write the set of functions to be found in terms of truth tables as {ffopt ⊕ d : cT d ≤ εmax − εopt }.
(13.6)
{d′ : c′T d′ ≤ εmax − εopt }.
(13.7)
At this stage the problem of finding the set of Boolean functions { f ′ : ε( f ′ ) ≤ εmax } has been converted to a problem of finding a subset D ⊂ k {0, 1}2 such that all elements d ∈ D share the property defined in Equation (13.6). Perhaps the easiest way of continuing is as follows. Let us first sort the vector c in increasing order and denote the sorted vector by c′ and corresponding permutation (sorting) and inverse permutation operators by p and p −1 . Now, we aim to find the set of distortion vectors D ′ in this new “permutation domain” that satisfy
Note that this is closely related to Equation (13.6). The following greedy recursive algorithm can be used to find D ′ such that their inverse permutations together with fopt define all functions f , satisfying ε( f ) ≤ εmax . Statement for the while-loop is assumed to be evaluated in “short-circuit evaluation” fashion that prevents indexing c′ over its length. When we initialize d′ ← (0 . . . 0), D ′ ← d′ and i 0 ← 0, the set D ′ will contain proper “distortion” vectors, or their p permutations, after the function call Distortion-Vectors(d′ , εopt , i 0 ), assuming ε( f opt ) ≤ εmax . In order to find the final distortion vectors we still need to apply the inverse permutation p −1 to the found vectors in D ′ . Then, truth tables of the searched functions, satisfying { f : ε( f ) ≤ εmax }, can be formed as {ffopt ⊕ d : d ∈ ¨ ¨ et al., 2003) for further details). The comp−1 (D ′ )} (see (Lahdesm aki when including the sorting step, can be written as putational complexity, k O 2k · log 2k · poly (k) + |{ f : ε ( f ) ≤ εmax }| = O 2 · k · poly (k) +|{ f : ε ( f ) ≤ εmax }| assuming that we are considering the class of functions with k variables. In order to find the functions for the whole Boolean network one needs to apply the above procedure for all nk variable combinations and n genes essentially multiplying the time complexity n by k · n. As in most estimation tasks, over-fitting is an issue in inference of gene regulatory networks. That is, as maximum indegree k increases, we get
268
COMPUTATIONAL GENOMICS
1 2 3 4 5 6 7 8 9
Distortion-Vectors(d′ , ε, i) j ←i +1 while j ≤ 2k ∧ ε + c′j ≤ εmax d′j ← 1 D ′ ← D ′ ∪ {d′ } Distortion-Vectors(d′ , ε + c′j , j) d′j ← 0 j ← j +1 endwhile
Figure 13.1. A greedy algorithm that can be used to find the set of functions { f : ε( f ) ≤ εmax }.
a better fit for the training set, resulting in overtrained network models. A known fact is that models that are “too adapted” to the training set loose their generalization capability for the cases outside the training set. In order to avoid possible over-fitting, it may be useful to consider some cross-validation approaches. For that purpose, let us first introduce a connection between the Best-Fit Extension paradigm and a pattern recognition method. Let us now consider the input-output patterns as random variables X and Y having a joint distribution π. In the pattern recognition context, it is common to search for a predictor (classifier) which has the smallest probability of miss-prediction, i.e., f = arg min f ∈C P[ f (X ) = Y ]. When the joint distribution π is unknown a classification rule is applied to the sample data to construct a predictor. In the discrete setting, the most often used rule is the so-called histogram rule. The plug-in estimate of π is π( ˆ x, y) = n(x, y)/m, where n(x, y) is the number of times x and y are observed jointly in the data and m denotes the number of samples. The histogram rule finds a predictor which minimizes the resubstitution error on the given data set, i.e., fˆ = arg min f ∈C Pˆ [ f (X ) = Y ], where the probability is computed relative to the plug-in estimate πˆ . Next we will consider the connection between the Best-Fit extension and the histogram rule. Given the plug-in estimate πˆ , define the corresponding Best-Fit weights for the seen inputs x as w (x) = |πˆ (x, 0) − πˆ (x, 1) |, and set x ∈ T if πˆ (x, 1) ≥ πˆ (x, 0), otherwise x ∈ F. Now it is easy to see that the error size of a function f satisfies ε ( f ) = w (T ∩ F ( f )) + w (F ∩ T ( f )) = Pˆ [ f (X ) = Y ] − Pˆ [ fˆ (X ) = Y ], where fˆ is the optimal (unconstrained) Boolean predictor. Thus, minimizing the error size will also minimize the resubstitution error. Alternatively, given the weights and the sets T and F, define a normalization constant
Inference of Genetic Regulatory Networks
269
C = w (T )+w (F). Then set the plug-in estimate πˆ as follows: πˆ (x, 0) = w (x) /C if x ∈ F, and πˆ (x, 1) = w (x) /C if x ∈ T , and otherwise πˆ (x, 0) = πˆ (x, 1) = 0. Now it follows that Pˆ [ f (X ) = Y ] = C1 ε ( f ). So, a predictor f which minimizes the resubstitution error also minimizes the error size. Thus, the Best-Fit extension and the histogram rule are tightly interconnected. This gives a more formal basis for the use of different error estimation and model selection strategies, such as cross-validation and bootstrap, together with the Best-Fit Extension method. A close connection with the resubstitution error estimate also enables the use of different distribution-free error bounds which are usually based on the resubstitution error estimate (see e.g. (Devroye et al., 1996)). Another possible approach to prevent over-fitting is by utilizing some complexity penalizing methods, such as the information-theoretic Minimum Description Length (MDL) principle. The use of the MDL principle in gene regulatory network learning was introduced in (Tabus and Astola, 2001), who considered the MDL criterion-based two-part code together with a particular encoding scheme for finding good predictor functions for a single gene. In detail, the two-part code consists of the code length for the measured data given the model and the code length for the model itself, often denoted by L(D|M) and L(M). The encoding method in (Tabus and Astola, 2001) encodes only the model residuals, i.e., the errors generated by the model (function) of interest. Based on the discussion above, it is easily observed that the Best-Fit Extension method can be utilized in the MDL-based inference approach as well. That is, with unity weights w(x) = 1 for all the INPUT/OUTPUT pairs, the error-size of a Boolean function in the Best-Fit method is just the number of misclassifications. That measure provides a fast way of computing the MDL criterion in gene regulatory network inference. It may also be useful to constrain the search of possible predictor functions only to a certain (restricted) class of functions. Indeed, Harris et al. (2002) recently studied more than 150 known regulatory transcriptional systems with varying number of regulating components and found that these controlling rules are strongly biased toward so-called canalizing functions. Another family of function classes that has recently been shown to possess useful properties and proposed as models for genetic regulation are certain Post class functions (Shmulevich et al., 2003).
4.
Simulation Analysis
In this section we present inference results for the cdc15 yeast gene expression time-series data set taken from (Spellman et al., 1998) where 799 ¨ ¨ et al., 2003). The cell cycle regulated genes were identified (Lahdesm aki
270
COMPUTATIONAL GENOMICS
cell cycle regulated genes were split into five groups in each of which genes behaved similarly. The groups were referred to as G1, S, G2, M and M/G1, corresponding to different states of the cell cycle. In order to limit the computational complexity, we restricted our experiments only to those 799 cell cycle regulated genes. Further, we did not infer the entire regulatory network, but searched the functions for a selected set of genes. In this experiment, we were interested in the set of five genes: {Cln1, Cln2, Ndd1, Hhf1, Bud3}. (See (Spellman et al., 1998) for information about function, phase of peak expression, and other details of the selected genes.) Instead of selecting only one function per gene, we searched all functions with the error size not exceeding a pre-determined threshold εmax = 5. (When the number of variables is 3 we used εmax = 3 because the number of functions is quite large for ε = 4, 5.) We preprocessed the expression data using the following steps. The normalization was carried out as in (Spellman et al., 1998). The missing expression values in the cdc15 experiment were estimated by using the weighted K-nearest neighbors method introduced in (Troyanskaya et al., 2001). The number of neighbors in the estimation method was set to 15 and the Euclidean norm was used as the basis for the similarity measure, as suggested in (Troyanskaya et al., 2001). We omitted the genes that have more than 20% missing values. In order to check the biological relevance of our results we added some genes that are known to regulate the genes of interest. Some of the added genes were not found to be cell-cycle regulated in (Spellman et al., 1998). As a consequence, the set of 799 cell cycle regulated genes was reduced to 733 genes. The data binarization was performed using the quantization method described in (Shmulevich and Zhang, 2002a) together with a global threshold equal to 0. Since no measures of quality are available for our set of measurements, we used unity weight, w(x) = 1, for all measurements. The maximum indegree of the functions, k, was set to 1, 2, and 3. As a result, the histograms of the error size of the found functions for all the selected genes are shown in Fig. 13.2. The horizontal axis corresponds to the error size and the vertical axis shows the number of functions for each gene that have the corresponding error size. The error sizes are also shown in numerical form in Table 13.1. Note that the set of functions with maximum indegree k = i + 1 contains all the functions with maximum indegree k = i. Recall that the column corresponding to zero error size contains the numbers of consistent functions. Recently, yeast cell cycle transcriptional activators have been studied in (Simon et al., 2001) by combining both expression measurements and location analysis. In that study, of particular interest were certain transcriptional regulators and cyclin genes, including the genes Cln1, Cln2,
271
Inference of Genetic Regulatory Networks 70 Cln1 Cln2 Ndd1 Hhf1 Bud3
number of functions
60 50 40 30 20 10 0
0
1
2 3 error size
4
5
4
5
(a) 4
12
x 10
Cln1 Cln2 Ndd1 Hhf1 Bud3
number of functions
10
8
6
4
2
0
0
1
2 3 error size
(b) 7
8 7
number of functions
6
x 10
Cln1 Cln2 Ndd1 Hhf1 Bud3
5 4 3 2 1 0
0
1
2
3
error size
(c) Figure 13.2. The histograms of the error size of the found functions for all the selected genes. The maximum indegree is (a) k = 1, (b) k = 2 and (c) k = 3. The bars for the different genes are coded as shown on the graphs.
Ndd1 and Hhf1. We compare our results with the ones shown in that paper in order to validate the biological significance of our results. We also want to emphasize that, in order to simplify the function inference task,
272
COMPUTATIONAL GENOMICS
Table 13.1. The number of functions having error size between 0 and 5 for all the selected genes. The maximum indegree is set to k = 1, k = 2 and k = 3. For indegree k = 3 we only show the number of functions for error sizes 0 through 3. Gene\ Error Size
0
1
2
3
4
5
k=1
Cln1 Cln2 Ndd1 Hhf1 Bud3
0 0 0 0 0
0 0 0 0 0
3 7 0 2 5
8 18 0 20 15
23 41 0 50 38
42 60 1 68 50
k=2
Cln1 Cln2 Ndd1 Hhf1 Bud3
1 12 0 21 16
175 432 1 332 510
4930 9018 34 4110 6712
17022 26425 378 26665 20769
47018 61933 2664 69989 50257
91895 104595 16434 113637 80427
k=3
Cln1 Cln2 Ndd1 Hhf1 Bud3
27051 118042 420 95834 93683
860663 2526975 13657 1696500 2127964
12131506 22822482 246522 14920744 16209816
50893601 77088111 2495556 72192691 56397855
Indegree
we could have concentrated only on the genes that are known to regulate each other. That is not, however, the purpose of the proposed inference method. Instead, we really want to search through the whole “space” of models (functions). The key motivation for doing so is to search new possible predictor functions not already known. In the study by Simon et al., (2001), the complex Swi4/Swi6 was found to be a transcriptional regulator of the cyclin gene Cln1. In our analysis, the Best-Fit predictor function for the gene Cln1, with input variables {Swi4,Swi6}, had error size 3. Recently, genome-wide location analysis has also revealed that genes such as Mbp1 and Ndd1 can bind to the gene Cln1 (Lee et al., 2002). So, since the complex Swi4/Swi6 and genes Mbp1 and Ndd1 all can presumably regulate the gene Cln1, it is instructive to look at the three variable predictor functions having input variables {Swi4,Swi6,Mbp1} and {Swi4,Swi6,Ndd1}. The Best-Fit predictor function for those input variables turned out to have error size 2. Even though the error size for those functions is not the smallest, it is in clear agreement with the results shown in (Simon et al., 2001) and (Lee et al., 2002). For instance, the two variable predictor {Swi4,Swi6} is among the top 0.5% of all two variable predictors. Similarly, the three variable predictors
Inference of Genetic Regulatory Networks
273
{Swi4,Swi6,Mbp1} and {Swi4,Swi6,Ndd1} are both among the top 0.08% of all three variable predictors. For the gene Cln2, a known regulator is Rme1 (Banerjee and Zhang, 2002). The Best-Fit error size for that single variable function is 4, and it is found to be among the top 2.3% of all one variable predictors. The transcriptional regulator Swi4/Swi6 has also been found to be able to regulate the gene Cln2 (Simon et al., 2001). Based on the used data, the Best-Fit method did not find that variable combination alone as a potential regulator (error size is 5). However, when looking at the three variable functions, e.g. with variables {Swi4,Swi6,Rme1} and {Swi4,Swi6,Clb2}, (Clb2 can also regulate the gene Cln2 (Simon et al., 2001)) the minimum error sizes were found to be as low as 3 and 2, respectively. Previous studies have shown that complexes Swi4/Swi6 and Swi6/Mbp1 can bind to the promoter region of the gene Ndd1. So, one would expect a function with variables {Swi4,Swi6,Mbp1} to be a reasonably good predictor for the gene Ndd1. However, the binary Best-Fit analysis on the used data set did not give very strong support for these genes to be a probable good predictor set for the gene Ndd1; error size was found to be 5. One possible reason, in addition to noise, could be that the expression profile of the Ndd1 has 4 missing measurements in the cdc15 experiment. The interpolated values may have caused some artificial effects. In general, our results agree with most of the known regulatory behavior (Simon et al., 2001; Banerjee and Zhang, 2002; Lee et al., 2002). The BestFit functions for the known transcriptional regulators are among the best ones in almost every case. However, we can also identify some problems in our experiments. The first is the insufficient length of the time-series or, alternatively, the large size of the space of possible predictors. That is, the probability of finding a good predictor by chance is fairly high. For instance, since we are looking at cell cycle regulated genes only, some two genes A and B may simply have similar (periodic) expression profiles with a proper “phase difference.” If gene A has peaks and troughs in its expression profile slightly before gene B, it may be found to regulate gene B. In that kind of case, we are not able to decide whether or not the found transcriptional factor is spurious. Further biological measurements such as location data (Ren et al., 2000) or functional studies may be needed. Second, as the Best-Fit framework incorporates positive weights, it would have been beneficial if one had been able to utilize them in this experiment. We, on the other hand, only used equal weights for each vector in T and F. For instance, each measurement could be associated with some indicative measure of its quality. Then, instead of throwing away the low quality measurements, one is able to control their relative influence in the following data analysis steps by down-weighting the most unreliable
274
COMPUTATIONAL GENOMICS
ones. Estimating the qualities for gene expression measurements has been discussed in (Hautaniemi et al., 2003). Ones those quality measurements become available, they can be directly used in the gene regulatory network inference in the Best-Fit Extension framework.
5.
Conclusions
The ability to efficiently infer the structure of Boolean networks has immense potential for understanding the regulatory interactions in real genetic networks. We have considered a learning strategy that is well suited for situations in which inconsistencies in observations are likely to occur. This strategy produces a Boolean network that makes as few misclassifications as possible and is a generalization of the well-known Consistency Problem. We have focused on the computational complexity of this problem. It turns out that for many function classes, the Best-Fit Extension Problem for Boolean networks is polynomial-time solvable, including those networks having bounded indegree and those in which no assumptions whatsoever about the functions are made. This promising result provides motivation for developing efficient algorithms for inferring network structures from gene expression data.
References Akutsu, T., Kuhara, S., Maruyama, O. and Miyano, S. (1998) Identification of gene regulatory networks by strategic gene disruptions and gene overexpressions. Proc. the 9th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’98), 695–702. Akutsu, T., Miyano, S. and Kuhara, S. (1999) Identification of Genetic Networks from a Small Number of Gene Expression Patterns Under the Boolean Network Model. Pacific Symposium on Biocomputing 4, 17–28. Akutsu, T., Miyano, S., and Kuhara, S. (2000) Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics 16, 727–734. Angluin, D. (1987) Learning regular sets from queries and counterexamples. Information and Computation, 75:2, 87–106. Arnone, M. I. and Davidson, E. H. (1997). The hardwiring of development: Organization and function of genomic regulatory systems, Development, 124, 1851–1864. Baldi, P. and Hatfield, G. W. (2002) DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling, Cambridge University Press.
REFERENCES
275
Banerjee, N. and Zhang, M. Q. (2002) Functional genomics as applied to mapping transcription regulatory networks. Current Opinion in Microbiology, 5(3), 313–317. Boros, E., Ibaraki, T., and Makino, K. (1998) Error-Free and Best-Fit Extensions of Partially Defined Boolean Functions. Information and Computation, 140, 254–283. Brazma, A. and Vilo, J. (2000) Gene expression data analysis. FEBS Letters 480, 17–24. Chen, T., He, H. L. and Church, G. M. (1999) Modeling gene expression with differential equations. Pacific Symposium on Biocomputing, 4, 29–40. D’Haeseleer, P., Wen, X., Fuhrman, S. and Somogyi, R. (1999) Linear modeling of mRNA expression levels during CNS development and injury. Pacific Symposium on Biocomputing, 4, 41–52. de Hoon, M. J. L., Imoto, S., Kobayashi, K., Ogasawara, N. and Miyano, S. (2003) Inferring gene regulatory networks from time-ordered gene expression data of Bacillus subtilis using differential equations. Pacific Symposium on Biocomputing, 8, 17–28. DeRisi, J.L., Iyer, V.R., and Brown, P.O. (1997) Exploring the metabolic and genetic contol of gene expression on a genomic scale. Science, 278, 680–686. Devroye, L., Gyorfi, L. and Lugosi. G. (1996) A Probabilistic Theory of Pattern Recognition, Springer-Verlag. Friedman, N., Linial, M., Nachman, I. and Pe’er, D. (2000) Using Bayesian Network to Analyze Expression Data. Journal of Computational Biology, 7, 601–620. Friedman, N. (2004) Inferring cellular networks using probabilistic graphical models. Science, 303, 799–805. Gabbouj, M., Yu, P-T., and Coyle, E. J. (1992) Convergence behavior and root signal sets of stack filters. Circuits Systems & Signal Processing, 11:1, 171–193. Glass, L. and Kauffman, S. A. (1973). The logical analysis of continuous non-linear biochemical control networks. Journal of Theoretical Biology, 39, 103–129. Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C. and Lander, E. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. Harris, S. E., Sawhill, B. K., Wuensche, A. and Kauffman, S. (2002) A model of transcriptional regulatory networks based on biases in the observed regulation rules. Complexity, 7, 23–40.
276
COMPUTATIONAL GENOMICS
¨ Hautaniemi, S., Edgren, H., Vesanen, P., Wolf, M., Jarvinen, A. K., YliHarja, O., Astola, J., Kallioniemi, O. P. and Monni, O. (2003) A novel strategy for microarray quality control using Bayesian networks. Bioinformatics, 19(16), 2031–2038. Huang, S. (1999) Gene expression profiling, genetic networks, and cellular states: an integrating concept for tumorigenesis and drug discovery. Journal of Molecular Medicine 77, 469–480. Ideker, T. E., Thorsson, V. and Karp, R. M. (2000) Discovery of regulatory interactions through perturbation: Inference and experimental design. Pacific Symposium on Biocomputing, 5, 302-313. Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore, T., Lee, J. C. F., Trent, J. M., Staudt, L. M., Hudson Jr., J., Boguski, M. S., Lashkari, D., Shalon, D., Botstein, D., and Brown, P. O. (1999) The transcriptional program in the response of human fibroblasts to serum. Science, 283, 83–87. Karp, R. M., Stoughton, R. and Yeung, K. Y. (1999) Algorithms for choosing differential gene expression experiments. RECOMB99, 208–217. Kauffman, S. A. (1969). Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology, 22, 437–467. Kauffman, S. A. (1993) The origins of order: Self-organization and selection in evolution, Oxford University Press, New York. Kearns, M. J. and Vazirani, U. V. (1994) An Introduction to Computational Learning Theory, MIT Press. Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., et al. (2002) Transcriptional Regulatory Networks in Saccharomyces cerevisiae. Science, 298, 799–804. Liang, S., Fuhrman, S. and Somogyi, R. (1998) REVEAL, A General Reverse Engineering Algorithm for Inference of Genetic Network Architectures. Pacific Symposium on Biocomputing 3, 18–29. ¨ ¨ H., Shmulevich, I. and Yli-Harja, O. (2003) On learning gene Lahdesm aki, regulatory networks under the Boolean network model. Machine Learning, 52, 147–167. Maki, Y., Tominaga, D., Okamoto, M., Watanabe, S. and Eguchi, Y. (2001) Development of a system for the inference of large scale genetic networks. Pacific Symposium on Biocomputing, 6, 446–458. McAdams, H. H. and Arkin, A. (1997). Stochastic mechanisms in gene expression. Proc. Natl. Acad. Sci. USA, 94, 814–819. Moran, G. (1995) On the period-two-property of the majority operator in infinite graphs. Trans. Amer. Math. Soc. 347, No. 5, 1649–1667.
REFERENCES
277
Murphy, K. and Mian, S. (1999) Modelling Gene Expression Data using Dynamic Bayesian Networks. Technical Report, University of California, Berkeley. Ren, B., Robert, F., Wyrick, J. J., Aparicio. O., Jennings, E. G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., et al. (2000) Genome-wide location and function of DNA binding proteins. Science, 290, 2306–2309. Sakamoto, E. and Iba, H. (2001) Inferring a system of differential equations for a gene regulatory network by using genetic programming. Proc. Congress on Evolutionary Computation ’01, 720–726. Schena, M., Shalon, D., Davis, R. W., and Brown, P.O. (1995) Quantitative monitoring of gene expression pattern with a complementing DNA microarray. Science, 270, 467–470. Shmulevich, I. and Zhang, W. (2002) Binary Analysis and OptimizationBased Normalization of Gene Expression Data. Bioinformatics, 18, 555–565. Shmulevich, I., Dougherty, E. R., Kim, S., and Zhang, W. (2002) Probabilistic Boolean Networks: A Rule-based Uncertainty Model for Gene Regulatory Networks, Bioinformatics, 18, 261–274. ¨ ¨ H., Dougherty, E. R., Astola, J. and Zhang, Shmulevich, I., Lahdesm aki, W. (2003) The role of certain Post classes in Boolean network models of genetic networks. Proc Natl Acad Sci USA, 100, 10734–10739. Simon, I., Barnett, J., Hannett, N., Harbison, C. T., Rinaldi, N. J., Volkert, T. L., Wyrick, J. J., Zeitlinger, J., Gifford, D. K., Jaakkola, T. S. and Young, R. A. (2001) Serial regulation of transcriptional regulators in the yeast cell cycle. Cell, 106, 697–708. Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D., and Futcher, B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9, 3273–3297. Szallasi, Z. and Liang, S. (1998) Modeling the Normal and Neoplastic Cell Cycle With Realistic Boolean Genetic Networks: Their Application for Understanding Carcinogenesis and Assessing Therapeutic Strategies. Pacific Symposium on Biocomputing 3, 66–76. Tabus, I. and Astola, J. (2001). On the use of MDL principle in gene expression prediction. Journal of Applied Signal Processing, 4, 297–303. ´ Thieffry, D., Huerta, A. M., Perez-Rueda, E., and Collado-Vides, J. (1998) From specific gene regulation to genomic networks: a global analysis of transcriptional regulation in Escherichia coli. BioEssays, 20:5, 433–440.
278
COMPUTATIONAL GENOMICS
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. B. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520–525. van Someren, E. P., Wessels, L.F.A., and Reinders, M.J.T. (2000) Linear modeling of genetic networks from experimental data. Intelligent Systems for Molecular Biology (ISMB 2000), San Diego, August 19–23. Valiant, L. G. (1984) A theory of the learnable. Comm. Assoc. Comput. Mach, 27, 1134–1142. van’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R. and Friend, S. H. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536. Weaver, D.C., Workman, C.T. and Stormo, G.D. (1999) Modeling Regulatory Networks with Weight Matrices. Pacific Symposium on Biocomputing, 4, 112–123. Wen, X., Fuhrman, S., Michaels, G. S., Carr, D. B., Smith, S., Barker, J. L., and Somogyi, R. (1998) Large-Scale Temporal Gene Expression Mapping of Central Nervous System Development. Proc Natl Acad Sci USA, 95, 334–339. Wuensche, A. (1998) Genomic Regulation Modeled as a Network with Basins of Attraction. Pacific Symp. on Biocomp. 3, 89–102. Zhang, W., Shmulevich, I. and Astola, J. (2004) Microarray Quality Control, John Wiley and Sons.
Chapter 14 REGULARIZATION AND NOISE INJECTION FOR IMPROVING GENETIC NETWORK MODELS Eugene van Someren1 , Lodewyk Wessels1,2 , Marcel Reinders1 , and Eric Backer1 1 Information and Communication Theory Group 2 Systems and Control Engineering
Faculty of Information Technology and Systems Delft University of Technology, Delft, The Netherlands
1.
Introduction
Genetic network modeling is the field of research that tries to find the underlying network of gene-gene interactions from the measured set of gene expressions. Up to now, several different modeling approaches have been suggested, such as Boolean networks (Liang et al., 1998), Bayesian networks (Friedman et al., 2000), Linear networks (van Someren et al., 2000a; D’Haeseleer et al., 1999), Neural networks (Weaver et al., 1999; Wahde and Hertz, 1999) and Differential Equations (Chen et al., 1999b). In these approaches, genetic interactions are represented by parameters in a parametric model which need to be inferred from the measured gene expressions over time. Current micro-array technology has caused a significant increase in the number of genes whose expression can be measured simultaneously on a single array. However, the number of measurements that are taken in a time-course experiment has not increased in a similar fashion. As a result, typical gene expression data sets consist of relatively few time-points (generally less than 20) with respect to the number of genes (thousands). This so called dimensionality problem and the fact that measurements contain a substantial amount of measurement noise are two of the most fundamental problems in genetic network modeling. Generally, when learning parameters of genetic network models from ill-conditioned data (many genes, few time samples), the solutions become
280
COMPUTATIONAL GENOMICS
arbitrary. Comparative studies (van Someren et al., 2001a; Wessels et al., 2001) have recently reported that at least a number of the currently proposed models suffer from poor inferential power when estimated from such data. The performance of these models can be significantly improved by employing additional (biological) knowledge about the properties of real genetic networks such as limited connectivity, dynamic stability and robustness. In this chapter, several methods to impose robustness are considered. First, we present the fundamental problems of genetic network modeling and review currently known methodologies to overcome these problems. Then we introduce the concept of noise injection, where the original measured data set is artificially expanded with a set of noisy duplicates. In addition, two methods are presented, being ridge regression and lasso regression, that impose robustness by directly minimizing the first derivatives. We show how each of these approaches influence the obtained interaction parameters in the case of linear models and to what extent they are equivalent. Finally, experimental investigations show under which conditions these three regularization methods improve the performance in comparison with other models that are currently proposed in literature.
2.
Current Approaches to Tackling the Dimensionality Problem
In order to capture the combinatorial nature of genetic relations (Holstege et al., 1998), gene regulation should be modelled as a network of genetic interactions rather than by methods that are based solely on pairwise comparisons (Arkin et al., 1997; Chen et al., 1999a). The inference of genetic network models is, however, especially hampered by the dimensionality problem. Fortunately, (biological) knowledge about the general properties of genetic networks can be employed to alleviate some of the data requirements. In particular, true genetic networks are assumed to be 1) sparsely connected, because genes are only influenced by a limited number of other genes (Arnone and Davidson, 1997), 2) robust against noise, because small fluctuations in the expression-state may not lead to large differences in the consecutive expression-states, in other words, the general behavior of the system is not influenced by the small stochastic disturbances which are inevitable in biological systems (D’Haeseleer et al., 1999), 3) redundant, because genes are known to share functionality (Wen et al., 1998), and 4) dynamically stable, because the expression of a gene cannot grow without bound and 5) the activity level of genes are assumed to behave smoothly over time (D’Haeseleer et al., 1999). This knowledge can be used to alleviate the dimensionality problem in the
Regularization and Noise Injection in Genetic Networks
281
modeling process either by a pre-processing step that modifies the data set prior to the inference process or by directly controlling the inference process through direct regularization. Pre-processing: With a pre-processing step the dimensionality problem can be reduced by either artifically reducing the number of genes or by artificially increasing the number of time-points. Reduction of the number of genes can be achieved by thresholding and/or clustering the data (Wahde and Hertz, 1999; van Someren et al., 2001a; Chen et al., 1999a). Thresholding is based on the fact that small signals should be rejected because of their strong corruption with measurement noise. Clustering solves the ambiguity problem by considering co-expressed genes as a single ‘meta-gene’ and is biologically inspired by the redundancy assumption. Unfortunately, the use of thresholding and clustering is limited to what is biologically plausible and can therefore only partly reduce the dimensionality problem. By exploiting the smoothness of gene expression as a function of time, the number of time-points can be increased by employing interpolation (D’Haeseleer et al., 1999). However, our experience shows that interpolation reduces the dimensionality problem only marginally regardless of the number of time-points added. Direct Regularization: The use of biological constraints to regulate the inference process directly has only been applied in a few earlier papers. In previous work (van Someren et al., 2000a; van Someren et al., 2001b), we exploited the fact that networks have a limited connectivity to search directly for sparse linear networks. Weaver (Weaver et al., 1999) also enforce the sparsity, but only briefly describe an approach that iteratively sets small weights in the gene regulation matrix to zero. D’Haeseleer employs weight decay and weight elimination when learning recurrent neural networks to tackle the dimensionality problem (D’Haeseleer et al., 1999). Thus far, none of the reported modeling approaches have incorporated methodologies that explicitly increase robustness or dynamic stability of the obtained gene regulation matrices. In this chapter, we focus on methods that impose robustness.
3.
Learning Genetic Network Models
First, in this section, the representation of gene expression data and genetic network models is defined and we describe how genetic interactions can be learned from data. Throughout this chapter, square brackets are used to denote concatenation of vectors or matrices and the following notation for positive integers is used: N K+ = 1, 2, . . . , K .
282
COMPUTATIONAL GENOMICS
Data: Let gi (t) represent the gene expression level of gene i at timepoint t and let N genes be measured. Then an expression-state is represented by a column vector of all gene expression values measured at a certain point in time, i.e. g(t) = [g1 (t), . . . , g N (t)]T . A typical timecourse gene expression data set reflects how one state of gene expression levels is followed by a consecutive state of gene expression. For learning a genetic network model, the data set is best represented as a training set of Q state-transition (input-output) pairs, represented by S = &X; T'. Here, input matrix X = [x1 , x2 , . . . x Q ] and target (output) matrix T = [t1 , t2 , . . . t Q ] are matrices containing the concatenated input-states and target output-states respectively. The pairs are constructed such that each input state xq directly precedes the target (output) state tq , i.e. if xq = g(t) then tq = g(t + 1). Genetic Network Model: A genetic network model is a representation of the genetic interactions such that for a given state of gene expression it can predict (output) the consecutive state(s). In general, a genetic network model represents a parameterized non-linear mapping, y = f (x), from an N-dimensional input vector into an N-dimensional output vector. One of the simplest continuous models is a linear genetic network model. This model is studied in this chapter because it allows insight in the consequences of the dimensionality problem and the effects of regularization, while its relatively limited number of parameters makes it less sensitive to the dimensionality problem. In addition, several other continuous genetic network models are just extensions of the linear model (Wessels et al., 2001). The linear model assumes that the gene expression level of each gene is the result of a weighted sum of all other gene expression levels at the previous time-point, or in vector notation: yq = W · xq
+ q ∈ NQ
(14.1)
The interaction parameter, wi j , represents the existence (wi j = 0) or absence (wi j = 0) of a controlling action of gene j on gene i, whether it is activating (wi j = 0) or inhibiting (wi j = 0), as well as the strength (|wi j |) of the relation. The complete matrix of interactions, W, is called the gene regulation matrix (GRM). To learn the gene regulation matrix, we simply require that the predicted states, yq , are as close as possible to the target states, tq . For example, by requiring that the mean square error, E, should be minimal, i.e.: ˆ = arg min E, W w
with
Q 1 E= yq − tq 2 2Q q=1
(14.2)
Regularization and Noise Injection in Genetic Networks
283
For problems affected by the dimensionality problem, minimizing E results in infinitely many GRM’s, which all give a zero mean square error. In the linear case this is obvious, since each of the Q elements in the training set constitute one linear equation of N unknowns and typically N >> Q, i.e. many unknowns and only a few equations to solve them. For example, in the case of a genetic network with two inputs, one output and one state-transition pair, this set of zero-error solutions corresponds to a line in the space of all solutions as is illustrated in Fig. 14.1. A well-known solution to this problem is to select Pseudoinverse: from the set of zero-error solutions, the one with minimal sum of squared weights. This solution is known as the Moore-Penrose pseudoinverse, for which the following analytical solution exists (See also Fig. 14.1): ˆ P S EU D O = T · (XT · X)−1 · XT W
(14.3)
This is one possible way to select a weight matrix from a whole array of possibilities. However, many other ways of regularizing the problem exist. In this chapter, we will focus on choosing the GRM such that the resulting genetic network model is robust.
Figure 14.1. Weight space of an example genetic network with two inputs, one output and one state-transition pair given. Indicated are the set of zero-error solutions, the pseudo-inverse and a range of solutions for different parameter settings of ridge regression and lasso regression.
284
4.
COMPUTATIONAL GENOMICS
Robust Methods
A genetic network model is robust against noise in the input, when small fluctuations in the input-state do not lead to large differences in the outputstate. The basic idea of noise injection is that robustness Noise Injection: can be imposed by augmenting the original data set with state-transition pairs onto which small distortions are added, such that the model automatically learns to be robust. To achieve this, an extended training set, S ′ = &[X, X′ ]; [T, T′ ]', is constructed that contains for each state-transition pair in the original training set, the original pair and Nr repetitions of the original pair corrupted with zero-mean additive Gaussian noise of given standard deviation, σr , i.e.: X′ = [x11 , x12 , . . . , xqm , . . . , x Q Nr ], wher e xqm = xq + x˜ qm , T′ = [t11 , t12 , . . . , tqm , . . . , t Q Nr ], wher e tqm = tq + ˜tqm , + , m ∈ N N+r , x˜ qm , ˜tqm ∼ N (0, σr I) q ∈ NQ
(14.4)
Here, the double superscript qm indicates the m-th noise corrupted copy of the q-th original. A pleasant side-effect of using the augmented training set, S′ , is that, already for relatively small Nr , the (linear) model becomes over-determined. Consequently, the least squares estimation can be employed to infer the gene regulation matrix from the data set S′ . We denote this new approach as noise injection, for which a solution is obtained as follows: ˆ N O I S E = T′ · X′ T · (X′ · X′ T )−1 W (14.5) The number of repetitions should always be chosen as high as is computationally feasible to avoid random effects. For the experiments in this chapter, we let it grow proportional to the number of genes, i.e. Nr = 5N . Ridge Regression: An alternative approach to make a model less sensitive to deviations in the input is by directly minimizing the model’s first derivatives of the outputs with respect to the inputs, ∂y/∂x. This can be achieved by augmenting the original mean squared error (Eq 14.2) with, amongst other, a quadratic term (Hastie et al., 2001). In the case of a linear model, with first derivatives ∂ yi /∂ x j = wi j , this result in: Q N N 1 q q 2 ˆ W R I DG E = arg min W · x − t +λ wi2j W 2Q q=1
i=1 j=1
(14.6)
285
Regularization and Noise Injection in Genetic Networks
For this problem, an analytic solution exists, known as ridge regression (Hoerl and Kennard, 1970): ˆ R I DG E = T · XT · (X · XT + λI)−1 , W
λ>0
(14.7)
This method utilizes λ as a parameter that provides a tunable trade-off between data-fit and shrinkage of the weights. In general, ridge regression tends to produce solutions with many small, but non-zero weights. Lasso Regression: Alternatively, the original mean squared error (Eq. 14.2) can be augmented with a slightly different penalty term, namely one that shrinks the sum of the absolute values of the weights (Tibshirani, 1994): Q N N 1 q q 2 ˆ W L ASS O = arg min W · x − t +γ |wi j | (14.8) W 2Q q=1
i=1 j=1
A solution to this equation can, generally, be found using only a few iterations of the EM-algorithm (Grandvalet, 1998). This method is called Least Absolute Shrinkage and Selection Operator, because it tends to shrink the weights such that only a few weights remain non-zero. Lasso regression is especially suited for genetic network modeling as it tries to produce models that exhibit robustness as well as limited connectivity. The performance of all three methods depends Parameter Setting: heavily on the right setting of the free parameter, i.e. the noise strength, lambda and/or gamma. Figure 14.1 depicts the different solutions of ridge and lasso regression for a range of lambda and/or gamma settings. The right setting of this parameter depends on the number of genes, the original connectivity, the underlying type of model, the number of measurements, but most heavily on the expected amount of measurement noise. At the expense of increased computational costs we propose to employ a leave-oneout cross-validation procedure to automatically determine the right parameter for each new data set. Given a set of k possible parameter settings, at each step, this procedure 1) sets aside one sample out of the training set, 2) learns, using the remaining samples, k different model solutions for each of the k different parameter settings, 3) predicts, using the input of the left-out sample, k different outputs for each of the found k solutions, 4) determines, using the output of the left-out sample and the k predictions, the corresponding prediction errors. This procedure is repeated for each sample in the training-set and the resulting prediction-errors, corresponding to one parameter setting, are averaged. The final model solution
286
COMPUTATIONAL GENOMICS
is then learned on the complete training-set using the parameter setting that showed the lowest average prediction-error on the unseen data. Example: To further illustrate the difference in shrinkage, the solutions and corresponding errors of noise injection, ridge regression and lasso regression for different parameter settings are depicted in Fig. 14.2 on an example of a 27-gene network. The three plots on the left show, how, for increasing parameter settings, the weights of each approach are shrunk. For further illustration, in all plots, the solution marked with stars on the y-axis does not corresponds to the lowest parameter setting, but to the pseudo-inverse solution. All three methods start for low parameter settings (σr , λ, γ ↓ 0) with a different solution, i.e. while noise injection is only approximately similar to the pseudo-inverse, it is clear that ridge regression is exactly equivalent to the pseudo-inverse, whereas lasso regression directly sets a substantial number of weights to zero. For very large parameter settings (λ, γ → ∞) noise injection approaches some fixed solution, while ridge and lasso regression solutions approach the zero matrix (all weights to zero). The right-most plots depict the corresponding training error and the average leave-one-out error. The parameter value that will be chosen based on the minimum of the leave-one-out error is indicated in all plots with a vertical dashed line.
5.
Noise Injection is Equivalent to Regularization
Bishop studied the relation between adding noise to a training set and regularization for a set of general non-linear models (Bishop, 1994). He has shown that, for small noise strength, adding noise to the data is equivalent to minimization of the following error function: E˜ = E + η2 E R
(14.9)
Here, E R is controlled by the strength of the noise, η2 , and can be further approximated by (Eq. 19 in (Bishop, 1994): Q N N q 1 ∂ yi 2 R ˆ E = ( q) 2Q ∂x j
(14.10)
q=1 i=1 j=1
Index i labels the N predicted genes and index j labels the N genes used as input. In the case of genetic network models the parameters remain constant for each training sample, so we can sum over all q and in the case of a linear model, the partial derivatives simply are the corresponding
287
Regularization and Noise Injection in Genetic Networks
Figure 14.2. Shrinkage of weights (left) and corresponding errors (right) as a function of the free parameter for noise injection (top), ridge regression (middle) and lasso regression (bottom). Weights corresponding to the pseudo-inverse are marked with stars.
weights in the gene regulation matrix. Thus, Eq. (14.10) simplifies to: N
N
1 2 Eˆ R = wi j 2 i=1 j=1
(14.11)
288
COMPUTATIONAL GENOMICS
Consequently, for small noise strength and a large number of repetitions, noise injection is equivalent to ridge regression. Furthermore, in the underdetermined case and in the limit of very small λ and / or σr2 respectively, both are equivalent to the application of the Moore-Penrose pseudo-inverse to the original data set. Obviously, the mean square error, E, then becomes zero since a perfect solution exists.
6.
Comparison with Other Models
To show that imposing robustness improves the performance of genetic network models, the four types of ”robust” models ((pseudo-inverse, noise injection, ridge regression and lasso regression) are compared with a reference model that always returns a random stable GRM of connectivity four and a model that employs hierarchical clustering (van Someren et al., 2000b), denoted as the cluster model. Experimental Setup: For a given number of genes, N N, and a given connectivity (number of non-zero weights in each row of the GRM), C, a uniformly random gene regulation matrix is generated. To ensure that the random GRM is stable, all its elements are divided by the magnitude of the largest eigenvalue. Then, using a linear model and a given number of T, a subset is generated from a randomly chosen initial state. time-points, T Measurement noise is added by corrupting all elements of this data set with zero-mean additive Gaussian noise of given standard deviation, σm . The performance was evaluated by monitoring the inferential power, the predictive power, the number of connections and the computational costs. The principle goal of genetic network modeling is to maximize the inferential power. This criterium expresses the correlation between the true genetic interactions and the relationships inferred by the model and is defined by: 1 ˆ + 1) I P = (PC(W, W) (14.12) 2 Here, PC, returns the Pearson correlation coefficient between the two vectors that are obtained by transforming the entries of each matrix into one vector. The inferred model-parameters are, however, determined by maximizing the predictive power, i.e. by minimizing the mean squared error (Eq. 14.2). The predictive power is given by: PP =
1 1+ E
(14.13)
An important cause of decreased inferential power (which easily arises with insufficient regularization) is the variability of the obtained solutions
Regularization and Noise Injection in Genetic Networks
289
when small modifications are made to the data set. This inferential stability is determined by measuring the average inferential power between all ˆ k , obtained by the inference on K corresponding noise corsolutions, W ˜ where the elements of X ˜ are given by rupted data sets, Xk = X + X, x˜ nq ∼ N (0, .01). Formally, the inferential stability is defined as: IS =
k−1 K 2 ˜ k, W ˜ m) I P(W K (K − 1)
(14.14)
k=1 m=k+1
Experimental settings: In subsequent experiments, either the number of genes, N N, was varied and the connectivity, C, was fixed at four or the connectivity was varied while the number of genes was fixed at thirty. Available micro-array data, generally, consist of multiple small (timecourse) experiments. Therefore, data sets were generated by merging four differently initialized subsets of T = 5 time-points each. Note that one such data set, consists of Q = 4 · (T − 1) = 16 state-transition pairs. For each data set, σm was chosen such that the peak-signal-to-noise-ratio (Psnr) was 40 dB. Parameter values of σr , λ and γ were set to their optimal value (determined by the leave-one-out procedure). To filter out random effects, the performance of the models at each fixed condition (data-point) was determined by the average over 40 data sets. Experiments were done using GENLAB, a gene expression analysis toolbox developed by the authors and available at our website http://www.genlab.tudelft.nl. Results for small networks: Figure 14.3 depicts the inferential power, predictive power on a test-set, connectivity and computational costs of the different models for small networks up to thirty genes. The gray patch denotes the average performance of the reference model plus and minus two standard deviations. The cluster model performs well for N < 16, however, it tackles the dimensionality problem only marginally and performs poorly for N > 16. A remarkable observation is that all robust models outperform the reference and cluster model in terms of inferential power, especially in the under-determined case (N > 16). This illustrates the suggestion that different ways of imposing robustness all improve the inferential power. Around N ≈ Q the pseudo-inverse shows a dip in performance with respect to the other robust models. This phenomenon is also observed in classification tasks and is probably caused by the fact that the covariance matrix becomes ill-conditioned when the number of samples is close to the number of dimensions and, effectively, no regulation takes place. Both ridge regression and noise injection exhibit good and similar inferential power and do not show the same dip as the pseudo-inverse.
290
COMPUTATIONAL GENOMICS
Figure 14.3. Inferential power, predictive power on test-set, connectivity and computational costs of six genetic network models (see legend) for small networks as a function of no. genes (x-axis). See text for more detailed description.
Most striking is the superior performance of lasso regression which tries to find solutions that are not only robust, but also exhibit limited connectivity (since weights are set to zero). The inferential power shows strong correlation with the predictive power on a separate test-set, supporting the use of cross-validation to determine the right parameter setting. The connectivity of the obtained solutions shows that the pseudo-inverse, ridge regression and noise injection, return fully connected solutions whereas lasso regression limits the connectivity. However, lasso regression still over-estimates the connectivity (which should be four) under all conditions. Finally, it is shown that, as expected, the analytical solutions (pseudo-inverse and ridge regression) are considerably faster and thus more likely to scale up. Noise injection is the slowest model, because it needs to solve more equations due to the extension of the data set. Results for large networks: The inferential power and computational costs for larger networks are depicted in Fig. 14.4. Although de-
Regularization and Noise Injection in Genetic Networks
291
Figure 14.4. Inferential power and computational costs of six genetic network models (see legend) for large networks as a function of no. genes (x-axis).
creasing performance for larger networks is to be expected, it becomes apparent that the performance of all models drop sharply when the network size increases from 30 to 200 genes, whereas the performance decreases more asymptotically for larger networks. At the same time, the exponential increase in the computational costs for noise injection caused us to refrain from applying noise injection for networks larger than 100 genes. For a hundred gene network, it took already almost two hours to determine the connections going into a single node on a 450 Mhz Pentium II. Results for varying connectivity: The difference in performance of the models is expected to depend strongly on the actual complexity of the original GRM. Therefore, the models were also tested for a 30-gene GRM with connectivity varying from 1 to 30. The results of this experiment are depicted in Fig. 14.5. The results indicate that the pseudo-inverse, noise injection and ridge regression perform equivalently and consistent for different connectivity conditions. The remarkably constant performance seems to indicate that the models are not affected by the addition of more connections, suggesting that the individual contributions of each gene can still be recognized. Furthermore, it becomes apparent that the lasso regression favors data sets created by GRM’s with low connectivity and performs worse than the pseudo-inverse, noise injection and ridge regression when the connectivity becomes larger than half the number of genes. Results of inferential stability: The inferential stability as a function of the number of genes for small networks is depicted in Fig. 14.6. Apart from a downward trend, the inferential stability shows strong correlation with the inferential power, depicted in Fig. 14.3. These figures illustrate that low inferential stability implies low inferential power. Thus,
292
COMPUTATIONAL GENOMICS
Figure 14.5. Inferential power of six genetic network models (see legend) for a 30-gene network as a function of connectivity (x-axis).
Figure 14.6. Inferential stability of six genetic network models (see legend) as a function of no. genes (x-axis).
it becomes apparent that the dip in the inferential power of the pseudoinverse and the marginal performance of the cluster model are both caused by inferential instability. By imposing robustness (noise injection, ridge regression and lasso regression) the models become inferentially stable and the general performance is improved. However, the inferential power can be further improved by imposing other constraints that do not necessarily improve inferential stability. For example, the superior inferential power of lasso regression, that imposes limited connectivity, cannot fully be explained by significantly improved inferential stability.
7.
Discussion
In this chapter, we’ve shown that imposing robustness, generally, improves the performance of genetic network models. In our view, direct regulariza-
REFERENCES
293
tion is to be preferred to adding noise and we believe it is an essential tool for successful genetic network modeling. However, when a good regularization term is hard to define or the inference process does not allow inclusion of such a term, adding noise will prove to be a very simple alternative to regularization. As a rule-of-thumb, the number of added samples should be chosen as high as is computationally feasible and the added noise strength should be chosen based on cross-validation. Whether these observations hold for non-linear models or even realistic data sets needs to be further investigated. Nevertheless, in addition to thresholding, clustering and interpolation, adding noise provides a simple and general tool to tackle the dimensionality problem and is definitely the most effective method thus far.
Acknowledgments This work was funded by the Dioc-IMDS program of the Delft University of Technology.
References Arkin, A., Shen, P., and Ross, J. (1997). A test case of correlation metric construction of a reaction pathway from measurements. Science, 277. Arnone, A. and Davidson, B. (1997). The hardwiring of development: Organization and function of genomic regulatory systems. Development, 124:1851–1864. Bishop, C. (1994). Training with noise is equivalent to tikhonov regularization. Neural Computation, 7(1):108–116. Chen, T., Filkov, V., and Skiena, S. (1999a). Identifying gene regulatory networks from experimental data. In Proceedings of the third annual international conference on Computational molecular biology (RECOMB99), pages 94–103. Association for Computing Machinery. Chen, T., He, H., and Church, G. (1999b). Modeling gene expression with differential equations. In Pacific Symposium on Biocomputing ’99, volume 4, pages 29–40. World Scientific Publishing Co. D’Haeseleer, P., Wen, X., Fuhrman, S., and Somogyi, R. (1999). Linear modeling of mrna expression levels during cns development and injury. In Pacific Symposium on Biocomputing ’99, volume 4, pages 41–52. World Scientific Publishing Co. Friedman, N., Linial, M., Nachman, I., and Pe’er D. (2000). Using bayesian networks to analyze expression data. Journal of Computational Biology, 7:601–620.
294
COMPUTATIONAL GENOMICS
Grandvalet, Y. (1998). Least absolute shrinkage is equivalent to quadratic penalization. In International Conference on Artificial Neural Networks ¨ (ICANN’98), Skovde, Sweden. Hastie, T., Tibshirani, R., and Friedman, J. (2001). Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, New York. Hoerl, A. and Kennard, R. (1970). Ridge regression: Applications to nonorthogonal problems. In Technometrics, volume 17, pages 69–82. Holstege, F., Jennings, E., Wyrick, J., Lee, T., Hengartner, C., Green, M., Golub, T., Lander, E., and Young, R. (1998). Dissecting the regulatory circuitry of a eukaryotic genome. Cell, 95(5):717–728. Liang, S., Fuhrman, S., and Somogyi, R. (1998). Reveal, a general reverse engineering algorithm for inference of genetic network architectures. In Pacific Symposium on Biocomputing ’98, volume 3, pages 18–29. World Scientific Publishing Co. Tibshirani, R. (1994). Regression selection and shrinkage via the lasso. Technical report. van Someren, E., Wessels, L., and Reinders, M. (2000a). Information extraction for modeling gene expressions. In Proceedings of the 21st Symposium on Information Theory in the Benelux, pages 215–222, Wassenaar, The Netherlands. Werkgemeenschap voor Information Theory in the Benelux. van Someren, E., Wessels, L., and Reinders, M. (2000b). Linear modeling of genetic networks from experimental data. In Altman, R., Bailey, T., Bourne, P., Gribskov, M., Lengauer, T., Shindyalov, I., Eyck, L. T., and Weissig, H., editors, Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 355–366, La Jolla, California, AAAI. van Someren, E., Wessels, L., and Reinders, M. (2001a). genetic network models: A comparative study. In Proceedings of SPIE, Micro-arrays: Optical Technologies and Informatics, volume 4266, San Jose, California. van Someren, E., Wessels, L., Reinders, M., and Backer, E. (2001b). Searching for limited connectivity in genetic network models. In Proceedings of the Second International Conference on Systems Biology (ICSB’01), Pasadena, California. Wahde, M. and Hertz, J. (1999). Coarse-grained reverse engineering of genetic regulatory networks. In IPCAT 99. Weaver, D., Workman, C., and Stormo, G. (1999). Modeling regulatory networks with weight matrices. In Pacific Symposium on Biocomputing ’99, volume 4, pages 112–123, Hawai. World Scientific Publishing Co.
REFERENCES
295
Wen, X., Fuhrman, S., Michaels, G., Carr, D., Smith, S., Barker, J., and Somogyi, R. (1998). Large-scale temporal gene expression mapping of central nervous system development. In Proceedings of the National Acadamy of Sciences of the USA, volume 95 of 1, pages 334–339. National Academy of Sciences. Wessels, L., van Someren, E., and Reinders, M. (2001). A comparison of genetic network models. In Pacific Symposium on Biocomputing 2001, volume 6, pages 508–519, Hawai. World Scientific Publishing Co.
Chapter 15 PARALLEL COMPUTATION AND VISUALIZATION TOOLS FOR CODETERMINATION ANALYSIS OF MULTIVARIATE GENE EXPRESSION RELATIONS Edward B. Suh1 , Edward R. Dougherty2 , Seungchan Kim2 , Michael L. Bittner3 , Yidong Chen3 , Daniel E. Russ1 , and Robert L. Martino1 1 Division of Computational Bioscience, Center for Information Technology, National Institutes
of Health, Bethesda, Maryland, USA 2 Department of Electrical Engineering, Texas A&M University, College Station, Texas, USA 3 Cancer Genetic Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
1.
Introduction
A key goal of functional genomics is to develop methods for determining ways in which individual actions of genes are integrated in the cell. Kim et al. have proposed using the nonlinear coefficient of determination for finding associations between genes (Kim et al., 2000a; Kim et al., 2000b; Dougherty et al., 2000). The method assesses the codetermination of gene transcriptional states based on statistical evaluation of reliably informative subsets of data derived from large-scale gene-expression measurements. It measures the degree to which the transcriptional levels of a small gene set can be used to predict the transcriptional levels of a small gene set can be used to predict the transcriptional state of a target gene in excess of the predictive capability of the mean level of the target. Conditions besides transcriptional features can be used as predictive elements, thus broadening the information that can be evaluated in modelling biological regulation. Application of the statistical framework to a large set of genes requires a prohibitive amount of computer time on a classical single-CPU computing machine. To meet the computational requirement, we have developed
298
COMPUTATIONAL GENOMICS
a parallel implementation of the codetermination method, named the Parallel Analysis of Gene Expression (PAGE) program. PAGE has performed a codetermination analysis of melanoma skin cancer expression data in two weeks of computer time using 64 processors in a Beowulf parallel computing machine (Sterling et al., 1999) at the Center for Information Technology, National Institutes of Health (NIH). This analysis would have required about one and half years on a sequential computing machine. PAGE consists mainly of three modules. The first module generates predictor gene sets out of a large gene pool. It has an exclusionary feature that eliminates any set that contains a subset in an exclusion list. The second computes the coefficient of determination (CoD) for the gene sets generated by the first module. The third module checks to see if the CoD calculated from the data satisfies a certain theoretical requirement. If it does not, then the module adjusts it accordingly. This chapter analyzes the performance of the parallel implementation. PAGE provides a massive amount of computationally derived information that is too large for a user to directly review. In order to facilitate the intelligent navigation and further analyses of the computationally derived information, computer-aided data visualization and analysis tools have been developed. The data visualization software provides tools within a graphical user interface (GUI) environment for analysis, interpretation, and application by geneticists. Aspects of this data visualization program, Visualization of Gene Expression (VOGE), are illustrated.
2.
Codetermination Algorithm
When using multiple microarrays, raw signal intensities vary extensively due to the processes of preparing and printing the expressed sequence tag (EST) elements, and of preparing and labelling the cDNA representations of the RNA pools. This problem is solved via internal standardization. An algorithm calibrates the data internally to each microarray and statistically determines whether the data justify the conclusion that an expression is up- or down-regulated (Chen et al., 1997). Prior to analysis, data from a microarray is reduced by thresholding changes in transcript level into ternary form: −1 (down-regulated), +1 (up-regulated), or 0 (invariant). The determination paradigm involves nonlinear multivariate predictors acting on the ternary input data. A nonlinear predictor is a logical system L that estimates a target expression level Y based on a set of input expression levels X 1 , X 2 , . . . , X k . The target Y is the output of a biological system S that is observed when X 1 , X 2 , . . . , X k are input to S. The optimal predictor yields a logical value Ypred that best predicts Y Y. The target value is random owing to the various possible values of X 1 , X 2 , . . . , X k that might be input
Parallel Computation and Visualization Tools for Gene Expression
299
to S, as well the fact that our knowledge of S is necessarily incomplete. Statistical training uses only that X 1 , X 2 , . . . , X k are among the inputs to S, the output Y of S can be measured, and a logical system L can be constructed whose output Ypred statistically approximates Y Y. The logic of L represents an operational model of our understanding. The theoretically optimal predictor, which has minimum error across the population, is unknown and must be statistically designed (estimated) from a sample. How well a designed predictor performs depends on the training procedure and sample size n. The error ǫn of a designed predictor exceeds the error ǫopt of the optimal predictor. For a large number of microarrays, ǫn ≈ ǫopt , but for the small samples typically used, ǫn exceeds ǫopt . The coefficient of determination (CoD), θopt , of the optimal predictor is the relative decrease in error owing to the observed variables: ǫb − ǫopt θopt = (15.1) ǫb where ǫb is the error for the best predictor in the absence of observations. Since the error, ǫopt , of the optimal predictor cannot exceed ǫb , 0 ≤ θopt ≤ 1. If optimization is relative to mean-square-error (MSE), which is the expectation E[|Y Ypred − Y |2 ], then the best predictor based on the predictor variables X 1 , X 2 , . . . , X k is the conditional expectation of Y given X 1 , X 2 , . . . , X k . The best predictor of Y in the absence of observations is its mean, μY , and the corresponding error is the variance of Y Y. Predicting Y by its mean might yield small or large MSE depending on the variance Y. Thus, there is normalization by ǫb in Equation 15.1 to measure the of Y effect of the observations. Here, the conditional expectation is quantized to ternary values, and ǫb is the error from predicting Y by applying the ternary threshold of μY . For designed predictors, ǫb and ǫopt are replaced by sample estimates, and errors are estimated by cross-validation. The coefficient of determination measures normalized predictive capability in a probabilistic, not causal, sense. A high CoD may indicate causality of the predictors on the target or vice versa. It may also indicate diffuse control along network pathways, or coregulation of both predictors and target by another mechanism. Hence, the CoD is being called the codetermination in biology.
3.
Prediction System Design
In this and the next two sections, system design, parallel-program modules, and visualization of algorithm output are discussed. Although we have previously implemented a perceptron-based prediction system (Kim et al., 2000a; Kim et al., 2000b; Dougherty et al., 2000), we focus on the design of a logic-filter-based prediction system in this chapter.
300
COMPUTATIONAL GENOMICS
Rather than use the ternary-quantized conditional expectation for the predictor, we use the conditional mode. This requires less computation, the differences between it and the conditional expectation are small, and it avoids predicting values that have not occurred in the samples, a desirable property for such coarse quantization. The ternary logic filter is constructed via the conditional mode. For any observation vector x, (x) is the value of Y seen most often with x in the sample data. The size of table defining the predictor grows exponentially with the number of predictor variables, and the number of conditional probabilities to estimate increases accordingly. For gene expression ratio data, the number of pairs to use in filter design is very limited; in fact, we often do not observe all vectors. When applying the filter to test data, there may be inputs not observed during design. A value needs to be assigned to such inputs. The algorithm provides several choices. For a given target gene and predictor-gene combination, it first designs a nonlinear prediction model, ternary logic filter, with a training data set. It then uses a test data set to evaluate predictor performance based on the coefficient of determination. A user may pre-select t target genes of interest. To find gene sets with high CoDs for a target-gene requires searching through combinations of one-, two-, three-, or up to k-predictor-gene combinations. There may exist prior knowledge to guide this combinatorial search, or we may search through the full space that grows exponentially as k and the number of genes in a gene pool increase. For estimating the CoD, a user can choose a number, N N, of repetitions for data splitting. The higher N N, the better estimated the expectation. The system has nested loops to go through N repetitions of data splitting, and has m Ck = m!/(k!(m − k)!) combinations to compute for each of t target genes. Since incremental relations between smaller and larger predictor sets are important, it is necessary to calculate the CoD for k predictor gene combinations, for each k of 1, 2, 3, to some stopping point. The overall runtime, computing complexity, of these operations is O(N ·m Ck · t). While design and validation (error estimation) of a prediction model for a single predictor combination for a target gene may not require significant computational resources, the prediction system does require significant parallel computing performance as k and m are increased.
4.
Parallel Analysis of Gene Expression (PAGE)
As more and newer types of biological information are obtained from the biomedical research community, the computational performance of parallel computing is required to analyze this information with new methods
Parallel Computation and Visualization Tools for Gene Expression
301
in reasonable time (Martino et al., 1997; Suh et al., 1998). This section presents PAGE, a parallel program for analyzing the codetermination of gene transcriptional states from large-scale simultaneous gene expression measurements with cDNA microarrays.
4.1
The Three Sequential Algorithms and Motivation for Parallel Implementation
The program module gcombi generates k-predictor gene combinations from a set of m genes in lexicographic order. A k-predictor gene combination of an m gene pool set is obtained by selecting k distinct integers out of m and arranging them in increasing order. The total number of distinct kcombinations of an m-set is m Ck . The computational complexity of the sequential gcombi program is gcombi(n, k) = O(m Ck · k). The run-time is proportional to this computational complexity. Parallelization of gcombi is necessary for large values of k and m. Optionally, a k-predictor combination can be excluded it if contains a subset of (k − 1) genes that appears unimportant in predicting the target gene. glogic uses a logical-filter-based design to find the CoD between expression values of different k-predictor gene combinations for a given target gene. Design and validation are performed independently on each k-predictor combination. Each k-predictor gene combination also takes about the same amount of computing time for the design and validation operation. Theoretically, adjoining predictors to a predictor set gives no worse performance; however, owing to estimation error, adding more predictors may yield a lower estimated CoD (Kim et al., 2000a; Kim et al., 2000b; Dougherty et al., 2000). The gadjust module adjusts the CoD value of a k-predictor gene combination by comparing it to the CoD values of its (k − 1)-predictor subsets and changing it to the maximum of these CoD values if it does not exceed them all. The computational complexity of this adjustment operation has an upper bound of O(m Ck · m).
4.2
Parallel implementation
As each of the three modules is computationally intensive, we implemented parallel versions of each of these modules to create three phases of the parallel program, PAGE. PAGE was implemented on a Beowulf cluster parallel architecture (Sterling et al., 1999), running the Redhat Linux operating system. The Beowulf cluster is a distributed memory parallel machine that facilitates single-program-multiple-data (SPMD) and multiple-program-
302
COMPUTATIONAL GENOMICS
multiple-data (MPMD) parallel computing environments (Hwang, 1993). We use both the SPMD and the MPMD approaches with the industry standard message-passing interface (MPI) for handling communication between the processing nodes in our parallel program, PAGE. The measures we use to evaluate the parallel performance of PAGE are speedup and efficiency. Speedup is defined as χ p = Tseq /T T p , where Tseq is the sequential runtime on a single processor and T p is the parallel runtime on P processors. T p is the runtime of the processor that completes its assigned tasks last among the P processors. Ideal speedup is P. Efficiency, defined as E p = χ p /P, is a measure of how effectively a parallel program is using the processors. Ideal efficiency is 100%.
4.3
Parallelization Methods
There exist several ways of distributing the computational work when implementing a parallel version of an algorithm including, uniform load assignment and dynamic load assignment. Uniform Load Assignment. The uniform load assignment strategy was used for the first and second module of PAGE. This strategy decomposes the data, D = {d1 , d2 , . . . , dq }, where q =m Ck , into P contiguous partitions, S1 = {d1 , d2 , . . . , di }, S2 = {di+1 , di+2 , . . . , d j }, S p = {dt+1 , dt+2 , . . . , dq }, with an equal number of data elements in each partition, where (i − 1) = ( j − i + 1) = (q − t + 1). It then uniquely maps P partitions onto P processors. Processors work in parallel on the data elements in their assigned partitions. Uniform load assignment works well if the data elements require equal computation times (workloads). If data elements require different computation times, workload imbalance among the P processors will cause the processors to complete computations at different times, thereby idle processors wait for the final processor to complete its job. Therefore, the parallel execution time T p is determined by the maximum execution time Tmax among the processors, which can severely lower the efficiency. Dynamic load assignment. To improve the parallel performance by reducing the load imbalance, we have employed a dynamic load partition and assignment scheme, for the third phase of PAGE. Load mapping (load balancing) consists of two operations: (1) partitioning a data domain into data partitions, and (2) assigning each partition to a certain processor in a parallel machine. Given a data domain D = {d1 , d2 , . . . , dq } with q data elements, the partitioning operation computes the total load W D of D to determine an ideal partition size . Assuming e j is the computa-
Parallel Computation and Visualization Tools for Gene Expression
303
q tional weight of data element d j , total load is obtained by W D = j=1 e j . An ideal partition size, , is then obtained for P partitions (q ≥ P) by =
q 1 1 e j ≈ λavg WD = v . p p
(15.2)
j=1
Using (Equation 15.2), the partitioning operation must then partition D into an evenly weighted data partition, D = S0 ∪ S1 ∪ · · · ∪ S p−1 . The size of partition Si is determined by ⎧ ⎫ p−1 ⎨ ⎬ Si = dki |1 ≤ ki ≤ q, ki = q , (15.3) ⎩ ⎭ i=0
≈ λs0 ≈ λs1 · · · ≈ λs p−1 =
ek p ,
(15.4)
kp
where ki is a number of data elements in data partition Si , λ Si is the total load of Si , and eki is the computational weight of data element dki in Si . The Si may not contain equal numbers of data elements because each data element has a different computational weight. The partitioning operation is highly problem dependent, and is known to be NP-hard (Garey et al., 1979). The assignment operation assigns each data partition Si to a processor P j in a way that minimizes the interconnection distance between communicating processors (dilation). Assuming a cut-through wormhole routing, dilation is considered one if processors P j and Pk are directly connected. Otherwise, dilation is the hop count (the number of intermediate processors to connect on the shortest interconnection path between processors P j and Pk ). The assignment operation is highly dependent on parallel machine architecture, and is known to be computationally intractable (Bokhari, 1981; Garey et al., 1979). To reduce processor idle time, it is necessary to perform the partitioning and the assignment operations for an equitable load distribution over the processors. A dynamic load assignment scheme determines the partition size for the processors during run-time (Choudhary et al., 1993; Suh et al., 1998). In applying the dynamic load assignment scheme to gadjust, we group the k-predictor gene combinations into a linear chain of N disjoint partitions with equal computational costs, not equal numbers of combinations. A mapping π : {1, . . . ,m Ck } ) → {1, . . . , p}, from the combinations to the processors, generates for each processor, 1 ≤ i ≤ p, a pair (start[i], end[i]). Processor i gets all combinations j where start[i] ≤ j ≤ end[i] (Choudhary et al., 1993; Martino et al., 1997; Suh et al., 1998). The
304
COMPUTATIONAL GENOMICS
load λl on a processor l, under mapping π , is the sum of the loads of the combinations assigned to the processor: λl = {i|π(i)=l} load[i]. We now ) assign combinations to a processor l starting from the combination start[i] to some combination end[i] until λl becomes (λavg − δ) ≤ λl ≤ (λavg + δ). This approach will yield a good load balancing, however, the overhead associated with the computation of λavg and λl might be high.
4.4
Parallel Versions of Algorithms
We implemented a parallel version of gcombi with exclusion by first having each processor determine all k predictor gene combinations of an m gene pool set. Next, the combinations are linearly grouped into P partitions of equal size. The P partitions are distributed to P processors in the third step, each processor being assigned an equal number of combinations. Finally, each processor compares and excludes any k-predictor combination containing an excluded (k −1)-predictor subset. Table 15.1 shows the parallel performance of gcombi for the generation of 3-predictor gene combinations of a 587 gene pool with an exclusion of 43,000 2-predictor gene combinations. Glogic performs the same operation of design and validation on each kpredictor combination, independently. Each k-predictor combination takes about the same computing time for design and validation. Hence, parallelization can be achieved via a uniform load assignment scheme. Table 15.2 shows the parallel performance of glogic for the 3-predictor gene combination analysis of a single target gene out of 58 target genes in a 587 gene pool set. The parallel glogic achieves almost linearly increasing parallel performance with increasing numbers of processors. To analyze the data set for a melanoma study with 587 genes in the gene pool and 58 target genes, we designed and tested the model of a
Table 15.1. Parallel performance of gcombi for 587 C3 with an exclusion of 43,000 two-gene predictors. Number of Processors
Runtime (hours)
Speedup
Efficiency (%)
1 2 4 8 16 32 64
24.51 12.25 6.12 3.02 1.52 0.77 0.39
1.0 2.0 4.0 8.1 16.1 31.8 62.8
100.0 100.0 100.0 101.3 100.6 99.4 98.1
305
Parallel Computation and Visualization Tools for Gene Expression Table 15.2. Parallel performance of glogic for a single target gene with k = 3 and m = 586. (Note: When a gene is selected as a target gene, the selected target gene is not used as a predictor. For each selected target gene, it is only necessary to compute CoD for 586 C3 = 586!/(3!(586 − 3)!) = 33, 366, 840 of 3-predictor gene combinations, where m = 587 − 1 = 586.) Number of Processors
Runtime (hours)
Speedup
Efficiency (%)
1 2 4 8 16 32 64
72.71 36.35 18.36 9.29 4.69 2.37 1.21
1.0 2.0 4.0 7.8 15.5 30.7 60.1
100.0 100.0 99.0 97.8 96.9 95.9 93.4
ternary logic filter with a full combination of up to three predictor genes on the NIH Beowulf machine. It took less than 3 days of computer time to analyze 58 target genes, using 32 nodes (64 processors). Without the parallelization, it would have taken about 178 days of computer time. For the gadjust module, parallelization based on uniform partitioning did not perform well owing to load imbalance. Therefore, we implemented a dynamic load assignment scheme to minimize processor idle time and improve parallel performance. Using the dynamic load assignment scheme on the Beowulf system, we achieved the parallel performance shown in Table 15.3, which shows a detailed timing breakdown of gadjust for 58 target genes. Adjustment Runtime is the computer time taken to adjust the CoD values. As seen in the table, it scales well and achieves high parallel efficiency with an increasing number of processors. It uses the load partition and
Table 15.3. Parallel performance of dynamic load assignment scheme on gadjust for 58 target genes. (Note: Total Runtime = Adjustment Runtime + I/O Time + Load Assignment). Number of Processors
Adjustment Runtime (hours)
Efficiency (%)
I/O Time (hours)
Load Assignment (hours)
Total Runtime (hours)
Efficiency (%)
1 2 4 8 16
35.43 20.63 10.75 5.41 2.63
100.0 85.9 82.4 81.8 84.2
9.19 9.08 5.62 2.37 1.62
0 0.19 0.09 0.05 0.05
44.62 29.90 16.46 7.83 4.30
100.0 74.6 67.7 71.2 64.8
306
COMPUTATIONAL GENOMICS
assignment information calculated by the dynamic load partition and assignment. Total runtime is the time taken to do both the CoD adjustment and the file transfer. As the number of processors increases, efficiency drops because the file I/O time becomes dominant as compared to the CoD adjustment time.
5.
Visualization of Gene Expression (VOGE)
PAGE produces a massive amount of computationally derived information (approximately 200Gbytes) that is too large to study and interpret directly. Without the aid of an appropriate data visualizer, it would be almost impossible to interpret the derived information to determine associations between genes based on their expression patterns. To aid geneticists in their interpretation of the derived information, we have developed an interactive graphical visualization program, VOGE, that provides tools to display, manipulate, and interpret computationally derived information using the following integrated GUIs: decision tree graphs, tables, filters, histograms, and links to other genetic databanks. Figure 15.1 is a screen snap shot of VOGE that shows data manipulation, filtering, and raw data browsing facilities. The right window lists the target genes in a gene pool. Upon selection of a target gene, VOGE displays the CoD values in a spreadsheet-like environment with some filtering functions that have been suggested, by biologists at NHGRI/NIH, to be useful in extracting relevant relationships for a given study. If a user wants to see the primary data, microarray ratios, then he can pull up a window displaying them. Once a group of predictor genes is chosen, a user can visualize their relationships to target genes in an arrow-diagram as shown in Fig. 15.2. The highlighted blue circle in the center is the target gene of interest and it is expanded to display those gene combinations that have significant CoDs with respect to that target gene. The predictor genes in the selected combinations are displayed as red circles. If any predictor gene for a given target gene is also a target gene itself, it is displayed as a blue circle. The predictor gene that is also a target gene is considered a sub-target gene for this display. Any sub-target gene can be expanded to display its predictor combinations. This GUI provides a tree-like view or a circular tree view with the capability to rotate or change viewing angles. It also provides an auto-plot feature that constructs a tree of gene combinations for a given target gene with a specified depth and number of children. It has a zoom function to focus or defocus on the region of interest. Those genes that are displayed more than once can be cut or copied onto each other. Upon a click on a gene, a popup menu appears with various options, such
Parallel Computation and Visualization Tools for Gene Expression
307
Figure 15.1. VOGE: spreadsheet-like environment.
as a raw data visualizer and links to various genetic databanks to retrieve information on the chosen gene. The goal of this GUI is to provide access to the broad types of information that could be used to construct a hypothetical genetic regulatory network for a given target gene. While using the visualization environment, a user can connect via a Web browser interface to the UNIGENE database as well as other genetic databases to obtain relevant information, as shown in the left window of
308
COMPUTATIONAL GENOMICS
Figure 15.2. VOGE: arrow-diagram visualizer with web-based UniGene connectivity.
Fig. 15.2. Access to all sources of prior knowledge concerning genes that a given gene has been shown to interact with, as well as processes a given gene has been shown to influence are of tremendous value in pruning the large number of possible connections down to the most likely few. Currently, many tools are being constructed to carry out experimental searches for gene-to-gene interactions and to catalogue those interactions that have
REFERENCES
309
already been reported in the literature. VOGE is designed to allow rapid connection to online representations of these data sets as soon as they are available. New features are continually being added to VOGE as NHGRI biologists gain more experience with this tool.
6.
Summary and Conclusions
This chapter presents PAGE, a parallel program for analyzing the codetermination of gene transcriptional states from large-scale simultaneous gene expression measurements with cDNA microarrays, and its application to a large set of genes. Using PAGE, it was possible to compute coefficients of determination for all possible three-predictor sets from 587 genes for 58 targets in a reasonable amount of time. Given the limited sample sizes currently being used for microarray analysis, it is not necessary to go beyond three predictors at this time since the data are insufficient for fourpredictor CoD estimation. As shown in Tables 15.1, 15.2, and 15.3, significant speedups are achieved by the parallelization when compared to the sequential version of the program modules. A related data visualization program, VOGE, helps geneticists navigate, manipulate, and interpret the massive amount of computationally derived information produced by PAGE. Tools provided in VOGE enhance the ability of researchers to interpret and apply the computationally derived information for understanding the functional roles of genes.
Acknowledgments All biological data used in our experiments came from the Cancer Genetics Branch at the National Human Genome Research Institute. We are indebted to the Division of Computer System Services at the Center for Information Technology (CIT) of the National Institutes of Health (NIH) for providing computational resources and expertise including the use of the CIT Beowulf system. The authors from the CIT, NIH, would also like to thank Thomas Pohida of the CIT, NIH, for introducting us to the exciting area of gene expression analysis.
References Bokhari, S. H. (1981) On the Mapping Problem. IEEE Transaction on Computers, C30, 207–214. Chen, Y., Dougherty, E. R., and Bittner, M. L. (1997) Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics, 2, 364–374.
310
COMPUTATIONAL GENOMICS
Choudhary, A. N., Narahari, B., and Krishnamurti, R. (1993) An Efficient Heuristic Scheme for Dynamic Remapping of Parallel Computation. Parallel Computing, 19, 633–649. Dougherty, E. R., Kim, S., and Chen, Y. (2000) Coefficient of determination in nonlinear signal processing. Signal Processing, 80, 2219–2235. Garey, M. R., and Johnson, D. S. (1979) Computer and Intractability: A Guide to the Theory of NP-Completeness, Freeman. Hwang, K. (1993) Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw-Hill, Inc. Kim, S., Dougherty, E. R., Bittner, M. L., Chen, Y., Sivakumar, K., Meltzer, P., and Trent, J. M. (2000) A general nonlinear framework for the analysis of gene interaction via multivariate expression arrays. J. Biomed. Optics, 5, 411–424. Kim, S., Dougherty, E. R., Chen, Y., Sivakumar, K., Meltzer, P., Trent, J. M., and Bittner, M. (2000) Multivariate measurement of gene expression relationships. Genomics, 67, 201–209. Martino, R. L., Yap, T. K., and Suh, E. B. (1997) Parallel algorithms in molecular biology. High-Performance Computing and Networking, Springer-Verlag, 232–240. Sterling, T. L., Salmon, J., Becker, D. J., and Savarese, D. F. (1999) How to Build a Beowulf: A guide to the Implementation and Application of PC Clusters, The MIT Press. Suh, E., Narahari, B., and Simha, R. (1998) Dynamic load balancing schemes for computing accessible surface area of Protein molecules. Proceedings of International Conference on High-Performance Computing (HiPC’98), 326–333.
Chapter 16 SINGLE NUCLEOTIDE POLYMORPHISMS AND THEIR APPLICATIONS Rudy Guerra and Zhaoxia Yu Department of Statistics, Rice University, Houston, Texas, USA
1.
Introduction
Biotechnology has had a tremendous impact on modern biology, especially molecular biology and genetics. Available now are genetic data of various types and resolution, including DNA sequence, genotype, haplotype, allele-sharing, gene expression, and protein expression. In addition to making these data available, biotechnology has also made possible highthroughput assays that can generate a large amount of data. A prime example is microarray technology that allows for RNA expression measurement on thousands of genes simultaneously. An important use of measured genetic data is in finding polymorphisms that underlie human disease, such as asthma, diabetes, and Alzheimers, among others. To this end, one of the most popular types of genetic data comes in the form of single nucleotide polymorphisms (SNPs), which are estimated to occur every 600-1000bp in the human genome. In this chapter we give general background on SNPs and their application in genetic association studies and haplotype reconstruction. References to SNP databases and application software are also given. The main context is human genetics. A few of the more important definitions and concepts necessary for the discussion are given below for easy reference. Allele Variant of a DNA sequence at a specific locus on a single chromosome, usually in reference to a gene or genetic marker. Genotype Paired alleles of a fixed locus on homologous chromosmes. Haplotype Allelic combination of different loci (markers, genes) on the same chromosome.
312
COMPUTATIONAL GENOMICS
Genetic Polymorphism DNA sequence variation, typically at loci such as SNPs and sequence repeats across individuals. Genetic Marker A segment of DNA with an identifiable physical location on a chromosome and whose inheritance can be followed. A marker can be a gene, or it can be some section of DNA with no known function. [NHGRI] Genetic Linkage The association of genes on the same chromosome. Two genes on the same chromosome are said to be linked. A common measure of association between two linked genes is the recombination fraction. Understanding the genetic basis of phenotypes, including diseases, requires an appreciation and understanding of genetic polymorphisms. Indeed, polymorphisms known as mutations contribute to many diseases and being able to detect sequence differences between normal and diseased individuals is an important step in understanding the genetics of diseases. The Human Genome Project has generated and catalogued various classes of genetic polymorphism which can be used to investigate genomic variation across individuals or groups of individuals. The largest class is the single nucleotide polymorphism, accounting for approximately 90% (Collins et al. 1998) of naturally occurring human DNA variation. The following definition of a SNP is due to Brookes (1999): SNP: Single base pair positions in genomic DNA at which different sequence alternatives (alleles) exist in normal individuals in some population(s), wherein the least frequent allele has an abundance of 1% or greater. The popular working defintion of a SNP is a diallelic marker, but according to the above definition this is somewhat misleading (Brookes 1999). Nevertheless, there seems to be little harm in using the popular definition, by which a SNP is defined by two alleles, say A and G, at a specific location. Individuals would therefore be one of three genotypes, AA, AG, GG. It is important to remember that DNA is double-stranded and, thus, the complementarity of DNA would seem to indicate that all four nucleotides are present at the base-pair location since A and G individuals carry a T and G, respectively, on the complementary strand. However, in defining a SNP only one of the two complementary strands (Watson or Crick) is used.The typical frequency with which one observes single base differences in genomic DNA from two equivalent chromosomes is on the order of 1/1000bp (nucleotide diversity). The typical frequency of SNPs in a whole population is about 1/300bp. By screening more individuals (more chromosomes) more base differences can be found, but the nucleotide
Single Nucleotide Polymorphisms and their Applications
313
diversity index remains unchanged. SNPs are estimated to occur about once every 1000bp although SNP density can vary by as much as 100fold across the human genome (Brooks 1999). SNPs are found throughout the entire genome, including pormotors, exons and introns. However, since less than 5% of a person’s DNA codes for proteins, most SNPs are found in non-coding sequences. The most common (2/3) SNP type is defined by alleles C and T. Although there is much promise and hope in using SNPs for the identification of genes that determine (complex) diseases, many questions remain as to how to best use SNP data for this purpose. One common approach is the case-control study design in which the respective SNP patterns are compared (see below). This is reasonable for candidate genes or a relatively small collection of SNPs, but the ideal situation would be to compare large numbers of cases to controls with a dense set of SNPs over the entire genome. To achieve this, however, will require ultra-high-throughput assays (Isaksson et al. 2000) to discover, score, and genotype SNPs and such technologies are now being introduced, including Dynamic AllelicSpecific Hybridization (DASH) (Prince et al., 2001), reduced representation shotgun (RRS) sequencing (Altshuler et al., 2000), MALDI-TOF mass-spectrometry (e.g., Bray, Boerwinkle, and Doris, 2001), and SNP microarrays (e.g., Matsuzaki, 2004). The identification and cataloguing of SNPs began in earnest with the Human Genome Project (Collins et a., 1998) and continues today. Beginning in the late 1990’s several private efforts, such as Genset, Incyte and Celera, began to identify and build SNP databases. To ensure publicly available data for the scientific community major efforts were initiated by the National Institutes of Health (NHGRI), The SNP Consortium (TSC), and the Human Genome Variation database (HGVbase) (Brookes et al., 2000, Fredman et al., 2002). In 2001 The SNP Consortium (The International SNP Map Working Group, 2001) reported 1.4 million SNPs in the human genome, with an average density of one SNP every 1.9kb. Along with many other sources, these data were eventually deposited to a public SNP database, dbSNP (Sherry et al. 2001), maintained by the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov/SNP/). As of this writing dbSNP reports (Build 122) 8.3 million SNPs in the human genome with a density of about 28 SNPs per 10kb. dbSNP also provides databases for other organisms, including mouse, rat, chicken, zebrafish, and the malaria parasite, among others. To facilitate research NCBI provides cross-annotation with resources such as PubMed, GenBank, and LocusLink. dbSNP is also included in the Entrez retrieval system for intergrated access to a number of other data databases and software tools. Several SNP databases continue to be available (e.g, HGVbase), but it appears that dbSNP will serve as the main public repository of SNPs.
314
COMPUTATIONAL GENOMICS
The information provided by the SNP databases is a very important and valuable resource for research. Nevertheless, their usefulness is determined by quality and coverage. Since there are literally hundreds of sources that deposit SNP information into the databases, issues of quality are particularly important. The heart of the matter is whether or not a submitted SNP is real. Related issues concern SNP distribution relative to genic and non-genic regions, the allele-frequency spectrum of the SNPs, and frequency differences among racial groups. To address these and other related issues the public databases make some provisions for quality control and validation. In an independent study of public databases, Jiang et al. (2003) compared records in dbSNP and HGVbase to their own database of SNPs in pharmaceutically relevant genes (promotors, exons, introns, exonintron boundary regions). Of the 126,300 SNPs in 6788 genes from their database, Jiang et al. matched 22,257 SNPs to HGVbase and 27,775 SNPs to dbSNP. The Jiang et al. SNPs were found by resequencing a standard cohort of 70 unrelated individuals comprising four major ethnic groups. Jiang et al. were able to verify that 60% of the public SNPs with minor allele frequencies greater than 1% were real. The remainder are thought to be of very low frequency, mismapped, or not polymorphic. No sampling bias was found with respect to ethnicity at high frequency SNPs. Jiang et al. report on seven other similar studies with confirmations ranging from 45% to 95%. Factors that differed among the studies include sample size (number chromosomes), SNP detection methods, and genome coverage. Remarkable are two studies (Mullikin et al. 2000, Altshuler et al. 2000) that each report 95% confirmation (Mullikin: 74/78 SNPs; Altshuler: 216/227 SNPs) using the reduced representation shotgun sequencing technique (Altshuler et al. 2000). The general advice is to carefully consider the sources of any SNPs extracted from the public databases. With careful scrutiny these public databases serve to provide a wealth of polymorphisms. SNPs can be used for a variety purposes in both basic and applied research, ranging from theoretical population genetics to genetic counseling. In this article we focus on their use in genetic association studies and haplotype block reconstruction, both of which are used to help localize disease susceptibility genes. Readers interested in recombination, linkage disequilibrium, mutation, population admixture, estimation of population growth rates, and other aspects of population genetics and evolutionary history are referred to papers by Nielsen (2000), Pritchard and Przeworski (2001), Li and Stephens (2003), and Zhao et al. (2003), among others.
2.
SNPs and Genotype-Phenotype Association
In many situations geneticists understand enough about a biological process that they can identify specific genes that may in part determine the
Single Nucleotide Polymorphisms and their Applications
315
trait. Total cholesterol, for example, is a genetic trait and much is known about how cholesterol gets in and out of the bloodstream, its role in coronary artery disease, and its relationship with environmental factors (e.g., diet and exercise). See, for example, Rader, Cohen and Hobbs (2003) for a recent review of developments in the molecular pathogenesis and treatment of monogenic forms of severe hypercholesterolemia. In cases such as this biomedical researchers can statistically analyze SNPs within candidate genes for their possible association (correlation) with the trait of interest. The possible relationship between a fixed locus and a phenotype has traditionally been studied with family data using genetic linkage analysis (Ott, 1999). In this approach the idea is to estimate through genetic recombination the physical or genetic distance between a given locus and a putative trait locus: genetic markers that co-segregate with disease status provide evidence of a nearby trait locus. One limitation of genetic linkage is that the resolution of physical distance is on the order of megabases, whereas genetic association methods are believed to have detection levels on the order of kilobases (Risch and Merikangas, 1996; Kruglyak, 1999). In contrast to genetic linkage analysis, association analysis can be based on population data (unrelateds) or family data. In a population based case-control design an association analysis evaluates the null hypothesis of equal genotypic distributions between cases and controls. If the trait is continuous a one-way analysis-of-variance can be used to evaluate a null hypothesis of equal means across the three genotypes at a SNP locus. Analogous tests can be conducted using alleles or haplotypes. If a significant association is detected the conclusion is that the given marker locus is in linkage disequilibrium (Section 4.1 below) with a susceptibility locus.
Genetic Case-Control Study Assumptions a. There is a binary trait that defines cases and controls. b. A random sample of n unrelated cases and a random sample of m unrelated controls are collected. The cases are independent of the controls. c. A discrete risk factor that is obtained retrospectively is available for each case and control. In genetic studies the risk factor is usually defined by a set of alleles or genotypes of a genetic marker or gene. To formalize the genetic case-control study we make the following typical assumptions for a binary trait that defines cases and controls.
316
COMPUTATIONAL GENOMICS
Detailed discussion of case-control study design and analysis for genotypephenotype associations are given in Khourny et al., (1993). In this setting the numbers of cases and controls are fixed and the allele or genotype counts within cases and controls are random. The comparison can be viewed as a test of homogeneity, whereby the distributions of case and control counts are assumed equal under the null hypothesis. Table 16.1 shows a generic contingency table for genotype counts corresponding to a SNP with alleles, A and B. This approach is not unique to SNPs since any genetic marker defined by two alleles or two allelic classes can be similarly analyzed. The first row shows case counts r0 , r1 , and r2 corresponding to genotypes with 0, 1, or 2 B-alleles, respectively, and sample size n = r0 + r1 + r2 . Similarly, m = s0 + s1 + s2 for the control sample in the second row. The total sample size is N = n + m. The natural pairing of alleles as genotypes makes the above case-control genotype table a basis for preliminary analysis. However, other table structures are possible. If one considers the fact that the alleles at a given locus are inherited from two different parents the data may be viewed as 2N independent observations, instead of the usual sample size of N . In this case, each individual contributes two alleles to the counts as in Table 16.2. Cases have 2r0 +r1 A-alleles since each A A individual contributes two A-alleles and each AB individual contributes one A-allele. The other counts are similarly calculated. This distribution of counts also corresponds (Lewis 2002) to a multiplicative model with a k-fold increased risk for AB and an k 2 increased risk for B B.
Table 16.1. Case-control genotype counts.
Case Control Total
AA
AB
BB
Total
r0 s0 n0
r1 s1 n1
r2 s2 n2
n m N
Table 16.2. Case-control allele distribution.
Case Control Total
A
B
Total
2r0 + r1 2s0 + s1 n 1 + 2n 0
2r2 + r1 2s2 + s1 n 1 + 2n 2
2n 2m 2N
Single Nucleotide Polymorphisms and their Applications
317
Table 16.3. Case-control dominance distribution.
Case Control Total
AA
AB + B B
Total
r0 s0 n0
r1 + r2 s 1 + s2 n1 + n2
n m N
A third tabulation of the counts is given in Table 16.3, where interest is in having allele B or not. Therefore, AB and B B genotypes are indistinguishable as far as the analysis is concerned. This exactly corresponds to an assumption of the B-allele being dominant to the A-allele, or equivalently, the A-allele being recessive to the B-allele. It is also an appropriate formulation when one is unable to distinguish the B B homozygote from the AB heterozygote, as was common, for example, when HLA typing was done by serology (Sasieni, 1997). In all three approaches a standard chi-square test can be used to compare the observed counts to expected counts under a null hypothesis of equal distributions between cases and controls. A test of homogeneity to compare the case genotype distribution to the control genotype distribution in Table 16.1 is accomplished with the statistic, X2 =
6 (Oi − E i )2 ∼ χ2 Ei i=1
where the summation is over all six cells, Oi are the observed counts, and E i are the expected counts. If cell i is defined by row j and column k, then E jk is calculated as R j Ck /(n+m), where R j and Ck are row j and column k counts, respectively. The X 2 statistic has an asymptotic chi-square distribution with two degrees of freedom. As an asymptotic distribution the usual precautions regarding sample sizes should be kept in mind (Agresti, 1990). The genotype data in Table 16.1 can also be analyzed under an additive model assumption where an r -fold increase in risk is associated with the AB genotype and a 2r increased risk with the B B genotype (Lewis 2002). This is also viewed as a “dosage-effect” analysis and the test can be performed with Armitage’s trend test (Armitage 1955; Agresti 1990). The other two scenarios can be similarly analyzed with Tables 16.2 and 16.3 having 1 df in the χ 2 test. Sasieni (1997) gives an excellent discussion of the interpretation and comparison of these three approaches to analyzing genetic case-control studies. Using the A A homozygote as the reference group, the genotype
318
COMPUTATIONAL GENOMICS Table 16.4. Case-control odds ratios. Table
Odds ratio
Estimator
16.1
ψhet
r1 s0 r0 s1
16.1
ψhom
r2 s0 r0 s2
16.2
ψallele
(2r2 +r1 )(2s0 +s1 ) (2r0 +r1 )(2s2 +s1 )
16.3
ψser o
(r1 +r2 )s0 r0 (s1 +s2 )
counts allow for two 2-by-2 table odds ratios, another is available from the allele counts, and the “serological” table provides a fourth odds ratio as shown in Table 16.4. The distinction between ψallele and ψsero is particularly important. The serological odds ratio, ψsero compares the odds of disease in subjects with either an AB or B B genotype, that is, exposure to at least one B allele, to that in A A subjects. The comparison is ultimately at the genotype level and there is no need to make a dominance assumption, equating risk of disease between B B homozygotes and AB heterozygotes. The allelic odds ratio, on the other hand, is a comparison at the allelic level. In the situation where the allele of interest (B) is rare, ψallele approximates the relative gene frequency in cases and controls. Sasieni (1997) remarks about the difficulty of interpreting ψallele as it is hard to imagine the risk of an allele developing the disease, and thus generally recommends against using the allelic odds ratio. Under the null hypothesis of no association betweeen the disease and the genetic locus, chi-square statistics associated with genotype and serological data are both asymptotically χ 2 with 1 df, and they are locally most powerful under certain assumptions. Factors that affect the chi-square test include Hardy-Weinberg equilibrium and co-dominance of the allele of interest. Additional discussion is given by Lewis (2002), including the important topic of population stratification that can lead to false-positive associations.
3.
SNPs, Haplotypes and Genetic Association
Association studies attempt to take advantage of linkage disequilibrium, which exists at small genetic distances. Thus, one has to be quite lucky to come upon a single SNP that is in LD with a trait locus. Haplotypes of SNPs, on the other hand, cover more genetic distance and may have more statistical power to detect trait loci than a single SNP. Figure 16.1 shows an example of a two-SNP haplotype; each SNP has alleles 0 and 1.
319
Single Nucleotide Polymorphisms and their Applications D
D
.3 1 00
.3 .4
2
X =1.84, P=.175 Fisher.exact=.475 D'=1
X2=10, P=.007 Fisher.exact=.005
Figure 16.1. Single-SNP vs haplotype-SNP association. Each SNP has two alleles represented as black (1) and white (0) in the vertical bars. Haplotypes in the bar-figure and tables are read similarly; for example, 30% of the population has haplotype AD = 11. See text for additional explanation.
The vertical bar for SNP A shows that 30% of the population has allele 1, while 70% (30% + 40%) has allele 0. Considering SNP A with the disease locus, D, 30% of the population has haplotype AD = 11, 30% has AD = 01, and 40% has AD = 00. This information is also shown in the AD contingency table. The data for SNP B is read similarly. Using either the (asymptotic) χ 2 -test or Fisher’s exact test a sample of 10 individuals does not yield a significant association between AD or B D. However, when the haplotype AB is considered we do obtain a significant result; D is positive (D = 1) when either A or B is positive, and zero otherwise. Figure 16.2 shows another example with different allelic frquencies and sample size (n = 50). Here A is statistically associated with D, while B is not. The haplotype AB is strongly associated with disease. However, the results are difficult to interpret as the homozygotes of AB (11, 00) are associated with a positive disease status, while the heterozygotes are not. Of course, if one is interested in looking at interactions between SNP loci, the combinations can be evaluated regardless of linkage disequilibrium between the loci. The lesson is that there must be some basis for constructing haplotypes for association, or a more appropriate test than something like the generic χ 2 -test should be considered; that is, the biology of the situation should motivate the analysis. Several authors have considered the relative performance of single-SNP and haplotype-SNP association studies. In general, haplotypes are shown
320
COMPUTATIONAL GENOMICS D
D
0. 0. 0.
00
0. 0.
X 2.38, P .123 Fisher.exact=.217 D'=0.33
X2=29.17, P=2E-6 Fisher.exact=6E-8
Figure 16.2. Single-SNP and haplotype-SNP association. See Fig. 16.1 for explanation.
to have better power, but see Yu and Guerra (2004) for a discussion on interpretation. Simulation studies are reported by Service et al. (1999) and Zollner and von Haeseler (2000) . Akey and Xiong (2001) provide theoretical power calculations based on standard chi-square statistics. Morris and Kaplan (2002) consider the relative power between single-SNP and haplotype-SNP analyses in the situation where the disease locus has multiple susceptibility alleles. The answer depends on the degree of nonrandom association among the component SNPs of the haplotype; the weaker the linakge disequilibria among the SNPs, the better the haplotype analyses performs. One important class of haplotypes are those located within functional regions since they may be able to capture cis-acting susceptibility variants interacting within the same gene (Epstein and Satten 2003; Neale and Sham 2004). Related discussions are given by Hirschhorn et al. (2002)and Pritchard and Cox (2002). It is also worth mentioning that several other efforts use multiple SNP data for association, but not necessarily through haplotype analysis, for example: logistic regression (Cruickshanks et al., 1992); Bayesian genomic control (Devlin and Roeder 1999); logic regression (Kooperberg et al. 2001); sums of single-SNP statistics (Hoh and Ott 2003; Hao et al. 2004); Hotelling’s T 2 (Xiong et al. 2002; Fan and Knapp 2003).
3.1
Haplotype Methods for Genetic Association
As with single SNPs, when working with binary traits, a sample of haplotypes or estimated haplotype frequencies can be analyzed with a χ 2 -test. If
Single Nucleotide Polymorphisms and their Applications
321
there are only a handful of common haplotypes this approach is potentially useful; however, with more haplotype variation one encounters small cell counts which can lead to wrong p-values. Work on quantitative traits by Long and Langley (1999) compared phaseknown haplotype association to single-SNP association. Their simulations showed that haplotype-SNP tests performed no better than single-SNP tests. However, the nature of their single-SNP test (HMP: haploid marker permutation test) is not exactly based on one-marker-at-a-time methodology. The single-SNP approach is based on the SNP (among many) that shows the highest ANOVA F-statistic and its significance is evaluated by a permutation distribution obtained by permuting the (quantitative) phenotypes over the observed marker data (e.g., Churchill and Doerge 1994; Wan, Cohen, and Guerra 1997). The HMP test, therefore, is in fact a multiple-marker test which by construction only allows evaluation of a single SNP. If the collection of markers define a haplotype, then this approach may be viewed as an approximate haplotype analysis since all SNPs are being used through the identification of the strongest correlated SNP. A bona fide single-SNP analysis allows for the possibility of multiple significant SNPs. Long and Langley (1999) also consider a haploid haplotype one-way ANOVA test (HHA) whereby the means associated with haplotypes from a sample of haploid individuals are compared, Yi j = h i + ǫi j , i = 1, . . . , H ; j = 1, . . . , n i , where H distinct phase-known haplotypes are observed, each with a population mean of h i . If the error terms (ǫi j ) follow a standard Gaussian distribution with equal variances, then an F-test is used, otherwise a permutation test is suggested. ANOVA models are at the basis of other proposals. For example, Zaykin et al. (2002) propose haplotype methods for diploid individuals. For N individuals one can view the 2N haplotypes as 2N observations and analyze them according to the HHA model of Long and Langley (1999); Zaykin et al. (2002) call this approach the “2N -based ANOVA model.” An alternative approach, the haplotype trend regression (HTR) method, maintains the paired haplotypes within each individual. The regression model is, Yi = Di β + ǫi , where Di is a design vector indicating the two haplotypes of individual i; Di j is 1 for homozygous haplotype j, 1/2 if the heterozygous individual posseses haplotype j, and 0 otherwise. The authors suggest using permutation methods to test the null hypothesis of no haplotypic effect, H0 : β1 = β2 = · · · = β H , for H distinct haplotypes. Individual haplotype effects can also be tested. Phase unknown data (genotypes) are analyzed by using the Expectation-Maximization algorithm (Dempster, Laird, and Rubin 1977) as described below. In the case
322
COMPUTATIONAL GENOMICS
of inferred haplotypes, the design matrix is replaced by conditional probabilities of haplotypes, given the observed genotype data. Zaykin et al. (2002) conclude that haplotype analysis based on the HTR method can be more powerful than both the 2N -based ANOVA method and single-SNP studies. A generalization of the above ANOVA models with ambiguous phase SNP data is given by Schaid et al. (2002). They propose generalized linear models to accommodate qualitative or quantitative phenotypes and nongenetic covariates. Score tests are used to evaluate hypotheses of global haplotype association or haplotype-specific association.
3.2
Estimating Haplotypes with SNPs
Haplotypes ultimately determine the genetic variation in a population and thus are of interest beyond genetic association studies. One limitation of haplotype-based studies, however, is the fact that haplotypes are difficult to collect, much more so than genotypes. Although laboratory methods are available for obtaining haplotypes from multi-site genotype data (Saiki et al. 1985; Scharf et al. 1986; Newton et al 1989; Wu et al. 1989; Ruano et al. 1990) or by genotyping family members (Perlin et al. 1994, Sobel and Lange 1996) the process is still time and labor consuming. As such, inferential methods for reconstructing haplotypes from more easily acquired genotype data are desired. The nature of the combinatorial problem is easily appreciated by simply considering the fact that an individual heterozygous at n phase-unknown loci has one of the possible 2n−1 haplotypes. One of the major uses of SNPs is to infer haplotypes from unphased genotype data. The first major statistical method for inferring haplotypes was introduced by Clark (1990). In this approach, genotypes of individuals with unambiguous phases are first haplotyped. The unambiguous phases are defined by sites where no more than one of the sites is heterozygous. A sequential iterative method is then used to haplotype the remaining individuals based on the haplotypes that have already been identified. The basic idea is to determine if an ambiguous haplotype could have arisen from one of the known haplotypes. Following Clark (1990), suppose one of the known haplotypes is AT GGT AC and that we have a 7-site sequence AT {G, C}G{C, T }AC with ambiguous third and fifth positions. This gives four possible haplotype assignments, one of which matches the known sequence AT GGT AC. Therefore, another count is added to AT GGT AC, and the homologous haplotype AT C GC AC is also added to the evolving haplotype list since we must account for the observed genotype. This process continues until all data are exhausted. Matching
323
Single Nucleotide Polymorphisms and their Applications
ambiguous haplotypes to the “known” group minimizes the total number of inferred haplotypes and thus is a type of parsimony method in the spirit of Ockam’s razor. Two limitations of Clark’s method are the possible lack of unambiguous haplotypes to start the process and that the results can be sensitive to the order in which individuals are added to the “known” haplotype group. Excoffier and Slatkin (1995) proposed a maximum likelihood approach to estimate haplotype frequencies. See also Hawley and Kidd (1995) and Long et al. (1995). Following the notation of Stephens et al. (2001a), let G = (G 1 , G 2 , ..., G n ) denote the observed genotypes of n sampled individuals; H = (H1 , H2 , ..., Hn ) their (unknown) haplotype pairs, Hi = (h i1 , h i2 ); M the number of possible haplotypes in the population; F = (F1 , F2 , ..., FM ) the set of unknown population haplotype frequencies; f = ( f 1 , f 2 , ..., f M ) the unknown sample haplotype frequencies. In this application the likelihood is a function of F, given the observed genotype data (G), L(F) = Pr (G|F) =
n i=1
Pr (G i |F) =
n
Fh 1 Fh 2 ,
i=1 (h 1 ,h 2 )∈H Hi
where Hi is the set of all ordered haplotype pairs consistent with the multisite genotype data G i . The likelihood calculations are based on the EM algorithm under an assumption of Hardy-Weinberg equlibrium (HWE). Let pi denote the population frequency of haplotype h i . In the E-step the current haplotype frequencies are used to calculate phased genotype probabilities. In the M-step the haplotype frequencies that maximize the likelihood are calculated based on the updated phased genotype probabilities from the previous E-step. The haplotype frequencies are easily found by counting if the gametic phases of the observed genotype data are known, which is precisely the essence of the “missing data” in the E-step. It is important to emphasize that the EM method as described above is for estimating haplotype frequencies per se; estimation of the haplotypes themselves is not obvious (Stephens et al., 2001a). The performance of the EM algorithm is excellent for large samples regardless of the recombination rates among the loci. Simulation studies by Fallin and Shork (2000) considered the accuracy of the EM algorithm as a function of sample size, number of loci, allele frequencies, and HardyWeinberg proportions. They found that the performance under diallelic diploid genotype samples was generally quite good over a wide range of parameter configurations. The largest source of error in estimating haplotype frequencies appears to be sampling error, and those thought to have a big effect - such as departures from Hardy-Weinberg proportions - were
324
COMPUTATIONAL GENOMICS
relatively minimal. To avoid convergence to local maxima, Treqouet et al. (2004) introduced a stochastic version of the EM algorithm. The number of loci that can be efficiently haplotyped by the EM algorithm approaches is limited. As the number of loci increases the computer space needed to store the haplotypes grows exponentially. More recent estimation methods gain on computational efficiency by using population genetic theory. For example, the coalescent (Kingman 1982) is used to better predict haplotype patterns in natural populations. By exploiting such a priori expectations about haplotype patterns Stephens, Smith, and Donnelly (2001a) proposed a novel class of Bayesian models for haplotype inference. In addition to significantly reducing error rates, the proposed method has the added bonus of providing a meausre of uncertainty in the phase calls. Like the EM algorithm their approach views the inference as a “missing” data problem whereby the haplotypes are treated as unobserved variables whose conditional distribution is estimated given the unphased genotype data. The method is based on a Gibbs sampler whereby an estimated stationary distribution, Pr (H |G), of a Markov chain is sampled to obtain the reconstructed haplotypes. Two other notable Bayesian contributions have been made following the ideas of Stephens et al. (2001a). Niu et al. (2002) introduced prior annealing and partition ligation to enhance the speed of haplotype reconstruction. Prior annealing protects the algorithm from converging to a local maximum, while partition ligation addresses the difficult problem of estimating haplotype frequencies over a large number of contiguous sites. To this end, the whole region is partitioned into several mutually exclusive shorter segments each of which can be be efficiently analyzed. Estimation of haplotype frequencies within each segment, as well as the reassemblage of the entire segment are accomplished by Gibbs sampling. Lin et al. (2002) proposed a version of the Stephens-Smith-Donnelley algorithm by suggesting two modifications. First, they account for missing data. Second, they avoid the problem of guessing haplotypes at random in the situation where an individual does not match to any known (or already inferred) haplotypes in the sample; recall Clark’s methods. To maintain the basic idea that future-sampled haplotypes should resemble what’s already been observed, Lin et al. (2001) suggest looking for matches only at sites where the individual is heterozygous; clearly, homozygous sites already help to fix the haplotype reconstruction. Both simulated (Stephens et al. 2001a) and real data (Stephens et al. 2001b) demonstrate that the Stephens-Smith-Donnelly algorithm performed better than existing methods of the time. More recently, Stephens and Donnelly (2003) compared the Stephens-Smith-Donnelly algorithm to those of Niu et al. (2002) and Lin et al. (2002). In addition, they introduced a new
Single Nucleotide Polymorphisms and their Applications
325
algorithm that incorporates the modeling and computational strategies from all three Bayesian approaches. The new algorithm (PHASE 2.0) outperforms the individual algorithms when analyzing phase unknown population data. It is worth noting that the maximum likelihood estimates of haplotype frequencies coincide with the mode of a posterior distribution using a Dirichlet prior. Thus, the EM algorithm approach (which is one of many ways to find MLEs) to estimating haplotype frequencies is an instance of a Bayesian method, albeit with an unrealistic prior as discussed above (Stephens et al. 2001b). The above methods (Clark, EM, Stephens-Smith-Donnelley) seem to form the basis of many algorithms for haplotype reconstruction. Several other approaches, however, are available. The following are not exhaustive but give a flavor of the different types of approaches. The partition ligation approach has been also been applied (Qin et al. 2002) with the EM algorithm. To accomodate a large number of loci, Clayton’s SNPHAP program implements a sequential method that starts with two loci and then adds one locus one at a time until completion. Researchers from computer science and engineering have also considered haplotype reconstruction. Gusfield (2002) and Eskin et al. (2003) propose phylogenetic approaches and Eronen et al. (2004) introduce a Markov chain approach aimed at long marker maps with variable linkage disequilibrium. Unlike other methods, the Markov approach does not treat the entire haplotype as a unit; instead, it specifically accommodates recombination and works with “local” haplotype fragments that may be conserved over many generations. A selection of software packages for haplotype reconstruction are listed in Section 5 of this paper.
4.
SNPs and Haplotype Blocks
One problem in genomewide mapping studies is multiple testing of many markers, which can lead to a large number of false-positive genotypephenotype correlations. Adjustments for mutliple testing can in turn be affected by correlated test statistics associated with neighboring markers. In this section we consider the problem of finding regions of low disequilibria over chromosomal segments, the so-called haplotype block reconstruction problem. Haplotype block structures can minimize correlation among tests and reduce the problem of multiple testing by treating each block as a single multilocus marker. Daly et al. (2001), Patil et al. (2001), and Gabriel et al. (2002) were among the first to formalize the idea that haplotype blocks prevail over the human genome and that this structure could be found, with historical recombination a natural source of boundaries between blocks. Although
326
COMPUTATIONAL GENOMICS
debate still exists regarding the biological plausibility of haplotype blocks, more evidence in the direction of their existence is available than not. Wall and Pritchard (2003b) provide an excellent review of these and several other issues.
4.1
Linkage Disequilibrium
To estimate the haplotype block structure on a chromosomal segment a measure of correlation between alleles on the same chromosome is needed and the one most commonly used is linkage disequilibrium (Lewontin and Kojima, 1960; Weir, 1990), also called gametic phase disequilibrium or allelic association. Consider two diallelic loci A and B with alleles Ai (i = 1, 2) and B j ( j = 1, 2), respectively. Denoting the haplotype with alleles Ai and B j as Ai B j , linkage disequilibrium is defined as the difference between the observed frequency ( pi j ) of the haplotype and its expected value under the assumption of complete linkage equilibrium: D = pi j − p Ai p B j . It is well known that the range and magnitude of D is highly sensitive to the underlying allele frequencies. It is therefore more common to work with a normalized version, D ′ , which is bounded by -1 and 1 (Lewontin, 1964), but see Lewontin (1988) for further discussion: ⎧ ⎨ D/min{ p A1 (1 − p B1 ), p B1 (1 − p A1 )} D > 0 D′ = (16.1) ⎩ D/min{ p p , (1 − p )(1 − p )} D ≤ 0. A1 B1 A1 B1
Over many generations LD between two loci diminishes due to recombination. However, the rate of erosion depends on the genetic distance between the two loci, with closer loci maintaining their LD longer than loci farther apart. Letting Dt denote LD at generation t, the relationship between LD and recombination is Dt = (1 − θ)t D0 , where θ is the recombination rate per site per generation, and D0 is the initial LD (Ott, 1992). In practice the sign of D ′ is of little interest and thus |D ′ | is often reported. When |D ′ | is 0, the two loci are said to be in linkage equilibrium or completely independent; when |D ′ | is at its maximum value of 1, the two loci are said to be completely linked. Values of |D ′ | between 0 and 1 are not so easily interpreted. The natural estimator of D (|D ′ |) is obtained by substituting observed haplotype and allele frequencies from the sample. Although D (D ′ ) has an appealing definition and is a simple estimator, several shortcomings are recognized. The distribution of |D ′ | can only be roughly approximated, and estimates of |D ′ | show high variation even between pairs of sites that are far from each other (Hudson 1985). Devlin and Risch (1995) give an overview LD and many alternative measures.
327
Single Nucleotide Polymorphisms and their Applications
1 0.9
5
0.8 10 0.7 15
0.6 0.5
20
0.4 25 0.3 30 0.2 0.1
35 5
10
15
20
25
30
35
Figure 16.3. LD heatmap as measured by |D ′ | on 37 SNPs from dataset 10a of the European population collected by Daly et al. (2001). The coordinate (x, y) = (18, 1) shows LD = 0.4 between SNPs 1 and 18. The heatmap is symmetric and the four dark square regions along the diagonal represent sets of SNPs that are in high LD. Unrelated (n = 48) individuals and loci with MAF> 0.1 were used.
Figure 16.3 shows a heatmap of pairwise LD using |D ′ |. Represented are 37 ordered SNPs along a chromosomal segment. LD between a pair of SNPs is represented by a colored pixel. The SNPs are from dataset 10a of the European population of Daly et al. (2001). Genotype data were collected (n = 48) on unrelated individuals and LD calculations were retricted to SNPs with a minor allle frequency (MAF) of at least 10%. The axes show the relative locations of the SNPs and do not account for the physical distance between SNPs. Visual inspection of the plot shows a high degree of LD in four “blocks” as indicated by four dark squares, the first covering SNPs 1-12. The block structure in heatmaps is not always so well defined and formal statistical methods are needed to objectively select the block boundaries. Two packages that can calculate pairwise LD are GOLD (Abecasis and Cookson 2000; http://www.sph.umich.edu/csg/abecasis/GOLD/) and EMLD (Q. Huang, https://epi.mdanderson.org/ qhuang/Software/pub.htm).
328
4.2
COMPUTATIONAL GENOMICS
Haplotype Blocks
As indicated by Fig. 16.3 there appear to be physical contiguous stretches of DNA where recombination events are relatively rare and result in blocks of high linkage disequilibria. Initial studies of this phenomenon (Daly et al., 2001; Jeffery et al., 2001; Johnson et al., 2001; Patil et al., 2001) further characterized these regions by their haplotype variation. For example, in a discussion of the block structure along a 500kb region at 5q31, Daly et al. (2001) estimate that only two haplotypes are responsible for 96% of sampled chromosomes along an 84kb stretch covered by 8 SNPs. There were ten other blocks found in the 500kb region and each one accounted for > 90% of sampled chromosomes with only 2-4 haplotypes. Moreover, within each block, none of the common haplotypes showed evidence of being derived from the others by recombination, which strongly indicates existence of a few ancestral haplotypes that tended to be inherited without recombination. One advantage of having a block structure would be in their application to genetic association studies, which have traditionally been plagued with low polymorphism (per SNP) and multiple testing. Haplotype blocks could be treated as individual markers with a higher degree of polymorphism, while reducing the problem of multiple comparisons by minimizing redundant testing arising from markers in linkage disequilibrium. The evidence for haplotype blocks provided by these earlier authors is quite compelling. Later research tempered the idea of general block-like structures over the entire human genome. In particular, Wall and Pritchard (2003a) and Stumpf and Goldstein (2003) discussed the role of recombination rate heterogeneity in determining block structures. The general conclusion being that block structures are likely present over parts of the genome, but not all, and that recombination occuring in narrow hotspots is likely responsible for the block structures. Given that a block-structure exists in a given chromosomal region, it nevertheless can be difficult reconstructing (estimating) the block structure. This is especially true of studies that depend on unphased genotype data on unrelated individuals. As discussed above, the haplotypes themselves have to be estimated from the genotype data, which in turn adds another level of uncertainty to the block reconstruction problem. The problem thus seems circular in that knowing the block structure would inform us on where to estimate haplotypes, but estimating the block structure requires us to have haplotype information. One way around the problem is to work with pairwise LD information over a set of SNPs in the region of interest. Ideally, a good definition of a haplotype block would depend on a measure of “regional” linkage disequilibrium (Wall and Pritchard 2003b), but in practice blocks are largely contructed by pairwise LD. The SNP data from Daly et al. (2001) were
Single Nucleotide Polymorphisms and their Applications Recombination Hot spots
329
Block
Figure 16.4. Haplotype block model. The simplest model assumes that a chromosome is naturally divided by historical recombination hot spots. The regions between hot spots are haplotype blocks, each of which exhibits limited haplotype variation.
based on family trios, which therefore had parental transmission information not common in population based studies. Figure 16.4 shows a generic haplotype block structure. The simplest model can be developed by assuming that recombination hot spots determine the boundaries of the blocks. The success of this approach depends on the ability to directly measure or accurately estimate recombination, which can be difficult in humans. Accurate reconstruction of haplotype blocks depends on a number of factors which should be accounted for in either the estimation method or the interpretation of results. One factor already noted is estimation of haplotype frequencies from unphased genotype data. Perhaps the most important factor is population admixture, which is known (Ewens and Spielman, 1995) to lead to false-positive results in association studies of even single markers. Gabriel et al. (2002) and Yu and Guerra (2004) show that older populations (e.g. African-Americans) tend to have more blocks of shorter length than relatively younger (Caucasian-Americans) populations. Estimated block topologies will therefore depend on the underlying population heterogeneity represented in the sample (Pritchard and Przeworski 2001; Wall and Pritchard 2003b). Additional factors include (Yu and Guerra, 2004) sample size (number of chromosomes), SNP density, minor-allelefrequency thresholds, measures of LD, and any assumptions of statistical procedures or models used to estimate the block structure. The treatment of “rare” SNPs is also important. For example, if certain SNPs are excluded from analysis due to a MAF threshold, then the inferred haplotype variation within blocks is underestimated since the rare SNPs are not represented in the estimation of haplotypes. Several block definitions and corresponding search algorithms have been proposed. Recent reviews are given by Wall and Pritchard (2003), Yu and Guerra (2004), and Ding et al. (2005). One major feature that is common among the various block definitions is that blocks should contain limited genetic variation in that only a few haplotypes should account for most of the sampled chromosomes. Ding et al. (2005) evaluate the impact of three major operational definitions for haplotype blocks shown in the box below.
330
COMPUTATIONAL GENOMICS
Operational Haplotype Block Definitions Diversity Block structure defined to have low sequence “diversity” within blocks (Patil et al., 2001). LD Requires high (low) pairwise LD within (between) blocks (Gabriel et al., 2002). Recombination Define blocks as recombination free regions (Wang et al., 2002). To complete the operational definitions algorithms must be defined to implement the methods. Thus, for the diversity-based method low sequence diversity may be taken to mean that a minimum fraction of all observed haplotypes are represented beyond a certain threshold in the sample; for example, Ding at al. (2005) require that within blocks 80% of the observed haplotypes occur in at least 5% of the sample. The dynamic-programming algorithm of Zhang et al. (2002) can be used to implement the method. For the LD-based approach one can define a block as a region where a minimum percentage (e.g., 90%) of all pairwise SNPs are in strong LD, which in turn must be defined using some measure of LD (e.g., D ′ ). The recombination-based method has been implemented with the four-gamete test of Hudson and Kaplan (1985). The first approaches to haplotype block reconstruction were proposed by Daly et al. (2001), Patil et al. (2001), Gabriel et al. (2002), and Zhang et al. (2002). Daly et al. (2001), Patil et al. (2001), and Zhang et al. (2002) propose methods based on haplotype diversity with multiple (> 2) SNPs jointly analyzed. Daly et al. (2001) define haplotype blocks as regions which have only a few (2-4) haplotypes that account for > 90% of the sampled chromosomes. Patil et al. 2001 and Zhang et al. 2002 propose methods similar to Daly et al. (2001), as well as use the idea of tagging SNPs to uniquely identify haplotypes within blocks(see below) with a minimal set of SNPs. Gabriel et al. (2002) propose a pairwise LD approach. The pairwise LD approaches are highly intuitive and relatively easy to implement and thus seem to be the most popular of the approaches. The method is sequential by starting at one end of the chromosome (or chromosomal segment) and building the haplotype blocks one SNP at a time. If the first two (ordered) SNPs are found to be strongly correlated, the algorithm then considers addition of the third SNP, and so on. With the pairwise approach there is much flexibility in how one decides whether to add the third and later SNPs. Gabriel et al. (2002) use pairwise linkage disequilibrium as
Single Nucleotide Polymorphisms and their Applications
331
measured by D ′ to assess the correlation between two SNPs. To assess the degree of LD for a give pair of SNPs they calculate 5% and 95% confidence bounds for |D ′ | and classify the pair as: (a) strong LD if the lower bound is greater than 0.7 and higher bound is greater than 0.98, (b) weak LD if the higher bound is less than 0.9, or (c) ambiguous, otherwise. A haplotype block is defined to be a set of contiguous loci such that the ratio of the number of strongly linked pairs to weakly linked pairs is no less than 19-fold, ignoring the ambiguous pairs. Due to the relatively high level of variation in estimates of |D ′ | many pairs are found to have ambiguous linkage resulting in a potential loss of information. It appears (Yu and Guerra, 2004) that many of the ambiguous cases are characterized by large differences in minor allele frequencies at the two SNPs. One possible consequence (Yu and Guerra 2004) of having too many ambiguous cases is failing to include SNPs in blocks when they should be included. In turn, this may result in more and/or shorter blocks than necessary. No single numerical measure seems to capture the different aspects of linkage disequilibrium. Therefore haplotype block reconstruction based on the pairwise LD approach may yield different results depending on the LD measure used. Recognizing such limitations Yu and Guerra (2004) proposed a new composite summary of LD for LD-based haplotype block reconstruction. The first part of the composite LD measure is a majoritycounting summarization of LD. Consider two diallelic loci A and B with two alleles A1 , A2 and B1 , B2 . Without loss of generality, assume that A1 and B1 are the minor frequency alleles. We thus have two classes of haplotypes: coupling (A1 B1 , A2 B2 ) and repulsion (A1 B2 , A2 B1 ). Strong evidence of LD is indicated if either of the two classes is observed in high proportion. A measure of LD is thus defined and called “proportion disequilibrium” (PLD): P L D = max( p A1 B1 + p A2 B2 , p A1 B2 + p A2 B1 ), with the constraint p A1 B1 + p A2 B2 + p A1 B2 + p A2 B1 = 1. The idea underlying P L D is the same as that in finding haplotype blocks, whereby definition of low disequilibria across a block we expect a “few” haplotypes within a block to account for the vast majority of chromosomes. PLD can be used as a measure of LD and, in fact, has characteristics similar to the LD measure r (Hill and Wier, 1994; Devlon and Risch, 1995). PLD has much smaller variation than D ′ . On the other hand, like r , P L D cannot attain the value 1 (complete linkage disequilibrium) unless there are only two haplotypes; D ′ can attain the value of 1 when only one of the haplotype frequencies is zero. To complete the composite summary of LD, Yu and Guerra (2004) summarize the 2×2 table of haplotype counts by combining PLD with another
332
COMPUTATIONAL GENOMICS
term defined as α = min{ p A1 B1 , p A1 B2 , p A2 B1 , p A2 B2 }. When α is very small, there are practically three haplotypes and, thus, no strong evidence of recombination between the two loci. Note that PLD on its own may only be, say, 60% indicating low LD. However, if one of the haplotypes is very small, we then have additional evidence in favor of strong LD. The composite LD summary is thus defined: a pair of loci are in (a) strong LD, if either P L D > 0.95 or α < 0.015; (b) weak LD, if P L D < 0.9 or α > 0.03; (c) ambiguous, otherwise. Let S = number of pairs in strong LD and W = number of pairs in weak LD. A haplotype block is defined as a region where S/(S + W ) > 0.98. This algorithm starts with the first SNP and finds the longest block that satisfies the criterion; the longest block may be of length one, a single SNP. The process starts again with the first SNP following the first estimated block, and continues until the entire chromosomal segment has been analyzed. Several other methods have been proposed for haplotype block reconstruction (Nothnagel et al. 2002; Zhu et al. 2003; Kimmel and Shamir 2004; Koivisto et al. 2004; Greenspan and Geiger, 2004). However, given the relatively short time since the first proposals appeared in 2001, it appears that the methods by Daly et al. (2001), Patil et al. (2001), and Gabriel et al. (2002) continue to be the most popular. Therefore, it is of interest to know the relative advantages and disadvantages of any new proposal with at least one of these approaches. Assessing the performance of different block definitions and search algorithms, however, is challenging due to the variety of data sets reported by different authors. When sampled populations, marker densities, sample sizes, minor allele frequency thresholds, and other factors differ from paper to paper it is indeed quite difficult to compare methods. One difficulty in assessing the relative performance of methods is defining appropriate criteria for comparison. For example, several methods are algorithmic and do not provide measures of statistical accuracy. Indeed, each reconstructed block topology is an estimate; some blocks will be statistically significant and others will not. From a statistical viewpoint measures of accuracy are needed not just to evaluate method performance, but also for practice. Simulations are also useful for comparison, but in this case much care is required to estimate realistic evolutionary forces that determine haplotype variation. One very useful model is the coalescent (Hudson, 2002), which has been used in haplotype block model simulations (e.g., Wall and Pritchard 2003). Fallin and Schork (2000) used Dirichlet distributions to simulate haplotype frequencies. Issues in comparing alternative block estimation methods have been discussed by Wall and Pritchard (2003), Schwartz et al. (2003), Yu and Guerra (2004), and Ding et al. (2005).
Single Nucleotide Polymorphisms and their Applications
333
Wall and Pritchard (2003a) proposed three criteria as a test of fitness for a block model; they can also be used to assess the performance of block identification algorithms. The criteria are: (1) Coverage, the percentage of physical distance covered by blocks. A region under a true block structure should largely be covered by blocks. (2) Holes. If loci A, B and C are physically ordered as A− B −C, then a hole occurs if A and C are in strong LD but either A and B or B and C are in weak LD. A true block structure should show few holes. (3) Overlapping blocks. Two blocks are said to be overlapping if at least one site can be identified with both blocks. There should be no overlapping blocks in a true block structure. Schwartz et al. (2003) consider the problem of comparing two competing block partitions from two different methods. They use the number of shared boundaries between the two partitions as a statistic for comparison. To this end, they provide a P-value formula for the null hypothesis that two block partitions were determined randomly and independently from one another. They also suggest using this statistic in testing the robustness of a particular block partition method since most methods can give several equally good solutions. The proposed statistic is intriguing and as far as we know is the first approach suggested for formally comparing competing block partitions. This is a very important problem, especially as interest grows in using haplotype blocks for association studies. One point that warrants further investigation is the definition of “shared boundaries,” which in Schwartz et al. (2003) appears to be absolute concordance at the SNP-location resolution. Yu and Guerra (2004) propose a more traditional Type I and Type II error rate approach to compare methods in the context of simulation studies; see below. For the region of interest, each SNP is either correctly captured by a block or not and these binary results can be used to calculate the error rates, as well as the Wall and Pritchard (2003) criteria. Sun, Stephens, and Zhao (2003) report results on the impact of sample size and marker density on block partitions. Using real data sets (Daly et al., 2001) from African-American (group B), Japanese, and Chinese (group C) populations, these authors show that both sample size and marker density have a substantial impact on block partitioning and tagging SNPs. Both the number of blocks and the number of tagging snps increase with sample size and/or maker density, at least with respect to the size and density in the samples. This behavior is also reported by Yu and Guerra 2004).
4.3
Simulations
Below we briefly summarize some results from simulation studies conducted by Yu and Guerra (2004) to compare their new approach based on the composite summary of LD with Gabriel’s LD method for block
334
COMPUTATIONAL GENOMICS
reconstruction. In both cases blocks are constructed via pairwise LD and the difference lies in how one decides on the strength of disequilibrium. Readers are referred to the original paper of Yu and Guerra (2004) for simulation details and more extensive results. The coalescent (Hudson, 2002) was used to simulate haplotype data, which in turn were randomly paired to ultimately analyze unphased genotype population data. We simulated haplotype data in part based on the characteristics of real data from Gabriel et al. (2002). We assumed a constant population size N = 10, 000 and a mutation parameter θ(= 4N μ) set to 7.836 × 10−5 per bp (Wall and Prtichard, 2003). In the first simulation we modeled block length as an exponential distribution with mean 30kb with blocks separated by recombination hot spots of 1kb in length. In other simulations we changed the parameters to generate two hot spots each 10kb in length and one hotspot of 1kb. In each simulation we ran 50 data sets with 200 chromosomes. This resulted in 100 unphased individuals whose genotypes were subjected to haplotype block inference, using the EM algorithm (Excoffier and Slatkin 1995) for haplotype frequency estimation. Results from two simulation experiments are summarized in Tables 16.5; Figs 16.5 and 16.6 correspond to their respective true block models and instances of simulated data with estimated block structure. The striking difference between the two methods is the increase in coverage and decrease in Type II error rates by the Yu and Guerra approach. The statistical power (1− Type II) of the Yu and Guerra approach is especially encourgaing with a view toward association studies. Still problematic for both methods are the hole and overlap frequencies, which in the
Table 16.5. Simulation results for block reconstruction. The true model included 50 SNPs over a 100kb region with four haplotype blocks separated by three recombination hot spots. Cell values are mean percentages over 50 simulated data sets, each with 200 chromosomes. Only SNPs with observed MAF>0.1 were used. (A) Hot spot lengths (1kb, 1kb, 1kb); see Fig. 16.5. (B) Hot spot lengths (10kb, 1kb, 10kb); see Fig. 16.6. Simulation
Coverage
Holes
Overlap
Type I
Type II
A Gabriel et al. Yu and Guerra
33.3% 50.0
23.5 16.4
17.3 16.3
0.47 1.83
21.77 6.78
B Gabriel et al. Yu and Guerra
24.6% 41.8
25.9 22.6
15.31 19.1
0.57 6.37
22.03 7.86
335
Single Nucleotide Polymorphisms and their Applications
Data 1
Data 2
Yu-Gue Gabrie
Data 3
Assum 0 kb
100 kb
Figure 16.5. Three replications of Simulation A blocks. Hashmarks show SNP locations. See Table 16.5 caption and text for details.(Please refer to the color section of this book to view this figure in color).
ata 1
a2
Yu-Gue Gabriel
a3
Assume 0 kb
100 kb
Figure 16.6. Three replications of Simulation B blocks. Hashmarks show SNP locations. See Table 16.5 caption and text for details. (Please refer to the color section of this book to view this figure in color).
336
COMPUTATIONAL GENOMICS
case of a true block-structure may not be expected to be at these levels. The results, however, are consistent with those of Wall and Pritchard (2003a).
4.4
Applications
We also applied the block methods of Gabriel et al. (2003) and Yu and Guerra (2004) to the four populations from Daly et al. (2001). We only used unrelated individuals and markers with MAF ≥ 0.1. The coverages for the Utah CEPH, African Americans, East Asians, and Nigerian samples using the Gabriel et al. method were 46%, 24%, 41% and 25%, respectively, which increased to 58%, 41%, 54% and 45% using the Yu and Guerra method. Hole and overlap frequencies were about the same. The apparent low coverage values may indicate a weak to moderate blocklike pattern in these data and/or chromosomal region (Wall and Pritchard 2003a). There is now much interest in haplotype-based association studies using SNPs, especially in the context of complex traits including quantitative traits. As discussed in Section 3.1 there is some debate concerning the relative merits of single-SNP and haplotype-SNP approaches for genetic association. Intuitively, one might expect haplotypes to provide higher power. The issue is not so simple, however, especially in the case of unphased SNP genotypes where one must first estimate haplotype frequencies and/or haplotype blocks. Yu and Guerra (2004) considered the question in the context of real genes where all SNPs have been identified. The real data are part of a larger study, the Dallas Heart Study (Victor et al., 2004), to investigate cardiovascular disease. We selected one of the genes being studied and simulated a continuous phenotype. We assume there is a single trait locus in complete LD with one of the SNPs (dotted line in Fig. 16.7) and model the continuous trait as a mixture of three normal distributions with means 50, 55 and 60 corresponding to genotypes CC, CA and AA, and homogeneous standard deviation of 20. Haplotype blocks were estimated using the approach of Yu and Guerra (2004). To test the single SNPs we used an additive (allelic) model defined by a predictor that counts the number of A alleles. The block haplotypes were also tested with an additive model using Haplo.Stat (Schaid et al., 2002). Fig. 16.7 shows the results. There were nine blocks found, including a very thin one at 30k comprised of two SNPs and a very short one at 38kb with three SNPs. There was significant single-SNP association at the assumed disease locus, as well as at neigboring SNP loci. There is a general decreasing trend in significance as the SNPs are farther away from the trait SNP. A significant global-block p-value was also found at the block covering the disease locus, although the block p-value is less than the single-SNP
337
0
2
4
-log(p)
6
8
10
Single Nucleotide Polymorphisms and their Applications
10000
20000
30000
40000
Relative Physical Positions (bp)
Figure 16.7. Single-SNP vs. haplotype-SNP association. The y-axis is -log(p) with y = 3 corresponding to a significance level of 0.05 shown as the solid horizontal line. The x-axis is physical position; SNPs are shown as hashmarks. Individual dots correspond to single-SNP tests. The nine shaded rectangles are estimated haplotype blocks and their heights correspond to a global p-value. (Please refer to the color section of this book to view this figure in color).
p-value. In this application there is concordance (vis-a-vis significance) between single-SNP and haplotype-SNP analysis at the trait locus, as well as at the third, sixth, and seventh blocks. The other blocks, for example, block 2, show conflicting results with the single-SNPs that are within the blocks. Block 4 is highly significant with two SNPs, one significant and one not based on single-SNP analysis. The causes of such discrepancies are still being investiagted by various authors. See Yu and Guerra (2004) for further discussion, including results on specific haplotypes which can help explain some of the discrepancies.
4.5
Tagging SNPs
The idea of tagging SNPs (Johnson et al. 2001) is to identify a few SNPs that capture most of the variation at a haplotype locus. In a haplotype block model there would be a set of haplotype tagging SNPs (htSNPs) for each block. One advantage of htSNPs is that future sampled chromosomes need only be typed at the tagging SNPs, saving time and resources. In Fig.16.8 there are four SNPs giving four haplotypes with frequency 0.3, 0.3, 0.2 and 0.2 respectively. SNP1 and SNP2 are redundant and SNP 3 has 80%
338
COMPUTATIONAL GENOMICS Haplotype
SNP1 SNP2 SNP3
SNP4
Frequency
hap1
A
T
G
C
0.3
hap2
A
T
G
T
0.3
hap3
C
A
G
C
0.2
hap4
C
A
C
T
0.2
Figure 16.8. An illustration of tagging SNPs.
of the population with allele G. We can thus select as tagging SNPs (1,4) or (2,4) without loss of information. A standard measure of haplotype variation is haplotype diversity,(D), which is defined as the total number of base differences across all possible pairwise comparisons of the observed haplotypes. If z = (z 1 , ..., z S ) represents a haplotype of S linked SNPs, each represented by alleles 0 and 1, then the haplotype diversity based on a sample of n haplotypes is n n D= (z i − z j )T (z i − z j ). i=1 j=1
Chapman et al. (2003) proposed using D to find htSNPs. Any subset of k SNPs partitions the observed haplotypes into no more than 2k groups. By computing D within each group, we can define the residual diversity (R) of the subset as the sum of within-group diversities. The proportion of diversity explained (PDE) by the selected set of SNPs is thus defined as P D E = 1 − R/D and can be used to measure the informativeness of the selected SNPs. SNP sets with a high PDE assure that not much information will be lost. Other authors propose methods that simultaneously estimate haplotype blocks and tagging SNPs. In a greedy block partition algorithm used by Patil et al. 2001, the partition that has the minimal number of SNPs that can distinguish the common haplotypes is selected. Zhang et al. (2002) note that the greedy algorithm by Patil et al. (2001) does not guarantee an optimal solution and thus propose an extended dynamic programming solution. The tagging methods discussed above all require phase information or estimated phase information, which is computationally expensive when there are a large number of SNPs. Meng et al. (2003) introduced a method based on the spectral decomposition of the LD matrix for dimension reduction. The benefit is that phase information is not needed, which makes the algorithm more efficient and eliminates the extra degree of uncertainty due to estimation of haplotype frequencies from unphased data. Section 6 gives a list of software programs that offer tagging SNP calculations.
Single Nucleotide Polymorphisms and their Applications
5.
339
Conclusions
SNPs are highly abundant in the human genome, explaining most of sequence variation. This makes them a valuable resource for population genetics, evolution, and gene mapping. In this article we have given an overview of the major issues arising in their application to haplotype and haplotype block estimation and genetic association. The discussion should make clear that many statistical methods have been developed for these problems, but there is still much more to understand about the relative merits of the competing methods. Perhaps more important is further understanding of the practical utility of the methods.
6.
Resources
A generally useful website for literature and software on statistical genetics, including SNPs and haplotypes, is http://linkage.rockefeller.edu/. Haplotype data from unrelated individuals can be simulated based on coalescent theory (Kingman, 1982). The M S package (Hudson 2002; http://home.uchicago.edu/ rhudson1/source/mksamples.html) can simulate sequences based on evolutionary forces such as mutation, recombination, and migration. In order to generate sequences with high recombination, Li and Stephens (2003) introduced a distance shrinking method to produce the effect of recombination hotspots. The package hdhot with block structure is available at http://www.biostat.umn.edu/ nali/SoftwareListing.html. The HapMap Project (The International HapMap Consortium, Nature 426: 789-796, www.hapmap.org) is a public database of haplotype resources. The goal of this project is to compare the genetic sequences of several hundred individuals to identify common variation; DNA samples come from populations representing African, Asian, and European ancestry.
6.1
Selected Haplotype Reconstruction Software
If available, published papers introducing the software are given below. In all cases an internet URL is provided. A more comprehensive listing ´ is given by Halldorsson, Istrail and De La Vega (Hum. Herred. 2004; 58: 190–202). Arlequin Excoffier and Slatkin (1995) cmpg.unibe.ch/software/arlequin/ EH Xie and Ott (1993) linkage.rockefeller.edu/software/eh
340
COMPUTATIONAL GENOMICS
EH+ Zhao, Curtis, and Sham (2000) www.iop.kcl.ac.uk/IoP/Departments/PsychMed/GEpiBSt/software.shtml EM-decoder of Haplotyper Niu et al. (2002) EM-decoder: www.people.fas.harvard.edu/˜junliu/em/em.htm Haplotyper: www.people.fas.harvard.edu/˜junliu/Haplo/docMain.htm PHASE Stephens, Smith, and Donnelly (2001a); Stephens and Donnelly (2003) www.stat.washington.edu/stephens/software.html PLEM Qin, Niu, and Liu (2002) www.people.fas.harvard.edu/˜junliu/plem SNPHAP Clayton, D. www-gene.cimr.cam.ac.uk/clayton/software/snphap.txt HAP Eskin, Halperin, and Karp (2003) www.cs.columbia.edu/compbio/hap/ Haplo.STAT Schaid, et al. (2002) www.mayo.edu/hsr/people/schaid.html HaploBlock Greenspan and Geiger (2003) bioinfo.cs.technion.ac.il/haploblock/
6.2
Tagging SNP Software
htSNP Chapman, et al. (2003) www-gene.cimr.cam.ac.uk/clayton/software/stata HaploBlockFinder Zhang and Jin (2003) cgi.uc.edu/cgi-bin/kzhang/haploBlockFinder.cgi
REFERENCES
341
SNPTagger Ke and Cardon (2003) www.well.ox.ac.uk/˜xiayi/haplotype TagSNPs Stram et al. (2003) www-rcf.usc.edu/˜stram/tagSNPs.htm Hapblock Zhang et al. (2002) www.cmb.usc.edu/msms/HapBlock ldSelect Carlson et al. (2004) droog.gs.washington.edu/ldSelect.html
Acknowledgments Drs. Jeffrey Wall and Jonathan Pritchard kindly provided formatted data from the Daly datasets, as well as their C simulation program for the coalescent (Wall and Pritchard 2003), whose foundation is due to Li and Stephens (2003). Gudmundaut Arni Thorisson, Haplotype Map Project, provided helpful information on the databases dbSNP and hapmap. Drs. Helen Hobbs, Jonathan Cohen, and Alex Pertsemlidis provided SNP data and helpful discussion. Zhaoxia Yu is supported by a training fellowship from the W.M. Keck Foundation to the Gulf Coast Consortia through the Keck Center for Computational and Structural Biology, Rice University. Rudy Guerra is partially supported by NIH BAAHL0204 and NSF EIA0203396.
References Abecasis GR and Cookson WO (2000) GOLD–graphical overview of linkage disequilibrium. Bioinformatics, 16(2):182-183. Agresti, A. (1990). Categorical Data Analysis. Wiley. Akey J, Jin L, and Xiong M (2001) Haplotypes vs single marker linkage disequilibrium tests: what do we gain? Eur.J.Hum.Genet. 9(4):291-300. Altshuler, D., Pollara, V. J., Cowles, C. R., Van Etten, W. J., Baldwin, J., Linton, L., and Lander, E. S.(2000). An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature, 407:513-516.
342
COMPUTATIONAL GENOMICS
Armitage, P. (1955). Tests for linear trends in proportions and frequencies. Biometrics, 11:375-386. Bray, M.S., Boerwinkle, E., Doris, P.A. (2001). High-throughput multiplex SNP genotyping with MALDI-TOF mass spectrometry: practice, problems and promise. Hum Mutat., 17(4):296-304. Brookes, A,J. (1999) The essence of SNPs. Gene 234(2):177-186. Brookes, A.J., Lehvslaiho, H., Siegfried, M., Boehm, J.G., Yuan, Y.P., Sarkar, C.M., Bork, P., Ortigao, F. (2000). HGBASE : A Database of SNPs and Other Variations in and around Human Genes. Nucleic Acids Research, 28:356-360. Carlson C.S., Eberle M.A., Rieder M.J., Yi Q., Kruglyak L., Nickerson D.A. (2004). Selecting a maximally informative set of single-nucleotide polymorphisms for association analysis using linkage disequilibrium. American Journal of Human Genetics, 74:106-120. Chapman, J.M., Cooper, J.D., Todd, J.A., and Clayton, D.G. (2003). Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of statistical power. Human Heredity, 56:18-31. Churchill G.A. and Doerge R.W. (1994). Empirical threshold values for quantitative trait mapping. Genetics, 138(3):963-971. Clark, A.G. (1990). Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol., 7:111-122. Collins FS, Brooks LD, and Chakravarti A (1998). A DNA polymorphism discovery resource for research on human genetic variation. Genome Res., 8(12):1229-1231. Cousin, E., Genin, E., Mace, S., Ricard, S., Chansac, C., del Zompo, M., and Deleuze, J. F.(2003). Association studies in candidate genes: strategies to select SNPs to be tested. Hum.Hered., 56:151-159. Cruickshanks, K. J., Vadheim, C. M., Moss, S. E., Roth, M. P., Riley, W. J., Maclaren, N. K., Langfield, D., Sparkes, R. S., Klein, R., and Rotter, J. I. (1992). Genetic marker associations with proliferative retinopathy in persons diagnosed with diabetes before 30 yr of age. Diabetes, 41: 879-885. Daly, M. J., Rioux, J. D., Schaffner, S. F., Hudson, T. J., and Lander, E. S.(2001). High-resolution haplotype structure in the human genome. Nat.Genet., 29:229-232. Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Ser. B, 39:1-22. Devlin B, Risch N (1995) A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics, 29:311-322.
REFERENCES
343
Devlin, B. and Roeder, K. (1999). Biometrics, 55(4):997-1004. Ding, K., Zhou, K., Zhang, J., Knight, J., Zhang, X., and Shen, Y.(2005). The Effect of Haplotype-Block Definitions on Inference of HaplotypeBlock Structure and htSNPs Selection. Mol. Biol. Evol., 22:148-159. Epstein, M. P. and Satten, G. A.(2003). Inference on haplotype effects in case-control studies using unphased genotype data. Am.J.Hum.Genet., 73:1316-1329. Eronen, L., Geerts, F. and Toivonen, H. (2004). A Markov Chain approach to reconstruction of long haplotypes. Pacific Symposium on Biocomputing, 9:104-115. Eskin, E., Halperin, E., Karp, R.M. (2003). Efficient reconstruction of haplotype structure via perfect phylogeny. J. Bioinform. Comput. Biol., 1(1):1-20. Ewens, W. J. and Spielman, R. S.(1995). The transmission/disequilibrium test: history, subdivision, and admixture. Am. J. Hum. Genet., 57: 455-464. Excoffier, L. and Slatkin, M. (1995). Maximum-Likelihood Estimation of Molecular Haplotype Frequencies in a Diploid Population. Mol. Biol. Evol., 12(5):921-927. Fallin, D. and Schork, N. J.(2000). Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am.J.Hum.Genet., 67:947-959. Fan, R. and Knapp, M.(2003). Genome association studies of complex diseases by case-control designs. Am.J.Hum.Genet., 72:850-868. Fredman, D., Siegfried, M., Yuan, Y.P., Bork, P., Lehvslaiho, H., Brookes, A.J. (2002). HGVbase : A human sequence variation database emphasizing data quality and a broad spectrum of data sources. Nucleic Acids Research, 30:387-91. Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Copper R, Ward R, Lander ES, Daly MJ, Altshuler D (2002) The structure of haplotype blocks in the human genome. Science, 296:2225-2229. Garner C., and Slatkin M. (2003). On selecting markers for association studies: patterns of linkage disequilibrium between two and three diallelic loci. Genet Epidemiol., 24(1):57-67. Greenspan, G. and Geiger, D. (2004). Model-based inference of haplotype block variation. Journal of Computational Biology, 11(2-3): 493-504. Gusfield, D. (2002). Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. In Proceedings of the Sixth Annual Inter-
344
COMPUTATIONAL GENOMICS
national Conference on Computational biology, 166-175. ACM Press. Hao, K., Xu, X., Laird, N., Wang, X., and Xu, X.(2004). Power estimation of multiple SNP association test of case-control study and application. Genet.Epidemiol., 26:22-30. Harding, R.M., Fullerton, S.M., Griffiths, R.C., Bond, J., Cox, M.J., Schneider, J.A., Moulin, D.S., Clegg, J.B. (1997). Archaic African and Asian lineages in the genetic ancestry of modern humans. Am. J. Hum. Genet., 60:772-789. Hawley, M. E. and Kidd, K. K.(1995). HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J.Hered., 86:409-411. Hedrick, P.W. (1987) Gametic disequilibrium measures: Proceed with caution. Genetics, 117:331-341. Hill, W. G. and Weir, B. S.(1994). Maximum-likelihood estimation of gene location by linkage disequilibrium. Am.J.Hum.Genet., 54:705-714. Hirschhorn, J. N., Lohmueller, K., Byrne, E., and Hirschhorn, K.(2002). A comprehensive review of genetic association studies. Genet.Med., 4: 45-61. Hoh, J. and Ott, J.(2003). Mathematical multi-locus approaches to localizing complex human trait genes. Nat.Rev.Genet., 4:701-709. Hudson, R. R. (1983). Properties of a neutral allele model with intragenic recombination. Theoretical Population Biology, 23:183-201. Hudson, R. R.(1985). The sampling distribution of linkage disequilibrium under an infinite allele model without selection. Genetics, 109:611-631. Hudson, R.R. (2001). Two-locus sampling distributions and their application. Genetics, 159: 1805-1817. Hudson, R. R. (2002) Generating samples under a Wright-Fisher neutral model. Bioinformatics, 18:337-8. Hudson, R. R. and Kaplan, N. L.(1985). Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics, 111:147-164. Isaksson, A., Landegren, U., Syvanen, A. C., Bork, P., Stein, C., Ortigao, F., and Brookes, A. J.(2000). Discovery, scoring and utilization of human single nucleotide polymorphisms: a multidisciplinary problem. Eur.J.Hum.Genet., 8:154-156. Jiang, R., Duan, J., Windemuth, A., Stephens, J. C., Judson, R., and Xu, C.(2003). Genome-wide evaluation of the public SNP databases. Pharmacogenomics, 4:779-789. Johnson, G., Esposito, L., Barratt, B., Smith, A., Heward, J., et al. (2001). Haplotype tagging for the identification of common disease genes. Nature Genetics, 29:233-237.
REFERENCES
345
Ke, X. and Cardon, L.R. (2003) Efficient selective screening of haplotype tag SNPs. Bioinformatics, 19(2):287-8. Khoury, M.J, Beaty, T.H., Cohen, B.H. (1993). Fundamentals of Genetic Epidemiology. Oxford University Press. Kimmel, G. and Shamir, R.(2005). GERBIL: Genotype resolution and block identification using likelihood. Proc. Natl. Acad. Sci., 102: 158-162, Kingman, J.F.C. (1982). The Coalescent, Stochastic Proc. Appl., 13: 235-248. Koivisto, M., Perola, M., Varilo, T., Hennah, W., Ekelund, J., et al. (2003). An MDL method for finding haplotype blocks and for estimating the strength of haplotype block boundaries. In: R.B. Altman, A.K. Dukner, L. Hunter, T.A. Jung, and T.E. Klein (eds.), Proceedings of the Eighth Pacific Symposium on Biocomputing (PSB’03), pp. 502-513, World Scientific. Kooperberg, C., Ruczinski, I., LeBlanc, M., and Hsu, L. 2001. Sequence analysis using logic regression. Genetic Epidemiology, 21(S1):626-631. Kruglyak, L.(1999). Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat.Genet., 22:139-144. Lewis, C. M.(2002). Genetic association studies: design, analysis and interpretation. Brief.Bioinform., 3:146-153. Lewontin, R.C., and Kojima, K. (1960). The evolutionary dynamics of complex polymorphisms. Evolution, 14:458-472. Lewontin R (1964) The interaction of selection and linkage.I. General considerations; heterotic models. Genetics 49: 49-67 Lewontin R (1988) On measures of gametic disequilibrium. Genetics 120: 849-852. Long, A. D. and Langley, C. H.(1999). The power of association studies to detect the contribution of candidate genetic loci to variation in complex traits. Genome Res., 9:720-731. Long, J. C., Williams, R. C., and Urbanek, M.(1995). An E-M algorithm and testing strategy for multiple-locus haplotypes. Am.J.Hum.Genet., 56:799-810. Li, N and Stephens, M (2003). Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics, 165:2213-2233. Lin, S., Cutler, D.J., Zwick, M.E., and Chakravarti, A. (2002). Haplotype inference in random population samples. Am. J. Hum. Genet., 71: 1129-1137. Matsuzaki, H., Loi, H., Dong, S., Tsai, Y., Fang, J., et al. (2004). Parallel Genotyping of over 10,000 SNPs using a One Primer Assay on a High Density Oligonucleotide Array Genome Research, 3:414-25.
346
COMPUTATIONAL GENOMICS
Meng, Z., Zaykin, D. V., Xu, C. F., Wagner, M., and Ehm, M. G.(2003). Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes. Am.J.Hum.Genet., 73:115-130. Morris, R. W. and Kaplan, N. L.(2002). On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles. Genet.Epidemiol., 23:221-233. Mullikin, J. C., Hunt, S. E., Cole, C. G., Mortimore, B. J., Rice, C. M., et al. (2000). An SNP map of human chromosome 22. Nature, 407: 516-520. Neale, B. M. and Sham, P. C.(2004). The future of association studies: gene-based analysis and replication. Am.J.Hum.Genet., 75:353-362. Newton, C. R., A. Graham, L. E. Heptinstall, S. J. Pow-Ell, C. Summers, et al. (1989). Analysis of any point mutation in DNA: the amplification refractory mutation system (ARMS). Nucleic Acids Res., 17:25032516. Nielsen, R.(2000). Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics, 154:931-942. Niu T, Qin ZS, Xu X, Liu JS (2002), Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet., 70(1):157-169. Nothnagel, M., Furst, R., and Rohde, K.(2002). Entropy as a measure for linkage disequilibrium over multilocus haplotype blocks. Hum.Hered., 54:186-198. Ott, J. (1999). Analysis of Human Genetic Linkage, Third edition. Johns Hopkins University Press. Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM and et al. (2001). Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294: 1719-1723. Perlin, M. W., Burks, M.B., Hoop, R.C. and Hoffman, E.C. (1994). Toward fully automated genotyping: allele assignment, pedigree construction, phase determination, and recombination detection in Duchenne muscular distrophy. Am. J. Hum. Genet., 55:777-787. Prince, J. A., Feuk, L., Howell, W. M., Jobs, M., Emahazion, T., Blennow, K., and Brookes, A. J.(2001). Robust and accurate single nucleotide polymorphism genotyping by dynamic allele-specific hybridization (DASH): design criteria and assay validation. Genome Res., 11: 152-162. Pritchard, J.K. and Przeworski, M. (2001). Linkage disequilibrium in humans: models and data. Am. J. Hum .Genet., 69:1-14. Pritchard, J. K. and Cox, N. J.(2002). The allelic architecture of human disease genes: common disease-common variant...or not? Hum.Mol.Genet., 11:2417-2423.
REFERENCES
347
Qin Z.S., Niu T., Liu J.S. (2002). Partition-Ligation Expectation-Maximisation algorithm for haplotype inference with singlenucleotide polymorphisms. Am. J. Hum. Genet., 71(5):1242-1247. Rader, D.J., Cohen, J.C. and Hobbs, H.H. (2003). Monogenic hypercholesterolemia: new insights in pathogenesis and treatment. J. Clin. Invest., 111:1795-1803. Risch, N., Merikangas,K. (1996). The future of genetic studies of complex huuman diseases. Science, 273:1516-1517. Ruano, G., K. K. Kidd, and J. C. Stephens. (1990). Haplotype of multiple polymorphisms resolved by enzymatic amplification of single DNA molecules. Proc. Natl. Acad. Sci., 87:6296-6300. Saiki, R. K., S. Scharf, F. Faloona, K. B. Mullis, G. T. Horn, H. A. Erlich, and N. Arnheim. (1985). Enzymatic amplification of β-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. Science, 230: 1350-1354. Sasieni, P. D.(1997). From genotypes to genes: doubling the sample size. Biometrics, 53:1253-1261. Schaid D.J., Rowland C.M., Tines D.E., Jacobson R.M., Poland G.A. (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet., 70:425-34. Scharf S. J., G. T. Horn, and H. A. Erlich. (1986). Direct cloning and sequence analysis of enzymatically amplified genomic sequences. Science, 233: 1076-1078. Schwartz, R.S., Halldorsson, ´ B., Bafna, V., Clark, A.G., and Istrail, S. (2003) Robustness of inference of haplotype block structure. J. of Comp. Bio., 10: 13-19. Schork, N. J., Fallin, D., and Lanchbury, J. S.(2000). Single nucleotide polymorphisms and the future of genetic epidemiology. Clin.Genet., 58:250-264. Schork, N. J., Nath, S. K., Fallin, D., and Chakravarti, A.(2000). Linkage disequilibrium analysis of biallelic DNA markers, human quantitative trait loci, and threshold-defined case and control subjects. Am.J.Hum.Genet., 67:1208-1218. Service, S. K., Lang, D. W., Freimer, N. B., and Sandkuijl, L. A.(1999). Linkage-disequilibrium mapping of disease genes by reconstruction of ancestral haplotypes in founder populations. Am. J. Hum. Genet., 64: 1728-1738. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. (2001). dbSNP: the NCBI database of genetic variation. Nucleic Acids Research, 2000, 28:352-355 Sobel, E., Lange, K. (1996). Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker sharing statistics. Am. J. Hum. Genet., 58:1323-1337.
348
COMPUTATIONAL GENOMICS
Stephens M. and Donnelly P. (2003). A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am. J. Hum. Genet., 73: 1162-1169. Stephens M., Smith N.J., and Donnelly, P. (2001a). A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet., 68:978-989. Stephens M., Smith N.J., Donnelly, P. (2001b). Reply to Zhang et al., Am. J. Hum. Genet., 69:912-914. Stram, D.O., Haiman, C.A., Hirschhorn, J.N., Altshuler, D., Kolonel, L.N., Henderson, B.E., Pike, M.C. (2003). Choosing haplotype-tagging SNPs based on unphased genotype data from a preliminary sample of unrelated subjects with an example from the Multiethnic Cohort Study, Hum. Hered., 55(1):27-36. Stumpf, M. P. and Goldstein, D. B.(2003). Demography, recombination hotspot intensity, and the block structure of linkage disequilibrium. Curr.Biol., 13:1-8. Sun, X., Stephens, J.C., Zhao, H. (2004). The impact of sample size and marker selection on the study of haplotype structures. Human Genomics, 1: 179-193. Terwilliger, J.D., Haghighi F., Hiekkalinna T.S. and Goring H. H. (2002). A bias-ed assessment of the use of SNPs in human complex traits. Current Opinion in Genetics and Development, 12: 726-734. The International SNP Map Consotium (2001). A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms, Nature, 409:928-933. Tregouet, D. A., Escolano, S., Tiret, L., Mallet, A., and Golmard, J. L.(2004). A new algorithm for haplotype-based association analysis: the StochasticEM algorithm. Ann. Hum. Genet., 68:165-177. Victor RG, Haley RW, Willett DL, Peshock RM, Vaeth PC, et al. (2004). The Dallas Heart Study: a population-based probability sample for the multidisciplinary study of ethnic differences in cardiovascular health. Am. J. Cardiol., 93(12):1473-80. Wall J.D. and Pritchard J.K. (2003a). Assessing the performance of the haplotype block model of linkage disequilibiurm. Am.J.Hum.Genet., 73:502-515. Wall, J,D, and Pritchard, J.K. (2003b). Haplotype blocks and linkage disequilibrium in the human genome. Nature Review Genetics, 4:587-597. Wan, Y., Cohen, J., and Guerra, R.(1997). A permutation test for the robust sib-pair linkage method. Ann. Hum. Genet., 61:79-87. Wang, N., Akey, J. M., Zhang, K., Chakraborty, R., and Jin, L.(2002). Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. Am.J.Hum.Genet., 71:1227-1234.
REFERENCES
349
Wang, D.G., Fan, J.B., Siao, C.J., Berno, A., Young, P., et al. (1998). Large-scale identification, mapping and genotyping of single-nucleotide polymorphisms in the human genome. Science, 280: 1077-1082. Weir, B.S. (1990). Genetic Data Analysis. Methods for discrete population genetic data. Sinauer, Sunderland, Mass. Wu, D.Y., Ugozzoli, L., PAL, B.K. and WALLACE, R.R. (1989). Allelespecific amplification of P-globin genomic DNA for diagnosis of sicklecell anemia. Proc. Natl. Acad. Sci., 86:2757. Xie, X. and Ott, J. (1993). Testing linkage disequilibrium between a disease gene and marker loci. Am. J. Hum. Genet., 53:1107. Xiong, M., Zhao, J., and Boerwinkle, E.(2002). Generalized T 2 test for genome association studies. Am. J. Hum. Genet., 70:1257-1268. Yu, Z. and Guerra, R. (2004). Haplotype blocks and association: Operating performance and new methods, TR2004-07, Department of Statistics, Rice University. Zaykin, D. V., Westfall, P. H., Young, S. S., Karnoub, M. A., Wagner, M. J., and Ehm, M. G.(2002). Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum.Hered., 53:79-91. Zhang, K., Calabrese, P., Nordborg, M., and Sun, F.(2002). Haplotype block structure and its applications to association studies: power and study designs. Am. J. Hum. Genet., 71:1386-1394. Zhang, K., Deng, M., Chen, T., Waterman, M. S., and Sun, F.(2002). A dynamic programming algorithm for haplotype block partitioning. Proc. Natl. Acad. Sci., 99:7335-7339. Zhang, K. and Jin, L. (2003). HaploBlockFinder: haplotype block analyses. Bioinformatics, 19(10):1300-1301. Zhao, J. H., Curtis, D., Sham, P. C. (2000). Model-free analysis and permutation tests for allelic associations. Hum. Hered., 50(2):133-139. Zhao, H., Pfeiffer, R., and Gail, M. H.(2003). Haplotype analysis in population genetics and association studies. Pharmacogenomics, 4:171-178. Zollner, S. and von Haeseler, A.(2000). A coalescent approach to study linkage disequilibrium between single-nucleotide polymorphisms. Am. J. Hum. Genet., 66:615-628. Zou G. and Zhao, H. (2003). Haplotype frequency estimation in the presence of genotyping errors. Hum. Hered., 56:131-138. Zhu, X., Yan, D., Cooper, R. S., Luke, A., Ikeda, M. A., Chang, Y. P., Weder, A., and Chakravarti, A.(2003). Linkage disequilibrium and haplotype diversity in the genes of the renin-angiotensin system: findings from the family blood pressure program. Genome Res., 13:173-181.
Chapter 17 THE CONTRIBUTION OF ALTERNATIVE TRANSCRIPTION AND ALTERNATIVE SPLICING TO THE COMPLEXITY OF MAMMALIAN TRANSCRIPTOMES ¨ Mihaela Zavolan and Christian Schonbach University of Basel, Basel, Switzerland
1.
Introduction
In multi-cellular organisms, cells from different tissues have dramatically different phenotypes, even though they share the same genetic information. Until recently, phenotype diversity was thought to be largely the result of transcriptional regulation, different sets of genes being expressed in different cell types or at different developmental stages. It therefore came as a surprise that the human genome contains around 40,000 genes, only about twice as many as the genome of the worm Caernorhabditis elegans (The C.elegans sequencing consortium 1998). Probing the mammalian transcriptome with high-throughput sequencing techniques revealed that the key to this puzzle resides – at least partially – in the increased complexity of gene structure in mammalian genomes. 60% of human (Lander et al. 2001) and mouse (Zavolan et al. 2003) multi-exon genes have splice variants compared to only 22% of worm genes, although some authors argued that this is due to deeper coverage of mammalian compared to worm transcriptome (Brett et al. 2002). Another surprising finding of full-length cDNA sequencing studies is that mammalian genes frequently initiate transcription from different promoters, and that there are multiple polyadenylation sites for a given gene. Thus, in eukaryotes, the one gene - one protein paradigm of molecular biology turns out to be the exception rather than the rule, and various molecular mechanisms contribute to transcriptome diversification. In this chapter we present an analysis of alternative splicing in the Fantom2 dataset of full-length mouse cDNAs (FANTOM Consortium and the RIKEN GSC Genome Exploration Group 2002). We found that at least
352
COMPUTATIONAL GENOMICS
41% – and perhaps as many as 60% – of multi-exon mouse genes have multiple splice forms, with splice variation affecting predominantly the coding region. Many of the observed splice variants may, however, be rare transcripts resulting from a “noisy” splicing process, as suggested by our analyses of the effect of splice variation on the exon length and reading frame. Several of our analyses aimed at a better understanding of the regulation of alternative splicing. We found, for instance, that one of the mechanisms by which alternative splicing is induced is through the use of alternative transcription start sites. A Bayesian analysis of the sequences flanking the splice sites indicates that constitutive and cryptic1 exons differ in their flanking splice signals. These differences are presumably the basis for previous reports of “stronger” splice signals of constitutive exons. With respect to the presence of motifs that might be important for exon recognition, we found that constitutive exons are enriched in several known splice enhancer motifs, while other motifs, generally pyrimidine-rich, occur with higher frequency in cryptic exons. The functional significance of these candidate regulatory motifs remains to be established. Alternative splicing has emerged as one of most important mechanisms underlying the diversity of phenotypes exhibited by mammalian cells. Even repeat elements, generally considered to be parasitic, can contribute to the diversification of the transcriptome: they can be spliced into the functional transcripts of the host cells, sometimes in a species-specific manner, and sometimes as alternative exons. With this enormous potential for transcriptome diversification, the main question that remains to be answered in the coming years is how the expression of alternative splice forms is regulated by the cell. To add another layer of complexity to this problem, we found a higher frequency of alternative splicing in genes that could themselves be involved in the regulation of alternative splicing. These observations and their biological implications are discussed in the light of published work by others.
2.
Alternative Splicing in Mouse
Initial sequencing and annotation of the human (Lander et al. 2001; Venter and et al. 2001) and mouse (Gregory et al. 2002) genomes uncovered that the number of protein-coding genes is around 40,000, much smaller than expected. Anticipating the complexity of gene structure in these species – which cannot be predicted by current gene prediction algorithms – several groups embarked on efforts to capture and sequence full-length cDNA sequences (Kawai et al. 2001; Suzuki et al. 2001) We have studied the extent and pattern of alternative splicing in the mouse transcriptome using the FANTOM datasets, consisting initially of
The Contribution of Alternative Transcription and Alternative Splicing
353
21,076, and, more recently, of 60,770 full-length cDNA sequences. To ensure a more complete coverage of the mouse transcriptome, this latter set was supplemented with 44,122 public-domain mRNA sequences from LocusLink and from the non-EST divisions of the Mouse Gene Index and GenBank (FANTOM Consortium and the RIKEN GSC Genome Exploration Group 2002). All sequences were aligned to the mouse genome assembly to infer the splice junctions as described in Zavolan et al. (2003). What became apparent immediately is that the complexity of mammalian gene structure makes it difficult to even define what a gene or what a locus are. FANTOM Consortium and the RIKEN GSC Genome Exploration Group (2002) focused instead on transcription units, defined as “segments of genome from which transcripts are generated”. But how can we infer the boundaries and structure of a transcription unit; we do not know what mechanisms determine the expression of different transcripts from a given transcription unit. All that we observe is a limited number of transcripts that have been sequenced by the sequencing projects. In (Zavolan et al. 2003) we considered transcription units to be represented by clusters of transcripts mapping to the genome in an overlapping manner. Two transcripts are placed in the same cluster if their genomic mappings overlap in at least one exonic nucleotide. This computational definition of transcription units allowed us to retrieve quite accurately the transcription units produced by manual annotation: only 84 of our 11,677 (< 1%) clusters containing multiple spliced transcripts straddled more than one manually-annotated transcription unit. Figure 17.1 shows a set of fulllength transcripts that are especially difficult to assign computationally to transcription units. What appears to be a very complex pattern of splice variation in the upper panel corresponds, in fact, to three entirely different protein-coding genes: hyaluronidase 1 and 3 and N-acetyl transferase 6. It has been shown very recently that the transcription of this genomic region yields a polycistronic transcript (Shuttleworth et al. 2002). In addition, interferon-related developmental regulator 2 is expressed from the negative strand of this locus, and its exons are interleaved with those of the hyaluronidase gene. For our analyses of splice variation we selected only transcripts that mapped quasi-completely to the genome assembly: we required at least 95% identity between the cDNA and the genome, both over the entire length of the cDNA and over each individual exon. In addition, we discarded transcripts for which some intron-exon junctions could not be identified by the mapping program. This procedure yielded 77,640 multipleexon transcripts, which were clustered as described above. The types of splice variation that we seeked to identify were: exons that are included in some transcripts and skipped in others (cryptic exons),
Figure 17.1. The structure of the hyaluronidase locus on chromosome 9, that gives rise to polycistronic transcripts containing besides Hyal1 the genes Fus2 and Hyal3. The gene for interferon-related developmental factor 2 is encoded on the opposite strand, downstream of Fus2. Constitutive exons are shown in green, exons with alternative 3’ splice sites in yellow, and exons with alternative 5’ splice sites in blue. Cryptic exons are drawn with a surrounding black box. Intronic regions are shown in red. (Please refer to the color section of this book to view this figure in color).
354 COMPUTATIONAL GENOMICS
The Contribution of Alternative Transcription and Alternative Splicing
355
exons with multiple splice acceptor sites, with multiple splice donor sites and with multiple splice variations. In order to report a non-redundant set of alternative splice events, we focused on what we called “genomic exons”. A genomic exon was defined as the union of all overlapping exons identified in the cDNAs of a cluster. We first compared all exons of all transcripts in a cluster in a pairwise fashion to identify those which used alternative donor and acceptor splice sites, and then propagated this information to genomic exons. Events that could be also interpreted as incomplete splicing were not reported as splice variations. These include intron inclusions and terminal exon extensions (Fig. 17.2). Technically, exons that appear in initial or terminal positions in some transcripts and are skipped in other transcripts cannot be unambiguously classified within our classification scheme. This is because we assume that the 5’ and 3’ ends of a given transcript are not defined with absolute certainty by the sequencing procedure.We made a specific note of such exons because, as we will discuss below, they may be indicative of alternative transcription inducing alternative splicing. The flow of the data through the analysis procedure in summarized in Table 17.1. In particular, we found multiple splice forms in 40% of the clusters containing multiple multi-exon transcripts, more than what we initially found in the Fantom1 dataset (Zavolan et al. 2002), or what other authors reported based on analyses of EST sequences (Brett et al. 2002). Clearly, increasing the depth of transcriptome coverage by sequencing more transcripts from different tissues yields higher estimates. However,
Exon flanked by variable 5' splice site Terminal exon extention?
Internal cryptic exon
Constitutive exon Intron inclusion?
Terminal cryptic exon
genome cDNA A cDNA B
Figure 17.2. Schematic representation of the forms of splice variation that we analyzed. For the color scheme see caption of Fig. 17.1. (Please refer to the color section of this book to view this figure in color).
356
COMPUTATIONAL GENOMICS Table 17.1. Mouse transcriptome summary. Flow of the data through the analysis procedure. Data set description
Sequences
Clusters
Genome-mapped Quasi-complete mappings Spliced Singletons Multi-transcript Alternative splicing Without splice variation
101,356 77,640 54,490 7,293 47,197 22,150 25,047
36,617 33,042 18,970 7,293 11,677 4,750 6,927
the current estimates are still considerably lower compared to those for the human transcriptome, where as much as 59% of genes have multiple splice forms (Lander et al. 2001). Is the structure of mouse genes really simpler? To address this question, we developed a Bayesian model and we estimated an upper bound on the fraction of mouse genes with multiple splice forms, taking into account the fact that cDNA sequencing only gives limited coverage of the potential transcriptome. The basic premise of the model is that as more transcripts are sequenced, some of the clusters that so far do not contain splice variants will acquire alternative splice forms. Thus, what we want to calculate is an upper bound on the probability P(n|k) that a gene has n splice forms given that k identically-spliced transcripts have been observed. In a Bayesian framework, this probability depends on the prior probability P(n) that a gene has n splice forms, and on the prior probability P( p1 , p2 , . . . , pn |n) for the relative frequencies with which these splice forms (1 through n) occur in the cDNA pool from which our data has been generated. To calculate an upper bound on P(n|k), we will assume that these priors are such that they put much of the probability on there being a large number of splice forms. For P(n) we assume a “scale-prior”, that is, P(n) ∝ 1/n. This prior is uniform in log(n), and it is also the unique prior that is invariant under scale transformations n → λn and transformations of the form n → n λ . For the prior P( p1 , p2 , . . . , pn |n) we take a uniform distribution, which again puts a relatively high a priori weight on very skewed frequencies pi , thus increasing the a priori probability that we observed only 1 splice form even if alternative splice forms exist. of observing k identicallyWith these assumptions, the probability n k spliced transcripts is P(k|n, p*) = i=1 ( pi ) . By integrating out the unknown parameters pi , which in Bayesian framework are called “nuisance parameters”, the probability P(k|n) of observing k times the same splice
The Contribution of Alternative Transcription and Alternative Splicing
357
form given that the gene has, in fact, n splice forms is ; n ( p )k d p 1 · · · d p n n!k! i=1 ; i = P(k|n) = , (n + k − 1)! d p 1 · · · d pn n pi = 1. We where the integrals are taken over the simplex pi ≥ 0, i=1 use the Bayes theorem to find the posterior probability that the gene has n distinct splice forms given that k identically-spliced transcripts have been observed. This probability is P(k|n)P(n) k − 1 (n − 1)!k! P(n|k) = ∞ = . ′ ′ k (n + k − 1)! n ′ =1 P(k|n )P(n )
The probability that there is exactly one splice form given that k identical transcripts have been sampled is P(n = 1|k) = 1 − 1/k. Thus, the probability that the gene has multiple splice forms is P(n > 1|k) = k1 . By summing this probability over all multiple-transcript clusters that have only identical transcripts in our dataset, we find that an additional 2,232 of the 6,927 non-variant clusters are expected to have multiple splice forms. Together with the 4,750 clusters already observed to be variant, this yields an upper bound of 60% genes with multiple splice forms. This value, extremely close to the estimate obtained for human genes (Lander et al. 2001), is confirmed by our subsequent analysis of all the mouse mRNA and EST sequences found in Genbank as of August 2002 – we found that 58.9% of multi–exon genes have multiple splice forms (Zavolan, unpublished data).
3.
Impact of Alternative Splicing on the Coding Potential
While bioinformatic analyses reveal that splice variation is indeed very common in mammals, the question that is much more difficult to address computationally is to what extent is this variation functional. We performed a number of analyses aimed at providing supplementary evidence toward this end. The first question that we addressed was to what extent are individual splice events reproducibly occurring in vivo. We searched dbEST sequences for variant exon forms initially inferred from cDNA sequences. A particular exon was considered “confirmed” if all its variant forms were present in the EST data. As expected, we find that exons with multiple variant forms have a lower rate of confirmation, and that initial and terminal exons, for which there is usually deeper EST sampling, have higher rate of confirmation. Generally, the rate at which alternative 5’ or 3’ splice sites are confirmed is 20-25% (Zavolan et al. 2003), which is relatively
358
COMPUTATIONAL GENOMICS
low. Multiple variations, as for example exons that can be either skipped or included and, if included, have variant boundaries, are confirmed at even lower rates than one would expect if individual variations were to be independently confirmed. This suggests that either different splice variants occur at markedly different frequencies, or there is correlated occurrence of certain splice variations, or both. Thus, we conclude that only simple forms of splice variations (exon inclusion/skipping and use of alternative 5’/3’ splice sites) have a substantial rate of confirmation in the EST data, and that a large proportion of observed splice forms are probably rare transcripts produced as a result of a “noisy” splicing process (see also below). Does this splice variation affect the coding potential of a transcript? That is, does it tend to occur in the coding region of transcripts rather than in their untranslated regions? To address this question, we made use of the annotation of coding regions of representative transcripts generated by the Fantom2 annotation project. The details of constructing the set of representative transcripts are given in (FANTOM Consortium and the RIKEN GSC Genome Exploration Group 2002). Briefly, full-length cDNAs were clustered, and cluster representatives were chosen using a combination of computational analyses and manual curation. This dataset was then merged with a dataset of transcripts from public databases, and another step of clustering was used to reduce redundancies. Human curation produced the final set of transcription units with their associated representative transcripts. Using the genome mapping and the FANTOMannotated coding region (CDS) (Furuno et al. 2003) of the representative transcript, we determined the genomic coordinates corresponding to the start and end of the representative coding region. For clusters for which no representative transcript was curated (1,207 of the 4,750 variant clusters), we simply chose the longest cDNA in the cluster as the representative. Distinct proteins forms will be generated when cryptic exons or alternative 5’ and/or 3’ splice sites fall within the genomic map of the CDS. Similarly to what has been reported for human transcripts (Modrek et al. 2001), we found that the large majority of variant exons – 6,225 (74%) of 8,456 – occur within the coding region of the representative transcript. Thus, most of the splice variation is expected to affect the protein coding potential of a transcript. Are alternative splice sites selected for preserving the reading frame? Previous studies suggested that, in human, alternatively spliced exons tend to have a length which is a multiple of three, thus preserving the reading frame (Thanaraj et al. 2003). Similarly, the length of the introns seems to be such that the reading frame tends to be preserved (de Souza et al. 1998). We seeked to determine whether the Fantom2 data set supported this hypothesis. We first selected exons for which the alternative splice
359
The Contribution of Alternative Transcription and Alternative Splicing
sites were documented unambiguously, without the possibility of error introduced by sequencing or by genome assembly artifacts. Namely, we only analyzed simple variations such as alternative 5’ and alternative 3’ splice sites, with both exons flanking the junction matching the genome perfectly for at least 10 nucleotides from the splice junction. We then computed the distribution of distances between the alternative splice sites (Fig. 17.3). Surprisingly, we found that a large fraction of alternative splice sites are less than 10 nucleotide apart: 83 (18%) of the 452 cases of alternative 5’ splice sites and 237 (40%) of the 587 cases of alternative 3’ splice sites. Only alternative 3’ splice sites – 310 of 587 cases (52.8%) – seem to be selected for preserving the frame (p-value < 2.2 × 10−16 relative to a null model that any of the reading frames occurs with identical probability) by keeping the distance between alternative splice sites a multiple of three. In contrast, we did not find such evidence for alternative 5’ splice sites: only 159 (35.2%) of the 452 alternative 5’ splice sites are situated at a multiple of three distance-away from each other (p-value = 0.4247). We do not know the reason for this asymmetry other than that it seems to be entirely due to the frequent use (161 of the 587 cases) of alternative 3’ splice sites that are only 3 nucleotides apart. These findings prompted us to infer that the spliceosome is quite “noisy” and its repositioning around the splice signals generates much of the observed splice variation. This conjecture is supported by more recent evidence showing that selection for preserving the reading frame is evident only in alternative exons conserved between species (Resch et al. 2004; Sorek et al. 2004), and therefore presumed to be involved in functional splice variation. Surprisingly, the selection pressure seems to be higher for exons which are only infrequently included in a transcript. The picture that emerges out of all of these studies is that some splice forms that differ only locally in the protein sequence B. Alternative 3' splice sites
Number of alternative junction pairs
A. Alternative 5' splice sites 50
250
40
200
30
150
20
100
10
50
0
0 0
20
40
60
80
100
0
20
40
60
Distance (nucleotides)
Figure 17.3. Histogram of distances between alternative splice sites.
80
100
360
COMPUTATIONAL GENOMICS
have acquired specific functions in the cell and are therefore under strong selective pressures which maintains them during evolution. In addition to these, a “noisy” splicing process generates many splice variants containing frameshift errors which will most likely be the target of nonsense-mediated decay. Of course, not all of the frameshift-inducing alternative splicing is necessarily attributable to low frequency splicing errors. In our database of mouse alternative splice forms (Zavolan et al. 2002), we came across the case of UDP-galactose transport-related protein, in which a frame-shifting splice event occurred repeatedly, in transcripts from different tissues. In fact, Lewis et al. (2003) presented evidence that frameshift-inducing splice variation may occur in a regulated manner, as a mechanism for post-transcriptional down-regulation of mRNA transcripts.
4.
Alternative Splicing and Alternative Transcription
What we found perhaps the most surprising in the initial dataset of mouse full-length cDNAs (Zavolan et al. 2002) was the frequency of transcripts whose transcription appeared to be initiated from alternative promoters. Using a Bayesian model, we estimated this frequency to be at least 20%. Additionally, we found that many of the transcripts initiated from alternative promoters have alternative initial exons. Here we present our model applied to the much larger Fantom2 dataset. The main issue that the model had to address was how to accurately estimate the frequency of alternative transcription excluding the possibility of cloning artifacts that may have resulted in truncated clones. For this reason, we focused on cases in which apparent alternative transcription was associated with alternative splicing in initial and terminal exons (Fig. 17.4). Such cases cannot be explained by mere loss of sequence from the beginning or the end of the clone. cryptic internal exon
Exon with 3' length variation
Intron inclusion
Exon with 5' length variation e
A
cryptic initial exon Type A
Skipped splice site in initial exon Type B
. Schematic representation of the events that could lead to cryptic initial and terminal exons. (Please refer to the color section of this book to view this figure in color).
The Contribution of Alternative Transcription and Alternative Splicing
361
Let us focus on cryptic initial exons. These could have been generated in two ways: transcription could have been initiated at an alternative promoter and then a cryptic 5’ splice site was chosen to generate the cryptic initial exon; alternatively, transcription may have been initiated at a unique promoter, but splice variation coupled with sequence loss from the beginning of the transcript generated what appears now to be an initial cryptic exon. Two types of splice variation could have been involved: a cryptic second exon (followed by the loss of the entire first exon), and alternative 5’ splice site of the initial exon (followed by partial loss of this exon). Assume that the frequency of alternative promoter usage is f . Further assume that the frequency with which cryptic initial exons are generated from conserved promoters is pc and from alternative promoters pa . What we would like to estimate is f . If we could use the frequencies of the above types of splice variations in internal exons to infer pc , we would at least be able to put bounds on f , knowing that both f and pa take values between 0 and 1. The frequency of splice variation may, however, be dramatically different in internal compared to initial or terminal regions of a transcript. To circumvent this problem to some extent, we reasoned that even though the absolute frequency of splice variation may be different in initial versus internal positions, the relative ratio of various splice forms may still be similar. Therefore, we turned to a second type of splice variation in initial exons that may be associated with alternative promoter usage (Fig. 17.4). This involves exons that are not cryptic, but appear in internal positions in some transcripts and as initial exons in other transcripts. In the latter case however, the 5’ end of the exon extends beyond the 3’ splice signal which is used when the same exon appears internal to a transcript. We call this form of initial exon type B, and the cryptic initial exon type A. Let us now redefine our variables: given that an alternative transcription start site has been used (with probability f ), transcripts have a probability pa to generate an exon of type A (and (1 − pa ) to generate an exon of type B); given that the same transcription start site has been used and nucleotide loss has occurred (with probability (1 − f )) transcripts have probability pc to be of type A. The key assumption that we make is that pc can be estimated from the relative frequencies of various splice variations in internal exons. We can now write the probability PA that a transcript with variation in the initial exon is of type A as PA = f pa + (1 − f ) pc ,
(17.1)
PB = f (1 − pa ) + (1 − f )(1 − pc ).
(17.2)
and similarly for type B
362
COMPUTATIONAL GENOMICS
The probability to observe n A exons of type A and n B exons of type B is P(n A , n B | f, f pa , pc ) = ( f pa + (1 − f ) pc )n A ( f (1 − pa ) + (1 − f )(1 − pc ))n B . (17.3) Let m A be the number of internal exons that were either cryptic or had variant 5’ splice signals. Let m B be the number of internal exons with variant 3’ splice signals or with included intron(s). The probability to observe this data given pc is P(m A , m B | pc ) = pcm A (1 − pc )m B .
(17.4)
Combining this with the equation above, we obtain the probability for all the observed counts P(n A , n B , m A , m B | f, f pa , pc ) = ( f pa + (1 − f ) pc )n A ( f (1 − pa ) + (1 − f )(1 − pc ))n B pcm A (1 − pc )m B .
(17.5)
The maximum likelihood values for f , pa , and pc satisfy pc∗ =
mA ≡ w, mA + mB
(17.6)
and
nA ≡ u. (17.7) nA + nB Here we have introduced the fraction of A types in internal (w) and initial (u) exons to simplify the expressions. Solving for f ∗ we find f ∗ pa∗ + (1 − f ∗ ) pc∗ =
f∗ =
u−w . pa∗ − w
(17.8)
We may now derive bounds on f ∗ and pa∗ knowing that both f and pa take values between 0 and 1. If u > w (higher frequency of type A in initial exons) we find that f ∗ ≥ (u − w)/(1 − w) and pa∗ ≥ u. If u < w we have f ∗ > (w − u)/w and pa∗ < u. In the Fantom2 dataset we have n A = 1976, n B = 1770, m A = 3249 + 612, and m B = 1047 + 495. Using those numbers we find that f ∗ > 0.26 and pa∗ < 0.53. The derivation for terminal exons is entirely analogous to the derivation above for initial exons. There we find f ∗ > 0.58 and pa∗ < 0.15. These values indicate that the use of alternative transcription start sites and especially of polyadenylation signals is very common in mouse. In cases of alternative polyadenylation, the most common outcome is an elongation of the terminal exon. In contrast, the choice
The Contribution of Alternative Transcription and Alternative Splicing
363
of an alternative promoter is followed quite frequently by the choice of a cryptic splice signal to produce a novel initial exon. Many of the alternative initial exons are predicted to change the protein sequence: of all the 4,242 variant clusters that have been annotated as protein encoding, 933 (22%) have cryptic initial and/or terminal exons that lie entirely within the coding region. These transcripts, if translated, would miss either initial and/or terminal parts of the protein.
5.
Regulation of Splicing
How the expression of alternative splice forms is controlled has been the focus on intense study (Robberson et al. 1990; Talerico and Berget 1990; Hoffman and Grabowski 1992; Zahler et al. 1993; Staknis and Reed 1994; Wang et al. 1995; Berget, 1995; McCullough and Berget 1997; Caceres and Krainer 1997; Liu et al. 1998; Eldridge et al. 1999; Tacke and Manley 1999; Blencowe et al. 2000). The hypothesis that emerged is that cryptic exons are not readily recognizable by the spliceosome, and that additional signals, provided in a tissue-specific manner, are needed for the inclusion of these exons in the mature mRNA. Why would cryptic exons be less recognizable than constitutive exons? Berget (1995) found that cryptic exons tend to be shorter than constitutive exons, while Stamm et al. (2000) showed that the signals flanking alternative exons deviate from the consensus splice signals. Exonic splice enhancers can compensate for weaker splice sites by binding to regulatory proteins (Berget, 1995; Liu et al. 1998; Schaal and Maniatis 1999; Tacke and Manley 1999; Fairbrother et al. 2002), and Smith and Valcarcel (2000) proposed that the tissuedependent expression of these regulatory proteins is responsible for the pattern of exon inclusion. As the Fantom2 dataset provided us with the largest database of splice variants available at the time of our study, we set to answer the following questions: are cryptic exons shorter, on average, than constitutive exons; are cryptic exons flanked by weaker splice signals; are there differences in the oligonucleotide composition of cryptic versus constitutive exons (presumably due to the presence of exonic splice enhancers or silencers)? For this purpose we extracted from our database a set of 15,298 constitutive and 3,468 cryptic internal exons, together with their flanking splice signals.
5.1
Length Distribution is Different Between Constitutive and Cryptic Exons
As shown in Fig. 17.5, the average length of cryptic exons is quite similar to that of constitutive exons, although cryptic exons have a somewhat
364
COMPUTATIONAL GENOMICS
Frequency
25
50
100
250
500
1000
2500
0.35
0.35
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0 25
50
100
250 Length
500
1000
2500
Figure 17.5. Length distribution of cryptic and constitutive exons.
larger mean and also a larger variance (142.4 ± 192.7 versus 129.2 ± 92.2). The two distributions are however, different overall (KolmogorovSmirnov test P-value = 2.075×10−12 ) with the most striking difference being an enrichment of cryptic exons at both ends. Thus, in contrast to Berget (1995) who found that cryptic exons tend to be shorter than constitutive ones, we found an enrichment of cryptic exons at both extremes of the length interval. We proposed that the discrepancy between the results of these two studies stems from the fact that Berget analyzed EST sequences, while our study was based entirely on full-length cDNA sequences. ESTs are generally short, of a few hundred nucleotides, and if we were to truncate the length distribution of the exons in our dataset at 250 nucleotides, after which the data of Berget becomes very sparse, we would also infer that cryptic exons are shorter than constitutive exons. It is yet unclear why both short exons and long exons tend to be overlooked by the spliceosome. According to the exon definition model (Berget, 1995), vertebrate exons are “recognized” through interactions that take place across exons between splicing factors and SR proteins. It is conceivable that there is an exon length at which these interactions occur optimally. Shorter exons may not contain all the appropriate signals, or may not accommodate all the required splicing factors. On the other hand, very long exons may not allow the assembly of spliceosomal proteins with appropriate spatial relationships, and therefore these exons may be equally excluded from the mature mRNA.
The Contribution of Alternative Transcription and Alternative Splicing
5.2
365
Constitutive and Cryptic Exons Differ in their Flanking Splice Signals
To determine the positions in the splice signal that may be important in distinguishing between cryptic and constitutive exons, we used the following Bayesian model. At each position i in the splice signal, let n iα and m iα be the number of nucleotides α (α ∈ A = A, C, G, T ) observed in the constitutive and cryptic exons, respectively. We then computed, at each position, the likelihoods of the observed nucleotide frequencies assuming that (1) constitutive and cryptic exons had different underlying nucleotide frequencies pαi and qαi , and (2) the underlying frequencies were identical between the two datasets. Under the first model, the likelihood for the observed nucleotide counts given the frequencies pαi and qαi is i i L 1 (D i |p | i , qi ) = ( pαi )n α (qαi )m α . α∈A
Using a uniform prior for the unknown frequencies pαi and qαi we obtain ; | i , qi )d pi dq i L 1 (D i |p 3! α n iα ! 3! α m iα∈ A ! i ; = i L 1 (D ) = , (n + 3)! (m i + 3)! d pi dq i i where n i = ni , m i = α∈A m α , and the integrals are over the i α∈A α i simplex α pα = α qα = 1. Under the second model, assuming that the nucleotide counts at position i in the constitutive and cryptic exons derive from the same underlying distribution pαi , we obtain i i L 2 (D i | pi ) = ( pαi )n α +m α . α∈A
Using again a uniform prior over pαi we obtain 3! α∈A (n iα + m iα )! i . L 2 (D ) = (n i + m i + 3)! The likelihood ratio L 1 (D i )/L 2 (D i ) for each position relative to the 3’ and 5’ splice junction, respectively, is shown in Fig. 17.6. We observe statistically significant differences at positions +1, +4 and +5 in the 5’ splice site and positions -5 and -6 in the 3’ splice site. We conclude that these differences probably underlie observations that cryptic exons have “weaker” splice sites (Stamm et al. 2000).
366
COMPUTATIONAL GENOMICS 5' splice signal
P(diff. distrib.)/P(same distrib.)
A.
3' splice signal
109
109
106
106
103
103
100
100
10-3
10-3
10-6
10-6
0
3
6
9
12
15 15 Position
12
9
6
3
0
Figure 17.6. Position-dependent ratio between the likelihoods that nucleotide frequencies are different versus identical between the splice signals flanking constitutive and cryptic exons. Panel A shows the profile of this ratio along the 5’ splice signal, and panel B the profile along the 3’ splice signal. The x axis indicates the distance (in nucleotides) from the splice junction.
5.3
Constitutive Exons are Enriched in Known Splice Enhancer Motifs
While there is a general consensus that cryptic exons are in some way less recognizable to the spliceosome than constitutive exons, the signals that enable the recognition of cryptic exons in certain tissues are yet to be uncovered. Making use of our relatively large dataset of constitutive and cryptic exons, we seeked to identify motifs whose representation differs between these two datasets, and would therefore be candidate splice enhancers or silencers. Previous studies suggested that many sequences can function as splice enhancers. In particular, Blencowe et al. (2000) and Eldridge et al. (1999) showed that 15-20% of randomized 18- or 20-mers, when substituted for an ESE within different exon contexts promote splicing, and a high level of degeneracy was observed by Fairbrother et al. (2002) with even shorter sequences (6 nucleotides-long). Our aim was not only to pinpoint those sequences motifs that contribute to exon recognition in general, but mainly to find those motifs that help distinguish between constitutive and cryptic exons. The method that we used to extract these motifs was the following. For each “word” of 8 or fewer nucleotides, we computed the frequency f co of occurrence in the set of 15,298 constitutive exons, and the frequency f cr 2 and s 2 be the corresponding variin the set of 3,468 cryptic exons (let sco cr ances). Similarly to Fairbrother et al. (2002), we focused our analysis on the neighborhood of splice signals, namely 40 nucleotides around the 5’and 3’-splice junctions of the exons. Under the assumption that regulatory motifs occur independently at any position around the splice boundary, the
The Contribution of Alternative Transcription and Alternative Splicing
367
number of occurrences of the motif is binomially distributed. Approximating the binomial distribution by a Gaussian, we calculated, for each motif, the value z = √fcr2− f co2 that quantifies the difference between the mean fresco +scr
quency of a motif between the two datasets, assuming the the variance is the same. We then selected for each length l all oligonucleotides for which the p-value corresponding to this z-statistic is below 4−l , meaning that we expect less than one false positive motif at each length. These are motifs that are over-represented in cryptic exons. The list is, of course, redundant, short motifs that are part of longer over-represented motifs being also overrepresented. We therefore removed those motifs that are contained within longer motifs that also occur in the list. Finally, we retained only those motifs whose frequency cannot be predicted from the mono-nucleotide frequencies in the corresponding set of exons. We used a similar procedure to identify motifs that are over-represented in constitutive exons. Figure 17.7 shows the final set of motifs, classified based on 1 The location where the motif is differentially represented (5’ or 3’ splice site) 2 Over- or under-representation in constitutive vs. cryptic exons 3 Over- or under-representation of the motif when compared with the frequency expected from the mono-nucleotide frequencies within the corresponding set of exons. Interestingly, some of the previously reported splice enhancer motifs (Fairbrother et al. 2002) turn out to be over-represented in constitutive exons compared to cryptic exons. Of these, TGAAG- and AAGAA-containing motifs are over-represented in both constitutive and cryptic exons relative to their mono-nucleotide frequencies, while TGGA-containing motifs are enriched in constitutive exons but depleted in cryptic exons relative to the mono-nucleotide frequencies. This suggests that aside from different length and weaker splice signals, the absence of splice enhancers may also be responsible for making the cryptic exons less recognizable by the spliceosome. Cryptic exons must, however, contain some signals that determine their inclusion in the mature mRNA. Candidate signals may reside in the set of sequences enriched in cryptic compared to constitutive exons. These are pyrimidine-rich motifs, reminiscent of the SRp20-specific motifs described by (Schaal and Maniatis 1999), or perhaps of motifs to which the polypyrimidine tract binding-protein binds (Fairbrother and Chasin 2000). Additionally, we found a number motifs resembling binding sites of hRNP proteins A1 (Burd and Dreyfuss 1994; Chabot et al. 1997), F and H
368
COMPUTATIONAL GENOMICS
, 3 splice junction
, 5 splice junction
Constitutive exons
TTTGA TGAT TGAAGA GGATGAA CATGGC ATGTG ATGAAG
TCACCAT GCTGGAT GCAGTT
Cryptic exons TTTC TCTCT GCCT GAGCC GACT CTTCCT CTTCC
CCTC CCCT CCCA CCAGG CAGA AGAGG AGAGA
TAGG TAGG TAAC GTAA CTAG CATA ATAG AGGGG
TGACGG TGAACG GTACG CGTGAA CGGAA ATCGG ATCCG
TGAAGA GATGATG CAGCTGCT CACTGTGG CAACA AAGAAGC
TTTGCAG TCAAGCA TATGA ATGGATG AGTTTGA
TTTACG TGACGT TC GT TC GCT TC GA TACAT GC GA GCCC GA
GATTCG CTTCG C GTCT C GTCAG C GCTT C GAGA ATTCGC ATCG AGTCG
CAGGGT
TTTC TTCC TCTCTCTC TCCI TCCCA CTTG
CTCC CCCCA CCCAGG CCACC AGGCAG AGAGA
TAGT TAGGC TAGCA TAAA GTAGC
GTAA CTAG ATTAG ATAG AGGG
Figure 17.7. Nucleotide motifs whose expression is significantly different between constitutive and cryptic exons. Left panels contain motifs over-represented in constitutive exons, right panels contain motifs over-represented in cryptic exons. Upper and lower panels contain motifs found in the neighborhood of 5’- and 3’- splice sites, respectively. Additionally, each set of motifs has been subdivided based on its enrichment or depletion relative to the mono-nucleotide frequencies in the two categories of exons.
(Min et al. 1995), and of p54nr b (Basu et al. 1997). These binding sites, some of which inhibit exon inclusion, are G-rich, frequently containing an AGGG core, or containing alternating A and G nucleotides. These results suggest that cryptic exons may also contain signals that generally inhibit splicing, and that in certain tissues, the proteins binding to such signals are under-expressed, rendering the exons recognizable by the spliceosome.
The Contribution of Alternative Transcription and Alternative Splicing
5.4
369
Recruitment of Repeat Sequences in Alternative Splicing
Repetitive sequences occupy large fractions of mammalian genomes. For example, transposon-derived interspersed repeats cover 46% of the human (Lander et al. 2001) and 38% of the mouse (Waterston et al. 2002) genomes, but only 3% of the genome of the japanese pufferfish Fugu rubripes. The high repeat content in mammals raises the questions of whether repeats are yet another mechanisms to generate complexity and variation in higher organisms. Analyses of potential molecular functions of repeat sequences have been mostly restricted to triplet repeats involved in neurological diseases and spontaneous mutations. In fact, intracisternalA particle (IAP) insertions contribute to about 10% of spontaneous mouse mutants (Waterston et al. 2002). Thus far, little is known about the effect that repeats might have on splicing and alteration of protein function. To capture information on spliced repeats we compared genomic exon positions with the RepeatMasker-determined repeat positions in the corresponding coding sequence (CDS) containing cDNAs. If the repeat sequence terminates at the splice site of the genomic exon the repeat in the cDNA is truncated. If the repeat spans the splice junction (exon-intronexon) two adjacent truncated repeat sequences will be found within the cDNA. The results of this analysis were integrated into “Functional REPeat” database (http://facts.gsc.riken.go.jp/FREP/) FREP (Nagashima et al. 2004). In total, 1401 repeats in 1253 cDNAs (7.6% of the transcript set) were found to be spliced. More than 50% of spliced repeats are attributed to SINEs (33.8%) and LTR (16.3%). The remaining repeats are derived from simple (16.3%) and other repeat families. Of functional interest are two subsets: 1) repeats supplying alternative splice sites and 2) spliced repeats overlapping with the CDS or residing inside the CDS. The latter are potential sources of species or strain-specific modifications of protein functions. We identified 615 (43.8%) spliced repeats in the CDS of 579 (18.7%) cDNA sequences. Insertion of interspersed repeats or expansion of simple repeats may generate alternative splice sites. One can identify alternative splice sites provided by repeat sequences either by mapping all the data (cDNAs, splice sites, repeat elements) onto the genome or by extracting variant exons from existing databases such as MousDB (Zavolan et al. 2003) and aligning those with repeat-containing cDNAs. We have used the latter strategy: pairwise alignment of repeat-containing cDNAs to the variant exon sequences of 4,750 variant clusters of MouSDB showed that 239 repeats of 229 cDNAs contributed to 4.6% (219 of 4,750) of variant clusters. The actual number might be even higher, because a repeat sequence that resides mostly in an intron and covers only a few bases of the adjacent exon(s) cannot be detected by RepeatMasker in the cDNA
370
COMPUTATIONAL GENOMICS
sequence. To capture all spliced repeats it would be necessary to use the first strategy of projecting all the data on the genome to determine the spatial relationship between repeat elements and variant splice junctions. Despite these short–comings we were able to extract several interesting biological findings. More than 90% of repeat-containing genomic exons with CDS potential were spliced in-frame, reflecting the selection pressure on insertion or exon shuffling events. Most of the remaining spliced repeats associated with frameshifts or unexpected stop codons were caused by sequence quality problems, rather than repeat insertions. One of the few candidates containing a spliced repeats that causes a premature translation stop is a flavin–containing monooxygenase−1 (fmo–1) variant. Clone A630093H08 (AK042457) harbors a retrovirus-derived IAP element of the LTR ERV−K family. The IAP supplies a cryptic splice site in genomic exon 5, which results in a premature stop codon. The putative translation product has a truncated fmo–domain (InterPro IPR000960) that may affect fmo–dependent drug metabolism. Ketoconazole–like drugs for systemic mycotic infections and prostate cancer are toxic, if metabolized by Fmo−1 (Rodriguez and Acosta 1997). Therefore, toxicity testing in mice that express the IAP–associated Fmo−1 isoform may generate misleading results. Tumor necrosis factor superfamily member 13b (Tnfsf13b) of C57Bl/6J mice provides an example for strain–specific expansion of the protein– coding sequence by a spliced repeat. Tnfsf13b (NM 033622) carries in the CDS a 93 bp LTR MaLR repeat. The cDNA–to–genome alignment Tnfsf13b (NM 033622) revealed that an LTR MaLR repeat is spliced in– frame and accounts for the entire exon 3. The repeat was found in Tnfsf13b of C57Bl/6J mouse strain, but not in Tnfsf13b of BALB/c, rat and human. The functionally active form of human TNFS13B is a trimer (Liu et al. 2002) that binds to TNFRSF13B, TNFRSF13C, and TNFRSF17 (Gavin et al. 2003). The LTR MaLR insertion causes a unique extension of the N–terminal sequence of the mature Tfnfsf13b in C57Bl/6J mice. It is possible that the different conformation may affect trimerization and/or receptor binding. Since Tnfsf13b is a key molecule in systemic lupus erythematosus (SLE)–like autoimmune diseases, it will be interesting to establish whether the repeat–contained Tfnfsf13b of C57Bl/6J affects the receptor signaling network and susceptibility toward SLE–like diseases. Dlm1 encodes a Z-DNA binding protein that is inducible by γ interferon and expressed in activated macrophages from colonic mucosa of Crohn’s disease patients and stomach adenocarcinomas. Extreme transcript variability obscures the functional characterization of Dlm1. Human Dlm1 produces more than 2000 transcript variants (Rothenburg et al. 2002). Some of the alternative splicing appears to be facilitated by repeat insertions. Of note, exon 9 is derived from a Tigger1 DNA transposon (Smit and
The Contribution of Alternative Transcription and Alternative Splicing
371
Riggs 1996). Although Tiggers are very ancient repeat elements, mouse and rat Dlm1 seem to have lost it as they completely lack the sequence corresponding to human exon 9. Mouse Dlm1 has instead acquired L1 and B1 repeats that are absent in the human ortholog. The B1 repeat of genomic exon 4 in Dlm1 (AK008179) generates an alternative splice site leading to a short variant. The apparent species–specific usage of different repeat elements to increase the number of splice variants may have interesting effects on differential role of Dml1 in Z–DNA binding and RNA editing in mouse and human. The functional modifications of gene products described above are mediated by the splicing machinery acting on specific repeat types located at specific positions within the genes. This adds yet another level of complexity to the mammalian transcriptome, and when we consider the large combinatorial space of transcript variations as a function of potential splice sites, repeat positions, splicing factors, and single nucleotide polymorphisms (not discussed in this review) it becomes clear that our knowledge about the transcriptome and its impact on phenotype is rather rudimentary. Further intra- and cross-species analyses are required to understand the flexibility of mammalian transcriptomes. In particular, the unexpected species–specific splicing differences caused by repeat insertions require further systematic cross–species investigations to re–evaluate mice strains as models for human diseases.
6.
Alternative Splicing of Regulatory Factors
The category of genes that we were especially interested in studying from the point of view of alternative splicing are those genes that regulate alternative splicing. This topic is not new: the classical example of functionally distinct splice forms comes from splicing factors involved in sex determination in Drosophila (Baker 1989). In the Fantom2 dataset, we found splice variants for many of the known splice-regulatory proteins: SF3B, SC35, SRp40, SRp55, KSRP, SRRP86, hnRNP C, SFRS7, SMNrelated factor, U4/U6 ribonucleoprotein HPRP3, and others. As an example, Fig. 17.8 shows the intron-exon structure of three full-length transcripts from the Fantom2 data set and of one transcript found in the public databases, all encoding polypyrimidine tract-binding protein (also known as heterogeneous nuclear ribonucleoprotein 1, hnRNP1). We used the SMART protein domain prediction program (Schultz et al. 1998) to infer the domains encoded in each of the splice variants. The longest form of this protein contains four RNA recognition motifs (RRM). Alternative splicing generates forms in which some of the exons are excluded, resulting in the loss of two or even three RRMs. The latter form has been isolated
372
COMPUTATIONAL GENOMICS
Figure 17.8. Gene structure (left side) and predicted protein domain composition (right side) of various splice forms of the polypyrimidine tract-binding protein. Predicted RNA recognition motifs are shown as diamonds. (Please refer to the color section of this book to view this figure in color).
from a tumor tissue, but the other forms seem to co-exist in the thymus. It thus appears that splicing factors, with their modular architecture featuring RNA recognition motifs, are easily amenable to splice variation. With large transcriptome datasets and several eukaryotic genomes now published, one can begin to quantitatively study the extent and pattern of splicing of regulatory proteins. We began these analyses with another class of proteins which, due to the connection that we and others have found between alternative transcription and alternative splicing, may also contribute to the regulation of splicing. This is the set of zinc finger-containing proteins which were extensively annotated during the Fantom2 project. Comparable annotation of splicing regulatory proteins was not available at that time, and a statistical analysis of this dataset is the focus of a further study.
6.1
Alternative Splicing of Zinc Finger-Containing Proteins
Many of zinc finger-containing proteins have multiple isoforms (Ravasi et al. 2003): of the 11,677 transcription units with multiple transcripts, 655 (5.6%) were annotated as encoding zinc-finger proteins, and 311 of these (47.5%) had multiple splice forms. This frequency is significantly higher than for the rest of the transcription units (4,439 TUs with variants in a total of 11,022 total TUs = 41.1%, P-value = 0.0002), and the result is not due to a deeper sampling of transcription units corresponding to transcription factors (average number of transcripts sampled for zinc fingers TUs = 4.0 versus 4.04 in the rest of the TUs). The frequency increased even further when ESTs from dbEST were included in the analysis of splice variation (data not shown), indicating that many variants are yet to be discovered. Furthermore, we found that the use of alternative transcription start sites is a common mechanism for altering the protein sequence of transcription factors: in 334 (51%) of the 655 zinc finger-containing TUs,
The Contribution of Alternative Transcription and Alternative Splicing
373
we found at least one transcript predicted to generate a truncated protein. Truncated protein forms may have important regulatory functions. For example, Yang et al. (2002) found that STAT92E is negatively regulated by an N-terminally truncated STAT derived from an alternative promoter site. The high rate of alternative splicing in the zinc finger superfamily could reflect the modular domain architecture, and the fact that individual domains commonly occur as single exons within a gene. This gene structure appears to facilitate addition and swapping of entire domains during evolution (Morgenstern and Atchley 1999).
7.
Conclusions
Initially, alternative splicing was thought to be a minor processing pathway affecting about 5% of all genes (Sharp 1994). With the growth of mammalian EST and cDNA datasets and with more sophisticated bioinformatic analyses (see for example Mironov et al. (1999)), it has become increasingly clear that alternative splicing is a common phenomenon. A recent microarray survey of exon-exon junctions in over 10,000 human genes suggests that at least 74% of all human multi-exon genes are alternative spliced (Johnson et al. 2003). Alternative splicing is not only one of the most important gene regulatory mechanisms in eukaryotes, it is also one of the most versatile, as it can have a number of downstream effects. Perhaps the most well known effect of alternative splicing is the insertion or deletion of protein domains. Only very recently has it become apparent that reproducible alternative splicing events result in frame-shifted transcripts which are then subjected to nonsense-mediated decay. Yet another downstream effect of alternative splicing has been observed in the case of the T-cell receptor gene. Aberrant Tcr-β transcripts bearing nonsense codons are down-regulated by nonsense-mediated decay factor UPF2 (Wang et al. 2002b; Mendell et al. 2002), this mechanism probably preventing the expression of truncated proteins resulting from non-productive rearrangement of T cell receptors. Wang and co-workers have moreover found (Wang et al. 2002a) that mutations in Tcr-β transcripts that cause premature termination or nonsense codons are countered with an increase in the number of alternatively spliced transcripts that skip the nonsense codon. Thus, in this case alternative splicing is used to bypass errors introduced in the reading frame by other mechanisms. Finally, some alternative splicing events do not affect the coding region of the transcript but rather the untranslated regions. For example, a form of TCRζ with alternatively spliced 3’ UTR is generated with increased frequency in patients with systemic lupus erythematosus (Nambiar et al. 2001). The implication of this change in the splicing pattern is yet unknown, and the first hypothesis that comes to mind is that
374
COMPUTATIONAL GENOMICS
alternative splicing in 3’ UTR may affect the expression of the protein by changing the stability of the mRNA. The discovery of post-transcriptional gene regulation by microRNAs acting on the 3’ UTR of mRNAs, makes it very tempting to speculate that some of the effects of alternative splicing in the 3’ UTRs are mediated by a change in susceptibility to regulation by microRNAs. What makes a given transcript be expressed in a given cell at a given time? Given the interdependencies in generating and controlling gene product diversity, we are only at the very beginning of understanding contextdependent expression. Recent analyses suggest that as the number of genes in an organism grows, the fraction of the organism’s genes that is involved in regulatory processes grows as well (van Nimwegen 2003). In line with this, our own studies indicate that regulatory genes are also more extensively diversified through alternative splicing than non-regulatory genes. It is clear that the regulated alternative splicing of regulatory genes has an enormous potential for implementing complex regulatory circuits. First of all, it seems likely that the regulation of alternative splicing is, like the regulation of transcription, under combinatorial control, with collections of interacting RNA-binding proteins together affecting alternative splicing Smith and Valcarcel (2000). The alternative splicing of regulatory proteins may then have several downstream effects. In cases where protein interaction domains are affected, the ability of these regulators to interact with other regulators may be affected. If the RNA or DNA binding domains are affected by the alternative splicing then this may affect their ability to recognize regulatory motifs in RNA and DNA respectively. All these changes will affect the composition of various enhanceosomes, with a plethora of downstream effects on gene expression, including on the rate of transcription initiation, on the use of alternative transcription start sites, and on the alternative splicing itself. The complex regulatory interactions determining the expression of regulatory proteins are likely implemented by constellations of regulatory motifs in coding and non-coding DNA that are under strong evolutionary selective pressures. Indeed, a recent paper that uncovered ultraconserved sequence elements in mammalian genomes (Bejerano et al. 2004) found that many of them correspond to alternative exons and their flanking regions in genes involved in RNA processing and in developmentally-related transcription regulation. This suggests that the alternative splicing of regulatory genes may be especially tightly regulated. Unraveling this regulatory network is not only an exciting challenge for the future, but in our opinion an essential prerequisite for any system biological studies of the “Omics” space (Toyoda and Wada 2004).
REFERENCES
375
Notes 1 We
called an exon cryptic if it was included in some transcript(s) and skipped in other(s). These exons were called alternative or cassette exons in other studies.
References Baker, B. 1989. Sex in flies: the splice of life. Nature 340: 521–524. Basu, A., Dong, B., and Krainer, C., A.R. Howe. 1997. The intracisternal A–particle proximal enhancer–binding protein activates transcription and is identical to the RNA– and DNA–binding protein p54nrb/NonO. Mol. Cell Biol. 17: 677–686. Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W., Mattick, J., and Haussler, D. 2004. Ultraconserved elements in the human genome. Science 304:1321–1325. Berget, S. 1995. Exon recognition in vertebrate splicing. J. Biol. Chem. 270: 2411–2414. Blencowe, B., Bauren, G., Eldridge, A., Issner, R., Nickerson, J., Rosonina, E., and Sharp, P. 2000. The SRm160/300 splicing coactivator subunits. RNA 6: 111–120. Brett, D., Pospisil, H., Valcarcel, J., Reich, J., and Bork, P. 2002. Alternative splicing and genome complexity. Nat. Genet. 30: 29–30. Burd, C. and Dreyfuss, G. 1994. RNA binding of hnRNP A1: significance of hnRNP A1 high–affinity binding sites in pre–mRNA splicing. EMBO J. 13: 1197–1204. Caceres, J. and Krainer, A. 1997. Mammalian pre–mRNA splicing factors. In Eukaryotic mRNA processing (ed. A. Krainer), pp. 174–182. Oxford Univ. Press. Chabot, B., Blanchette, M., Lapierre, I., and La Branche, H. 1997. An intron element modulating 5’ splice site selection in the hnRNP A1 pre– mRNA interacts with hnRNP A1. Mol. Cell Biol. 17: 1776–1786. de Souza, S., Long, M., Klein, R., Roy, S., Lin, S., and Gilbert, W. 1998. Towards a resolution of the introns early/late debate: only phase zero introns are correlated with the structure of ancient proteins. Proc. Natl Acad. Sci. USA 95: 5094–5098. Eldridge, A., Li, Y., Sharp, P., and Blencowe, B. 1999. The SRm160/300 splicing coactivator is required for exon–enhancer function. Proc. Natl. Acad. Sci. USA 96: 6125–6130. Fairbrother, W. and Chasin, L. 2000. Human genomic sequences that inhibit splicing. Mol. Cell Biol. 20: 6816–6825. Fairbrother, W., Yeh, R., Sharp, P., and Burge, C. 2002. Predictive identification of exonic splicing enhancers in human genes. Science 297: 1007–1013.
376
COMPUTATIONAL GENOMICS
FANTOM Consortium and the RIKEN GSC Genome Exploration Group. 2002. Analysis of the mouse transcriptome based upon functional annotation of 60,770 full length cDNAs. Nature 420: 563–573. Furuno, M., Kasukawa, T., Saito, R., Adachi, J., Suzuki, H., Baldarelli, R., Hayashizaki, Y., and Okazaki, Y. 2003. CDS annotation in full–length cDNA sequence. Genome Res. 13: 1478–87. Gavin, A., Ait–Azzouzene, D., Ware, C., and Nemazee, D. 2003. DeltaBAFF, an alternate splice isoform that regulates receptor binding and biopresentation of the b cell survival cytokine, BAFF. J. Biol. Chem. 278: 38220–38228. Gregory, S., Sekhon, M., Schein, J., Zhao, S., Osoegawa, K., Scott, C., Evans, R., Burridge, P., Cox, T., Fox, C., Hutton, R., Mullenger, I., Phillips, K., Smith, J., Stalker, J., Threadgold, G., Birney, E., Wylie, K., Chinwalla, A., Wallis, J., Hillier, L., Carter, J., Gaige, T., Jaeger, S., Kremitzki, C., Layman, D., Maas, J., McGrane, R., Mead, K., Walker, R., Jones, S., Smith, M., Asano, J., Bosdet, I., Chan, S., Chittaranjan, S., Chiu, R., Fjell, C., Fuhrmann, D., Girn, N., Gray, C., Guin, R., Hsiao, L., Krzywinski, M., Kutsche, R., Lee, S., Mathewson, C., McLeavy, C., Messervier, S., Ness, S., Pandoh, P., Prabhu, A., Saeedi, P., Smailus, D., Spence, L., Stott, J., Taylor, S., Terpstra, W., Tsai, M., Vardy, J., Wye, N., Yang, G., Shatsman, S., Ayodeji, B., Geer, K., Tsegaye, G., Shvartsbeyn, A., Gebregeorgis, E., Krol, M., Russell, D., Overton, L., Malek, J., Holmes, M., Heaney, M., Shetty, J., Feldblyum, T., Nierman, W., Catanese, J., Hubbard, T., Waterston, R., Rogers, J., de Jong, P., Fraser, C., Marra, M., McPherson, J., and Bentley, D. 2002. A physical map of the mouse genome. Nature 418: 743–750. Hoffman, B. and Grabowski, P. 1992. U1 snRNP targets an essential splicing factor, U2AF65, to the splice site by a network of interactions spanning the exon. Genes Dev. 6: 2554–2568. Johnson, J., Castle, J., Garrett–Engele, P., Kan, Z., Loerch, P., Armour, C., Santos, R., Schadt, E., Stoughton, R., and Shoemaker, D. 2003. Genome–wide survey of human alternative pre–mRNA splicing with exon junction microarrays. Science 302: 2141–2144. Kawai, J., Shinagawa, A., Shibata, K., Yoshino, M., Itoh, M., Ishii, Y., Arakawa, T., Hara, A., Fukunishi, Y., Konno, H., Adachi, J., Fukuda, S., Aizawa, K., Izawa, M., Nishi, K., Kiyosawa, H., Kondo, S., Yamanaka, I., Saito, T., Okazaki, Y., Gojobori, T., Bono, H., Kasukawa, T., Saito, R., Kadota, K., Matsuda, H., Ashburner, M., Batalov, S., Casavant, T., Fleischmann, W., Gaasterland, T., Gissi, C., King, B., Kochiwa, H., Kuehl, P., Lewis, S., Matsuo, Y., Nikaido, I., Pesole, G., Quackenbush, J., Schriml, L., Staubli, F., Suzuki, R., Tomita, M., Wagner, L., Washio, T., Sakai, K., Okido, T., Furuno, M., Aono, H., Baldarelli, R.,
REFERENCES
377
Barsh, G., Blake, J., Boffelli, D., Bojunga, N., Carninci, P., de Bonaldo, M., Brownstein, M., Bult, C., Fletcher, C., Fujita, M., Gariboldi, M., Gustincich, S., Hill, D., Hofmann, M., Hume, D., Kamiya, M., Lee, N., Lyons, P., Marchionni, L., Mashima, J., Mazzarelli, J., Mombaerts, P., Nordone, P., Ring, B., Ringwald, M., Rodriguez, I., Sakamoto, N., Sasaki, H., Sato, K., Schonbach, C., Seya, T., Shibata, Y., Storch, K., Suzuki, H., Toyo–oka, K., Wang, K., Weitz, C., Whittaker, C., Wilming, L., Wynshaw–Boris, A., Yoshida, K., Hasegawa, Y., Kawaji, H., Kohtsuki, S., and Hayashizaki, Y. 2001. Functional annotation of a full– length mouse cDNA collection. Nature 409: 685–690. Lander, E., Linton, L., Birren, B., and Nusbaum, C. e. a. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860–921. Lewis, B., Green, R., and Brenner, S. 2003. Evidence for the widespread coupling of alternative splicing and nonsense–mediated mRNA decay in humans. Proc. Natl. Acad. Sci. USA 100: 189–192. Liu, H., Zhang, M., and Krainer, A. 1998. Identification of functional exonic splicing enhancer motifs recognized by individual SR proteins. Genes Dev. 12: 1998–2012. Liu, Y., Xu, L., Opalka, N., Kappler, J., Shu, H., and Zhang, G. 2002. Crystal structure of sTALL–1 reveals a virus–like assembly of TNF family ligands. Cell 108: 383–394. McCullough, A. and Berget, S. 1997. G triplets located throughout a class of small vertebrate introns enforce intron borders and regulate splice site selection. Mol. Cell Biol. 17: 4562–4571. Mendell, J., Rhys, C., and Dietz, H. 2002. Separable roles for rent1/hUpf1 in altered splicing and decay of nonsense transcripts. Science 298: 419– 422. Min, H., Chan, R., and Black, D. 1995. The generally expressed hnRNP F is involved in a neural–specific pre–mRNA splicing event. Genes Dev. 9: 2659–2671. Mironov, A., Fickett, J., and Gelfand, M. 1999. Frequent alternative splicing of human genes. Genome Res. 9: 1288–1293. Modrek, B., Resch, A., Grasso, C., and Lee, C. 2001. Genome–wide detection of alternative splicing in expressed sequences of human genes. Nucl. Acids Res. 29: 2850–2859. Morgenstern, B. and Atchley, W. 1999. Evolution of bHLH transcription factors: modular evolution by domain shuffling? Mol. Biol. Evol. 16: 1654–1663. Nagashima, T., Matsuda, H., Silva, D., Petrovsky, N., RIKEN GER Group, ¨ GSL Members, Konagaya, A., , and Schonbach, C. 2004. FREP: a database of functional repeats in mouse cDNAs. Nucl. Acids Res. 32: D471– D475.
378
COMPUTATIONAL GENOMICS
Nambiar, M., Enyedy, E., Warke, V., Krishnan, S., Dennis, G., Kammer, G., and Tsokos, G. 2001. Polymorphisms/mutations of TCR–zeta–chain promoter and 3’ untranslated region and selective expression of TCR zeta–chain with an alternatively spliced 3’ untranslated region in patients with systemic lupus erythematosus. J. Autoimmun. 16: 133–142. Ravasi, T., Huber, T., Zavolan, M., Forrest, A., Gaasterland, T., Grimmond, S., Riken GER Group, GSL Members, and Hume, D. 2003. Systematic characterization of the zinc–finger–containing proteins in the mouse transcriptome. Genome Res. 13: 1430–1442. Resch, A., Xing, Y., Alekseyenko, A., Modrek, B., and Lee, C. 2004. Evidence for a subpopulation of conserved alternative splicing events under selection pressure for protein reading frame preservation. Nucl. Acids Res. 32: 1261–1269. Robberson, B., Cote, G., and Berget, S. 1990. Exon definition may facilitate splice site selection in RNAs with multiple exons. Mol. Cell Biol. 10: 84–94. Rodriguez, R. and Acosta, D. 1997. Metabolism of ketoconazole and deacetylated ketoconazole by rat hepatic microsomes and flavin–containing monooxygenases. Drug. Metab. Dispos. 25: 772–777. Rothenburg, S., Schwartz, T., KochNolte, F., and Haag, F. 2002. Complex regulation of the human gene for the ZDNA binding protein DLM1. Nucl. Acids Res. 30: 993–1000. Schaal, T. and Maniatis, T. 1999. Selection and characterization of pre– mRNA splicing enhancers: identification of novel SR protein–specific enhancer sequences. Mol. Cell Biol. 19: 1705–1719. Schultz, J., Milpetz, F., Bork, P., and Ponting, C. 1998. SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl. Acad. Sci. USA 95: 5857–5864. Sharp, P. 1994. Split genes and RNA splicing. Cell 77: 805–815. Shuttleworth, T., Wilson, M., Wicklow, B., Wilkins, J., and Triggs–Raine, B. 2002. Characterization of the murine hyaluronidase gene region reveals complex organization and cotranscription of Hyal1 with downstream genes, Fus2 and Hyal3. J. Biol. Chem. 277: 23008–23018. Smit, A. and Riggs, A. 1996. Tiggers and DNA transposon fossils in the human genome. Proc. Natl. Acad. Sci. 93: 1443–1448. Smith, C. and Valcarcel, J. 2000. Alternative pre–mRNA splicing: the logic of combinatorial control. Trends Biochem. Sci. 25: 381–388. Sorek, R., Shamir, R., and Ast, G. 2004. How prevalent is functional alternative splicing in the human genome? Trends Genet. 20: 68–71. Staknis, D. and Reed, R. 1994. SR proteins promote the first specific recognition of pre–mrna and are present together with the U1 small nuclear
REFERENCES
379
ribonucleoprotein particle in a general splicing enhancer complex. Mol. Cell Biol. 14: 7670–7682. Stamm, S., Zhu, J., Nakai, K., Stoilov, P., Stoss, O., and Zhang, M. 2000. An alternative–exon database and its statistical analysis. DNA Cell Biol. 19: 739–756. Suzuki, Y., Taira, H., Tsunoda, T., Mizushima–Sugano, J., Sese, J., Hata, H., Ota, T., Isogai, T., Tanaka, T., Morishita, S., Okubo, K., Sakaki, Y., Nakamura, Y., Suyama, A., and Sugano, S. 2001. Diverse transcriptional initiation revealed by fine, large–scale mapping of mRNA start sites. EMBO Rep. 2: 388–393. Tacke, R. and Manley, J. 1999. Determinants of SR protein specificity. Curr. Opin. Cell Biol. 11: 358–362. Talerico, M. and Berget, S. 1990. Effect of 5’ splice site mutations on splicing of the preceding intron. Mol. Cell Biol. 10: 6299–6305. Thanaraj, T., Clark, F., and Muilu, J. 2003. Conservation of human alternative splice events in mouse. Nucl. Acids Res. 31: 2544–2552. The C.elegans sequencing consortium. 1998. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282: 2012–2018. Toyoda, T. and Wada, A. 2004. Omic space: coordinate–based integration and analysis of genomic–phenomic interactions. Bioinformatics . van Nimwegen, E. 2003. Scaling laws in the functional content of genomes. Trends Genet. 19: 479–484. Venter, C. and et al. 2001. The sequence of the human genome. Science 291: 1304–1351. Wang, J., Hamilton, J., Carter, M., Li, S., and Wilkinson, M. 2002a. Alternatively spliced TCR mRNA induced by disruption of reading frame. Science 297: 108–110. Wang, J., Vock, V., Li, S., Olivas, O., and Wilkinson, M. 2002b. A quality control pathway that down–regulates aberrant T–cell receptor (TCR) transcripts by a mechanism requiring UPF2 and translation. J. Biol. Chem. 277: 18489–18493. Wang, Z., Hoffman, H., and Grabowski, P. 1995. Intrinsic U2AF binding is modulated by exon enhancer signals in parallel with changes in splicing activity. RNA 1: 335–346. Waterston, R., LindbladToh, K., Birney, E., Rogers, J., Abril, J., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., and An P., e. a. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520–562. Yang, E., Henriksen, M., Schaefer, O., Zakharova, N., and Darnell, J. 2002. Dissociation time from DNA determines transcriptional function in a STAT1 linker mutant. J. Biol. Chem. 277: 13455–13462.
380
COMPUTATIONAL GENOMICS
Zahler, A., Neugebauer, K., Lane, W. S., and Roth, M. 1993. Distinct functions of SR proteins in alternative pre–mRNA splicing. Science 260: 219–222. Zavolan, M., Kondo, S., Schonbach, C., Adachi, J., Hume, D., Hayashizaki, Y., Gaasterland, T., RIKEN GER Group, and GSL Members. 2003. Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. Genome Res. 13: 1290–1300. Zavolan, M., van Nimwegen, E., and Gaasterland, T. 2002. Splice variation in mouse full–length cDNAs identified by mapping to the mouse genome. Genome Res. 12: 1377–1385.
Chapter 18 COMPUTATIONAL IMAGING, AND STATISTICAL ANALYSIS OF TISSUE MICROARRAYS: QUANTITATIVE AUTOMATED ANALYSIS OF TISSUE MICROARRAYS Aaron J. Berger, Robert L. Camp, and David L. Rimm
∗
Dept. of Pathology, Yale University, School of Medicine, New Haven
1.
Introduction
Unlike DNA microarrays where each microscopic spot represents a unique cloned cDNA or oligonucleotide, the spots on tissue microarrays are larger and contain histologic tissue sections from unique patients or tumors. This technology reverses the typical array paradigm. Instead of testing one sample for expression of thousands of genes, tissue microarrays allow testing of one gene (or more typically a protein) on hundreds or thousands of samples (patients). Tissue microarray (TMA) technology was first described by Wan and Furmanski (Wan et al., 1987), and later advanced by Kononen and Kallioniemi in 1997 with production of an apparatus for mass-production of TMAs (Kononen et al., 1998). Consisting of an ordered array of tissue cores—up to 2,000—on a single glass slide, tissue microarrays provide a mechanism for the maximal use of scarce tissue resources. Most tissue microarrays are currently constructed from pathology tissue block archives, and the coordinate clinical data can be correlated with experiments performed on these tissues. TMAs allow for the validation of new concepts in cell and
∗
Mr. Berger is supported by NIH/NIGMS Medical Scientist Training Program Grant GM07205 (AB). Dr. Camp is supported by a grant from the NIH (K0-8 ES11571, NIEHS). Dr. Rimm is supported by a grant from the Patrick and Catherine Weldon Donaghue Foundation for Medical Research, grants from the NIH and US Army Grant DAMD-17-02-0463.
382
COMPUTATIONAL GENOMICS
molecular biology on human tissue (Rimm et al., 2001a; Rimm et al., 2001b). They are the ultimate step in the association of gene expression information with human tissues and human disease. Figure 18.1 shows an example of a tissue microarray and a schematic for TMA construction. It is estimated that there are over 30,000 genes within the human genome, encoding over 100,000 proteins. Researchers are currently beset with the task of filtering through the vast number of gene and protein targets to identify those with clinical relevance and diagnostic/prognostic/therapeutic potential. Target validation is a critical step in sifting through such targets. Traditionally, target validation was done with static assays such as Northern blot analysis, RT-PCR, macroarray, microarray, and gene chips. These technologies simply provide evidence of differential expression of specific genes and only rudimentary clinical information is obtained. For most techniques in molecular biology, tissue is homogenized to isolate RNA or protein for expression analysis. Unfortunately, the tissue obtained is not necessarily composed solely of the cells of interest. The tissue homogenate can contain normal cells, tumor cells, blood cells, muscle cells, and other cell types that may result in misleading information. Critical spatial information is lost in these studies, convoluting the results and making truly well-informed mechanistic explanations nearly impossible.
Figure 18.1. TMA construction (Please refer to the color section of this book to view this figure in color).
Computational Imaging, and Statistical Analysis of Tissue Microarrays
383
Tissue microarrays supply a mechanism for conservation of tissue, while providing the ability to evaluate hundreds of archival tissue specimens on one microscope slide. By exposing all tissues to precisely the same conditions, the slide-to-slide variability inherent to immunohistochemistry and in situ hybridization is minimized. Under the guidance of a pathologist, the target tissues are core-biopsied with a 0.6 mm diameter needle, and the cores are placed in a ‘recipient’ paraffin block. The maximum number of specimens one block can hold varies from 60 (with 1.5 mm needles) to 2000 (with new smaller diameter needles) with the most common size being 0.6mm allowing a maximum of about 750 histospots. The block containing the array is sectioned in an identical fashion to any paraffin-embedded tissue block. The maximum number of sections a block can provide depends on the size of the original tumor and the skill of the histotechnologist, but it is not uncommon to obtain hundreds of cores from a single conventional specimen (Rimm et al., 2001a; Rimm et al., 2001b). A key criticism of tissue microarrays is that the small tissue disk is not representative of the entire tumor. This is particularly problematic when the tumors are highly heterogeneous, on both a morphological and a molecular level. To evaluate this issue, numerous studies have compared IHC findings on TMAs with their corresponding traditional whole tissue sections (Camp et al., 2000; Engellau et al., 2001; Fernebro et al., 2002; Garcia et al., 2003; Ginestier et al., 2002; Gulmann et al., 2003; Hedvat et al., 2002; Hendriks et al., 2003; Hoos et al., 2001; Merseburger et al., 2003; Moch et al., 1999; Mucci et al., 2000; Natkunam et al., 2001; Nocito et al., 2001; Rassidakis et al., 2002; Rubin et al., 2002; Sallinen et al., 2000; Torhorst et al., 2001; Tzankov et al., 2003; Yosepovich and Kopolovic, 2002), with the vast majority revealing a high level of concordance (Camp et al., 2000; Engellau et al., 2001; Fernebro et al., 2002; Ginestier et al., 2002; Hedvat et al., 2002; Hendriks et al., 2003; Hoos et al., 2001; Moch et al., 1999; Mucci et al., 2000; Nocito et al., 2001; Rassidakis et al., 2002; Rubin et al., 2002; Sallinen et al., 2000; Torhorst et al., 2001; Tzankov et al., 2003; Yosepovich and Kopolovic, 2002). Most of these studies used multiple cores form donor blocks in order to determine how many samples are needed to obtain results on TMAs that are concordant with whole tissue sections. It has been generally found that two to three samples are needed to achieve a good concordance level. However, this number is highly dependant on the tumor type and the design of the study. In our effort on breast cancer, Camp et al. analyzed estrogen and progesterone receptors and HER2 on 2-10 cores per breast cancer donor block. It was found that analysis of two cores was sufficient to obtain identical results as compared with the corresponding whole
384
COMPUTATIONAL GENOMICS
tissue sections in 95% of cases; four cores yield a concordance of 99% (Camp et al., 2000). It is important to recognize that these studies are based on the assumption that traditional whole tissue sections, the current gold standard for molecular tumor tissue analysis, is representative of an entire tumor (Simon et al., 2003). A section from a tumor that measures 3 x 2 cm in diameter, cut 3 μm thick, has a volume of 0.0018 cm3 . This volume is only 1/19,000 of a tumor with a diameter of 4 cm or 1/150,000 of a tumor with a diameter of 8 cm. A TMA sample measuring 0.6 mm in diameter represents a tumor volume of 0.00000108 cm3 , which is 1/1600 of a 3 cm x 2 cm x 3 μm whole section tissue slice. Based on these numbers, the representativity problem is about 100-times greater between the entire tumor and a traditional large section than comparing results obtained on large sections and TMA sections (Simon et al., 2003). In addition, it should be noted that the tissues on the TMA are carefully selected to be representative of tumor on the whole tissue section.
2.
Oxidation and Storage
Tissue microarrays have facilitated the evaluation of large cohort studies with the use of formalin-fixed, paraffin-embedded tissues that can sometimes be almost a century old. The current method of archival storage of tissues—formalin fixation followed by paraffin embedding—is a remarkably resilient method of specimen and antigen preservation. However, once the samples from such blocks are cut into 5 micron sections and deparaffinized, the tissue is exposed to ambient air and the potential effects of oxidation. Several studies have demonstrated the detrimental effects of storing pre-cut tissue slides under ambient conditions (Bertheau et al., 1998; Fergenbaum et al., 2004; Henson, 1996; Jacobs et al., 1996; van den Broek and van de Vijver, 2000). For some proteins (i.e., estrogen receptor), precut slides stored in room air lose all detectable antigenicity in one month, with a statistically significant difference from fresh cut seen by 6 days (DiVito et al., 2004); demonstrated in Fig. 18.2. Unfortunately, there is not a universally accepted method for storing slides that reduces the loss of antigencity. Cold storage of slides at 4o C has shown some promise, but still results in significant loss of antigenicity (Jacobs et al., 1996; van den Broek and van de Vijver, 2000). One method of antigen preservation that has demonstrated moderate success is paraffin coating of pre-cut slides (Jacobs et al., 1996). Non-formalin-based fixatives have been tested and have some benefits, but they do not reduce the loss of antigenicity due to oxidation (Ahram et al., 2003; Beckstead, 1994; Gillespie et al., 2002; Sen Gupta et al., 2003; Vincek et al., 2003; Wester et al., 2003). While
Computational Imaging, and Statistical Analysis of Tissue Microarrays Fresh-cut
Day 2
Day 6
385
Day 30
Figure 18.2. Estrogen receptor immunohistochemistry is abrogated by storage under ambient air conditions. Matched histospot of breast cancer from freshly-cut slide, as well as slides stored for 2, 6, and 30 days in ambient air. Slides were stained for ER and visualized with DAB. Staining is localized to the tumor nuclei when assayed under high power.
these fixatives show great promise in overcoming some of the deleterious effects of formalin-based fixation, they are not useful for retrospective studies on paraffin-embedded archival tissues, the vast majority of which are formalin-fixed. The advent of tissue microarrays as a tool for the study of immunohistochemical stains has renewed interest in finding methods for cut-slide preservation. To avoid repeatedly facing the paraffin block containing a TMA, serial sections are typically cut at one time and stored as individual slides. While this procedure increases the quantity of high quality slides that can be cut from a particular TMA block, it necessitates rigorous slide storage techniques that will ensure antigen preservation. Our laboratory has developed a reliable method for the long-term storage of TMA slides that entails paraffin-coating of slides followed by (long-term) storage in a nitrogen dessication chamber (DiVito et al., 2004). This method of cutslide storage maintains tissue antigenicity comparable to fresh-cut slide for up to 3 months.
3.
Fixation and Antigen Retrieval
For over 100 years, pathologists have relied upon formalin—a 37% solution of formaldehyde gas in water—for fixation of tissue specimens (Gown, 2004). Formalin has several advantages over alcohol and other precipitative (non-cross-linking) fixatives, in that it maintains excellent preservation of morphologic detail. Quick and effective fixation of fresh tissue is necessary to fix antigens in situ, prior to the onset of autolytic changes. Although the biochemical mechanisms of the fixation process are not entirely understood, it is clear that formaldehyde-based fixation involves the formation of cross-links of amino groups (Fox et al., 1985; Fraenkel-Conrat
386
COMPUTATIONAL GENOMICS
et al., 1947; Sompuram et al., 2004), among other reactions, recently summarized by Shi et al (Shi et al., 2000b). These formalin-induced protein cross-linkages can result in partial or complete loss of immunoreactivity of various antigens, either through intramolecular crosslinks within the epitope region of the target antigen, or cross-linking of unrelated proteins to the epitope. Although formalin fixation will allow some epitopes to emerge unchanged (formalin-resistant), others will undergo substantial changes (formalinsensitive). Despite the loss of immunoreactivity by many antigens as a result of formalin fixation, formalin-fixed paraffin-embedded (FFPE) tissue remains the medium of choice for clinical and research studies because of the superior preservation of morphology. Various alternatives to formalin have been tested, but all have failed, and it is likely that an ideal fixative will never be found (Larsson, 1988; Shi et al., 2001b). To counteract or ‘reverse’ the effects of formalin fixation— to improve the immunoreactivity of FFPE tissues—a number of unmasking techniques, such as enzymatic digestion and heat-induced antigen retrieval, were introduced. The first attempts to ‘improve’ IHC involved the use of tryptic digestion prior to immunofluorescent staining (Huang et al., 1976). Other proteolytic enzymes (e.g., bromelain, chymotrypsin, ficin, pepsin, pronase) were tried for restoration of immunoreactivity to tissue antigens with varying degrees of success (Key, 2004). Proteolytic digestion compensates for the impermeable nature of the non-coagulant fixatives by “etching” the tissue and allowing hidden determinants to be exposed. Use of these enzymes may however also entail the risk of destroying some epitopes. Currently, the principle of antigen retrieval is based upon the application of heat for varying lengths of time to FFPE tissue sections in an aqueous medium. The technique, initially developed by Shi et al. in 1991, is a hightemperature heating method to recover the antigenicity of tissue sections that has been masked by formalin fixation (Shi et al., 1991). A number of modalities for heat generation have been tested, including microwave oven, pressure cooker, steamer and autoclave. The advantages and disadvantages of various heating modalities and retrieval solutions are the subject of ongoing experiments (Shi et al., 2000b). Alternate terminology used for “antigen retrieval” includes epitope retrieval, heat-induced epitope retrieval, target retrieval and target unmasking. The latter two versions have more generic appeal and have also been applied to the retrieval of nucleic acid targets for in situ hybridization. In general, antigen retrieval refers to a technique, used widely in pathology and other fields of morphology, which improves IHC staining on archival FFPE tissue sections by reducing detec-
Computational Imaging, and Statistical Analysis of Tissue Microarrays
387
tion thresholds of immunostaining (increases sensitivity) and retrieval of some antigens (e.g., Ki-67, ER-1D5, androgen receptor, many CD markers) that are otherwise negative in IHC. Antigen retrieval serves as a pretreatment amplification step, in contrast to amplification in the phases of detection (e.g., multistep or polymeric detection systems) or post-detection (i.e., enchanced substrated such as DAB using metal, imidazol, CARD, anti-end product, gold-silver enhancement) (Shi et al., 2001a). Although the molecular mechanism of antigen retrieval is not entirely clear at this time, it is apparently influenced by the pH of the antigen retrieval solution as well as temperature and time of heating (Shi et al., 2000b). A number of hypotheses have been proposed, including: loosening or breaking of the formalin-induced cross-links (Shi et al., 1992; Shi et al., 1997); protein denaturation (Cattoretti et al., 1993); hydrolysis of Schiff bases (Shiurba et al., 1998); chelation of cage-like calcium complexes, which may develop during formalin fixation (Morgan et al., 1997; Morgan et al., 1994); heat-induced reversal of chemical modification to protein structure that occurs during formalin fixation (Shi et al., 1997). This topic has been most recently characterized by the Mason group in a series of papers that look at the biophysical effects of formalin on RNAse (Rait et al., 2004a; Rait et al., 2004b). The technique of antigen retrieval, a heat-induced remodification of formalin-induced modification of protein conformation, may contribute to standardization of IHC staining of FFPE tissue by equalizing the intensity of IHC staining under variable conditions of fixation and processing (Shi et al., 2001a). While the molecular mechanisms of antigen retrieval are the subject of intense investigation, the main concern is that the procedure works and provides reproducible results. Importantly, antigen retrieval has become increasingly more useful in research and diagnostic pathology, as it provides increased sensitivity for the demonstration of molecular markers not previously demonstrable in standard tissue sections. A number of different antigen retrieval techniques are in development, and it is essential that scientists and pathologists focus on standardization of antigen retrieval, which will likely entail optimization for individual antigens (Boenisch, 2001).
4.
Standardization of Immunohistochemistry
The Biological Stain Commission (BSC) was founded in 1944 as a nonprofit organization to address the problem of standardization of chemical stains in pathology (Mowry, 1980). The BSC was able to achieve considerable success with regard to biological dyes, but the introduction of
388
COMPUTATIONAL GENOMICS
immunohistochemistry brought new challenges. The origins of immunohistochemistry date back to 1941, when Coons et al. established an immunofluorescence technique for detection of bacteria (Coons et al., 1941). But it was many years before it became a common part of diagnostic pathology. In 1977, the National Cancer Institute held a workshop to address the standardization of immuno-reagents in diagnostic pathology (DeLellis et al., 1979). From its inception and until recently, the use of IHC was limited to qualitative assessments of expression. The major application of the method in pathology was in the detection of cellular lineage markers to diagnose poorly differentiated malignancies. Interpretation of results was a matter of positive or negative, present or absent. Little attempt was made to distinguish degrees of staining, or differences in intensity (Shi et al., 2000a). The introduction of immunoperoxidase methods and increasingly sensitive detection systems in the mid-1970s provided the ability for broad application of IHC on routine FFPE tissues. With the promise of unparalleled specificity, IHC transformed histopathology from something resembling an art into something more closely resembling a science (Taylor, 1986). The application of IHC to routinely processed tissues represented a new era for the demonstration of antigens in situ, allowing pathologists to correlate immunologic findings directly with traditional cytologic and histologic criteria (Taylor, 1980). Immunohistochemistry typically follows a series of universal steps. A) Pretreatment (antigen retrieval), often with pressure-cooking of tissue in citrate buffer to unmask antigens hidden by formalin cross-links or other fixation. Other agents used for antigen retrieval include proteases, such as pepsin, trypsin, bromelain. B) Application of primary antibody; antibody binds to antigens of interest. C) Wash off excess primary antibody. D) Application of labeled anti-IgG antibody (secondary antibody), which binds to the primary antibody. The secondary antibody is typically conjugated to biotin or horseradish peroxidase. E) Application of detection chemicals, including a chromagen (color changing reagent), usually 3, 3’ diaminobenzidine (DAB). It is necessary add an avidin-biotin-peroxidase complex prior to DAB in the case of biotinylated antibodies. The slide is then counterstained with hematoxylin to identify all nuclei. This is the standard method for the ‘brown stain’. While immunohistochemistry has become an established method in research and clinical pathology, a surprising lack of standardization and reproducibility exists among different laboratories. Enormous variability has developed in terms of the reagents available, the detection methods used, and most importantly the interpretation and reporting of immunohistochemical findings (Taylor, 2000). Although efforts have been made to standardize immunohistochemistry, difficulties remain.
Computational Imaging, and Statistical Analysis of Tissue Microarrays
389
Recently, the need for IHC standardization has expanded due to the rise of bio-specific therapies that are approved with partner pharmco-diagnostics. Quantitative (or at least semiquantitative) IHC is likely to become more critical as pathologists are required to accurately distinguish expression levels of markers to determine whether or not a patient is a candidate for a specific therapeutic. The classic example is the evaluation of HER2-2/neu expression to aid in prognosis of breast cancer. Although this marker is an “easy one”, there has still been considerable literature on the inaccuracy of of the pathologist-based test. More recently, market forces have resulted in standardized test kits, such as the HerceptTest (DAKO, Carpinteria, CA) and INFORM DNA probe test (Ventana Medical Systems, Tucson, AZ), that have been approved by the FDA. None-the-less, the accuracy and comparability of these methods is somewhat controversial (Yaziji et al., 2004). Efforts to standardize immunohistochemistry have focused on three areas: (1) antibodies and reagents, (2) technical procedures, and (3) interpretation of IHC findings for research and diagnostic uses. While there has been recommendation for a Total Test Approach to standardization of IHC (Taylor, 2000), requiring that the pathologist pay close attention to each and every step of the procedure: from excision of the biopsy, including the type and duration of fixation, through the antigen retrieval method and detection procedure, as well as interpretation of the resulting stain. Much of the variability in staining is introduced by differences in fixatives or fixation times (Taylor, 1994), and these can be reduced, if not eliminated by the use of an optimized antigen retrieval protocol (Shi et al., 2000a). In addition to optimization of antigen retrieval, the authors suggest simplification of IHC staining method, and the implementation of quantitative IHC.
5.
Quantitative Immunohistochemistry
Although the current gold standard for diagnosis amongst pathologists relies upon morphological criteria, histochemistry—immunohistochemistry in particular—has provided a more accurate mechanism for diagnosis, as well as the ability to assess molecular interactions in situ. The primary usage has been as an ancillary test to assist in classifying the tumors or making a diagnosis. But there is now a trend toward increased usage of this test as a mechanism to predict response to therapy. This trend, as well as advances in computation power, is driving numerous efforts toward quantitative assessment of expression. The critical information that makes techniques such as immunohistochemistry and in situ hybridization so valuable is that they provide molecular information within the architectural context of tissues. This information, including the architectural and
390
COMPUTATIONAL GENOMICS
subcellular localization, can be critical for classifying tumors, or providing prognostic or predictive information. This information now needs to be maintained as a unique part of the molecular information package that now includes various molecular-based techniques that rely on suspensions or extracts of cells or tissues such as PCR, DNA sequencing, FACS and ELISA assays. Quantitative immunohistochemistry (QIHC) is composed to two general methodologies, manual and automated analysis. Pathologists typically perform the manual method of QIHC. After identifying the region of interest within a tissue section (i.e., tumor or other diseased tissue), estimation of intensity or percentage of IHC staining is determined. This method is presently more commonly used, and is well recognized in the literature. However, this approach is relatively subjective, based on visual estimation of the intensity and percentage of positive staining of IHC. The immunoreactive grading system provides a visual aggregate of intensity/percentage, denoted by “0, 1+, 2+, 3+”, representing negative, weak, moderate, and strong staining, respectively. Semi- quantitative systems have been invented and some have seen extensive usage, including the H-score system (McCarty et al., 1986) and the Allred system (Harvey et al., 1999). Both are based on the product of staining intensity and the percentage of positive cells. Though widely used, the manual system of QIHC is imperfect, based on human fallibility, which suffers from a high degree of inter- and intra-observer variability (Kay et al., 1994; Thomson et al., 2001). Automated systems for QIHC begin with acquisition and storage of digital images, followed by image correction and enhancement, segmentation of objects within the images, and ultimately image measurement (Oberholzer et al., 1996). Several commercial systems are currently available for image analysis of brown stain IHC slides. Among these systems, the CAS system, originally invented by James Bacus (Bacus et al., 1988) was perhaps the most popular as judged by the number of units sold, although now it is outdated. The CAS 200 system, and its successor, the BLISS system is designed for quantitative analysis of “brown” stained IHC material. The optical density of the reaction product is determined through conversion of the light transmission, or gray level, and the relative area of the staining reaction is also determined. Other systems with similar functionality include the following: R Automated Cellular Imaging System (ACIS ) (Chromavision Medical Systems, San Juan Capistrano, CA) http://www.chromavision.com/index.htm
Computational Imaging, and Statistical Analysis of Tissue Microarrays
391
BLISS workstation (Bacus Laboratories, Inc., Lombard, IL) with TMAscore http://www.bacuslabs.com/bliss.html R ScanScope systems (Aperio Technologies, Inc., Vista, CA) http://www.aperio.com/products-software.asp
PATHIAM (Bioimagene, Inc., San Jose, CA) http://www.bioimagene.com/pathiam.htm Discovery-1TM and Discovery-TMATM (Molecular Devices, Corp., Sunnyvale, CA) http://www.moleculardevices.com/pages/instruments/ microscopy main.html TissueAnalyticsTM (Tissue Informatics, Inc., Pittsburgh, PA) http://www.tissueinformatics.com/products/index.html GenoMXTM system (Biogenex, San Ramon, CA). http://www.biogenex.net/profile.php?pagename=csfaq genomx These systems generally cost between $200,000 and $300,000. In addition, some investigators have developed analysis systems on their own (Liu et al., 2002; Vrolijk et al., 2003) or using commercially available software (Matkowskyj et al., 2003; Matkowskyj et al., 2000). Because these systems are designed to analyze brown stain IHC, they rely upon absorbance (i.e., optical density) to compute an intensity score. In spectrophotometry, Beer’s Law (A = ε l c) states that the absorbance (A) of a species at a particular wavelength of electromagnetic radiation, is proportional to the concentration (c) of the absorbing species and to the length of the path (l) of the electromagnetic radiation through the sample containing the absorbing species. The extinction coefficient (ε), also known molar absorptivity, is dependent upon the wavelength of the light and is related to the probability that the species will absorb light of a given wavelength. When A equals zero, no photons are absorbed; when A equals 1.0, 90% of photons are absorbed, i.e., 10% detected; when A equals 2.0, 99% of photons are absorbed, i.e., 1% detected. Most brown stains are optimized by eye for signal to noise such that the absorbance of a ‘positive’ case is over 1.0 and often over 2.0. Thus, only a tiny percentage of the light is detected, which decreases the dynamic range of the assay and makes colocalization more challenging. Fluorescence based imaging could circumvent these problems, but that imaging modality has other problems. Thus very few companies have developed quantitative analysis on the basis of this platform.
392
6.
COMPUTATIONAL GENOMICS
Fluorescence-Based Platforms for Quantitative Analysis
Although fluorescence-based methods have numerous advantages with respect to dynamic range and co-localization, they have not been broadly used for numerous reasons. Firstly, formalin fixed tissue generally has a substantial amount of auto-fluorescence, especially in the wavelengths of the most common fluorophores. Red cells are a particular problem. A second issue is the higher cost and limited accessibility of epifluorescent microscopy equipment. Perhaps the greatest barrier is the inability of the pathologist to judge the surrounding tissue and architectural context. This can be seen as a huge disadvantage, or arguably an advantage if the goal is to reduce subjectivity. In flow cytometry, pathologists do not see any cells, but that is not seen as a disadvantage. Furthermore, recent studies suggest that pathologists tend to select non-representative regions when they select an area to score (Gann et al, personal communication). This can decrease the overall accuracy of assessment of protein expression levels. A small number of laboratories have now begun developing analysis systems for fluorescence-based quantification. The following paragraphs describe our lab’s efforts in this area (Camp et al., 2002). The assembly consists largely of off-the-shelf microscopy equipment based on an Olympus AX51 automated microscope, with a Cooke digital camera, and computerized image acquisition based on the IP Lab software (v3.54, Scanalytics, Inc.). The imaging system captures stacks of images that are then analyzed on custom software developed in our laboratory, written by Robert Camp called AQUA (Automated Quantitative Analysis, Fig. 18.3). The underlying principle behind the software is that spatial information is maintained, but that all localization is based on molecular information, rather the feature identification or morphology. Thus the steps for analysis of each image include identification of a mask to separate tumor from non-tumor, followed by definition of compartments on the basis of other molecular interactions. For example keratin may be used as a mask and DAPI to define a nuclear compartment. Then the target of interest is measured (for example estrogen receptor) in a quantitative fashion only in the pixels that are within the mask and compartment defined by the previous markers. The intensity then divided by the area of the compartment to define a normalized value that is directly proportional to the protein concentration of the target. The following is a detailed description of the process. Image acquisition: Images of microarrays are obtained using an Olympus Motorized Reflected Fluorescence System and software (IP lab v3.54, Scanalytics, Inc.), with an attached Cooke Sensicam QE High Performance
Computational Imaging, and Statistical Analysis of Tissue Microarrays
393
Figure 18.3. AQUA software. The windows of the AQUA software are shown including the algorithm description window, the settings window, score window and 4 image windows for the images of different raw images or processed images. For example the upper left image is the keratin immunofluorescence image from which the upper right mask image was constructed. (Please refer to the color section of this book to view this figure in color).
camera through an Olympus BX51 microscope with automated x, y, z stage movement. Low power images of microarrays are stitched together using multiple low-resolution images of the microarray (64 x 64 pixel) at approximately 7-micron resolution. Histospots are identified using a custom written spotfinder algorithm (written by Robert Camp) based on the signal from DAPI or other tags. This signal is thresholded to create a binary image of the microarray. Histospots are identified using size criteria. Rows and columns of histospots are then identified, and missing histospots filled in, allowing each histospot to be identified based on its row/column grid position. The coordinates of each histospot are then recorded. Subsequently, monochromatic, high-resolution (1024 x 1024 pixel, 0.5micron resolution) images are obtained of each histospot, both in the plane of focus and 8 microns below it, and recorded in an image stack as bitmaps.
394
COMPUTATIONAL GENOMICS
This depth, slightly below the bottom of the tissue, was determined to be optimal for 5-micron histologic sections. A resolution of 0.5 microns is suitable for distinguishing between large subcellular compartments such as the cell membrane and nuclei. Efforts are underway to accommodate smaller subcellular compartments (for example, mitochondria, nucleoli) using high power objectives. Images are obtained using a pixel intensity dynamic range of 0–4095. RESA/PLACE algorithmic analysis of images: Through the use of two algorithms, RESA (rapid exponential subtraction algorithm) and PLACE (pixel-based locale assignment for compartmentalization of expression), AQUA carries out the following functions: • Identification of the tumor locale within the histospot (“tumor mask”) • Identification of subcellular compartments (nuclei, membranes, cytoplasm, etc.) • Defines the localization and intensity of the “target” First, a tumor-specific mask is generated by thresholding the image of a marker that differentiates tumor from surrounding stroma and other nontumor material. This creates a binary mask (each pixel is either ‘on’ or ‘off’). Keratin is the most common mask used for epithelial neoplasms and S100 is used for melanocytic lesions. As formalin-fixed tissues can exhibit autofluorescence, analysis may give multiple background peaks. The RESA/PLACE algorithms determine which of these peaks is predominant and sets a binary mask threshold at a slightly higher intensity level. This provides an adaptive (unique to each histospot) thresholding system that ensures that only the target signal from the tumor and not the surrounding elements is analyzed. Thresholding levels are verified by spotchecking a few images and then automated for the remaining images. This binary mask can be modified using standard image manipulations. In most cases this involves filling holes of a particular size (for example, less than 500 pixels, to fill in tumor nuclei that do not stain for S100) and removing extraneous single pixels. Once set, these image manipulations are performed automatically on all images. All subsequent image manipulations involve only image information from the masked area. Next, two images (one in-focus, one slightly deeper) are taken of the compartment-specific tags and the target marker. A percentage of the outof-focus image is subtracted from the in-focus image, based on a pixel-bypixel analysis of the two images. This percentage is determined according to the ratio of the highest/lowest intensity pixels in the in-focus image – representing the signal-to-noise ratio of the image. By using an exponen-
Computational Imaging, and Statistical Analysis of Tissue Microarrays
395
tial scale, this allows RESA to subtract low intensity pixels in images with a low signal-to-noise ratio less heavily than low intensity pixels from images with a high signal-to-noise ratio. The overall degree of subtraction is based on a user-defined percentage for each subcellular compartment. For most applications this is empirically set to 40% of the total signal, and remains constant for images from an entire microarray. RESA thus eliminates nearly all of the out-of-focus information. The algorithm has the added benefit of enhancing the interface between areas of higher intensity staining and adjacent areas of lower intensity staining, allowing more accurate assignment of pixels of adjacent compartments. In contrast to the compartment-specific tags, the RESA subtraction of the target signal is uniform and not based on overall intensity of the image intensity. This ensures that the same amount of subtraction occurs with the target signal from all specimens. Finally, the PLACE algorithm assigns each pixel in the image to a specific subcellular compartment. Pixels that cannot be accurately assigned to a compartment to within a user-defined degree of confidence (usually 95%) are discarded. This is accomplished iteratively by determining the ratio of signal from two compartment-specific markers that minimizes the spillover of marker from one compartment into another. Once each pixel is assigned to a subcellular compartment (or excluded as described above), the signal in each location is added up. This data is saved and can subsequently be expressed either as a percentage of total signal or as the average signal intensity per compartment area. The score is expressed on a scale of 1 to 1000 as the total intensity divided by the area of the compartment to 3 significant figures. A step by step illustration of this process is illustrated in the schematic in Fig. 18.4. These algorithms are described in further detail in a patent of this technology owned by Yale University. The results of this assay (or other quantitative microscopy systems) are a “number” that represents an objective continuous variable for subsequent analysis. This allows statistical analysis of histologic data in a manner more similar to other quantitative areas of medicine (blood glucose, liver enzymes, or drug levels). Although these systems remove the subjectivity in assessment, there are still other issues that must be systematically addressed, including sample selection (the region analyzed) and a multitude of steps related to processing. However, the addition of quantitative analysis removes the most subjective step and allows the typical rigorous QC and QA of laboratory tests. It is easy to envision future application of these technologies in the clinical lab. Specifically a likely application is the assessment of biopsy tissue for the qualification of cancer patients for new biospecific therapies, based on quantitatively assessed levels of target protein expression.
396
COMPUTATIONAL GENOMICS
Cytokeratin (mask)
a-catenin (membrane tag)
Binary Mask
Tumor Mask
Membrane
Compartments
DAPI (nuclear tag)
Nuclei
Target Localization Target (b-catenin)
Target
Figure 18.4. The schematic above demonstrates a step-by-step methodology for RESA and PLACE algorithms used by AQUA in the assessment of β-catenin expression in colon cancer (Camp et al., 2002) A) Anti-cytokeratin is used to distinguish colon carcinoma cells from the surrounding stroma. B) Gating of that image creates a binary mask. C) The mask is enhanced by filling holes and removing small objects. D) Anti-α-catenin serves as a membrane tag. E) The membrane tag image is exponentially subtracted using RESA. F) DAPI identifies the nuclei. G) The nuclear tag image is exponentially subtracted using RESA. H) The membrane and nuclear tag images are merged, and overlapping pixels are removed. White pixels represent overlap. J) Anti-β-catenin is the target marker of interest. K) The target marker image is exponentially subtracted using RESA. L) The intensity of the marker is divided into the various subcellular compartments. Graphically, this is represented by red (membrane), green (cytoplasm), and blue (nuclei). In this tumor, β-catenin is predominantly localized to the membrane (red). (Please refer to the color section of this book to view this figure in color).
397
REFERENCES
Abbreviations: AQUA BSC DAB ELISA ER FACS FFPE IHC PCR PLACE QA QC QIHC RESA TMA
automated quantitative analysis Biological Stain Commission diaminobenzidine Enzyme linked immunosorbant assay Estrogen receptor Fluorescence Activated Cell Sorting formalin fixed, paraffin embedded Immuno-histochemistry Polymerase Chain Reaction Pixel-based Locale assignment for compartmentalization of expression quality assurance quality control quantitative immunohistochemistry rapid exponential subtraction algorithm tissue microarray
References Ahram, M., M.J. Flaig, J.W. Gillespie, P.H. Duray, W.M. Linehan, D.K. Ornstein, S. Niu, Y. Zhao, E.F. Petricoin, 3rd, and M.R. Emmert-Buck. 2003. Evaluation of ethanol-fixed, paraffin-embedded tissues for proteomic applications. Proteomics. 3:413-21. Bacus, S., J.L. Flowers, M.F. Press, J.W. Bacus, and K.S. McCarty, Jr. 1988. The evaluation of estrogen receptor in primary breast carcinoma by computer-assisted image analysis. Am J Clin Pathol. 90:233-9. Beckstead, J.H. 1994. A simple technique for preservation of fixationsensitive antigens in paraffin-embedded tissues. J Histochem Cytochem. 42:1127-34. Bertheau, P., D. Cazals-Hatem, V. Meignin, A. de Roquancourt, O. Verola, A. Lesourd, C. Sene, C. Brocheriou, and A. Janin. 1998. Variability of immunohistochemical reactivity on stored paraffin slides. J Clin Pathol. 51:370-4. Boenisch, T. 2001. Formalin-fixed and heat-retrieved tissue antigens: a comparison of their immunoreactivity in experimental antibody diluents. Appl Immunohistochem Mol Morphol. 9:176-9. Camp, R.L., L.A. Charette, and D.L. Rimm. 2000. Validation of tissue microarray technology in breast carcinoma. Lab Invest. 80:1943-9. Camp, R.L., G.G. Chung, and D.L. Rimm. 2002. Automated subcellular localization and quantification of protein expression in tissue microarrays. Nat Med. 8:1323-7.
398
COMPUTATIONAL GENOMICS
Cattoretti, G., S. Pileri, C. Parravicini, M.H. Becker, S. Poggi, C. Bifulco, G. Key, L. D’Amato, E. Sabattini, E. Feudale, and et al. 1993. Antigen unmasking on formalin-fixed, paraffin-embedded tissue sections. J Pathol. 171:83-98. Coons, A.H., H.J. Creech, and R.N. Jones. 1941. Immunological propterties of an antibody containing a fluorescent group. Proc. Soc. Exp. Biol. Med. 47:200-202. DeLellis, R.A., L.A. Sternberger, R.B. Mann, P.M. Banks, and P.K. Nakane. 1979. Immunoperoxidase technics in diagnostic pathology. Report of a workshop sponsored by the National Cancer Institute. Am J Clin Pathol. 71:483-8. DiVito, K.A., L.A. Charette, D.L. Rimm, and R.L. Camp. 2004. Long-term preservation of antigenicity on tissue microarrays. Lab Invest. Engellau, J., M. Akerman, H. Anderson, H.A. Domanski, E. Rambech, T.A. Alvegard, and M. Nilbert. 2001. Tissue microarray technique in soft tissue sarcoma: immunohistochemical Ki-67 expression in malignant fibrous histiocytoma. Appl Immunohistochem Mol Morphol. 9: 358-63. Fergenbaum, J.H., M. Garcia-Closas, S.M. Hewitt, J. Lissowska, L.C. Sakoda, and M.E. Sherman. 2004. Loss of antigenicity in stored sections of breast cancer tissue microarrays. Cancer Epidemiol Biomarkers Prev. 13:667-72. Fernebro, E., M. Dictor, P.O. Bendahl, M. Ferno, and M. Nilbert. 2002. Evaluation of the tissue microarray technique for immunohistochemical analysis in rectal cancer. Arch Pathol Lab Med. 126:702-5. Fox, C.H., F.B. Johnson, J. Whiting, and P.P. Roller. 1985. Formaldehyde fixation. J Histochem Cytochem. 33:845-53. Fraenkel-Conrat, H., B. Brandon, and H. Olcott. 1947. The reaction of formaldehyde with proteins, IV: participation of indole groups: gramicidin. J Biol Chem. 168:99-118. Garcia, J.F., F.I. Camacho, M. Morente, M. Fraga, C. Montalban, T. Alvaro, C. Bellas, A. Castano, A. Diez, T. Flores, C. Martin, M.A. Martinez, F. Mazorra, J. Menarguez, M.J. Mestre, M. Mollejo, A.I. Saez, L. Sanchez, and M.A. Piris. 2003. Hodgkin and Reed-Sternberg cells harbor alterations in the major tumor suppressor pathways and cell-cycle checkpoints: analyses using tissue microarrays. Blood. 101:681-9. Gillespie, J.W., C.J. Best, V.E. Bichsel, K.A. Cole, S.F. Greenhut, S.M. Hewitt, M. Ahram, Y.B. Gathright, M.J. Merino, R.L. Strausberg, J.I. Epstein, S.R. Hamilton, G. Gannot, G.V. Baibakova, V.S. Calvert, M.J. Flaig, R.F. Chuaqui, J.C. Herring, J. Pfeifer, E.F. Petricoin, W.M. Linehan, P.H. Duray, G.S. Bova, and M.R. Emmert-Buck. 2002. Evaluation of non-formalin tissue fixation for molecular profiling studies. Am J Pathol. 160:449-57.
REFERENCES
399
Ginestier, C., E. Charafe-Jauffret, F. Bertucci, F. Eisinger, J. Geneix, D. Bechlian, N. Conte, J. Adelaide, Y. Toiron, C. Nguyen, P. Viens, M.J. Mozziconacci, R. Houlgatte, D. Birnbaum, and J. Jacquemier. 2002. Distinct and complementary information provided by use of tissue and DNA microarrays in the study of breast tumor markers. Am J Pathol. 161:1223-33. Gown, A.M. 2004. Unmasking the mysteries of antigen or epitope retrieval and formalin fixation. Am J Clin Pathol. 121:172-4. Gulmann, C., D. Butler, E. Kay, A. Grace, and M. Leader. 2003. Biopsy of a biopsy: validation of immunoprofiling in gastric cancer biopsy tissue microarrays. Histopathology. 42:70-6. Harvey, J.M., G.M. Clark, C.K. Osborne, and D.C. Allred. 1999. Estrogen receptor status by immunohistochemistry is superior to the ligandbinding assay for predicting response to adjuvant endocrine therapy in breast cancer. J Clin Oncol. 17:1474-81. Hedvat, C.V., A. Hegde, R.S. Chaganti, B. Chen, J. Qin, D.A. Filippa, S.D. Nimer, and J. Teruya-Feldstein. 2002. Application of tissue microarray technology to the study of non-Hodgkin’s and Hodgkin’s lymphoma. Hum Pathol. 33:968-74. Hendriks, Y., P. Franken, J.W. Dierssen, W. De Leeuw, J. Wijnen, E. Dreef, C. Tops, M. Breuning, A. Brocker-Vriends, H. Vasen, R. Fodde, and H. Morreau. 2003. Conventional and tissue microarray immunohistochemical expression analysis of mismatch repair in hereditary colorectal tumors. Am J Pathol. 162:469-77. Henson, D.E. 1996. Loss of p53-immunostaining intensity in breast cancer. J Natl Cancer Inst. 88:1015-6. Hoos, A., M.J. Urist, A. Stojadinovic, S. Mastorides, M.E. Dudas, D.H. Leung, D. Kuo, M.F. Brennan, J.J. Lewis, and C. Cordon-Cardo. 2001. Validation of tissue microarrays for immunohistochemical profiling of cancer specimens using the example of human fibroblastic tumors. Am J Pathol. 158:1245-51. Huang, S.N., H. Minassian, and J.D. More. 1976. Application of immunofluorescent staining on paraffin sections improved by trypsin digestion. Lab Invest. 35:383-90. Jacobs, T.W., J.E. Prioleau, I.E. Stillman, and S.J. Schnitt. 1996. Loss of tumor marker-immunostaining intensity on stored paraffin slides of breast cancer. J Natl Cancer Inst. 88:1054-9. Kay, E.W., C.J. Walsh, M. Cassidy, B. Curran, and M. Leader. 1994. CerbB-2 immunostaining: problems with interpretation. J Clin Pathol. 47:816-22. Key, M. Antigen Retrieval. Vol. 2004. Dakocytomation. Kononen, J., L. Bubendorf, A. Kallioniemi, M. Barlund, P. Schraml, S. Leighton, J. Torhorst, M.J. Mihatsch, G. Sauter, and O.P. Kallioniemi.
400
COMPUTATIONAL GENOMICS
1998. Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med. 4:844-7. Larsson, L.-I. 1988. Immunocytochemistry : theory and practice. CRC Press, Boca Raton, Fla. 272 p., [2] p. of plates pp. Liu, C.L., W. Prapong, Y. Natkunam, A. Alizadeh, K. Montgomery, C.B. Gilks, and M. van de Rijn. 2002. Software Tools for High-Throughput Analysis and Archiving of Immunohistochemistry Staining Data Obtained with Tissue Microarrays. Am J Pathol. 161:1557-1565. Matkowskyj, K.A., R. Cox, R.T. Jensen, and R.V. Benya. 2003. Quantitative Immunohistochemistry by Measuring Cumulative Signal Strength Accurately Measures Receptor Number. J. Histochem. Cytochem. 51: 205-214. Matkowskyj, K.A., D. Schonfeld, and R.V. Benya. 2000. Quantitative Immunohistochemistry by Measuring Cumulative Signal Strength Using Commercially Available Software Photoshop and Matlab. J. Histochem. Cytochem. 48:303-312. McCarty, K.S., Jr., E. Szabo, J.L. Flowers, E.B. Cox, G.S. Leight, L. Miller, J. Konrath, J.T. Soper, D.A. Budwit, W.T. Creasman, and et al. 1986. Use of a monoclonal anti-estrogen receptor antibody in the immunohistochemical evaluation of human tumors. Cancer Res. 46:4244s4248s. Merseburger, A.S., M.A. Kuczyk, J. Serth, C. Bokemeyer, D.Y. Young, L. Sun, R.R. Connelly, D.G. McLeod, F.K. Mostofi, S.K. Srivastava, A. Stenzl, J.W. Moul, and I.A. Sesterhenn. 2003. Limitations of tissue microarrays in the evaluation of focal alterations of bcl-2 and p53 in whole mount derived prostate tissues. Oncol Rep. 10:223-8. Moch, H., P. Schraml, L. Bubendorf, M. Mirlacher, J. Kononen, T. Gasser, M.J. Mihatsch, O.P. Kallioniemi, and G. Sauter. 1999. High-throughput tissue microarray analysis to evaluate genes uncovered by cDNA microarray screening in renal cell carcinoma. Am J Pathol. 154:981-6. Morgan, J.M., H. Navabi, and B. Jasani. 1997. Role of calcium chelation in high-temperature antigen retrieval at different pH values. J Pathol. 182:233-7. Morgan, J.M., H. Navabi, K.W. Schmid, and B. Jasani. 1994. Possible role of tissue-bound calcium ions in citrate-mediated high-temperature antigen retrieval. J Pathol. 174:301-7. Mowry, R.W. 1980. Report from the president. The biological stain commission: its goals, its past and its present status. Stain Technol. 55:1-7. Mucci, N.R., G. Akdas, S. Manely, and M.A. Rubin. 2000. Neuroendocrine expression in metastatic prostate cancer: evaluation of high throughput tissue microarrays to detect heterogeneous protein expression. Hum Pathol. 31:406-14.
REFERENCES
401
Natkunam, Y., R.A. Warnke, K. Montgomery, B. Falini, and M. van De Rijn. 2001. Analysis of MUM1/IRF4 protein expression using tissue microarrays and immunohistochemistry. Mod Pathol. 14:686-94. Nocito, A., L. Bubendorf, E. Maria Tinner, K. Suess, U. Wagner, T. Forster, J. Kononen, A. Fijan, J. Bruderer, U. Schmid, D. Ackermann, R. Maurer, G. Alund, H. Knonagel, M. Rist, M. Anabitarte, F. Hering, T. Hardmeier, A.J. Schoenenberger, R. Flury, P. Jager, J. Luc Fehr, P. Schraml, H. Moch, M.J. Mihatsch, T. Gasser, and G. Sauter. 2001. Microarrays of bladder cancer tissue are highly representative of proliferation index and histological grade. J Pathol. 194:349-57. Oberholzer, M., M. Ostreicher, H. Christen, and M. Bruhlmann. 1996. Methods in quantitative image analysis. Histochemistry & Cell Biology. 105:333-55. Rait, V.K., T.J. O’Leary, and J.T. Mason. 2004a. Modeling formalin fixation and antigen retrieval with bovine pancreatic ribonuclease A: Istructural and functional alterations. Lab Invest. 84:292-9. Rait, V.K., L. Xu, T.J. O’Leary, and J.T. Mason. 2004b. Modeling formalin fixation and antigen retrieval with bovine pancreatic RNase A II. Interrelationship of cross-linking, immunoreactivity, and heat treatment. Lab Invest. 84:300-6. Rassidakis, G.Z., D. Jones, A. Thomaides, F. Sen, R. Lai, F. Cabanillas, T.J. McDonnell, and L.J. Medeiros. 2002. Apoptotic rate in peripheral T-cell lymphomas. A study using a tissue microarray with validation on full tissue sections. Am J Clin Pathol. 118:328-34. Rimm, D.L., R.L. Camp, L.A. Charette, J. Costa, D.A. Olsen, and M. Reiss. 2001a. Tissue microarray: a new technology for amplification of tissue resources. Cancer J. 7:24-31. Rimm, D.L., R.L. Camp, L.A. Charette, D.A. Olsen, and E. Provost. 2001b. Amplification of tissue by construction of tissue microarrays. Exp Mol Pathol. 70:255-64. Rubin, M.A., R. Dunn, M. Strawderman, and K.J. Pienta. 2002. Tissue microarray sampling strategy for prostate cancer biomarker analysis. Am J Surg Pathol. 26:312-9. Sallinen, S.L., P.K. Sallinen, H.K. Haapasalo, H.J. Helin, P.T. Helen, P. Schraml, O.P. Kallioniemi, and J. Kononen. 2000. Identification of differentially expressed genes in human gliomas by DNA microarray and tissue chip techniques. Cancer Res. 60:6617-22. Sen Gupta, R., D. Hillemann, T. Kubica, G. Zissel, J. Muller-Quernheim, J. Galle, E. Vollmer, and T. Goldmann. 2003. HOPE-fixation enables improved PCR-based detection and differentiation of Mycobacterium tuberculosis complex in paraffin-embedded tissues. Pathol Res Pract. 199:619-23.
402
COMPUTATIONAL GENOMICS
Shi, S.R., C. Cote, K.L. Kalra, C.R. Taylor, and A.K. Tandon. 1992. A technique for retrieving antigens in formalin-fixed, routinely aciddecalcified, celloidin-embedded human temporal bone sections for immunohistochemistry. J Histochem Cytochem. 40:787-92. Shi, S.R., R.J. Cote, and C.R. Taylor. 1997. Antigen retrieval immunohistochemistry: past, present, and future. J Histochem Cytochem. 45:32743. Shi, S.R., R.J. Cote, and C.R. Taylor. 2001a. Antigen retrieval techniques: current perspectives. J Histochem Cytochem. 49:931-7. Shi, S.-R., R.J. Cote, and C.R. Taylor. 2001b. Antigen Retrieval Techniques: Current Perspectives. J. Histochem. Cytochem. 49:931-938. Shi, S.-R., J. Gu, C. Cote, and C. Taylor. 2000a. Standardization of Routine Immunohistochemistry: Where to begin? In Antigen Retrieval Techniques: Immunohistochemistry & Molecular Morphology. S.-R. Shi, J. Gu, and C. Taylor, editors. Eaton Publishing, Natick, MA. 255-272. Shi, S.-R., J. Gu, and J. Turrens. 2000b. Development of the antigen retrieval technique: philosophical and theoretical bases. In Antigen Retrieval Techniques: Immunohistochemistry & Molecular Morphology. S.-R. Shi, J. Gu, and C. Taylor, editors. Eaton Publishing, Natick, MA. 17-40. Shi, S.R., M.E. Key, and K.L. Kalra. 1991. Antigen retrieval in formalinfixed, paraffin-embedded tissues: an enhancement method for immunohistochemical staining based on microwave oven heating of tissue sections. J Histochem Cytochem. 39:741-8. Shiurba, R.A., E.T. Spooner, K. Ishiguro, M. Takahashi, R. Yoshida, T.R. Wheelock, K. Imahori, A.M. Cataldo, and R.A. Nixon. 1998. Immunocytochemistry of formalin-fixed human brain tissues: microwave irradiation of free-floating sections. Brain Res Brain Res Protoc. 2:109-19. Simon, R., M. Mirlacher, and G. Sauter. 2003. Tissue microarrays in cancer diagnosis. Expert Rev Mol Diagn. 3:421-30. Sompuram, S.R., K. Vani, E. Messana, and S.A. Bogen. 2004. A molecular mechanism of formalin fixation and antigen retrieval. Am J Clin Pathol. 121:190-9. Taylor, C.R. 1980. Immunohistologic studies of lymphoma: past, present, and future. J Histochem Cytochem. 28:777-87. Taylor, C.R. 1986. Immunomicroscopy: A Diagnostic Tool for the Surgical Pathologist. WB Saunders Co, Philadelphia. Taylor, C.R. 1994. An exaltation of experts: concerted efforts in the standardization of immunohistochemistry. Hum Pathol. 25:2-11. Taylor, C.R. 2000. The total test approach to standardization of immunohistochemistry. Arch Pathol Lab Med. 124:945-51.
REFERENCES
403
Thomson, T.A., M.M. Hayes, J.J. Spinelli, E. Hilland, C. Sawrenko, D. Phillips, B. Dupuis, and R.L. Parker. 2001. HER-2/neu in breast cancer: interobserver variability and performance of immunohistochemistry with 4 antibodies compared with fluorescent in situ hybridization. Mod Pathol. 14:1079-86. Torhorst, J., C. Bucher, J. Kononen, P. Haas, M. Zuber, O.R. Kochli, F. Mross, H. Dieterich, H. Moch, M. Mihatsch, O.P. Kallioniemi, and G. Sauter. 2001. Tissue microarrays for rapid linking of molecular changes to clinical endpoints. Am J Pathol. 159:2249-56. Tzankov, A., A. Zimpfer, A. Lugli, J. Krugmann, P. Went, P. Schraml, R. Maurer, S. Ascani, S. Pileri, S. Geley, and S. Dirnhofer. 2003. Highthroughput tissue microarray analysis of G1-cyclin alterations in classical Hodgkin’s lymphoma indicates overexpression of cyclin E1. J Pathol. 199:201-7. van den Broek, L.J., and M.J. van de Vijver. 2000. Assessment of problems in diagnostic and research immunohistochemistry associated with epitope instability in stored paraffin sections. Appl Immunohistochem Mol Morphol. 8:316-21. Vincek, V., M. Nassiri, M. Nadji, and A.R. Morales. 2003. A tissue fixative that protects macromolecules (DNA, RNA, and protein) and histomorphology in clinical samples. Lab Invest. 83:1427-35. Vrolijk, H., W. Sloos, W. Mesker, P. Franken, R. Fodde, H. Morreau, and H. Tanke. 2003. Automated Acquisition of Stained Tissue Microarrays for High-Throughput Evaluation of Molecular Targets. J Mol Diagn. 5:160-167. Wan, W.H., M.B. Fortuna, and P. Furmanski. 1987. A rapid and efficient method for testing immunohistochemical reactivity of monoclonal antibodies against multiple tissue samples simultaneously. J Immunol Methods. 103:121-9. Wester, K., A. Asplund, H. Backvall, P. Micke, A. Derveniece, I. Hartmane, P.U. Malmstrom, and F. Ponten. 2003. Zinc-based fixative improves preservation of genomic DNA and proteins in histoprocessing of human tissues. Lab Invest. 83:889-99. Yaziji, H., L.C. Goldstein, T.S. Barry, R. Werling, H. Hwang, G.K. Ellis, J.R. Gralow, R.B. Livingston, and A.M. Gown. 2004. HER-2 testing in breast cancer using parallel tissue-based methods. Jama. 291:1972-7. Yosepovich, A., and J. Kopolovic. 2002. [Tissue microarray technology–a new and powerful tool for the molecular profiling of tumors]. Harefuah. 141:1039-41, 1090.
Index
A Adaptive thresholding, 8 Akaike’s criterion, 102 Alias, 39 Allele, definition of, 311 Allelic association, see Gametic phase disequilibrium Alternative splicing database, 351 Alternative splicing, 351–353, 357, 360–363, 369–374 in mouse, 352–357 on coding potential, impact of, 357–360 of regulatory factors, 371–373 tissue-specific, 363 Alternative transcription, 360–363 Amino acids, 63, 137–138, 141, 143–144, 150, 152, 155–156, 159, 180 Analysis of images PLACE algorithmic, 394–396 RESA algorithmic, 394–396 Analysis of variance (ANOVA), 42, 45, 94, 315 Anaplastic astrocytoma, 91, 103, 106 Anaplastic oligodendroglioma, 91–92, 103, 106 Anti-fluorescein antibody, 23 Antigen retrieval, 385 mechanism of, 387 pH of, 387
principle of, 385–386 standardization of, 387 technique of, 387 Anti-sense RNA (aRNA), 23 Apoptosis, 107f 7 AQUA assessment of β-catenin expression in colon cancer, 396f 6 functions, 394 software, 392–393 Attractor, 262 Autocorrelation matrix, 126 Autoimmune diseases, see Systemic lupus erythematosus (SLE) Automated quantitative analysis, 392, 397, see AQUA B Background detection, 6, 17 Background intensity, 6–7, 11, 26 “Banning-off” procedure, 94 Basin of attraction, 262 Bayes classifier, 120–121, 124, 129 Bayes error rate, 120–121, 124, 131, 222, 226, 229 Bayesian information criterion (BIC), 99 Bayesian model, 324, 356, 360, 365 Believability, 227
406
Best-Fit Extension Problem, 261, 263–266, 274 Beta-binomial model, 225 Binomial Differential (BD) distribution, 169, 197 Biological stain commission (BSC), 387 Biopsy tissue, assessment of cancer patients, 395 Boolean formalism, 260 function, 237, 239–240, 242, 244–245, 247, 254f 4f, 260–267 error size, 263, 265–266, 268–269 extension, 253–254 monotone, 265 partially defined, 263–264 regression, 235, 237, 239–241, 246–249, 251, 254 Bootstrapping, 44–45 Bounding box, 6–8 Breast cancer, 126, 383, 389 “Brown” stained IHC, analysis of, 390–391 C C.elegans, genome of worm, 351 Cancer, 75, 78, 85–86, 91, 119, 126, 133–134, 166, 174, 176, 179, 198, 211, 223, 231, 248, 257, 259–260, 298, 370, 383, 388, 395 classification, 74–75, 86, 134, 257 cDNA library, 211 microarray, 2–3, 17, 38, 49–50, 59, 91, 133, 219, 236, 301, 309
Index
microarray micrometastases, 2–3, 44ff, 17, 38, 50, 91, 133, 219, 236, 301, 309 sequences, full-length, 352–353, 364 Celera discovery system (CDS), 358, 369–370 Cell growth genes, 103 Cell-cycle, 70, 107f 7f, 270 Center affine filter, 146, 148 Chemical stains in pathology, problem of standardization, 387 Chemotherapy, 260 Clark method, 324–325 Class discovery, 75 Class prediction, 75–77, 81 Classification, 63, 77, 81, 89–90, 97, 106, 119–122, 124, 127, 131, 133–134, 235, 237, 247, 253, 268, 289, 355 accuracy, 75 cancer, 75, 78, 85–86 error, 65–66, 226, 254–255, 257 molecular, 90 Classification-expectationmaximization (CEM), see Expectation-maximization (EM) algorithm, steps in Classifier, 91–92, 119–125, 128–134, 237, 268 constrained, 124 Clinical genomics, 382 Clustering glioma gene-expression, 102 by GS-Gap and GS- MDL using discrimination power, 103, 104f 4
Index
differences between the two different lineages of glioma, 106, 108f 8 expressions of Endothelin receptor type B (EDNRB), 106, 113f 3 expressions of Heat shock protein 75 (TRAP1), 106, 112f 2 expressions of Myosin light polypeptide kinase (MYLK), 106, 113f 3 expressions of RNTRE (USP6NL), 106, 112f 2 expressions of Sialyltransferase7 (SIAT7B), 106, 112f 2 transition from low grades to most aggressive grade, 103, 105f 5 Clustering, 49, 63, 66–67, 76, 89–90, 95–97, 102, 106, 183, 281, 288, 293, 358 hierarchical, 62, 69 methods, 89 Clusters, finding of, 89 Codetermination, 297–299, 301, 309 Coefficient of determination (CoD), 297–300 Coefficient of variation, 11–13, 50 Complexity, 129, 134, 165, 176, 183–185, 187–191, 193, 194, 197–199, 212, 236, 270, 291, 300, 351–353, 369, 371 computational, 93, 261, 263, 270, 274, 301 model, 261 Computational learning theory, 261–263
407
Concordance correlation coefficient, 27, 31 Confidence interval ratio, 11 Consistency problem, 260, 263–266, 274 Constitutive and cryptic exons, length distribution difference of, 363–364 Core needle biopsies (CNB), 22, 383 Correlation, 27, 31, 34, 45, 63–65, 76–77, 236, 288, 291, 315, 325, 331 Corresponding coding sequence (CDS), 358, 369–370 Confounded effect, 39 Cross-correlation vector, 127 Cross-term interference, 145 Cross-validation, 65, 92, 130, 134, 268–269, 285, 290, 293, 299 Cryptic exons, recognition of, 366 Cumulative fraction function, 167f 7f, 168, 175f 5 Customized expression array, 4 D Data visualization, 298, 309 dbSNP, 313 dbSNP, 313–314 DCLRE1 (DNA cross-link repair 1A), 106 Dendrimer (3DNA) technique, 22 Denoising techniques, 216 Design cost, 120, 124, 129 Design, balanced, 40 Design, even, 40, 45 Detection of bacteria, immunofluorescence technique, 388 Diabetes, 38, 311 Diagnosis, 92–93, 389
408
Diallelic marker, see Single nucleotide polymorphisms Differential expression detection illustration of methods application, 227–230 multiple libraries outlier finding, 227 replicated libraries in one class, 224–226 single-library or “pseudo-library”, 220–224 Differential expression, 42, 44–45, 59, 106, 209, 219 Dimensionality problem, 279–283, 289, 293 DNA damage, 69 repair, 106 replication, 69 DNA-binding proteins, 164, 182–183, 197 Domain occurrence probability function (DOPF), 186 Domain profile, 185 Drosophila melanogaster, 185, 187, 188 Dynamic allelic-specific hybridization (DASH), 313 Dynamical system, 262 E Electromagnetic radiation, wavelength of, 391 EM algorithm, 324–325 convergence of, 99 EMLD, 327 Enhancer, 110–111, 180, 182, 352, 363, 366–368 Entrez retrieval system, 313 EntrezDatabase, 103
Index
Epanechnikov kernel, 123, 133t, 134 Epidermal growth factor receptor (EGFR), 152, 153t, 159 Ergodicity, 181, 198 Erroneous tags, 173, 179 Error model, 50, 62 EST sequences, 211 analyses of, 355, 357, 364 Estimation by interval, 218–219 Estimation process for unknown quantities of sample or population estimation by interval, 212, 218–219 point estimation, 212–217 Estimator, 13, 90, 96, 100, 102, 119, 126, 129–130, 134, 170–173, 212–219, 225, 228, 318t, 326 Estrogen receptor (ER), 384, 385f, 392, 397 Euclidean metric, 64f 4f, 65, 72 Euclidean norm, 270 Evolution, 105t, 143, 163, 165, 185, 189, 191–197, 198, 199, 339, 360, 373 Exon, 164, 351–353, 355t, 357–358, 360f 0f, 360–366, 368–371 arrays, 373 Expectation-maximization (EM) algorithm, steps in, 97–98 C step, 98 E step, 98 M step, 98 Expectation-maximization algorithm, 321 Experimental design, 37–40 Expressed sequence tags (EST), 209, 298
Index
Expression ratio, 1, 4, 11–17, 62–63, 72, 219, 222–223, 300 Extension, 45, 122, 253–254, 259, 261–266, 268–269, 274, 282, 290, 355, 370 F FANTOM, datasets full-length cDNA sequences, 353 public-domain mRNA sequences, 353, 357 Fantom2 dataset, 351, 353, 355, 358, 360, 362–363, 371–373 Feature genes, 91, 235, 257 FFPE tissue, 386–388 Fibroblast growth factors, 139–140, 144, 148–150 Fine needle aspiration (FNA), 22 Finite mixture model, 97 Fisher discriminant, 92 Fisher’s exact test, 221, 228, 319 Fluorescence-based methods, 391 for quantitative analysis, 392–396 Fluorescent in situ hybridization (FISH), 386 Fluorescent signal amplification, 23–24 Formalin-based fixation, effects of, 385 Formalin-fixed paraffin-embedded (FFPE) tissue, 386 Fourier transform, 137–138 discrete Fourier transform, 139 Frequentist approach, 219, 221 Functional classification, 63 Functional variants, 360, 370 G Gabriel’s LD method for block reconstruction, 333–334
409
Gametic phase disequilibrium, 326 Gap statistics selection, 94–95 Gaussian kernel, 123, 133 Gaussian process, 6 GenBank, groups, dataset, collector FANTOM Consortium, 353, 358 RIKEN GSC Genome Exploration Group 2002, 353, 358 Gene chips, 382 Gene expression level probability function (GELPF), 166 Gene selection process in leave-out one cross-validation (LOO-CV) procedure, 92–93 Gene shaving (GS), 90 Gene structure, complexity of in mammalian genomes, 351 Gene activity profile (GAP), 261 annotation, 166, 202, 210–211, 313, 352–353 expression level, 166, 282 monoallelic, 182, 198 profiles, 61, 63–64, 108t, 169, 177, 260, 263 finding, 266, 351 informative, 76–86, 237, 255, 257 prediction, 164, 352 predictor, 76, 254f 4f, 298, 300, 304, 306 recognition, 103 regulatory network, 269–270, 274, 307, 374 Bayesian network, 275 Boolean network, 261–262 linear model, 282, 286, 288 neural network, 128–129, 132–134
410
Gene Cont. target, 236, 297, 300–301, 304–307 transcript, 259 GeneChip technology, 174 Generalized discrete Pareto (GDP) distribution, 168, 187 Genetic linkage, definition of, 312 Genetic marker, definition of, 312 Genetic polymorphism, definition of, 312 Genome, 1, 11, 61–62, 163–164, 166, 169, 173–174, 176, 180–182, 183, 186, 188, 193–199, 211, 214, 223, 236, 260–261, 264, 309, 311–314, 325, 328, 339, 351, 353, 355, 356t, 358–359, 369–370, 382 Genomic exons, 355 Genotype, 198, 311–318, 321–324, 327–329, 334, 336 definition of, 311 Gibbs sampling, 324 Glioblastoma, 91, 103, 104t Glioma classification, lineages in anaplastic astrocytoma and anaplastic oligodendroglioma, 91 astrocytic and oligodendrocytic, 91 astrocytoma and oligodendroglioma, 91 glioblastoma multiforme, 91 Glioma data set, description of, 91–93 Glioma multiforme (GM) subtype of glioma, 91 GOLD, 327 GoMiner, 103 Grading, 390
Index
Greedy algorithm, 77, 80–81, 268f 8f, 338 Grid, 4–6, 50, 66–67, 71, 393 GS algorithm, 90 GS-Gap, see GS algorithm GS-MDL, 95–96, see also GS algorithm evaluation of code length, 96 performance of, 99 H Haploid haplotype one-way ANOVA test (HHA), 321 Haploid marker permutation (HMP) test, 321 Haplotype block structures and SNP relationship applications, 336–337 haplotype blocks, 328–333 linkage disequilibrium, 326–327 simulations, 333–336 tagging SNPs, 337–338 Haplotype block, 328–333 criteria for, model, 333 definition of, 332 methods for, construction, 332 Haplotype reconstruction software, list of, 339–340 Haplotype trend regression (HTR) method, 321 Haplotype, definition of, 311 Hardy-Weinberg equlibrium (HWE), 323 Heterogeneous nuclear ribonucleoprotein (hnRNP), 371 HGVbase, 314 Hidden layer, 128–129 Hierarchical clustering, see Clustering
411
Index
Histogram, 6–10, 43–44, 166–168, 171–175, 186–188, 191, 268–271, 359f 9 Histogram rule, 121, 123–124, 129–130, 268–269 Homeodomain proteins, 106, 139, 150–151, 158 Horseradish peroxidase (HRP), 23, 388 Human genome, 164, 174, 176, 181, 199, 309, 311–313, 325, 328 sequencing projects, 174, 181, 312–313 Hybridization trends, 55 Hydrophobicity, 137, 150, 151f 1 f, 152, 156, 158f 8 Hyperplane, 126–128, 134 Hysteresis, 260 I IDEG6, 229 IGFBP1 (insulin-like growth factor binding protein 1), 106 IHC, 383, 386–391 Image acquisition, 392–394 Image analysis, 3, 4f 4f, 17, 390, 397 Immunohistochemistry (IHC), 383, 385–391 Immunohistochemistry), 383, 385–391, 397 new challenges, 388 origins of, 388 standardization of, 387–389 In vitro transcription, 23, 24f 4 Inferential power, 280, 288–292 Information visualization, 62 Informative genes, 76–82, 84–86, 237, 255 Inner product, 64, 66–67, 72, 93
Insulin-like growth factor binding protein (IGFBP2), 105f 5f, 106, 107f 7 Integer programming problem, 216 Intron, 313–314, 353–355, 358, 360f 0f, 362, 369, 370 Ionizing radiation, 391 J Jeffreys’ test for precise hypothesis, 222 K Kernel density estimators, 219 Kernel, 82, 123–124, 133–134, 219 k-Nearest neighbor (k-NN), 65, 123, 133, 270 Knowledge-discovery, 223 L Lasso regression, 280, 283f 3f, 285–286, 287f 7f, 288, 290–292 Leave-one-out estimation, 130, 134 Leukemia data, 254 Library, 166–176, 201, 211, 214–215, 218, 220, 222, 224, 227, 228t, 229 Limit-cycle, 262 Linearly separable, 78, 126–128 Lloyd algorithm, 92 Load assignment dynamic, 302–305 dynamic load partition, 302, 306 uniform, 302, 304 Local factors, 70 Logarithmic growth (LG) model, 172 Lotka-Zipf law, 172, 201 M Majority voting, 65, 76, 79, 81–84, 237
412
MALDI-TOF mass-spectrometry, 313 Mammalian gene structure, complexity of, 353 Mammalian transcriptomes complexity of, 371 flexibility of, 371 Mann-Whitney method, 9, 10f 0 Massively parallel signature sequencing (MPSS) technique, 209, 213 MATN2 (matrilin 2), 106 Mature mRNA, 363–365, 367 Maximal-margin hyperplane, 127 Maximum likelihood, 13, 96, 216, 228, 235, 241–242, 323, 325, 362 MDL principle, 90, 95–96, 99–100, 269, 235, 237, 257 Medicine, areas of blood glucose, 395 drug levels, 395 enzymes, 395 liver, 395 Metabolic activation, 106 detoxification, 44 Metric, 16f 6f, 17, 26–27, 49, 63, 644f, 65–67, 72, 154 Minimum description length (MDL) principle, 90, 269 criterion, 101 estimation of number of gene clusters using, 99–102 Mitotic index, 91 Molecular classification, 85, 90 markers, 387 pathway, 91 Monte Carlo technique, 220 Mouse gene index, 353 Moving-window rule, 123–124
Index
mRNA, 1–2, 13, 21, 37, 39t, 163–164, 166, 169–170, 171f 1f, 173–174, 1755f, 176–177, 178f 8f, 179–180, 182, 211, 216, 259, 353, 357, 360, 363–364, 367, 374 Multidimensional scaling (MDS), 67 Multinomial distribution, 97, 170 Mutation, 61–70, 72, 185, 196, 312, 314, 334, 339, 369, 373 MYO6 (myosin VI), 106 N National Center for Biotechnology Information (NCBI), 103, 223, 313 Natural selection, 190, 198–199 Nearest-neighbor rule, 122–124 Neighbors and shadows, 214–215 Neural network, 125–134, 281 Neuroblastoma, 133–134 Neuron, 106, 128, 132–133 Nitrogen dessication chamber, 385 3’-most NlaIII site (CATG), 210–211 Noise, 11, 15, 26, 34, 51f 1f, 55, 62–66, 72, 97, 102, 167f 7f, 179, 273, 279–281, 284–286, 289, 293, 391, 394–395 Noise injection, 280–292 Non-coding RNA, 181–182 Normalization, 44ff, 6, 12, 14–15, 29, 37, 57–58, 168, 188, 242, 244, 268, 270 ratio, 11 Normalized maximum likelihood (NML), 235–257 NP-hard, 80, 263–264, 303
Index
Number of expressed genes, 166, 169–181, 197, 203 O Ockam’s razor principle, 323 Oligodendroglioma, 91–92, 103, 104f 4f, 106 Open reading frame (ORF), 173 Operational haplotype block definitions, 330 Optical density, 391 Orthogonal effect, 39–40, 45–46 Overfitting, 86 Overrepresented tags, 217 P p53, 57, 183 PAC learning, 263 Paraffin-embedded tissue, 383–384 Parallel analysis of gene expression (PAGE), 298, 300–306 Gadjust, 301, 303, 305 Gcombi, 301, 304 Glogic, 301, 304–305 Parallel computing, 298, 300, 302 Parallel performance, 304–305 efficiency, 302 speedup, 302 Pareto-like function, 166 Partially confounded effect, 40 Perception, 125 Phenotype, 61, 106, 165, 198, 312, 314–315, 321–322, 336, 351–352, 371 phred sequencing chromatogram analysis software, 214 Pixel-based locale assignment for compartmentalization of expression (PLACE), 394 algorithm, 395 Plug-in rule, 121 Point estimation abundant estimation, 216–217
413
counting estimation, 215–216 error-rates estimation, 213–214 Polymerase chain reaction (PCR), 23, 397 Polynomial-time, 261, 263–265, 274 Polypyrimidine tract-binding protein (PTB), 371, 372f 2 Positive controls, 51, 53, 57, 58f 8 POU domain transcription factors, 106 Power law, 166, 168–169, 172, 179, 184f 4f, 190t, 201, 202 Prediction strength, 81 Predictive power, 288–290 Principal components analysis, 8 Probability distribution, 120, 164, 4f, 193, 170, 180, 182, 184f 196, 199 Probability function, 165–166, 168, 170, 172, 176, 183, 186–187, 195, 198–200 Probably approximately correct (PAC) learning, 263 Prognosis, 389 Promoter sequence, 23, 24f 4 Proportion disequilibrium (PLD), 331 Protein clusters, 164–165, 183, 184f 4f, 197 domain, 164–165, 184–186, 188–190, 192–193, 195, 197–199 molecule, 138, 182 sequence, 137–143, 145, 150, 152, 154–157, 160 Protein-coding genes, number of, 352 Proteolytic enzymes bromelain, 386, 388
414
Proteolytic enzymes Cont. chymotrypsin, 386 ficin, 386 pepsin, 386, 388 pronase, 386 Proteome, 163–165, 183–199 Pseudoinverse, 127, 283, 292 Pseudo-library, see Differential expression detection Q QIHC, 390, 397 QIHC, automated systems for acquisition of, 390 image measurement, 390 storage of digital images, 390 Quality control, 2, 50, 55, 314, 397 Quality metric, 16–17 Quantitative immunohistochemistry, 389–391, see QIHC Quantization, 248, 270, 300 R Random variable, 120, 154, 170, 212, 218, 224, 239, 268 Rapid exponential subtraction algorithm (RESA), 394 Rare transcripts, 165–166, 214, 352, 358 Rarely-expressed gene, 178, 182 Rate of agreement, 28, 32 Real-time PCR, 3 Reduced representation shotgun (RRS) sequencing, 313–314 Reference channel, 50, 55 Reference efficiency, 55, 235, 302, 304–305, 324 RefSNP, 311 Regularization, 279–282, 286, 288, 293
Index
Repeat sequences in alternative splicing, recruitment of, 369–371 Replicate ratios, 52, 54–56, 57f 7 Replicate spottings, 50, 52f 2 –53f 3 Replicates, 50, 52, 54, 58f 8f, 92, 209, 220–222, 224 Reproducibility, 26–27, 31, 34, 388 Resonant recognition model, 139, 144 Resubstitution estimator, 129–130 RGB image, 3 Rhabdomyosarcoma, 133 Riboregulator, 181 Ridge regression, 280, 283f 3f, 284–286, 287f 7f, 289–292 RNA amplification, 22–23, 24f 4f, 25, 28 splicing, 353 r -of-k function, 77–79, 84, 92 Root signal, 262 Rosenblatt perception algorithm, 126 S SAGE Genie, 222–223, 227, 229 brain tumor library, 228t interface, 211 SAGE library, 171f 1f, 173–174, 1755f SAGE2002 softwares, 221 SAGE300 softwares, 221 SAGEmap, 222–223 Sample-size effect, 169 Segmentation, 4–5, 8, 17, 390 Self-organized critical systems, 178, 199 Self-organizing map, 62, 66–67, 70 Sequencing errors, 213–216, 230 Serial analysis of gene expression (SAGE) method, 164, 209 advantage of, 210
Index
errors in sequencing, 212 steps in transcript tags extraction, 210 Shift-log transformation, 41 Shrinkage, 217, 285–286, 287f 7 Sigmoid, 128–129, 133 Signal enhancement, 26 Signal recognition particle receptor (SRPR) gene, 103 Signal-to-noise ratio, 15, 34, 394–395 Silencer, 363, 366 Similarity measures, 63, 64f 4 Single library, see Differential expression detection Single nucleotide polymorphisms, 311–312 analysis, 321, 337 consortium, 313 database, 313 definition of, 312 genotype-phenotype relationship, 314–318 microarrays, 313 single-SNP and haplotype-SNP association, 318–320 use of, 314 Singular value decomposition (SVD) in microarray data analysis, 102 Skew distribution, 163, 183 Small sample size, 169 Smooth t-statistics, 30, 32 SNP, see Single nucleotide polymorphisms SNPHAP program, 325 Splice variation amenable to, 372 analyses of, 353, 372 correlated occurrence of, 358 forms of, 355, 358
415
frequency of, 361, 358, 361 impact of, 357 on exon length, 352, 363 types of, 353, 361 Splicing, 351–353, 355, 356t, 357–358, 360, 363–364, 366, 368–374 regulation of, 363–365 Stability, 52, 260, 280–281, 289, 291–292, 374 Statistics ratio, 11 Stephens-Smith-Donnelley algorithm, 324 Storage of tissues, current method of formalin-based fixation, 384 paraffin-embedded, 384 Streptavidin, 23 Studentizing, 49, 59 Support vector machine (SVM), 82, 127, 133–134 Survival, 139–140, 178 SVDMAN, 102 Systemic lupus erythematosus (SLE), 370 T T7 RNA polymerase, 23, 25 Tag, 164, 166, 168–169, 173, 175–176, 203, 210–222, 224–227, 229–230, 298 location database, 173 redundancy, 173 Tagging SNPs. see Haplotype block structures and SNP relationship Tag-to-gene mapping, 210 Tag-to-UniGene assignment, reliability order for, 211 Target detection, 4f 4f, 7–10, 17 Ternary logic filter, 300, 305 Test set, 65, 92, 254–255, 289–290
416
Threshold function, 77–79, 84–85 Time-frequency, 137, 139, 141–145, 148, 158f 8f, 159–160 Tissue microarray, 381–385, 397 advent of, 385 (TMA) technology, 381 Total cholesterol, 315 Total number of genes, 98, 165, 172–173, 190 Toxicity, 370 Training set, 92, 134, 248, 253–255, 268, 282–286 Transcription units (TUs), 353, 358, 372 Transcriptome, 89, 170–171, 174–175, 179, 182, 197–199, 213–214, 351–353, 355–356, 371–372 Trimmed mean, 11 True distinct tag, 171, 173–175, 203 T-score, 30, 32 Tumor heterogeneity, 117 Tumorigenesis, 276 Type II error rates, 333–334 Tyramide signal amplification (TSA), 23
Index
U U-matrix, 69 Unbiasedness, 130–131 Underrepresented tags, 217 Undersampling, 164, 183 UniGene cluster, 211–212 V Validation, 174, 300–301, 304, 314, 381–382 Vapnik-Chervonenkis (VC) dimension, 132 Variable essential, 264–265 fictitious, 261 Visualization of gene expression (VOGE), 298, 306–309 Voronoi cell, 122 W Wavelet transform, 143–144 Weighted voting, 75, 78, 82 Wiener filter, 126, 133 Wigner-Ville distribution, 144–146, 159 WINNOW algorithm, 77, 79–80 Y Yeast genome, 169, 173