323 99 8MB
English Pages 312
Edited by Frank Emmert-Streib and Matthias Dehmer Statistical Diagnostics for Cancer
Titles of the Series “Quantitative and Network Biology” Volume 1 Dehmer, M., Emmert-Streib, F., Graber, A., Salvador, A. (eds.)
Applied Statistics for Network Biology Methods in Systems Biology 2011 ISBN: 978-3-527-32750-8
Volume 2 Dehmer, M., Varmuza, K., Bonchev, D.(eds.)
Statistical Modelling of Molecular Descriptors in QSAR/QSPR 2012 ISBN: 978-3-527-32434-7
Related Titles Zhou, X.-H., Obuchowski, N. A., McClish, D. K.
Statistical Methods in Diagnostic Medicine 2011 ISBN: 978-0-470-18314-4
Azuaje, F.
Bioinformatics and Biomarker Discovery “Omic” Data Analysis for Personalized Medicine 2010 ISBN: 978-0-470-74460-4
Quantitative and Network Biology Series Editors M. Dehmer and F. Emmert-Streib Volume 3
Statistical Diagnostics for Cancer Analyzing High-Dimensional Data
Edited by Frank Emmert-Streib and Matthias Dehmer
The Editors
Matthias Dehmer UMIT Institut für Bioinformatik und Translationale Forschung Eduard Wallnöfer Zentrum 1 6060 Hall/Tyrol Austria Frank Emmert-Streib Queen's University Belfast Center for Cancer Research and Cell Biology 97, Lisburn Road Belfast BT9 7BL United Kingdom
Cover: Network design by Shailesh Tripathi
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty can be created or extended by sales representatives or written sales materials. The Advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Library of Congress Card No.: applied for British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.d-nb.de. #2013 Wiley-VCH Verlag GmbH & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law. Print ISBN: 978-3-527-33262-5 ePDF ISBN: 978-3-527-66544-0 ePub ISBN: 978-3-527-66545-7 mobi ISBN: 978-3-527-66546-4 oBook ISBN: 978-3-527-66547-1 Typesetting Thomson Digital, Noida, India Printing and Binding Markono Print Media Pte Ltd, Singapore Cover Design Grafik-Design Schulz, Fußgönheim Printed in Singapore Printed on acid-free paper
jV
Contents Preface XIII List of Contributors XVII
Part One General Overview 1 1
1.1 1.2 1.3 1.3.1 1.3.2 1.3.3 1.3.4 1.3.5 1.3.6 1.4 1.4.1 1.4.1.1 1.4.1.2 1.4.2 1.4.2.1 1.4.3 1.4.3.1 1.5 1.5.1 1.5.1.1 1.5.1.2 1.5.1.3 1.5.1.4 1.5.1.5
Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms 3 Jeffrey Miecznikowski, Dan Wang, and Song Liu Brief Summary 3 Introduction 3 High-Throughput Platforms 4 Gene Expression Arrays 5 RNA-Seq 5 DNA Methylation Arrays 6 Mass Spectrometry Platforms 6 aCGH Arrays 7 Preprocessing HT Platforms 7 Analysis of Experiments 8 Linear Regression 8 Simple Linear Regression 9 Multiple Regression 11 Logistic Regression (Y Discrete) 11 Multiple Logistic Regression 13 Survival Modeling 13 Kaplan–Meier Analysis 13 Multiple Testing Type I Errors 15 FWER, k-FWER Methods 17 Adjusted Bonferroni Method 17 Holm Procedure 17 Generalized Hochberg Procedure 18 Generalized S9 idak Procedure 18 minP and maxT procedures 19
VI
j Contents 1.6 1.7
Discussion 19 Perspective 20 References 21
2
Overview of Public Cancer Databases, Resources, and Visualization Tools 27 Frank Emmert-Streib, Ricardo de Matos Simoes, Shailesh Tripathi, and Matthias Dehmer Brief Overview 27 Introduction 27 Different Cancer Types are Genetically Related 28 Incidence and Mortality Rates of Cancer 29 Cancer and Disorder Databases 30 Visualization and Network-Based Analysis Tools 34 Web-Based Software 34 R-Based Packages 34 Conclusions 35 Perspective 37 References 37
2.1 2.2 2.3 2.4 2.5 2.6 2.6.1 2.6.2 2.7 2.8
Part Two Bayesian Methods 41 3
3.1 3.2 3.3 3.4 3.4.1 3.4.2 3.5 3.6 3.7 3.8 3.9 3.10
4 4.1 4.2 4.3
Discovery of Expression Signatures in Chronic Myeloid Leukemia by Bayesian Model Averaging 43 Ka Yee Yeung Brief Introduction 43 Chronic Myeloid Leukemia (CML) 44 Variable Selection on Gene Expression Data 44 Bayesian Model Averaging (BMA) 46 The Iterative BMA Algorithm (iBMA) 47 Computational Assessment 48 Case Study: CML Progression Data 49 The Power of iBMA 50 Laboratory Validation 51 Conclusions 52 Perspective 53 Publicly Available Resources 54 References 54 Bayesian Ranking and Selection Methods in Microarray Studies 57 Hisashi Noma and Shigeyuki Matsui Brief Summary 57 Introduction 57 Hierarchical Mixture Modeling and Empirical Bayes Estimation 59
Contents
4.4 4.4.1 4.4.1.1 4.4.1.2 4.4.1.3 4.4.2 4.4.2.1 4.4.2.2 4.5 4.6 4.7 4.8 4.9
Ranking and Selection Methods 60 Ranking Based on Effect Sizes 60 Posterior Mean (PM) 61 Rank Posterior Mean (RPM) 61 Tail-Area Posterior Probability (TPP) 62 Ranking Based on Selection Accuracy of Differential Genes 63 Posterior Probability of Differentially Expressed (PPDE) 63 Evaluating Selection Accuracy 64 Simulations 65 Application 67 Concluding Remarks 71 Perspective 72 Appendix : The EM Algorithm 72 References 73
5
Multiclass Classification via Bayesian Variable Selection with Gene Expression Data 75 Yang Aijun, Song Xinyuan, and Li Yunxian Brief Summary 75 Introduction 75 Matrix Variate Distribution 77 Method 77 Model 77 Prior Specification 79 Computation 80 Classification 82 Real Data Analysis 83 Leukemia Data 83 Lymphoma Data 87 Computational Time 89 Discussion 89 Perspective 89 References 90
5.1 5.2 5.3 5.4 5.4.1 5.4.2 5.4.3 5.4.4 5.5 5.5.1 5.5.2 5.5.3 5.6 5.7
6
6.1 6.2 6.3 6.3.1 6.3.2 6.3.3 6.3.3.1 6.3.3.2
Semisupervised Methods for Analyzing High-dimensional Genomic Data 93 Devin C. Koestler Brief Summary 93 Motivation 93 Existing Approaches 95 Fully Unsupervised Procedures 96 Fully Supervised Procedures 96 Semisupervised Procedures 97 Semisupervised Clustering 99 Semisupervised RPMM 100
jVII
VIII
j Contents 6.3.3.3 6.4 6.4.1 6.5
Considerations Regarding Semisupervised Procedures 101 Data Application: Mesothelioma Cancer Data Set 102 Results: Mesothelioma Cancer Data Set 104 Perspective 105 References 106
Part Three Network-Based Approaches 107 7
7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.7.1 7.7.2 7.8 7.8.1 7.9 7.10 7.10.1 7.10.2 7.10.3 7.10.4 7.11 7.12
8 8.1 8.2 8.3 8.4 8.4.1 8.4.2 8.4.3 8.4.4 8.5 8.5.1
Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation 109 Vishal N. Patel and Mark R. Chance Brief Summary 109 Colon Cancer: Etiology 109 Colon Cancer: Development 110 The Pathway Paradigm 111 Cancer Subtypes and Therapies 112 Molecular Subsystems: Introduction 113 Molecular Subsystems: Construction 113 Measurements 113 Manifolds 114 Molecular Subsystems: Interpretation 117 Examples 117 Molecular Subsystems: Validation 119 Worked Example: Label-Free Proteomics 120 Whole Protein-Level Significance 122 Peptide-Level Significance 122 Exon-Level Significance 125 Summarizing the Results 126 Conclusions 127 Perspective 128 References 129 Network Medicine: Disease Genes in Molecular Networks 133 Sreenivas Chavali and Kartiek Kanduri Brief Summary 133 Introduction 133 Genetic Architecture of Human Diseases 134 Systems Properties of Disease Genes 136 Network Measures 136 Disease and Disease-Gene Networks 137 Disease Genes in Protein Interaction Networks 139 Identification of Disease Modules 143 Disease Gene Prioritization 145 Linkage Methods 145
Contents
8.5.2 8.5.3 8.6 8.7
Disease-Module-Based Methods 146 Diffusion-Based Methods 147 Conclusion 147 Perspectives 148 References 148
9
Inference of Gene Regulatory Networks in Breast and Ovarian Cancer by Integrating Different Genomic Data 153 Binhua Tang, Fei Gu, and Victor X. Jin Brief Summary 153 Introduction 153 Theory and Contents of Gene Regulatory Network 154 Basic Theory of Gene Regulatory Network 154 Content of Gene Regulatory Network 155 Identify and Infer the Structure Properties and Regulatory Relationships of Gene Networks 155 Understand the Basic Rules of Gene Expression and Function 155 Discover the Transfer Rules of Genetic Information During Gene Expression 155 Study on the Gene Function in a Systematic Framework 156 Inference of Gene Regulatory Networks in Human Cancer 156 The In Silico Analytical Approach 156 Study Case 1: Inference of Static Gene Regulatory Network of Estrogen-Dependent Breast Cancer Cell Line 158 Study Case 2: Gene Regulatory Network of Genome-Wide Mapping of TGFb/SMAD4 Targets in Ovarian Cancer Patients 160 A Bayesian Inference Approach for Genetic Regulatory Analysis 164 Study Case: ERa Transcriptional Regulatory Dynamics in Breast Cancer Cell 165 Conclusions 167 Perspective 168 References 169
9.1 9.2 9.3 9.3.1 9.3.2 9.3.2.1 9.3.2.2 9.3.2.3 9.3.2.4 9.4 9.4.1 9.4.1.1 9.4.1.2 9.4.2 9.4.2.1 9.5 9.6
10 10.1 10.2 10.3 10.4 10.5 10.5.1 10.5.2 10.5.3 10.5.4
Network-Module-Based Approaches in Cancer Data Analysis 173 Guanming Wu and Lincoln Stein Brief Summary 173 Introduction 173 Notation and Terminology 174 Network Modules Containing Functionally Similar Genes or Proteins 174 Network Module Searching Methods 175 Greedy Network Module Search Algorithms 175 Objective Function Guided Search 176 Network Clustering Algorithms 176 Community Search Algorithms 177
jIX
j Contents
X
10.5.5 10.5.6 10.6 10.6.1 10.6.2 10.6.3 10.7 10.7.1 10.7.2 10.7.3 10.7.4 10.7.4.1 10.7.4.2 10.7.4.3 10.7.4.4 10.8 10.9
11 11.1 11.2 11.3 11.3.1 11.3.2 11.3.3 11.3.4 11.4 11.4.1 11.4.1.1 11.4.1.2 11.4.1.3 11.4.2 11.4.2.1 11.4.2.2 11.4.2.3 11.4.2.4 11.5 11.5.1 11.5.1.1 11.5.1.2
Mutual Exclusivity-Based Search Algorithms 178 Weighted Gene Expression Network 178 Applications of Network-Module-Based Approaches in Cancer Studies 179 Network Modules and Cancer Prognostic Signatures 179 Cancer Driver Gene Search Based on Network Modules 179 Using Network Patterns to Identify Cancer Mechanisms 180 The Reactome FI Cytoscape Plug-in 180 Construction of a Functional Interaction Network 181 Network Clustering Algorithm 181 Cancer Gene Index Data Set 181 Analyzing the TCGA OV Mutation Data Set 182 Loading the Mutation File into Cytoscape and Constructing a FI Subnetwork 182 Network Clustering and Network Module Functional Analysis 184 Module-Based Survival Analysis 186 Cancer Gene Index Data Overlay Analysis 187 Conclusions 189 Perspective 189 References 191 Discriminant and Network Analysis to Study Origin of Cancer 193 Li Chen, Ye Tian, Guoqiang Yu, David J. Miller, Ie-Ming Shih, and Yue Wang Brief Summary 193 Introduction 193 Overview of Relevant Machine Learning Techniques 194 Fisher’s Discriminant Analysis and ANOVA 194 Hierarchical Clustering 195 One-Versus-All Support Vector Machine and Nearest-Mean Classifier 196 Differential Dependency Network 197 Methods 198 CNA Data Analysis for Testing Existence of Monoclonality 198 Preprocessing 200 Assessing Statistical Significance of Monoclonality 200 Visualization of Monoclonality 201 A Two-Stage Analytical Method for Testing the Origin of Cancer 201 Basic Assumptions 202 Tissue Heterogeneity Correction 203 Stage 1: Feature Selection and Classification 203 Stage 2: Transcriptional Network Comparison 204 Experiments and Results 204 Monoclonality 204 Testing Existence of Monoclonality 204 The Significance of Monoclonality 206
Contents
11.5.2 11.5.2.1 11.5.2.2 11.6 11.7
Testing the Origin of Ovarian Cancer 207 Stage 1 Results 207 Stage 2 Results 208 Conclusion 211 Perspective 212 References 212
12
Intervention and Control of Gene Regulatory Networks: Theoretical Framework and Application to Human Melanoma Gene Regulation 215 Nidhal Bouaynaya, Roman Shterenberg, Dan Schonfeld, and Hassan M. Fathallah-Shaykh Brief Summary 215 Gene Regulatory Network Models 216 Intervention in Gene Regulatory Networks 218 Optimal Stochastic Control 219 Heuristic Control Strategies 221 Structural Intervention Strategies 222 Optimal Perturbation Control of Gene Regulatory Networks 223 Feasibility Problem 226 Optimal Perturbation Control 226 Minimal-Energy Perturbation Control 226 Fastest-Convergence Rate Perturbation Control 228 Trade-offs Between Minimal-Energy and Fastest Convergence Rate Perturbation Control 228 Robustness of Optimal Perturbation Control 231 Human Melanoma Gene Regulatory Network 231 Perspective 235 References 236
12.1 12.2 12.3 12.3.1 12.3.2 12.3.3 12.4 12.4.1 12.4.2 12.4.2.1 12.4.2.2 12.4.3 12.4.4 12.5 12.6
Part Four Phenotype Influence of DNA Copy Number Aberrations 239 13
13.1 13.2 13.2.1 13.2.2 13.2.3 13.2.4 13.2.5 13.2.6 13.3 13.3.1
Identification of Recurrent DNA Copy Number Aberrations in Tumors 241 Vonn Walter, Andrew B. Nobel, D. Neil Hayes, and Fred A. Wright Introduction 241 Genetic Background 242 Definitions 242 Mechanisms of DNA Copy Number Change: An Overview 243 CNAs and Cancer 244 Sporadic and Recurrent CNAs 245 Measuring DNA Copy Number 245 Other Issues to Consider When Assessing DNA Copy Number 246 Analyzing DNA Copy Number: Single Sample Methods 246 Notation 247
jXI
XII
j Contents 13.3.2 13.3.3 13.3.4 13.3.5 13.4 13.4.1 13.4.2 13.4.3 13.5 13.5.1 13.5.2 13.5.3 13.5.4 13.5.5 13.6
14
14.1 14.2 14.3 14.3.1 14.3.2 14.3.3 14.4 14.5 14.6 14.6.1 14.6.2 14.6.3 14.7 14.8 14.8.1 14.8.2 14.8.3 14.8.4 14.9 14.10 14.11
Quality Control and Preprocessing 247 Thresholding 247 Segmentation Algorithms 248 Methods Based on Hidden Markov Models 248 Analyzing DNA Copy Number Data: Multiple Sample Methods to Detect Recurrent CNAs 249 Additional Preprocessing and Summary Statistics 249 Multiple Testing 250 Assessing Statistical Significance: An Overview 250 Analyzing DNA Copy Number Data with DiNAMIC 251 Cyclic Shifts 251 Assessing Statistical Significance with DiNAMIC 252 Peeling 253 Confidence Intervals for Recurrent CNAs 256 Bootstrap Test-Based Confidence Intervals in Real Datasets 257 Open Questions 258 References 259 The Cancer Cell, Its Entropy, and High-Dimensional Molecular Data 261 Wessel N. van Wieringen and Aad W. van der Vaart Brief Summary 261 Introduction 261 Background 262 Molecular Biology 262 Cancer 263 Measurement Devices 263 Entropy Increase 264 Statistical Arguments 266 Statistical Methodology 268 Experiments 269 Entropy 269 Mutual Information 272 Simulation 275 Application to Cancer Data 275 Analyses of Type II Experiments 276 Analyses of Type I Experiments 279 Potential 280 Discussion 282 Conclusion 283 Perspective 283 Software 284 References 284 Index 287
jXIII
Preface The data revolution in biology and medicine provides not only opportunities in enhancing our fundamental understanding of biological processes, patho- and tumorigenesis, and epidemiology but also constitutes a considerable challenge toward their analysis. For this reason, novel statistical and computational approaches are required to unravel the mass data provided by contemporary sequencing and array technologies [4, 5]. The aim of the book Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data is to present statistical methods focusing on a systems level that can be applied to a wide spectrum of genetics and genomics data from high-throughput experiments of cancer. Due to the breathtaking progress during the last years in biology, many experimental approaches that originated in molecular and cell biology are now at the verge to enter medical research. For this reason, the major goal of the present book is to advocate and promote novel analysis methods that hold great promise to be beneficial for prognostic and diagnostic purposes in biomedical research. Along the way toward this goal, we are facing several principle problems which need to be addressed systematically [1, 2, 3]. In this respect, three problems are of particular importance. First, in contrast to traditional clinical data, data from highthroughput experiments are very high dimensional involving thousands or even tens of thousands of variables. Usually, this requires a dimension reduction or a variable selection to tame the associated computational complexity of such highdimensional problems. Second, due to the molecular dependence of gene products on each other there is a nonnegligible heterogeneity in these data-possessing difficulties for parametric statistical models. For this reason, nonparametric methods, for example, bootstrap or resampling methods are frequently used. Third, it is more and more common that high-throughput data from different technologies are simultaneously available, which requires their meaningful integration. According to the World Health Organization (WHO), cancer is one of the leading causes of death in the developed countries. For this reason, we are focusing in this book entirely on this menace to the health and the chapters are discussing a large variety of different methods applied to different cancer types. For example, investigations of breast cancer, cervical cancer, colorectal cancer, lung cancer, leukemia, lymphoma, melanoma, ovarian cancer, and prostate cancer are presented
XIV
j Preface in a way that highlights the obtained genetic and molecular understanding of these complex diseases, but provide also a thorough explanation of the statistical methods. This book is intended for researches, graduate, and advanced undergraduate students in the interdisciplinary fields of computational biology, biostatistics, bioinformatics, and systems biology studying problems in biological and biomedical sciences. Each chapter is comprehensively presented, accessible not only to researchers from this field but also to interested students or scientists specialized in related areas. To enable this, each chapter presents not only technical results but provides in addition background knowledge which is necessary to understand the statistical method or the biological problem under consideration. In addition, each chapter starts with a section called “Brief Summary” and finishes with a section “Perspective.” These sections are nontechnical in nature providing the reader with a brief overview of the presented topic. These features allow us to use this book as a textbook for, for example, an interdisciplinary seminar for advanced students. In Figure 1, we show an overview of all chapters in this book. Due to the complexity of general approaches to cancer, it is not possible to categorize the chapters uniquely by just one keyword. For this reason, we provide in Figure 1 a three-dimensional categorization, which is based on (1) the used data types, (2) statistical and computational methods, and (3) the studied cancer types. For each of these conceptual categories, we use a color code, as provided in Figure 1. In addition, the book is organized in four parts. In the first part, chapters present a general overview of generic methods and data used in the remainder of the book. The second part focuses on Bayesian methods and the third part on network-based approaches. Finally, part four contributed chapters describing the influence of DNA copy number abberrations on the phenotype. This conceptual overview may be useful for the reader to find quickly a specific chapter that deals with a particular subset of cancer types or statistical methods. Many colleagues, whether consciously or unconsciously, have provided us with input, help, and support before and during the preparation of the present book. In particular, we would like to thank Andreas Albrecht, G€ okmen Altay, Subhash Basak, Jaine Blayney, Danail Bonchev, Frederick Campbell, Aedin Culhane, Maria Duca, Dean Fennell, Galina Glazko, Armin Graber, Beryl Graham, Benjamin Haibe-Kains, Peter Hamilton, Des Higgins, Maria Hughes, Patrick Johnston, Frank Kee, Declan Kieran, Chang Sik Kim, Terry Lappin, Kang Li, D. D. Lozovanu, Florian Markowetz, Darragh McArt, Dennis McCance, James McCann, Abbe Mowshowitz, Ken Mills, Paul Mullan, Arcady Mushegian, Katie Orr, Andrei Perjan, John Quackenbush, Andre Ribeiro, Bert Rima, Sudhakar Sahoo, Ricardo de Matos Simoes, Francesca Shearer, John Story, Simon Tavare, Shailesh Tripathi, Peter Valent, Kurt Varmuza, Yinhai Wang, Kathleen Williamson, Shu-Dong Zhang, and apologize to all who have not been named mistakenly. We would also like to thank our editors Andreas Sendtko and Gregor Cicchetti from Wiley-Blackwell who have been always available and helpful.
Preface
(a)
Chapters
Data
Methods
1. Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms ey Miecznikowski et al. bc
multiple
Bayesian method
DNA microarray
Part 1: General overview
2. Overview of Public Cancer Databases, Resources and Visualization Tools Frank Emmert-Streib et al.
4. Bayesian Ranking and Selection Methods in Microarray Studies Hisashi Noma et al.
le, ly
Part 3: Network-based approaches
DNA sequencing
DNA copy number
DNA methylation
7. Colorectal Cancer and its Molecular Subsystems: Construction, Interpretation, and Validation Vishal N. Patel et al. co
(c)
8. Network Medicine: Disease Genes in Molecular Networks Sreenivas Chavali et al.
9. Inference of Gene Regulatory Networks in Breast and Ovarian Cancer by Integrating Diferent Genomic Data Binhua Tang et al.
multiple
bc, oc
10. Network Module Based Approaches in Cancer Data Analysis Guanming Wu et al.
oc
11. Discriminant and Network Analysis to Study Origin of Cancer Li Chen et al.
oc, pc
12. Intervention and Control of Gene Regulatory Networks: Theoretical Framework and Application to Human Melanoma Gene Regulation Nidhal Bouaynaya et al.
13. Identification of Recurrent DNA Copy Number Aberrations in Tumors Vonn Walter et al.
14. The Cancer Cell, its Entropy, and High-dimensional Molecular Data Wessel N. van Wieringen et al.
Supervised method
mo
Monte Carlo method
proteomics
6. Semisupervised Methods for Analyzing High-dimensional Genomic Data Devin Koestler et al.
lc
me
lc
multiple
Network-based method
RNA-Seq
3. Discovery of Expression Signatures in Chronic Myeloid Leukemia by Bayesian Model Averaging Ka Yee Yeung le
Unsupervised method
Part 2: Bayesian methods
5. Multiclass Classification via Bayesian Variable Selection with Gene Expression Data Yang Aijun et al.
jXV
Cancer types: breast (bc), cervical (ce), colorectal (co), lung (lc), leukemia (le), lymphoma (ly), melanoma (me), mesothelioma (mo), ovarian (oc) and prostate (pc)
Figure 1 Brief overview of the book chapters with respect to the three major conceptual topics: high-throughput data types (a), statistical and computational methods (b), and cancer types (c).
(b)
XVI
j Preface Finally, we hope this book helps to spread out the enthusiasm and joy we have for this field and inspires people regarding their own practical or theoretical research problems. Belfast and Hall/Tyrol, May 2012
Frank Emmert-Streib and Matthias Dehmer
References 1 Alon, U. (2006) An Introduction to Systems
4 Emmert-Streib, F. and Glazko, G. (2011)
Biology: Design Principles of Biological Circuits, Chapman & Hall/CRC, Boca Raton, FL. 2 Barabasi, A. L. and Oltvai, Z. N. (2004) Network biology: Understanding the cell’s functional organization. Nat. Rev. Genet., 5, 101–113. 3 von Bertalanffy, L. (1950) An outline of general systems theory. Brit. J. Philos. Sci., 1 (2), 134–165.
Network biology: a direct approach to study biological function. WIREs Syst. Biol. Med., 3 (4), 379–391. 5 Palsson, B.O. (2006) Systems Biology: Properties of Reconstructed Networks, Cambridge University Press, Cambridge, UK.
jXVII
List of Contributors Yang Aijun Nanjing Audit University School of Finance Jiangsu, 211815 China Nidhal Bouaynaya University of Arkansas at Little Rock Department of Systems Engineering ETAS 300 H., 2801 S. University Ave. Little Rock, AR 72204 USA Mark R. Chance Case Western Reserve University Center for Proteomics and Bioinformatics 10900 Euclid Ave., BRB 930 Cleveland, OH 44106-4988 USA Sreenivas Chavali MRC Laboratory of Molecular Biology Hills Road Cambridge CB2 0QH UK
Li Chen Johns Hopkins University Department of Pathology School of Medicine 1550 Orleans Street Baltimore, MD 21231 USA Matthias Dehmer UMIT Institut f€ ur Bioinformatik und Translationale Forschung Eduard Walln€ ofer Zentrum 1 6060 Hall/Tyrol Austria Ricardo de Matos Simoes Max F. Perutz Laboratories Center for Integrative Bioinformatics Vienna Dr. Bohr Gasse 9 1030 Vienna Austria Frank Emmert-Streib Queen’s University Belfast Computational Biology and Machine Learning Lab Center for Cancer Research and Cell Biology 97 Lisburn Road Belfast BT9 7BL UK
XVIII
j List of Contributors Hassan M. Fathallah-Shaykh University of Alabama at Birmingham Department of Neurology School of Medicine FOT 1020, 510 20th St South Birmingham, AL 35294-3410 USA Fei Gu The Ohio State University Department of Biomedical Informatics Columbus, OH 43210 USA D. Neil Hayes University of North Carolina at Chapel Hill UNC Lineberger Comprehensive Cancer Center School of Medicine CB#7295 450 West Drive Chapel Hill, NC 27599-7295 USA Victor X. Jin The Ohio State University Department of Biomedical Informatics Columbus, OH 43210 USA Kartiek Kanduri Turku Centre for Biotechnology Turku Finland Devin C. Koestler Dartmouth Medical School One Medical Center Drive Section of Biostatistics & Epidemiology 7927 Rubin Building Lebanon, NH 03756 USA
Song Liu University at Buffalo Department of Biostatistics 723 Kimball Tower New York, NY 14214 USA Shigeyuki Matsui Kyoto University School of Public Health Department of Biostatistics Yoshida Konoe-cho, Sakyo-ku Kyoto, 606-8501 Japan Jeffrey Miecznikowski University at Buffalo Department of Biostatistics 723 Kimball Tower New York, NY 14214 USA David J. Miller Johns Hopkins University Department of Pathology School of Medicine 1550 Orleans Street Baltimore, MD 21231 USA Andrew B. Nobel University of North Carolina at Chapel Hill Department of Statistics and Operations Research Hanes Hall, CB#3260 Chapel Hill, NC 27599-3260 USA Hisashi Noma Kyoto University School of Public Health Department of Biostatistics Yoshida Konoe-cho, Sakyo-ku Kyoto, 606-8501 Japan
List of Contributors
Vishal N. Patel Case Western Reserve University Center for Proteomics and Bioinformatics 10900 Euclid Ave., BRB 930 Cleveland, OH 44106-4988 USA Dan Schonfeld UIC College of Engineering Electrical and Computer Engineering 851 S. Morgan M/C 154 Chicago, IL 60607 USA Ie-Ming Shih Johns Hopkins University Department of Pathology School of Medicine 1550 Orleans Street Baltimore, MD 21231 USA Roman Shterenberg University of Alabama at Birmingham Department of Mathematics 452 Campbell Hall 1300 University Boulevard Birmingham, AL 35294-1170 USA Lincoln Stein Stony Brook University Department of Biomedical Engineering Stony Brook, NY 11794 USA Binhua Tang The Ohio State University Department of Biomedical Informatics Columbus, OH 43210 USA
Ye Tian Virginia Tech Research Center – Arlington The Bradley Department of Electrical and Computer Engineering 900 N. Glebe Road Arlington, VA 22201 USA Shailesh Tripathi Queen’s University Belfast Computational Biology and Machine Learning Lab Center for Cancer Research and Cell Biology 97 Lisburn Road Belfast BT9 7BL UK Aad W. van der Vaart Vrije Universiteit Department of Mathematics Faculty of Sciences De Boelelaan 1081a 1081 HV Amsterdam The Netherlands Wessel N. van Wieringen Vrije Universiteit Department of Mathematics Faculty of Sciences De Boelelaan 1081a 1081 HV Amsterdam The Netherlands Vonn Walter University of North Carolina at Chapel Hill UNC Lineberger Comprehensive Cancer Center School of Medicine CB#7295 450 West Drive Chapel Hill, NC 27599-7295 USA
jXIX
j List of Contributors
XX
Dan Wang University at Buffalo Department of Biostatistics 723 Kimball Tower New York, NY 14214 USA
Song Xinyuan The Chinese University of Hong Kong Department of Statistics Hong Kong SAR The People’s Republic of China
Yue Wang Virginia Tech Virginia Tech Research Center – Arlington The Bradley Department of Electrical and Computer Engineering 900 N. Glebe Road Arlington, VA 22201 USA
Ka Yee Yeung University of Washington Department of Microbiology Seattle, WA 98195-8070 USA
Fred A. Wright University of North Carolina at Chapel Hill Department of Biostatistics 4115B McGavran-Greenberg 135 Dauer Drive, Campus Box 7420 Chapel Hill, NC 27599-7420 USA Guanming Wu MaRS Centre Ontario Institute for Cancer Research South Tower 101 College Street, Suite 800 Toronto, ON M5G 0A3 Canada
Guoqiang Yu Johns Hopkins University Department of Pathology School of Medicine 1550 Orleans Street Baltimore, MD 21231 USA Li Yunxian Yunnan University of Economics and Finance School of Finance Yunnan China
j1
Part One General Overview
Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
j3
1 Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms Jeffrey Miecznikowski, Dan Wang, and Song Liu
1.1 Brief Summary
This chapter provides an overview of the genetic and proteomic high-throughput platforms and the statistical methods used to evaluate molecular biomarkers for cancer diagnosis. Commonly, these experimental platforms are used in cancer diagnosis where the biomarkers can be used to determine cancer subtypes and thus potential treatments. Because of the large amount of data from these platforms, accurate testing methods are necessary. In this chapter, we highlight the statistical methods used to evaluate each potential biomarker and limit the number of false positives under a specific error rate.
1.2 Introduction
Since the invention of microarray technology and related high-throughput technologies, researchers have been able to compile large amount of information. This amount of information enables researchers to uncover potentially new targets for therapies or to enhance our knowledge of biological systems. These high-throughput platforms have become commonly used experimental platforms in the biological realm [1]. A high-throughput platform is designed to measure large numbers (thousands or millions) of signatures in a biological organism at a given time point. These platforms are a function of the postgenomic era and are often used to determine how genomic expression is regulated or involved in biological processes. These platforms often use hybridization and sequence-based technologies such as gene expression microarrays and RNA-Seq platforms. Specifically, these platforms and technologies have revolutionized the way researchers study cancer, especially with regard to diagnosis and prognosis. Current cancer classification consists of more than 200 subtypes of cancer [2]. In order to receive the most appropriate therapy, the clinician must identify as accurately as possible the cancer subtype, stage, and/or grade. Clinicians commonly use Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
4
j 1 Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms morphologic characteristics of biopsy specimens but “it gives very limited information and clearly misses much important tumor aspects such as rate of proliferation, capacity for invasion and metastases, and development of resistance mechanisms to certain treatment agents” [3]. Therefore, in order to improve these classification methods, new molecular diagnostic methods are needed. Thus, the huge amount of molecular information that can be extracted and integrated to find common patterns is a major advantage of these high-throughput platforms. These new technologies will allow researchers to enhance cancer diagnostics by (1) classifying tumor samples into known and new taxonomic categories, (2) discovering new diagnostic and therapeutic markers, and (3) identifying new subtypes that correlate with treatment outcome. The design of high-throughput platforms, the cost of high-throughput platforms, and the amount of information received from these platforms necessitate the need for statisticians to be involved in the analysis of these experiments. Often the statistician’s primary task determines the genomic/proteomic regions of interest for further interrogation, verification, or validation. These regions of interest should be regions of the genome or proteome that are statistically significantly correlated with the outcome of interest, for example, survival, drug response, cancer subtype, and so on. With these large numbers of tests, reporting significance based on univariate p-values less than 0.05 leads to a large number of false positives. Besides limiting the number of false positives, another challenge in developing valid highthroughput-derived biomarkers is obtaining large enough datasets with sufficient patient follow-up time [4, 5]. Hence, in light of these concerns the concept of statistical significance has been re-evaluated over the last 20 years, most notably with the study of the false discovery rate (FDR) in [6]. Namely, multiple testing procedures have been greatly studied and refined in order to control a suitable Type I error in these experiments. The goal of these modern statistical procedures is to limit the number of false positive probes or genes that proceed to the validation phase of these experiments. This chapter is designed to study some of the high-throughput technologies that are employed in these experiments and the Type I error methods to control the results. In the remaining sections, this chapter outlines the high-throughput platforms for cancer diagnosis and the statistical methods to obtain a univariate measure of significance for each gene/protein/probe. Each chapter subsection also contains a hypothetical cancer experiment that would employ the described statistical technique. The chapter concludes by outlining methods that use the univariate p-values to control multiple testing based Type I errors for these experiments. Finally, the chapter concludes with a conclusion and perspective for future work.
1.3 High-Throughput Platforms
In the following subsections, we outline several of the common high-throughput platforms used in experiments designed for cancer diagnosis. These platforms
1.3 High-Throughput Platforms
were chosen to illustrate the diversity of platforms available for interrogating deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or proteins. 1.3.1 Gene Expression Arrays
The human genome consists of DNA sequences located within the nucleus of each cell. Specific DNA sequences are copied (transcribed) into messenger RNA (mRNA). These mRNA copies transition from the nucleus to the cytoplasm of the cell in order for the corresponding sequence to be used in manufacturing various protein molecules. Genetic microarray technology makes use of this process [7–9]. Generally speaking, there are two types of microarray technology: two-dye spotted pin technology and Affymetrix technology. In the two-dye spotted pin technology, target complementary DNA (cDNA) elements are laid out on a microscopic glass slide and are probed with dye-labeled samples. The target cDNA elements generated in advance are physically arrayed in a two-dimensional grid on a chemically modified glass slide. Then for the two-dye spotted method, equal amounts of two purified mRNA samples are separately reverse transcribed using primer sets labeled with two different fluorescent dyes. The two resulting dye-labeled samples are used as probes in a competitive hybridization reaction with the target elements on the chip. After hybridization, a laser scanner generates two images of the chip at the wavelengths of light corresponding to each sample for each spot on the chip. References [10, 11] discuss these chips further and the preprocessing and analysis methods used on the images that create the microarray data. Affymetrix gene expression microarrays represent the other major type of gene expression microarray technologies. The Affymetrix DNA microarrays, called “GeneChips” according to the Affymetrix trademark, are generated using semiconductor and photolithography manufacturing techniques. The major distinction between Affymetrix and spotting techniques is that multiple short probes (20–40 base pairs) are used to measure gene expression levels. For this reason, preprocessing methods are critical for this chip. A thorough outline of these methods are available in [10] with a comparison of the various techniques provided in [12]. Regardless of the type of gene expression microarray employed after preprocessing, the array experiment will result in an intensity level for a given probe/subject that reflects the amount of expression for that given probe/subject. Both types of gene expression arrays are commonly employed to study cancer diagnosis such as in [13–17] and cancer prognosis such as in [18–22]. 1.3.2 RNA-Seq
Recently, RNA-Seq has emerged as a powerful new technology for transcriptome analysis [23]. A typical RNA-Seq experiment takes a sample of purified RNA, has it sheared, converted to cDNA and sequenced on a massively parallel sequencer, such
j5
6
j 1 Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms as the Genome Analyzer (or HiSeq) from Illumina Inc, SOLiD from Life Technologies Inc, or 454 from Roche Inc. This process generates short (e.g., 75 bp) reads taken from either one end of both ends of each cDNA fragment. Depending on the sequencing depth, the number of sequenced short reads per sample could range from around 10 to 100 millions. By mapping millions of RNA-Seq reads to individual genes’ transcripts, one can estimate the overall mRNA abundance and detect differentially expressed genes. Unlike gene expression microarrays that rely on prior probe design and existing transcript annotations, RNA-Seq can be used to analyze any gene and any transcriptome. Applications to cancer studies can be found in [24, 25]. The development of analytic methods to process and analyze the RNA-Seq data is an active area of ongoing research [26]. 1.3.3 DNA Methylation Arrays
As a major epigenetic modification, DNA methylation plays a vital role in transcriptional regulation and chromatin remodeling. The aberration of DNA methylation profile has been associated with many human diseases including cancer [27]. Use of DNA methylation microarray is a popular approach in studies to characterize the epigenetic landscape of human cells [28]. Three widely used commercial platforms to perform methylation profiling are the GoldenGate Methylation Beadarray, Infinium HumanMethylation27 BeadChip, and Infinium HumanMethylation450 BeadChip provided by Illumina Inc. The first two arrays quantitatively target 1505 cytosine-phosphate-guanine (CpG) loci covering around 800 genes and 27,578 CpG sites targeting around 14,000 genes, respectively, while the last one covers 99% of RefSeq genes and 96% of CpG islands within the human genome. For each targeted locus, the raw fluorescent signals from both methylated (Cy5) and unmethylated (Cy3) alleles are extracted to create the average methylation (b) value derived from multiple replicate methylation measurements. The resulted methylation level (b value) for each locus ranges between zero and one. Zero indicates absent methylation and one indicates complete methylation. Since their release, many analytic methods have been developed to process and analyze the Illumina DNA methylation array data [29]. 1.3.4 Mass Spectrometry Platforms
Mass spectrometry is an analytic tool used to identify proteins, where the associated instrument (a mass spectrometer) measures the masses of molecules converted into ions via the mass-to-charge (m/z) ratio. This technology can be used to profile protein markers from tissue or bodily fluids, such as serum or plasma in order to compare biological samples from different patients or different conditions. Matrixassisted laser desorption and ionization – time-of-flight (MALDI-TOF) is a popular tool used by scientists, where a metal plate with the matrix containing the sample is placed into a vacuum chamber that is excited by a laser, causing the protein
1.3 High-Throughput Platforms
molecules to travel (or “fly”) through the tube until they strike a detector that records the time-of-flight for the various proteins under study; surface-enhanced laser desorption and ionization – time-of-flight (SELDI-TOF) is an analog of MALDI-TOF. The interested reader is referred to [30] for discussion regarding the experimental design that creates the data, and elaboration on the MALDI and SELDI constructs. The resulting data are spectral functions containing the m/z ratio and associated intensity, where the peaks in the spectral plots correspond to proteins (or peptides) present in the sample. These procedures generate large amounts of spectral data and can detect protein differential expression and modification in different treatment groups. Noisy data, however, can lead to a high rate of false positive peak identification. This is particularly an issue when working to establish an unbiased, automated approach to detect protein changes, particularly in low abundance proteins. Nevertheless, various mass spectrometry platforms have been used in experiments describing cancer diagnosis such as in [31–33] where pitfalls and concerns with these platforms in cancer diagnosis are noted in [34, 35]. 1.3.5 aCGH Arrays
Array-based Comparative Genomic Hybridization (aCGH) technology is similar to cDNA arrays and is an extension from conventional CGH that is used to identify and quantify DNA copy number changes across the genome in a single experiment [36]. The advantages of aCGH include high-resolution and high-throughput measurement capability allowing for more quantitative analysis of the genomic aberrations. In BAC aCGH arrays, the probes corresponding to locations on a genome are cloned (grown) in a bacterial culture and then arrayed to a glass slide. BAC aCGH technology can be employed to discover markers in diseases as in [37–40] and for detecting genomic imbalances in cancers as described in [41–51]. In BAC aCGH studies, the markers for cancer are often discovered by comparing the signal at a given chromosome loci between the tumor sample and a control sample. Specifically, researchers often examine the logarithm (base 2) of the ratio of the tumor sample to the control sample (log T/C). Some of the normalization methods for this logarithmic ratio are described in [53]. This normalized ratio will allow researchers to determine the presence of an imbalance in copy number for a given marker between the tumor sample (T) and the control sample (C). 1.3.6 Preprocessing HT Platforms
In short, preprocessing algorithms are required in nearly all high-throughput experiments (see, for example, [52]). This is due to the fact that high-throughput platforms measure both biological signal and technical signal. Therefore, the goal of preprocessing algorithms is to remove the technical signal. This technical signal can be considered in terms of background correction and normalization to adjust
j7
8
j 1 Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms across experiments. Often these preprocessing techniques are specific to the platform employed (see, for example, [53]). For these reasons, we will not cover all of the preprocessing methods available. However, to give the reader a feel for preprocessing methods, we discuss quantile normalization, a technique that has been applied and adopted in several different high-throughput platforms [54]. A nice feature of quantile normalization is that it does not require the construction of (non) linear models to describe the experimental system. As each experimental unit (e.g., mouse, patient, cell line, or sample) will be measured via the proposed high-throughput platform, a (genetic) profile for this experimental unit will be obtained. In quantile normalization, we impose the same empirical distribution of the high-throughput intensity for each profile (e.g., the profile for each experimental unit will have the same quartiles, median, etc.). The algorithm proposed in [54] is designed so that all profiles are matched (aligned) with the empirical distribution of the averaged sample profiles.
1.4 Analysis of Experiments
After preprocessing the experiment, we ultimately obtain a N M summary matrix, X ¼ ðx nm Þ, where x nm denotes the normalized measure of probe (gene/ protein) m in sample n. This data matrix will be used for subsequent statistical analysis. This data matrix can be interpreted as a collection of M explanatory vectors each of length n. In our setting, we assume that the researcher is interested in examining which of the M vectors are correlated with the outcome vector of interest Y. Since each of the M vectors represents a gene/protein or, generally speaking a “probe” we can consider this analysis as a “probe by probe” analysis where each probe represents a potential biomarker. For each of the following subsections, we assume that each column in our X matrix corresponds to a biomarker under consideration. Our goal in the following subsections is to assign a p-value measuring the correlation between the biomarker and the outcome of interest. The outcome of interest is denoted by Y and contains a value for each sample in the matrix X. The outcome of interest can be of several forms, (1) continuous (or nearly continuous) variable, for example, size of tumor, (2) categorical for example, healthy versus disease, or (3) censored continuous variable, for example, survival times, or time to recurrence. In the following subsections, we outline the analysis for each outcome variable setting and provide a cancer-related hypothetical experiment suitable for statistical analysis via the proposed methods. 1.4.1 Linear Regression
In a linear regression setting for discovering high-throughput biomarkers, our goal is to determine which biomarkers are significantly correlated with our outcome of interest which, for this section, is assumed to be suitably continuous. Examples of
1.4 Analysis of Experiments
continuous outcomes in biomarker discovery may include drug level concentrations, white blood cell count, marker staining percentage, and tumor size. The remainder of this section first introduces the simple linear regression model, and later addresses the multivariate regression model designed to assess the correlation between our markers and the continuous outcome of interest. 1.4.1.1 Simple Linear Regression &
Example 1.1 An experiment is conducted to study the correlation between gene expression and tumor size (a surrogate measure for the extent of disease) in breast cancer patients at the time of diagnosis. To that end, we obtain breast cancer tumor samples from a random cohort of patients recently diagnosed with breast cancer. These tumor samples are processed to obtain mRNA and are interrogated with a gene expression array to obtain the expression level for set of genes. The outcome of interest is the tumor size. We would like to know which genes are significantly associated with the tumor size and which are not.
In a simple linear regression, we consider only a single biomarker, which is considered a predictor or explanatory variable for the outcome or response variable Y. In a simple linear regression with N observations, the model is stated as Y i ¼ b0 þ b1 X i þ ei ; i ¼ 1; 2; 3; . . . ; N
ð1:1Þ
where Y i is the outcome for the ith sample, b0 ; b1 are (unknown) parameters and X i is the value of the biomarker (probe) for the ith sample. In this model, we assume that the error terms ei are independent with a constant (unknown) variance s 2 . In a simple linear regression, we can estimate our unknown parameters, b0 , b1 , s 2 , using least squares estimators. In a least squares estimation, our goal is to determine values for the parameters that minimize our error in the fitted model. For the pairs of observations ðX i ; Y i Þ, we consider the deviation of Y i from its fitted value from the linear regression by examining the deviation (DEV) defined as DEVi ¼ Y i
ðb0 þ b1 X i Þ:
ð1:2Þ
With the definition of deviation capturing our concept of “error,” the goal in the least squares estimation is to minimize the sum of the squared deviations: Q¼
N X i¼1
ðY i
ðb0 þ b1 X i ÞÞ2 ¼
N X
DEV2i :
ð1:3Þ
i¼1
As shown in [55], the following formulas yield the point estimators b0 and b1 for b0 and b1 , respectively, that minimize Q, P ðX i X ÞðY i YÞ b1 ¼ ; b0 ¼ Y b1 X : ð1:4Þ P ðX i X Þ2
j9
10
j 1 Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms Note that X and Y are the sample means of the X i and the Y i observations, respectively. A commonly used estimator for s2 is given by the mean squared error (MSE): P P ðY i ðb0 þ b1 X i ÞÞ2 DEV2i : ð1:5Þ ¼ MSE ¼ N 2 N 2 In order to measure the significance of the correlation between the predictor and response, we need to make an assumption about the form of the distribution of ei . We assume that the error terms ei are independently normally distributed with mean 0 and variance s2 (denoted by Nð0; s 2 Þ). With this assumption, we have the ability to assess the significance of b1 , or, in other words, ask the question, “Is b1 significantly different from 0?” Specifically, the hypothesis test of interest is stated as H 0 : b1 ¼ 0 : H1 : b1 6¼ 0
ð1:6Þ
In short, hypothesis testing allows researchers to make decisions between two hypotheses, based on observed data. An introduction to statistical hypothesis testing is provided [56, 57] with more advanced treatments in [58, 59]. In order to evaluate our test in (1.6), we need to derive a test statistic and its distribution under the null hypothesis. For our point estimator b1 in (1.4), it can be shown (as in [55]) that, under the null hypothesis, b1 is normally distributed with mean 0 and variance given by s2 ðb1 Þ ¼ P
s2 ðX i
X Þ2
;
ð1:7Þ
where s 2 ðb1 Þ denotes the variance of b1 . As mentioned earlier, s 2 is unknown, but MSE is a commonly used estimator for s 2. Hence, our estimator for s 2 ðb1 Þ in (1.7) can be expressed as s2 ðb1 Þ ¼ P
MSE ðX i
Þ2 X
;
ð1:8Þ
where MSE is given in (1.5). Since b1 is normally distributed, we have that the standard statistic ðb1 b1 Þ=sðb1 Þ is a standard normal variable. Using our estimator in (1.8) for s2 ðb1 Þ, our test statistic Z can be given as Z¼
b1 : sðb1 Þ
ð1:9Þ
Note that under the null distribution Z follows a t distribution with N 2 degrees of freedom (denoted by Z tN 2 ). Once obtaining the test statistic and its distribution under the null hypothesis, we can obtain the p-value: a value indicating the probability of obtaining a test statistic at least as extreme as the observed statistic under the assumption that the null hypothesis is true. See [56, 58, 60] for a more thorough discussion of p-values. For our biomarker X, using the statistic in (1.9), we can
1.4 Analysis of Experiments
calculate a (univariate) p-value for the test in (1.6) by the following: p-value ¼ 2Kð jZjÞ
ð1:10Þ
where K denotes the cumulative distribution function (CDF) for the t distribution with N 2 degrees of freedom. This p-value is univariate in the sense that it refers to the level of significance for a single biomarker. The univariate p-value in (1.10) does not address the significance in light of testing M possible biomarkers (see Section 1.5). Nevertheless by using (1.10), we can compute a M length vector consisting of p-values for each of the biomarkers under consideration. 1.4.1.2 Multiple Regression &
Example 1.2: Continued from Example 1.1 To further our analysis of the genes associated with tumor size, we would like to adjust for patient age. That is, the experimenters are interested in which genes are significantly associated with tumor size after adjusting for the patient’s age. In this experiment patient age acts as another explanatory variable.
Multiple regression represents an extension of the ideas developed in Section 1.4.1.1. In a multiple regression, we include multiple predictor variables in the model to explain the response variable. This setting is useful to evaluate potential biomarkers in light of other variables, for example patient age or patient race. For example with two predictor variables, for example, two biomarkers, X 1 and X 2 the first-order multiple regression model is given by Y i ¼ b0 þ b1 X i1 þ b2 X i2 þ ei :
ð1:11Þ
The model in (1.11) is first order in the sense that each variable is included in the model, but there is no interaction variable X 1 X 2 included in the model. Following the methodology in (1.1) and (1.11), we can generalize our regression model for m variables X 1 ; X 2 ; X 3 ; . . . ; X m as Y i ¼ b0 þ b1 X i1 þ b2 X i2 þ b3 X i3 þ þ bm X im þ ei :
ð1:12Þ
Following similar methodology outlined in Section 1.4.1.1, we can formally test the significance of variable X j using likelihood or least squares methods to estimate our parameters in (1.11) and (1.12), see [55, 61, 62]. 1.4.2 Logistic Regression (Y Discrete)
&
Example 1.3 Bladder cancer clinicians are interested in proteomic biomarkers associated with the two major subtypes of bladder cancer. This information will further
j11
12
j 1 Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms improve the ability of clinicians to diagnose and classify bladder cancer patients. Hence, a mass spectrometry experiment is performed to analyze the proteome in a set of bladder cancer tumors from a cohort of the papillary transitional cell carcinoma subtype and a cohort of the nonpapillary transitional cell carcinoma subtype. The experimenters are interested in proteins that are differentially expressed between the two bladder cancer subtypes. In this setting, we assume that Y is a binary (two categories) random variable. For example, Y may denote healthy or disease subjects. Note, it is outside the scope of this chapter to fully explore regression settings where Y consists of more than two categories. Hence, for the remainder of this section, we code our bivariate outcome variable Y as 0 or 1. Using a logistic regression model, and a single predictor variable X we can model our outcome as follows: EðYÞ ¼
expðb0 þ b1 X Þ ; 1 þ expðb0 þ b1 X Þ
ð1:13Þ
where Eð Þ represents the expected value function and expðÞ represents the exponential function. Similar to the regression models, we can use likelihood methods or least squares methods to obtain b0 ; b1 estimators of b0 ; b1 . Note unfortunately, closed form solutions for b0 ; b1 do not exist and so computer intensive numerical search procedures such as those employed in R (see [63]) and SAS1 software are necessary. Once we have determined estimates for b0 and b1 as b0 and b1 , respectively, our goal is to test the same hypothesis as in (1.6). Unfortunately, our statistic and ultimately, calculating the p-value in this setting is not as straightforward as in linear regression. A common test for b1 in the logistic regression setting is the likelihood ratio test [56–58]. In short, we compute the partial deviance representing the deviance between the model containing b1 and the model where b1 ¼ 0. Before defining partial deviance, we define deviance (DEV) for the logistic regression model in (1.13) as DEV ¼
2
N X
^ i þ ð1 Y i logðY
i¼1
Y i Þlogð1
^ iÞ ; Y
ð1:14Þ
^ i is the fitted value for sample i in the logistic regression model. The fitted where Y ^ i is obtained by using b0 and b1 in place of b0 and b1 in (1.13). Thus, we can value Y define the partial deviance (PD) as the difference between the deviance (calculated in (1.14)) for a model containing b1 (as in (1.13)) and the deviance for a model where b1 ¼ 0. Under the null hypothesis in (1.6), we have (asymptotically) that PD follows a chi-squared distribution with 1 degree of freedom. Thus, we can obtain a p-value measuring the significance of b1 as p-value ¼ 1
GðPDÞ;
ð1:15Þ
where G represents the CDF for a chi-squared distribution with 1 degree of freedom. Thus, for a bivariate outcome, we can use (1.15) to obtain a p-value for each biomarker under consideration.
1.4 Analysis of Experiments
1.4.2.1 Multiple Logistic Regression &
Example 1.4: Continued from Example 1.3 Researchers have determined that cigarette smoking plays a role in bladder cancer. Hence, the researchers would like to know the proteins associated with bladder cancer subtype after adjusting for smoking pack years – a measure quantifying the amount of cigarette smoking for each patient. In this example, smoking pack years acts as an additional explanatory variable.
Similar to the multiple linear regression model, we can generalize our model in (1.13) for m biomarkers as follows: EðYÞ ¼
expðb0 þ b1 X 1 þ b2 X 2 þ þ bm X m Þ : 1 þ expðb0 þ b1 X 1 þ b2 X 2 þ þ bm X m Þ
ð1:16Þ
Similarly, we can test for the significance of bj using analogs of (1.14) and (1.15) where asymptotically the statistic follows a chi-squared distribution with 1 degree of freedom under the null hypothesis. The simple and multiple logistic regression models can be extended to situations where the outcome variable has more than two groups or levels (polychotomous). Treatment for these situations can be found in [64–66]. 1.4.3 Survival Modeling
Survival analysis is commonly performed in biomarker testing within oncology research. Prognostic or predictive biomarkers, by design, are meant to explain the patients overall cancer outcome or the effect of a therapeutic intervention. Commonly the outcome or effect studied in these situations is the survival time, time to recurrence, or time to disease progression. In all three situations, this variable is considered right censored where the event is observed only if it occurs prior to some prespecified time. For example, patients may be followed with events recorded for up to five years. The amount of follow up time should be a balance based on the number of expected events and the resources required to follow the patients over that time frame. The following texts provide thorough treatments for survival analysis [67–69]. Our goal in this section will be to introduce the Kaplan–Meier estimator as a method to, ultimately, test and obtain a p-values representing the significance of a biomarker in assessing time to event data. 1.4.3.1 Kaplan–Meier Analysis &
Example 1.5 Researchers are interested in what DNA copy number changes are associated with shorter survival in ovarian cancer patients. To study this question,
j13
14
j 1 Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms researchers analyze a set of ovarian cancer tumors using aCGH technology. Each patient in this study has been followed for at least 5 years with their survival times documented. The goal is to determine copy number imbalances that are significantly associated with shorter survival. To this end, in each sample, the aCGH-derived data for each probe or location on the genome is dichotomized into either normal copy number or copy number imbalance. This dichotomized data is examined to determine what regions are significantly correlated with patient survival. In a survival analysis setting, we define SðtÞ to be the probability that an experimental unit from a given population will have a lifetime exceeding t. That is, for a random variable T representing the lifetime of the experimental unit, we have SðtÞ ¼ PrðT > tÞ;
ð1:17Þ
where Prð Þ denotes probability. Related to the survival function, we define the hazard function, denoted by hðtÞ, as the event rate at time t, conditional on survival until time t or later. Mathematically, when T is a continuous random variable, we have hðtÞ ¼
d log SðtÞ=dt:
ð1:18Þ
For a sample from this population of size N let the observed times until an event of N sample members be given as follows: t1 t2 t3 tN :
ð1:19Þ
Corresponding to each ti is ni – the number of patients at risk just prior to time ti and di – the number of events at time ti . With this notation we define the Kaplan– Meier estimator designed to estimate the survival function SðtÞ for a random variable T as ^ ¼ SðtÞ
Y ni ti qÞ
a) Note Ið Þ in the equation for k-FDR denotes the indicator function and maxð Þ denotes the maximum operator. Further note that q in TPPFP should be determined prior to testing.
1.5 Multiple Testing Type I Errors
1.5.1 FWER, k-FWER Methods
The k-FWER error rate is a generalized version of the family wise error rate (FWER). Control of FWER refers to controlling the probability of committing one or more false discoveries. If we let V denote the number of false positives from M hypothesis tests (biomarkers), then notationally, (according to [76]) control of FWER at the level of a can be expressed as, PrðV 1Þ a
ð1:24Þ
or equivalently, PrðV ¼ 0Þ 1
a
ð1:25Þ
Note that a is usually chosen to be small, for example, 0.05. Often (1.24) is abbreviated as FWER a. In k-FWER the equation becomes PrðV kÞ a
ð1:26Þ
where k and a are usually determined prior to the analysis. Similar to FWER, control of k-FWER at level a can be expressed as k-FWER a. Practically speaking, controlling k-FWER allows researchers to claim that with high probability there are no more than k false positives in their list of significant biomarkers. Naturally the choice of k is critical when controlling k-FWER. The choice should be made prior to the analysis and it should be based on the resources available to validate the biomarkers in the significance list. If there are relatively limited resources available to validate the biomarkers, then k could be rather small (conservative), otherwise k should be larger (liberal). The following subsections discuss the variety of methods available to control FWER and k-FWER. 1.5.1.1 Adjusted Bonferroni Method The adjusted Bonferroni method to control k-FWER is a generalized version of the Bonferroni correction designed to control FWER [76]. The Bonferroni correction is designed to control the FWER at level a by doing each individual test at significance level a=M where M is the number of tests. The adjustment given in [76] to control k-FWER at a is done by performing each test at level ka=M. That is, a biomarker and the corresponding hypothesis test is considered significantly associated (reject the null) if the p-value is less than ka=M. Under this scheme, the probability against k or more false positives is no larger than a, that is, k-FWER a. The proof is supplied in [76] and is a generalization of the proof for the original Bonferroni method designed to control FWER. 1.5.1.2 Holm Procedure A method to control k-FWER using the Holm procedure is given in [76]. This method is an adjustment to the Holm method designed to control the FWER [77]. The Holm method is considered a “step-down” procedure [58] which, essentially, means the p-value cut point for significance is based on considering the ranked
j17
18
j 1 Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms vector of p-values starting with the most significant p-values. The following procedure describes the Holm method to control FWER at level a for M tests. Let a1 a2 aM ;
ð1:27Þ
be constants defined by ai ¼ a=ðM i þ 1Þ. For each of the M biomarkers under consideration, we denote their corresponding null hypotheses by H 1 ; H 2 ; . . . ; HM. We let the ordered p-values (smallest to largest) be denoted by pð1Þ pðMÞ corresponding to the ordered null hypotheses, Hð1Þ ; . . . ; HðMÞ . If pð1Þ > a1 , then reject no null hypothesis. Otherwise, if pð1Þ a1 ; . . . ; pðrÞ ar ;
ð1:28Þ
then reject hypothesis Hð1Þ ; . . . ; H ðrÞ where the largest r satisfying (1.28) is used. With this framework to control FWER at level a, the Holm method to control k-FWER at level a as stated in [76] is done by redefining ai as ka=M; i Qk j > > ; i Qk j > > ; ik : a j¼1 M iþj
1.5.1.3 Generalized Hochberg Procedure The generalized Hochberg procedure is originally presented in [78] as a method to control FWER. It is expanded in [76] and is shown to be closely related to the generalized Holm procedure. In fact it is stated in [76] that “Hochberg’s procedure is the step-up version of Holm’s step-down procedure.” Recall that Holm’s procedure is considered a step-down procedure because it starts by considering the most significant p-values and once a p-value is larger than a threshold the process stops and all smaller p-values (hypotheses) are considered significant (reject the null). In an analogous way, a step-up procedure starts with the least significant p-values and once a p-value is smaller than a threshold, the process stops and all smaller p-values (null hypotheses) are rejected. With this notion is mind and assuming independent p-values, we can state Hochberg’s procedure as given in [76] as follows: if pðMÞ aM , then reject all the null hypotheses, that is, accept all alternative hypotheses. Otherwise, reject null hypothesis H ð1Þ ; . . . ; H ðrÞ where r is the largest integer satisfying pðrÞ ar with the ai defined in (1.30). 1.5.1.4 Generalized S9idak Procedure A thorough treatment of the generalized S9 idak method is presented in [73]. Note, that the notation and technical details required for their presentation of this method are outside the scope of this chapter. However, using reasonable assumptions, we can simplify the generalized S9 idak method presented in [73]. We consider
1.6 Discussion
using a beta-uniform model (BUM) as the distribution of the p-values [79]. The BUM model represents a mixture model for generating p-values. With a BUM model, we assume that the p-values are independently distributed according to a mixture model where the p-value observations are either from a uniform distribution (true null hypotheses), or a beta distribution (true alternative hypotheses). We expect that true alternatives will yield, on average, small p-values and hence the Beta distribution with a mean near zero is a reasonable model for the alternative p-values. With this setting, the generalized S9 idak procedure works by rejecting all hypotheses with a p-value less than pcut where pcut is such that Fðk
1jM; pcut Þ ¼ 1
a
ð1:31Þ
where F is the CDF of a Binomial random variable of size M and probability of success pcut . Notationally, we have W binðM; pcut Þ. In short, the S9 idak method can be interpreted in light of mixture models that interpret the M hypotheses as a mixture of alternative hypotheses (discoveries) and null hypotheses. For a collection of M tests, the S9 idak method is designed to select a success probability parameter in a binomial distribution where a success means the test follows the alternative hypothesis, while a failure means the test follows the null hypothesis. Under this assumption, the proof that the S 9 idak method controls k-FWER can be found in [73, 80]. 1.5.1.5 minP and maxT procedures Recently two data driven methods, minP and maxT, have been proposed to control k-FWER [81–83]. The methods require a bootstrap step or permutation step to estimate the null distribution [84]. This is in contrast to the adjusted Bonferroni method, the Holm method, the Hochberg method, and the generalized S 9 idak method that only require the M-length vector of p-values. Due to the complexity of these algorithms, we feel these methods are outside the scope of this chapter.
1.6 Discussion
It cannot be understated that there are numerous assumptions that must be verified for the regression and survival analysis methods to be valid. The field of regression/survival diagnostics refers to the general class of techniques for detecting whether the assumptions are valid with these methods. We encourage the reader to explore the references in each of the sections for more thorough coverage of the assumptions and diagnostic techniques for each method. Accurate Type I error control in high-throughput experiments is crucial in order to avoid costly downstream experiments attempting to validate false positives. Further, it is important to understand the assumptions and implications involved in choosing a Type I error to control (see Table 1.2). For example, in our work of pathway-based microarray analysis [85], we showed that k-FWER methods are more robust than the other error rate control methods.
j19
20
j 1 Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms The dependence structure in our tests is a key aspect of these k-FWER methods. For example, independent test statistics (p-values) are required for the presented versions of the S9 idak, Holm, and Hochberg methods. Due to the similarity of probes and their genomic inter-relatedness, this assumption is most likely unreasonable in high-throughput experiments. Recently there has been several works that discuss the dependence structure assumptions for these methods [86–89]. Future work for these k-FWER methods will continue to explore the robustness of these methods to violations in the dependence structure. In this chapter, we have highlighted several methods designed to control k-FWER, where k-FWER is designed to control a probability statement about the distribution of V, the number of false positives. We can generalize our treatment of Type I error rates by considering Type I error, generically, as a functional of a Type I error (e.g., V or V=R). That is, the Type I error can be characterized in terms of a general functional qðFÞ, where F represents the distribution corresponding to the (error) random variable of interest, for example, F is the distribution of V, the number of false positives. Future work in this area explores the possibility of unifying the assumptions required for generic Type I control and the possibility of formulating general expressions for power. In addition to the probe-by-probe testing we discussed in this chapter, there are alternative methods to analyze this data including principal components analysis (PCA) and gene set enrichment analysis (GSEA). Within principal components, the analysis can be supervised: outcome taken into consideration, or unsupervised: outcome information ignored. Both approaches have their merit and can be used in prediction and classification [90, 91]. Meanwhile, GSEA methods are designed to assess the significance of a cohort or group of probes/genes. These methods test a hypothesis of significance for each cohort or group of genes and thus a p-value can be assigned to each group of genes rather than an individual gene. Commonly used algorithms include the original GSEA algorithm [92] and more recently the gene set analysis (GSA) algorithm [93]. Most commonly these methods used predetermined gene sets compiled from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [94] database or the human protein reference database (HPRD) [95]. These databases include pathways for metabolism, genetic information processing, environmental information processing, cellular processes, human diseases, and drug development.
1.7 Perspective
In addition to common clinical–pathological variables used in cancer diagnosis, with the success of the human genome project researchers are using molecular variables to aid in the diagnosis and subtype classification of cancer. Genetic markers such as estrogen receptor gene and breast cancer susceptibility gene mutations have been commonly used for years. However, researchers continue to search for novel putative biomarkers derived from interrogating the entire genome or
References
proteome. These high-throughput experiments to find novel biomarkers yield highdimensional datasets. In general with these high-dimensional datasets, the task of the statistician is to reduce the dimension of the data. This dimension reduction, sometimes called feature extraction, should be performed in a way that removes noise while retaining biological signal. In a biological high-throughput experiment, this reduction can be performed by selecting a subset of biological probes that are significantly associated with the outcome of interest. Statistical significance of association is assessed in light of controlling a Type I error designed to control the number of false positives when simultaneously testing all of the biological probes with the outcome of interest. These biological probes can be genes, genetic regions, proteins, peptides, or microRNAs, while the outcome of interest may be continuous, discrete, or censored, and the Type I error might be controlling the rate of false positives or the probability of committing a certain number of false positives. In this chapter, we describe the high-throughput platforms that generate this type of high-dimensional data and the statistical methods employed to assess overall statistical significance with the various outcomes of interest. These statistical methods can be used in large-dimensional datasets obtained from high-throughput platforms designed to discover potentially novel biomarkers in the diagnosis of cancer. Future work in these areas will include further development and validation techniques for the putative markers obtained from these types of experiments. Statisticians continue to advance statistical methods to control Type I errors and are keenly interested in designing methods to control Type I error in light of correlation. The strategy for choosing a Type I error method/scheme based on the type of data under consideration as well as on the validation methods (and their error rates) that will be used for the markers is an active area of ongoing research. Also, recently scientists have started to explore combining datasets from multiple high-throughput experiments. This field of integrative analysis will require a new set of statistical methods to integrate DNA, RNA, and proteomic data all gathered on the same set of patients. These integrative approaches hold promise for researchers looking to gain insights on complex interactions involving multiple biological systems.
References 1 Rajan, S., Djambazian, H., Dang, H.,
Sladek, R., and Hudson, T. (2011) The living microarray: a high-throughput platform for measuring transcription dynamics in single cells. BMC Genomics, 12 (1), 115. 2 National Cancer Institute (2011) The cancer genome atlas. http:// cancergenome.nih.gov/newsevents/ forthemedia/backgrounder. 3 Perez-Diez, A., Morgun, A., and Shulzhenko, N. (2007) Microarrays for cancer diagnosis and classification.
Microarray Technol. Cancer Gene Prof., 593, 74–85. 4 Pepe, M. (2005) Evaluating technologies for classification and prediction in medicine. Stat. Med., 24, 3687–3696. 5 Pepe, M. and Longton, G. (2005) Standardizing diagnostic markers to evaluate and compare their performance. Epidemiology, 16 (5), 598–603. 6 Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to
j21
22
j 1 Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms 7
8
9
10
11
12
13
14
15
16
multiple testing. J. Roy. Stat. Soc. B Met., 57 (1), 289–300. Brown, P. and Botstein, D. (1999) Exploring the new world of the genome with DNA microarrays. Nat. Genet., 21 (Suppl 1), 33–37. Duggan, D., Bittner, M., Chen, Y., Meltzer, P., and Trent, J. (1999) Expression profiling using cDNA microarrays. Nat. Genet., 21 (Suppl 1), 10–14. Lockhart, D. and Winzeler, E. (2000) Genomics, gene expression and DNA arrays. Nat. London, 405, 827–836. Dudoit, S., Fridlyand, J., and Speed, T. (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc., 97 (457), 77–87. Yang, Y., Dudoit, S., Luu, P., Lin, D., Peng, V., Ngai, J., and Speed, T. (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res., 30 (4), e15–e15. Zhu, Q., Miecznikowski, J., and Halfon, M. (2010) Preferred analysis methods for Affymetrix GeneChips: II. an expanded, balanced, wholly-defined spike-in dataset. BMC Bioinform., 11 (1), 285. Glas, A., Floore, A., Delahaye, L., Witteveen, A., Pover, R., Bakx, N., LahtiDomenici, J., Bruinsma, T., Warmoes, M., Bernards, R. et al. (2006) Converting a breast cancer microarray signature into a high-throughput diagnostic test. BMC Genomics, 7 (1), 278. Gordon, G., Jensen, R., Hsiao, L., Gullans, S., Blumenstock, J., Ramaswamy, S., Richards, W., Sugarbaker, D., and Bueno, R. (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res., 62 (17), 4963. Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. et al. (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci., 98 (26), 15149. Statnikov, A., Aliferis, C., Tsamardinos, I., Hardin, D., and Levy, S. (2005) A
17
18
19
20
21
22
23
24
25
comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21 (5), 631–643. Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci., 99 (10), 6567. Barrier, A., Boelle, P., Roser, F., Gregg, J., Tse, C., Brault, D., Lacaine, F., Houry, S., Huguier, M., Franc, B. et al. (2006) Stage ii colon cancer prognosis prediction by tumor gene expression profiling. J. Clin. Oncol., 24 (29), 4685–4691. Michiels, S., Koscielny, S., and Hill, C. (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, 365 (9458), 488–492. Miecznikowski, J., Wang, D., Liu, S., Sucheston, L., and Gold, D. (2010) Comparative survival analysis of breast cancer microarray studies identifies important prognostic genetic pathways. BMC Cancer, 10 (1), 573. Sotiriou, C., Neo, S., McShane, L., Korn, E., Long, P., Jazaeri, A., Martiat, P., Fox, S., Harris, A., and Liu, E. (2003) Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proc. Natl. Acad. Sci. USA, 100 (18), 10393. Van’t Veer, L., Dai, H., van de Vijver, M., He, Y., Hart, A., Mao, M., Peterse, H., Van der Kooy, K., Marton, M., Witteveen, A. et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415 (6871), 530–536. Wang, Z., Gerstein, M., and Snyder, M. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet., 10, 57–63. Levin, J., Berger, M., Adiconis, X., Rogov, P., Melnikov, A., Fennell, T., Nusbaum, C., Garraway, L., and Gnirke, A. (2009) Targeted next-generation sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts. Genome Biol., 10 (10), R115. Pflueger, D., Rickman, D., Sboner, A., Perner, S., LaFargue, C., Svensson, M., Moss, B., Kitabayashi, N., Pan, Y., De La Taille, A. et al. (2009) N-myc downstream
References
26
27
28
29
30
31
32
33
34
35
36
regulated gene 1 (NDRG1) is fused to ERG in prostate cancer. Neoplasia, 11 (8), 804. Oshlack, A., Robinson, M., and Young, M. (2010) From RNA-seq reads to differential expression results. Genome Biol., 11, 220. Portela, A. and Esteller, M. (2010) Epigenetic modifications and human disease. Nat. Biotechnol., 28, 1057–1068. Laird, P. (2010) Principles and challenges of genome-wide DNA methylation analysis. Nat. Rev. Genet., 11, 191–203. Siegmund, K. (2011) Statistical approaches for the analysis of DNA methylation microarray data. Hum. Genet., 129, 585–595. Levner, I. (2005) Feature selection and nearest centroid classification for protein mass spectrometry. BMC Bioinform., 6. Kolch, W., Neus€ u, C., Pelzing, M., and Mischak, H. (2005) Capillary electrophoresis.mass spectrometry as a powerful tool in clinical diagnosis and biomarker discovery. Mass Spectrom. Rev., 24 (6), 959–977. Koopmann, J., Zhang, Z., White, N., Rosenzweig, J., Fedarko, N., Jagannath, S., Canto, M., Yeo, C., Chan, D., and Goggins, M. (2004) Serum diagnosis of pancreatic adenocarcinoma using surfaceenhanced laser desorption and ionization mass spectrometry. Clin. Cancer Res., 10 (3), 860–868. Paweletz, C., Trock, B., Pennanen, M., Tsangaris, T., Magnant, C., Liotta, L., and Petricoin, E. III (2001) Proteomic patterns of nipple aspirate fluids obtained by SELDI-TOF: potential for new biomarkers to aid in the diagnosis of breast cancer. Dis. Markers, 17 (4), 301. Diamandis, E. (2004) Analysis of serum proteomic patterns for early cancer diagnosis: drawing attention to potential problems. J. Natl. Cancer Inst., 96 (5), 353–356. Diamandis, E. (2004) Mass spectrometry as a diagnostic and a cancer biomarker discovery tool. Mol. Cell. Proteomics, 3 (4), 367–378. Snijders, A.M., Nowak, N., Segraves, R., Blackwood, S., Brown, N., Conroy, J., Hamilton, G., Hindle, A.K., Huey, B., Kimura, K., Law, S., Myambo, K., Palmer,
37
38
39
40
41
42
43
44
J., Ylstra, B., Yue, J.P., Gray, J.W., Jain, A. N., Pinkel, D., and Albertson, D.G. (2001) Assembly of microarrays for genome-wide measurement of DNA copy number. Nat. Genet., 29 (3), 263–264. Lugtenberg, D., de Brouwer, A., Kleefstra, T., Oudakker, A., Frints, S., SchranderStumpel, C., Fryns, J., Jensen, L., Chelly, J., Moraine, C. et al. (2006) Chromosomal copy number changes in patients with non-syndromic X-linked mental retardation detected by array CGH. J. Med. Genet., 43 (4), 362. Miyake, N., Shimokawa, O., Harada, N., Sosonkina, N., Okubo, A., Kawara, H., Okamoto, N., Kurosawa, K., Kawame, H., Iwakoshi, M. et al. (2006) BAC array CGH reveals genomic aberrations in idiopathic mental retardation. Am. J. Med. Genet. Part A, 140 (3), 205–211. Stankiewicz, P. and Beaudet, A. (2007) Use of array CGH in the evaluation of dysmorphology, malformations, developmental delay, and idiopathic mental retardation. Curr. Opin. Genet. Dev., 17 (3), 182–192. Ullmann, R., Turner, G., Kirchhoff, M., Chen, W., Tonge, B., Rosenberg, C., Field, M., Vianna-Morgante, A., Christie, L., Krepischi-Santos, A. et al. (2007) Array CGH identifies reciprocal 16p13.1 duplications and deletions that predispose to autism and/or mental retardation. Hum. Mutat., 28 (7), 674–682. Albertson, D. (2003) Profiling breast cancer by array CGH. Breast Cancer Res. Treat., 78 (3), 289–298. Albertson, D., Ylstra, B., Segraves, R., Collins, C., Dairkee, S., Kowbel, D., Kuo, W., Gray, J., and Pinkel, D. (2000) Quantitative mapping of amplicon structure by array CGH identifies CYP24 as a candidate oncogene. Nat. Genet., 25 (2), 144–146. Albertson, D.G., Collins, C., McCormick, F., and Gray, J.W. (2003) Chromosome aberrations in solid tumors. Nat. Genet., 34 (4), 369–376. Garnis, C., Coe, B., Zhang, L., Rosin, M., and Lam, W. (2003) Overexpression of LRP12, a gene contained within an 8q22 amplicon identified by high-resolution
j23
24
j 1 Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms 45
46
47
48
49
50
51
array CGH analysis of oral squamous cell carcinomas. Oncogene, 23 (14), 2582–2586. Hackett, C.S., Hodgson, J.G., Law, M.E., Fridlyand, J., Osoegawa, K., de Jong, P.J., Nowak, N.J., Pinkel, D., Albertson, D.G., Jain, A., Jenkins, R., Gray, J.W., and Weiss, W.A. (2003) Genome-wide array CGH analysis of murine neuroblastoma reveals distinct genomic aberrations which parallel those in human tumors. Cancer Res., 63 (17), 5266–5273. Hodgson, G., Hager, J., Volik, S., Hariono, S., Wernick, M., Moore, D., Albertson, D., Pinkel, D., Collins, C., Hanahan, D. et al. (2001) Genome scanning with array CGH delineates regional alterations in mouse islet carcinomas. Nat. Genet., 29 (4), 459–464. Idbaih, A., Marie, Y., Lucchesi, C., Pierron, G., Manie, E., Raynal, V., Mosseri, V., Hoang-Xuan, K., Kujas, M., Brito, I. et al. (2008) BAC array CGH distinguishes mutually exclusive alterations that define clinicogenetic subtypes of gliomas. Int. J. Cancer, 122 (8), 1778–1786. Pinkel, D. and Albertson, D. (2005) Array comparative genomic hybridization and its applications in cancer. Nat. Genet., 37, S11–S17 Pollack, J., Sorlie, T., Perou, C., Rees, C., Jeffrey, S., Lonning, P., Tibshirani, R., Botstein, D., Borresen-Dale, A., and Brown, P. (2002) Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc. Natl. Acad. Sci. USA, 99, 12963–12968 Rossi, M., Conroy, J., McQuaid, D., Nowak, N., Rutka, J., and Cowell, J. (2006) Array CGH analysis of pediatric medulloblastomas. Genes Chromosomes Cancer, 45 (3), 290–303. Veltman, J.A., Fridlyand, J., Pejavar, S., Olshen, A.B., Korkola, J.E., DeVries, S., Carroll, P., Kuo, W.-L., Pinkel, D., Albertson, D., Cordon-Cardo, C., Jain, A. N., and andWaldman, F.M. (2003) Arraybased comparative genomic hybridization for genome-wide screening of DNA copy number in bladder tumors. Cancer Res., 63 (11), 2872–2880.
52 Leek, J., Scharpf, R., Bravo, H., Simcha,
53
54
55
56 57
58 59 60
61
62
63
64
65
66
D., Langmead, B., Johnson, W., Geman, D., Baggerly, K., and Irizarry, R. (2010) Tackling the widespread and critical impact of batch effects in highthroughput data. Nat. Rev. Genet., 11 (10), 733–739. Miecznikowski, J., Gaile, D., Liu, S., Shepherd, L., and Nowak, N. (2011) A new normalizing algorithm for BAC CGH arrays with quality control metrics. J. Biomed. Biotechnol., 2011. Bolstad, B., Irizarry, R., Astrand, M., and Speed, T. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19 (2), 185–193. Neter, J., Wasserman, W., and Kutner, M. (1989) Applied Linear Regression Models, Richard D. Irwin, Homewood, IL. Casella, G. and Berger, R. (2001) Statistical inference. Wasserman, L. (2004) All of Statistics: A Concise Course in Statistical Inference, Springer, Berlin Lehmann, E. (1997) Testing Statistical Hypotheses, Springer, Berlin Schervish, M.J. (1995) Theory of Statistics, Springer, Berlin Schervish, M.J. (1996) P values: what they are and what they are not. Am. Stat., 50, 203–206. Draper, N. and Smith, H. (1998) Applied regression analysis (Wiley series in probability and statistics). Rawlings, J., Pantula, S., and Dickey, D. (1998) Applied Regression Analysis: A Research Tool, Springer, Berlin R Development Core Team (2008) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3900051-07-0. Hosmer, D. and Lemeshow, S. (2000) Applied Logistic Regression, vol. 354, WileyInterscience, New York Lesaffre, E. and Albert, A. (1989) Multiplegroup logistic regression diagnostics. Appl. Stat., 38, 425–440. Marshall, R. and Chisholm, E. (1985) Hypothesis testing in the polychotomous logistic model with an application to
References
67
68
69
70
71
72
73
74
75
76
77
78
79
detecting gastrointestinal cancer. Stat. Med., 4 (3), 337–344. Klein, J. and Moeschberger, M. (2003) Survival Analysis: Techniques for Censored and Truncated Data, Springer, Berlin Lee, E. and Wang, J. (2003) Statistical Methods for Survival Data Analysis, vol. 364, Wiley-Interscience, New York Prentice, R. and Kalbfleisch, J. (1980) The Statistical Analysis of Failure Time Data, John Wiley & Sons, New York. Kaplan, E. and Meier, P. (1958) Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc., 53, 457–481. Mantel, N. et al. (1966) Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemoth. Rep. 1, 50 (3), 163. Hosmer, D. and Lemeshow, S. (1999) Applied Survival Analysis: Regression Modeling of Time to Event data, Wiley Online Library, New York. Guo, W. and Romano, J. (2007) A generalized Sidak–Holm procedure and control of generalized error rates under independence. Stat. Appl. Genet. Mol. Biol., 6 (1), Article 3. Nichols, T. and Hayasaka, S. (2003) Controlling the familywise error rate in functional neuroimaging: a comparative review. Stat. Method Med. Res., 12 (5), 419–446. Gentleman, R., Carey, V., Huber, W., Dudoit, S., and Irizarry, R. (2005) Bioinformatics and Computational Biology Solutions Using R and Bioconductor, Springer, Berlin Lehmann, E. and Romano, J. (2005) Generalizations of the familywise error rate. Ann. Stat., 33, 1138–1154. Holm, S. (1979) A simple sequentially rejective multiple test procedure. Scand. J. Stat., 6, 65–70. Hochberg, Y. (1988) A sharper bonferroni procedure for multiple tests of significance. Biometrika, 75 (4), 800. Pounds, S. and Morris, S. (2003) Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics, 19 (10), 1236–1242.
80 Miecznikowski, J., Gold, D.,
81
82
83
84
85
86
87
88
89
90
91
Shepherd, L., and Liu, S. (2011) Deriving and comparing the distribution for the number of false positives in single step methods to control k-FWER. Stat. Probabil. Lett. 81 (11), 1695–1705. Dudoit, S., van der Laan, M., and Pollard, K. (2004) Multiple testing. Part I. Singlestep procedures for control of general type I error rates. Stat. Appl. Genet. Mol. Biol., 3 (1), 1040. van der Laan, M., Dudoit, S., and Pollard, K. (2004) Multiple testing. Part II. Stepdown procedures for control of the familywise error rate. Stat. Appl. Genet. Mol. Biol., 3 (1), 1041. Westfall, P. and Young, S. (1993) Resampling-Based Multiple Testing: Examples and Methods for p-value Adjustment, Wiley-Interscience, New York. Efron, B. and Tibshirani, R. (1997) An Introduction to the Bootstrap, Chapman & Hall, New York. Gold, D., Miecznikowski, J., and Liu, S. (2009) Error control variability in pathwaybased microarray analysis. Bioinformatics, 25 (17), 2216–2221. Benjamini, Y. and Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. Ann. Stat., 29, 1165–1188. Bhattacharjee, M., Dhar, S., and Subramanian, S. (2011) Recent Advances in Biostatistics: False Discovery Rates, Survival Analysis, and Related Topics, vol. 4, World Scientific Publishing, Singapore. Sarkar, S. (2008) Generalizing Simes’ test and Hochberg’s step-up procedure. Ann. Stat., 36 (1), 337–363. Sarkar, S., Guo, W., and Finner, H. (2011) On adaptive procedures controlling the familywise error rate. J. Stat. Plan. Infer., 142, 65–78. Bair, E., Hastie, T., Paul, D., and Tibshirani, R. (2006) Prediction by supervised principal components. J. Am. Stat. Assoc., 101 (473), 119–137. Hastie, T., Tibshirani, R., Eisen, M., Alizadeh, A., Levy, R., Staudt, L., Chan, W., Botstein, D., and Brown, P. (2000)
j25
26
j 1 Control of Type I Error Rates for Oncology Biomarker Discovery with High-Throughput Platforms Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol., 1 (2), 0003–1. 92 Subramanian, A., Tamayo, P., Mootha, V., Mukherjee, S., Ebert, B., Gillette, M., Paulovich, A., Pomeroy, S., Golub, T., Lander, E. et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci., 102 (43), 15545–15550.
93 Efron, B. and Tibshirani, R. (2007) On
testing the significance of sets of genes. Ann. Appl. Stat., 1 (1), 107–129. 94 Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 28 (1), 27–30. 95 Mishra, G., Suresh, M., Kumaran, K., Kannabiran, N., Suresh, S., Bala, P., Shivakumar, K., Anuradha, N., Reddy, R., Raghavan, T. et al. (2006) Human protein reference database – 2006 update. Nucleic Acids Res., 34 (Database Issue), D411.
j27
2 Overview of Public Cancer Databases, Resources, and Visualization Tools Frank Emmert-Streib, Ricardo de Matos Simoes, Shailesh Tripathi, and Matthias Dehmer
2.1 Brief Overview
In this chapter, we provide a general overview of incidence and mortality rates of the most severe cancer types. Further, we provide information about public databases containing valuable resources, for example, expression data or SNP arrays. Due to the complex nature of cancer, we do not limit ourselves to just one cancer type but provide information about a large variety of different cancer types. As motivation for this, we briefly review the Human Disease Network [1, 2] focusing on a subnetwork thereof showing a network consisting of different cancer types. Finally, we advocate the usage of the statistical programming language R as a flexible and efficient means to integrate the multilevel sources from different databases with each other.
2.2 Introduction
Modern research in biology is driven by data and their statistical and computational analysis [3–7] and a similar development can be anticipated for biomedical research [1, 8–11]. Especially, for studies involving complex disorders, such a datadriven approach is generally expected to be promising. In contrast to monogenic or single-gene diseases such as Cystic Fibrosis, Sickle cell anemia, or Huntington’s disease, cancer is a so-called complex disease. This means that modifications in single genes are not sufficient to explain a disorder, but multiple genes and their interactions in combination with lifestyle and environmental factors need to be considered simultaneously. These combinatorial effects involve an increased level of difficulty associated with cancer research. Due to the multifactorial nature of cancer, data from high-throughput technologies, for example, from next-generation sequencing (NGS) or expression arrays,
Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
28
j 2 Overview of Public Cancer Databases, Resources, and Visualization Tools provide a valuable means to extract functional information of various molecular levels. Given the high costs involved in the generation of such data, public databases provide an important source of information. For this reason, survey of various public databases and resources for different cancer types is given in this chapter. This chapter is organized as follows. In the next section, we provide a reminder why cancer is not only a complex disease, but show, that on a genetic level, different cancer types are related to each other. In Section 2.4, we review incidence and mortality rates of the most common cancer types. In Section 2.5, we provide information about many public cancer-related databases and describe their information. Section 2.6 provides information about visualization and network-based analysis tools. This chapter finishes with a conclusions and perspective section.
2.3 Different Cancer Types are Genetically Related
In medicine, diseases are defined by their etiology, pathogenesis, or by symptoms, for example, with respect to their phenotypic manifestation. Such a categorization led for cancer to over 100 different types, listed in the Online Mendelian Inheritance in Man (OMIM) [12]. Despite undeniable differences among such cancer types, the seminal paper by Goh et al. [2] revealed genetic commonalities not only among cancer types, but also general complex diseases including obesity, diabetes, deafness, epilepsy, and many more. More precisely, by using the OMIM a list of disorder disease-gene associations was compiled from which a disease network was constructed. In this network, nodes correspond to disorders and an edge connects two disorders if they share at least one disease-gene. For example, it is found that the disease-gene BRCA1 is related to breast cancer, ovarian cancer, and papillary serous carcinoma. Hence, all three cancer types are genetically related with each other. The resulting network has been termed human disease network. While in [2], a human disease network for 1284 disorders based on 1777 known disease genes has been constructed, we show in Figure 2.1 only a subnetwork thereof consisting of 84 different cancer types. In this figure, bigger nodes correspond to cancer types with a larger number of known disease-genes. However, for simplicity, the width of the links does not reflect the number of common disease-genes between pairs of cancer types. Figure 2.1 is a visual demonstration of the complex nature of cancer and its various types. If there would be cancer-specific genes that would be only related to particular cancer types, the shown genetic cancer network of 84 cancer types would be unconnected. Instead, one can nicely see that there is a strong interconnectedness among the different cancer types. Due to this genetic connection of different cancer types, we present in the following sections information about many and not just one particular cancer, because, as shown in [2], it is very fruitful to study various cancer types simultaneously.
2.4 Incidence and Mortality Rates of Cancer
j29
Glioblastoma, Gastric cancer
Colorectal
Ovarian
Lymphoma
Breast
Pancreatic
Leukemia Prostate Thyroid
Hepatocellular cancer
Figure 2.1 The Genetic Cancer Network shows genetic connections between 84 different cancer types. This corresponds to a subnetwork of the human disease network [2].
2.4 Incidence and Mortality Rates of Cancer
In this section, we provide incidence and mortality rates of cancer for women and men provided by the World Health Organization (WHO) [13, 14]. Due to the fact that also environmental and lifestyle factors effect cancer, the WHO provides this information separated for different geographic and developmental regions. In the following, we present this information from the category “more developed regions.” Tables 2.1 and 2.2 show incidence and mortality rates separated by gender. Here, ASR means age-standardized rate. Due to the fact that age has a significant influence on the risk of cancer it is important to adjust for this factor if several populations with different age structures are considered. The number of incidence is the number of new cases arising in a given period in a specified population and the incidence rate is the number of incidence per 100,000 persons per year. Similarly, mortality is the number of deaths occurring in a given period in a specified population and the mortality rate is the number of deaths per 100,000 persons per year. The provided percentage (%) for incidence and mortality in Tables 2.1 and 2.2 refer to the fraction among all cancer types (not all are listed in these tables). For women, the by far highest incidence rate is for breast cancer (66.4) followed by colorectum cancer. Also the mortality rate is highest for breast cancer, however,
30
j 2 Overview of Public Cancer Databases, Resources, and Visualization Tools Table 2.1 Estimated age-standardized incidence (Inc) and mortality (Mor) rates for women in the category “more developed regions,” WHO [13, 14].
Cancer
Inc (%)
Inc (ASR)
Mor (%)
Mor (ASR)
2.0 1.5 26.7 3.0 13.1 5.5 1.3 0.5 2.7 0.3 2.4 1.1 1.6 9.4 3.2 1.2 0.1 3.3 0.7 0.3 3.8 3.1 3.9 2.9
3.6 4.4 66.4 9.1 24.3 13.0 2.1 1.9 5.9 0.6 5.9 2.3 2.7 18.8 8.7 2.2 0.2 7.1 1.3 0.8 9.3 5.5 7.3 9.2
1.6 2.2 15.5 2.7 12.6 2.7 2.1 0.2 2.1 0.2 3.2 0.7 3.3 15.4 1.1 1.7 0.1 2.8 1.2 0.3 5.3 6.5 5.8 0.5
1.0 2.6 15.3 3.2 9.7 2.3 1.5 0.3 1.7 0.2 2.8 0.6 2.5 13.6 1.1 1.3 0.1 2.2 1.0 0.3 5.1 5.1 4.7 0.4
Bladder Brain, nervous system Breast Cervix uteri Colorectum Corpus uteri Gallbladder Hodgkin lymphoma Kidney Larynx Leukemia Lip, oral cavity Liver Lung Melanoma of skin Multiple myeloma Nasopharynx Non-Hodgkin lymphoma Esophagus Other pharynx Ovary Pancreas Stomach Thyroid
only slightly ahead of lung cancer. For men, prostate cancer has the highest incidence rate (61.7) followed by lung cancer that also has the highest mortality rate. In the next section, we will provide information about the availability of samples, for example, from expression or sequencing experiments. We will also discuss that the distribution of available samples reflects roughly the distribution of the here shown incidence and mortality rates in Tables 2.1 and 2.2. This emphasizes the strong correlation between basic and applied research in the biomedical sciences and the efficient synchronization between their research agendas.
2.5 Cancer and Disorder Databases
Due to the pivotal role of high-throughput data in modern biology and the biomedical sciences we provide in the following pointers to valuable public resources. An important resource of publicly available data is the National Center for Biotechnology Information (NCBI) that maintains the Gene Expression Omnibus (GEO) database [15, 16]. In Table 2.3, we provide an overview of RNA and
2.5 Cancer and Disorder Databases Table 2.2 Estimated age-standardized incidence and mortality rates for men in the category
“more developed regions,” WHO [13, 14]. Cancer Bladder Brain, nervous system Colorectum Gallbladder Hodgkin lymphoma Kidney Larynx Leukemia Lip, oral cavity Liver Lung Melanoma of skin Multiple myeloma Nasopharynx Non-Hodgkin lymp Other pharynx Esophagus Pancreas Prostate Stomach Testis Thyroid
Inc (%)
Inc (ASR)
Mor (%)
Mor (ASR)
5.9 1.5 13.2 0.9 0.5 3.8 1.7 2.7 2.1 2.8 16.2 2.9 1.2 0.2 3.2 1.4 2.1 2.9 21.7 5.8 1.0 0.8
16.3 5.8 37.7 2.3 2.2 11.9 5.4 9.1 6.8 8.2 47.1 9.6 3.3 0.6 10.3 4.5 6.5 8.3 61.7 16.7 4.6 2.9
3.6 2.2 10.9 1.2 0.2 2.8 1.5 3.2 1.4 5.0 26.9 1.2 1.4 0.1 2.5 1.4 3.5 5.4 8.9 7.2 0.1 0.2
4.6 3.9 15.1 1.6 0.4 4.1 2.4 4.8 2.3 7.2 39.2 1.9 1.9 0.3 3.6 2.2 5.3 7.9 10.5 10.3 0.3 0.3
Table 2.3 Available “RNA” and “genomic” samples from the NCBI GEO (Gene Expression Omnibus) database (accessed April 2012) for various cancer types [15].
Cancer All cancer types Bladder Breast Cervical Colorectum Leukemia Lymphoma Lung Ovary Melanoma of skin Esophagus Pancreas Prostate
RNA samples 9287 175 2106 181 895 941 881 579 1062 828 29 607 624
Genomic samples 3024 1 543 31 173 538 607 43 179 455 0 306 169
j31
32
j 2 Overview of Public Cancer Databases, Resources, and Visualization Tools Table 2.4 The Cancer Genome Atlas (TCGA) [17] provides data about gene expression, SNP,
methylation, somatic mutations, and clinical informationa). Cancer Acute Myeloid Leukemia [LAML] Bladder Urothelial Carcinoma [BLCA] Brain Lower Grade Glioma [LGG] Breast invasive carcinoma [BRCA] Cervical squamous cell carcinoma and endocervical adenocarcinoma [CESC] Colon adenocarcinoma [COAD] Glioblastoma multiforme [GBM] Head and Neck squamous cell carcinoma [HNSC] Kidney renal clear cell carcinoma [KIRC] Kidney renal papillary cell carcinoma [KIRP] Liver hepatocellular carcinoma [LIHC] Lung adenocarcinoma [LUAD] Lung squamous cell carcinoma [LUSC] Ovarian serous cystadenocarcinoma [OV] Pancreatic adenocarcinoma [PAAD] Prostate adenocarcinoma [PRAD] Rectum adenocarcinoma [READ] Skin Cutaneous Melanoma [SKCM] Stomach adenocarcinoma [STAD] Thyroid carcinoma [THCA] Uterine Corpus Endometrioid Carcinoma [UCEC]
Samples 200 78 80 864 37 422 581 292 501 97 55 351 283 591 38 153 165 242 133 232 451
a) The symbols in brackets ([ ]) correspond to the disease codes used by the TCGA.
genomic samples available (accessed April 2012). Here “RNA samples” refers to data from DNA microarrays or next-generation sequencing, for example, RNA-seq data, and “genomic samples” correspond to data from array CGH, Chromatin immunoprecipitation, DNA methylation, or SNP arrays. Despite the fact that the GEO database is not solely devoted to cancer but includes also samples about other disorders or nondisease-related data, there are currently thousands of samples available. A cancer-specific database is maintained by The Cancer Genome Atlas (TCGA) [17], which is a part of the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI). The TCGA provides currently data for 21 different cancer types, listed in Table 2.4. Available data types include gene expression, SNP, methylation, and somatic mutations. However, these data are not simultaneously available for all patient samples. Table 2.5 lists further important databases containing downloadable data for various cancer types or even more general disorders. Databases specifically devoted to information about protein–protein interactions and ontologies are listed in Table 2.6.
2.5 Cancer and Disorder Databases Table 2.5 A survey of cancer and general disease databases.
Database
Information provided
4DEpress [18] http://4dx.embl.de/ 4DXpress/welcome.do ArrayExpress [19] http://www.ebi.ac. uk/arrayexpress/ Breast cancer database http://www. breastcancerdatabase.org/ CaSNP [20] http://cistrome.dfci. harvard.edu/CaSNP/ CellMap http://cancer.cellmap.org/ cellmap/ cmap [21] http://www.broadinstitute. org/cmap/
Platform to query and compare gene expression data during the development of major model animals. Downloadable database of expression experiments.
Cosmic [22] http://www.sanger.ac.uk/ genetics/CGP/cosmic/ dbGaP [23] http://www.ncbi.nlm.nih. gov/gap dbvar [24] http://www.ncbi.nlm.nih. gov/dbvar/ Diseasome [25] http://diseasome.kobic. re.kr/ Drugbank [26] http://www.drugbank.ca/
GEO [15, 16] http://www.ncbi.nlm.nih. gov/geo/ Genome-Wide Association Catalog [27] www.genome.gov/gwastudies HapMap [28] http://hapmap.ncbi.nlm. nih.gov/ HLungDB [29] http://www. megabionet.org/bio/hlung/ KEGG EXPRESSION [30, 31] http:// www.genome.jp/kegg/expression/ KEGG DISEASE [30, 31] http://www. genome.jp/kegg/disease/ OMIM [12] http://www.ncbi.nlm.nih. gov/omim Oncomine [32] https://www.oncomine. org PID [33] http://pid.nci.nih.gov/ PrognoScan [34] http://gibk21.bse. kyutech.ac.jp/PrognoScan/
Molecular alterations associated with Breast Cancer for DNA, RNA, protein, and drug-induced. Collection of copy number alteration (CNA) from SNP arrays. Selected set of human cancer focused pathways. Collection of genome-wide transcriptional expression data from cultured human cells treated with small molecules. Somatic mutation and related details for various cancer types. Genome-wide association, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits Database of genomic structural variation. An integrated database of genes, genetic variation, and diseases. The database contains 6712 drug entries including 1441 FDA-approved small molecule drugs, 135 FDA-approved drugs, 84 nutraceuticals and 5084 experimental drugs. Repository for array and sequence-based data. Catalog of SNP-trait associations from published genome-wide association studies. Haplotype map of the human genome. Genes, proteins and miRNAs involved in lung cancer and related clinical information. Gene expression profile data for various organisms. Collection of disease entries capturing knowledge on genetic and environmental perturbations. Human genes and genetic disorders. Expression and Genomic data for various cancer types. Curated pathways of human molecular signaling and regulatory events. Gene expression and patient progonsis.
j33
34
j 2 Overview of Public Cancer Databases, Resources, and Visualization Tools Table 2.6 Overview of downloadable protein databases (first part) and ontologies (second part).
Name
Provided information
BioGrid [35] http://thebiogrid.org/
Database of genetic and protein interactions for various organisms Downloadable data of yeast two-hybrid protein-protein interactions for various organisms Experimentally determined, curated interactions between proteins Protein interaction data
CCSB Interactome [36, 37] http:// interactome.dfci.harvard.edu/ DIP [38] http://dip.doe-mbi.ucla. edu/dip/Main.cgi IntAct [39] http://www.ebi.ac.uk/ intact/main.xhtml GO [40] http://www.geneontology. org/ KEGG [30, 41] http://www.genome. jp/kegg/
Gene ontology provides a controlled vocabulary of terms for describing molecular function, cellular component and biological process. The Kyoto Encyclopedia of Genes and Genomes is a collection of different databases for systems information, genomic information, and chemical information.
2.6 Visualization and Network-Based Analysis Tools
In the last section of this chapter, we provide an overview of visualization methods for network data, which are either web-based or make use of the statistical programming language R [42]. The former methods are especially helpful for a pure application whereas the latter allow in addition modifications, extensions, and new developments thereof. 2.6.1 Web-Based Software
Web-based portals that allow the visualization and analysis of genomic network data are very popular, because these tools allow us to quickly gain insights into the complex data structures as encountered by high-throughput data. In Table 2.7, we list some of the most useful tools. Many of these emphasize the visualization aspect of biological networks and, hence, enable an exploratory analysis [43, 44]. This is particularly beneficial for a first inspection of data, because genomics data are usually high-dimensional and our understanding of the underlying biology is limited. This hampers the formulation of precise hypotheses and an initial imagination of the information contained in a data set. 2.6.2 R-Based Packages
Whenever there is a program available that allows us to solve a specific problem under investigation, it has a clear advantage to use it over the development of
2.7 Conclusions Table 2.7 Overview of web interfaces, visualization, and analysis tools.
Software
Provided information
Cytoscape [45] http://www.cytoscape.org/
Visualizing complex networks and integrating these with any type of attribute data. Functional annotation analysis. Functional enrichment visualization.
DAVID [46] http://david.abcc.ncifcrf.gov/ Enrichment Map [47] http://baderlab.org/ Software/EnrichmentMap GraphWeb [48] http://biit.cs.ut.ee/ graphweb/ NeAT [49] http://rsat.bigre.ulb.ac.be/rsat/ visANT [50] http://visant.bu.edu/
Public web server for graph-based analysis of biological networks. Algorithms for the analysis of biological networks. Integrative Visual Analysis Tool for Biological Networks and Pathways.
similar solutions from the scratch. However, if such a program is not capable of solving a particular problem, there is no way around to implement a new solution. In this respect, in the Computational Biology and Biostatistics community the statistical programming language R [42] can be considered as gold standard. Briefly, the advantage of R over many other programming languages is that it is a high-level programming languages such as python or perl that allow the easy integration of user developed functions written in C/Cþþ, Fortran, or R itself. Due to its free availability and a community effort establishing various package repositories, for example, CRAN or bioconductor [51], it combines the advantages of quick developmental times, no costs, speed (because functions can be implemented in a low-level programming language), and the compatibility with a very large and continuously growing number of packages. Such packages are frequently provided as supplement to publications. Hence, they provide up-to date information about current methodological developments and the publications present results of their application to contemporary biological problems. Figure 2.2 visualizes the pivotal role of R in the Computational Biology and Biostatistics community. The key point here is the circular connectivity among the individual components of the developmental and analysis process. The circular integration of R into the genomic analysis is a necessary factor to bridge the gap between the theoretical methods from statistics and machine learning on one side and the data provided by the databases on the other. In other words, R serves as an embodiment of the analysis methods so that they can be applied to real data. In Table 2.8, we list a number of very useful R packages that focus on biological networks. Here “bioconductor” refers to an R package repository accessible under http://www.bioconductor.org/ [51].
2.7 Conclusions
The purpose of this chapter was to provide an overview of incidence and mortality rates of the most severe cancer types and to survey public databases containing valuable resources, for example, expression data or SNP arrays. In addition, we
j35
36
j 2 Overview of Public Cancer Databases, Resources, and Visualization Tools
Figure 2.2 R-centric implementation of data integration, visualization, analysis, and method development and their resulting circular interconnectedness.
Table 2.8 Overview of R-based visualization and analysis packages.
Package
Provided information
bnlearn [52] http://www.bnlearn. com/
Package for learning the graphical structure of Bayesian networks, estimate their parameters and perform inference. Inferring gene regulatory networks with direct physical interactions from microarray expression data using C3NET. Handling, creating and visualizing graphs. Creating and manipulating of undirected and directed graphs. Provides various algorithms, for example, MRnet and Aracne, for inferring mutual information networks. Interfaces R with the AT & T graphviz library for plotting networks. qp-graphs are undirected Gaussian graphical Markov models built from q-order partial correlations. Useful for learning undirected graphical Gaussian Markov models from data sets with p n. Offers a set of ca. 150 topological network measures to analyze complex networks structurally.
c3net [53, 54] http://cran.r-project. org/web/packages/c3net/ graph [55] bioconductor igraph [56] http://igraph. sourceforge.net/ minet [57, 58] bioconductor Rgraphviz [59, 60] bioconductor qpgraph [61, 62] bioconductor
QuACN [63, 64] http://cran.r-project. org/web/packages/QuACN
References
advocated the usage of the statistical programming language R as an efficient means to implement statistical and computational methods in order to analyze and integrate data from different databases with each other. Due to the fast growing field this overview is inevitable incomplete. However, the basic structure behind our underlying principle, visualized in Figure 2.2, should translate to new genomic data types as well.
2.8 Perspective
Technological progress in the biomedical sciences is difficult to predict, because the field is quickly developing. However, we expect within the next years to witness a significant increase in the availability of next-generation sequencing data, that is, DNA-seq, RNA-seq, and ChIP-seq [65–67]. This will lead to a tremendous increase in the quantity of the data and constitutes a considerable technological challenge for its storage and distribution. Further, due to the fast-paced nature of the field it can be anticipated that the role of R as an enabling tool for a genomic data analysis will be further established.
References 1 Barab asi, A.-L. (2007) Network medicine –
2
3
4
5
6
7
from obesity to the “diseasome”. N. Engl. J. Med., 357 (4), 404–407. Goh, K.-I., Cusick, M.E., Valle, D., Childs, B., Vidal, M., and Barabasi, A. (2007) The human disease network. Proc. Natl. Acad. Sci., 104 (21), 8685–8690. Dehmer, M., Emmert-Streib, F., Graber, A., and Salvador, A. (eds) (2011) Applied Statistics for Network Biology: Methods for Systems Biology, Wiley-Blackwell, Hoboken, NJ. Emmert-Streib, F. and Dehmer, M. (2011) Networks for systems biology: conceptual connection of data and function. IET Syst. Biol., 5 (3), 185. Palsson, B. (2006) Systems Biology, Cambridge University Press, Cambridge, New York. Schadt, E. (2009) Molecular networks as sensors and drivers of common human diseases. Nature, 461, 218–223. Vidal, M. (2009) A unifying view of 21st century systems biology. FEBS Lett., 583 (24), 3891–3894.
8 Emmert-Streib, F. (2007) The chronic
9
10
11
12 13
14
fatigue syndrome: A comparative pathway analysis. J. Comput. Biol., 14 (7), 961–972. Emmert-Streib, F. and Dehmer, M. (2010) editors, Medical Biostatistics for Complex Diseases, Wiley-Blackwell, Weinheim. Emmert-Streib, F. and Glazko, G. (2011) Pathway analysis of expression data: deciphering functional building blocks of complex diseases. PLoS Comput. Biol., 7 (5), e1002053. Zanzoni, A., Soler-Lopez, M., and Aloy, P. (2009) A network medicine approach to human disease. FEBS Lett., 583 (11), 1759–1765. Online Mendelian Inheritance in Man, OMIM (TM). Ferlay, J., Shin, H.R., Bray, F., Forman, D., Mathers, C., and Paarkin, D.M. (2010) Cancer Incidence and Mortality Worldwide: IARC Cancerbase no. 10, IARC Press, Lyon, France, pp. 1–11. Pisani, P., Bray, F., and Parkin, D.M. (2002) Estimates of the world-wide prevalence of cancer for 25 sites in the
j37
38
j 2 Overview of Public Cancer Databases, Resources, and Visualization Tools 15
16
17
18
19
20
21
22
adult population. Int. J. Cancer, 97 (1), 72–81. Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Muertter, R.N., Holko, M., Ayanbule, O., Yefanov, A., and Soboleva, A. (2011) NCBI GEO: archive for functional genomics data sets 10 years on. Nucl. Acids Res., 39 (suppl 1), D1005–D1010. Edgar, R., Domrachev, M., and Lash, A.E. (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucl. Acids Res., 30, 207–210. Network, T.C.G.A.R. (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455 (7216), 1061–1068. Haudry, Y., Berube, H., Letunic, I., Weeber, P.-D., Gagneur, J., Girardot, C., Kapushesky, M., Arendt, D., Bork, P., Brazma, A. et al. (2008) 4DXpress: a database for cross-species expression pattern comparisons. Nucl. Acids Res., 36 (Database issue), D847–D853. Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., Vilo, J., Abeygunawardena, N., Holloway, E., Kapushesky, M., Kemmeren, P., Lara, G.G. et al. (2003) ArrayExpress: a public repository for microarray gene expression data at the EBI. Nucl. Acids Res., 31 (1), 68–71. Cao, Q., Zhou, M., Wang, X., Meyer, C.A., Zhang, Y., Chen, Z., Li, C., and Liu, X.S. (2011) CaSNP: a database for interrogating copy number alterations of cancer genome from SNP array data. Nucl. Acids Res., 39 (suppl 1), D968–D974. Lamb, J., Crawford, E.D., Peck, D., Modell, J.W., Blat, I.C., Wrobel, M.J., Lerner, J., Brunet, J.-P., Subramanian, A., Ross, K.N. et al. (2006) The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science, 313 (5795), 1929–1935. Forbes, S.A., Bindal, N., Bamford, S., Cole, C., Kok, C.Y., Beare, D., Jia, M., Shepherd, R., Leung, K., Menzies, A., Teague, J.W., Campbell, P.J., Stratton, M.R., and Futreal, P.A. (2011) COSMIC:
23
24
25
26
27
28
29
30
31
mining complete cancer genomes in the catalogue of somatic mutations in cancer. Nucl. Acids Res., 39 (suppl 1), D945–D950. Mailman, M.D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., Bagoutdinov, R., Hao, L., Kiang, A., Paschall, J., Phan, L. et al. (2007) The NCBI dbGaP database of genotypes and phenotypes. Nat. Genet., 39 (10), 1181–1186. Church, D.M., Lappalainen, I., Sneddon, T.P., Hinton, J., Maguire, M., Lopez, J., Garner, J., Paschall, J., DiCuccio, M., Yaschenko, E., Scherer, S.W., Feuk, L., and Flicek, P. (2010) Public data archives for genomic structural variation. Nat. Genet., 42 (10), 813–814. Yang, J.O.O., Hwang, S., Oh, J., Bhak, J., and Sohn, T.-K.K. (2008) An integrated database-pipeline system for studying single nucleotide polymorphisms and diseases. BMC Bioinformatics, 9 (Suppl 12). Knox, C., Law, V., Jewison, T., Liu, P., Ly, S., Frolkis, A., Pon, A., Banco, K., Mak, C., Neveu, V., Djoumbou, Y., Eisner, R., Guo, A.C., and Wishart, D.S. (2011) DrugBank 3.0: A comprehensive resource for “Omics” research on drugs. Nucl. Acids Res., 39 (Suppl 1), D1035–D1041. Hindorff, L.A., Sethupathy, P., Junkins, H.A., Ramos, E.M., Mehta, J.P., Collins, F.S., and Manolio, T.A. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci., 106 (23), 9362–9367. Thorisson, G.A., Smith, A.V., Krishnan, L., and Stein, L.D. (2005) The international HapMap project website. Genome Res., 15 (11), 1592–1593. Wang, L., Xiong, Y., Sun, Y., Fang, Z., Li, L., Ji, H., and Shi, T. (2010) HLungDB: an integrated database of human lung cancer research. Nucl. Acids Res., 38 (suppl 1), D665–D669. Kanehisa, M. and Goto, S. (2000b) KEGG: Kyoto encyclopia of genes and genomes. Nucl. Acids Res., 28, 27–30. Kanehisa, M., Goto, S., Hattori, M., AokiKinoshita, K., Itoh, M., Kawashima, S., Katayama, T., Araki, M., and Hirakawa, M. (2006) From genomics to chemical
References
32
33
34
35
36
37
38
39
genomics: new developments in KEGG. Nucl. Acids Res., 34, D354–D357. Rhodes, D.R., Yu, J., Shanker, K., Deshpande, N., Varambally, R., Ghosh, D., Barrette, T., Pandey, A., and Chinnaiyan, A.M. (2004) ONCOMINE: a cancer microarray database and integrated datamining platform. Neoplasia, 6 (1), 1–6. Schaefer, C.F., Anthony, K., Krupa, S., Buchoff, J., Day, M., Hannay, T., and Buetow, K.H. (2009) PID: the pathway interaction database. Nucl. Acids Res., 37 (Suppl 1), D674–D679. Mizuno, H., Kitada, K., Nakai, K., and Sarai, A. (2009) PrognoScan: a new database for meta-analysis of the prognostic value of genes. BMC Med. Genom., 2 (1), 18. Breitkreutz, B.-J., Stark, C., Reguly, T., Boucher, L., Breitkreutz, A., Livstone, M., Oughtred, R., Lackner, D.H., Bahler, J., Wood, V., Dolinski, K., and Tyers, M. (2008) The BioGRID interaction database: 2008 update. Nucl. Acids Res., 36 (Suppl 1), D637–D640. Venkatesan, K., Rual, J.-F., Vazquez, A., Stelzl, U., Lemmens, I., HirozaneKishikawa, T., Hao, T., Zenkner, M., Xin, X., Goh, K.-I. et al. (2009) An empirical framework for binary interactome mapping. Nat. Methods, 6 (1), 83–90. Yu, H., Braun, P., Yildirim, M.A., Lemmens, I., Venkatesan, K., Sahalie, J., Hirozane-Kishikawa, T., Gebreab, F., Li, N., Simonis, N., Hao, T., Rual, J.-F., Dricot, A., Vazquez, A., Murray, R.R., Simon, C., Tardivo, L., Tam, S., Svrzikapa, N., Fan, C., de Smet, A.-S., Motyl, A., Hudson, M.E., Park, J., Xin, X., Cusick, M.E., Moore, T., Boone, C., Snyder, M., Roth, F.P., Barabasi, A.-L., Tavernier, J., Hill, D.E., and Vidal, M. (2008) Highquality binary protein interaction map of the yeast interactome network. Science, 322 (5898), 104–110. Xenarios, I., Rice, D.W., Salwinski, L., Baron, M.K., Marcotte, E.M., and Eisenberg, D. (2000) DIP: the database of interacting proteins. Nucl. Acids Res., 28 (1), 289–291. Aranda, B., Achuthan, P., Alam-Faruque, Y., Armean, I., Bridge, A., Derow, C., Feuermann, M., Ghanbarian, A.T.,
40
41
42
43
44 45
46
47
48
49
Kerrien, S., Khadake, J., Kerssemakers, J., Leroy, C., Menden, M., Michaut, M., Montecchi-Palazzi, L., Neuhauser, S.N., Orchard, S., Perreau, V., Roechert, B., van Eijk, K., and Hermjakob, H. (2009) The IntAct molecular interaction database in 2010. Nucl. Acids Res., 40, 878. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H. et al. (2000) Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat. Genet., 25 (1), 25–29. Kanehisa, M. and Goto, S. (2000a) KEGG: Kyoto encyclopedia of genes and genomes. Nucl. Acids Res., 28, 27–30. R Development Core Team (2008) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. Hoaglin, D., Mosteller, F., and Tukey, J. (1983) Understanding Robust and Exploratory Data Analysis, John Wiley & Sons, New York. Tukey, J. (1977) Exploratory Data Analysis, Addison-Wesley. New York. Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., and Ideker, T. (2003) Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res., 13 (11), 2498–2504. Dennis, G., Sherman, B.T., Hosack, D.A., Yang, J., Gao, W., Lane, H.C., and Lempicki, R.A. (2003) DAVID: database for annotation, visualization, and integrated discovery. Genome Biol., 4 (5), R60. Merico, D., Isserlin, R., Stueker, O., Emili, A., and Bader, G.D. (2010) Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. PLoS ONE, 5, e13984. Reimand, J., Tooming, L., Peterson, H., Adler, P., and Vilo, J. (2008) Graphweb: mining heterogeneous biological networks for gene modules with functional significance. Nucl. Acids Res., 36 (suppl 2), W452–W459. Brohee, S., Faust, K., Lima-Mendez, G., Sand, O., Janky, R., Vanderstocken, G., Deville, Y., and Van Helden, J. (2008)
j39
40
j 2 Overview of Public Cancer Databases, Resources, and Visualization Tools 50
51
52
53
54
55
56 57
58
NeAT: a toolbox for the analysis of biological networks, clusters, classes and pathways. Nucl. Acids Res., 36 (Web Server issue), W444–W451. Hu, Z., Mellor, J., Wu, J., Yamada, T., Holloway, D., and DeLisi, C. (2005) VisANT: data-integrating visual framework for biological networks and modules. Nucl. Acids Res., 33 (suppl 2), W352–W357. Gentleman, R., Carey, V., Bates, D. et al. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol., 5, R80. Scutari, M. (2010) Learning Bayesian networks with the bnlearn R package. J. Statistical Software, 35 (3), 1–22. Altay, G. and Emmert-Streib, F. (2010) Inferring the conservative causal core of gene regulatory networks. BMC Syst. Biol., 4, 132. Altay, G. and Emmert-Streib, F. (2011) Structural influence of gene networks on their inference: analysis of C3NET. Biol. Direct, 6, 31. Carey, V.J., Gentry, J., Whalen, E., and Gentleman, R. (2005a) Network structures and algorithms in bioconductor. Bioinformatics, 21 (1), 135–136. Csardi, G. and Nepusz, T. (2008) igraphpackage. Meyer, P., Kontos, K., and Bontempi, G. (2007) Information-theoretic inference of large transcriptional regulatory networks. EURASIP J. Bioinf. Syst. Biol., 2007, 79879. Meyer, P., Lafitte, F., and Bontempi, G. (2008) minet: A R/Bioconductor package for inferring large transcriptional
59
60
61
62
63
64
65
66
67
networks using mutual information. BMC Bioinformatics, 9 (1), 461. Carey, V.J., Gentry, J., Whalen, E., and Gentleman, R. (2005b) Network structures and algorithms in bioconductor. Bioinformatics, 21 (1), 135–136. Ellson, J., Gansner, E.R., Koutsofios, E., North, S.C., and Woodhull, G. (2001) Graphviz–Open Source Graph Drawing Tools, vol. 2265, Springer., Berlin, pp. 483–484. Castelo, R. (2006) A robust procedure for Gaussian graphical model search from microarray data with p larger than n. J. Machine Learn. Res., 7, 2621–2650. Castelo, R. and Roverato, A. (2009) Reverse engineering molecular regulatory networks from microarray data with qpgraphs. J. Comput. Biol., 16 (2), 213–227. Mueller, L., Kugler, K., Graber, A., Emmert-Streib, F., and Dehmer, M. (2011) Structural measures for network biology using QuACN. BMC Bioinformatics, 12 (1), 492. Mueller, L.A., Kugler, K.G., Dander, A., Graber, A., and Dehmer, M. (2010) QuACN – An R package for analyzing complex biological networks quantitatively. Bioinformatics, 27 (1), 140–141. Marguerat, S. and B€ahler, J. (2010) RNAseq: from technology to biology. Cell. Mol. Life Sci., 67, 569–579. Park, K. and Kim, D. (2009) Localized network centrality and essentiality in the yeast–protein interaction network. Proteomics, 9 (22), 5143–5154. Wang, Z., Gerstein, M., and Snyder, M. (2009) RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet., 10 (1), 57–63.
j41
Part Two Bayesian Methods
Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
j43
3 Discovery of Expression Signatures in Chronic Myeloid Leukemia by Bayesian Model Averaging Ka Yee Yeung
3.1 Brief Introduction
Gene expression data has been used to develop molecular diagnostic tests in cancer. Gene expression is the conversion of the genetic information encoded in a gene into other functional gene products. These gene products (such as messenger RNAs and proteins) are used in various cell activities, and subsequently would give rise to the organism’s phenotypes. High-throughput technologies, such as microarrays and sequencing, allow the measurement of the activity levels (or expression levels) of tens of thousands of genes simultaneously. These high-throughput technologies mainly contribute to the discovery phase of the development of molecular diagnostics, in which exploratory studies are used to identify potential targets. The objective of the discovery phase is to determine a shortlist of high-priority candidates. The number of such candidate targets is limited by the capacity of downstream target validation, which is time consuming, costly, and labor intensive. Therefore, a small set of potential target genes is highly desirable for the development of inexpensive diagnostic tests. The expression patterns of patient samples from different types of cancer, different stages of cancer, or different prognosis can be studied using computational techniques. In particular, classification is the prediction of the diagnostic category of a tissue sample from its expression array phenotype, given the availability of similar data from tissues in identified categories. A challenge in predicting diagnostic categories using high-throughput expression data is that the number of genes (or variables) is usually much greater than the number of tissue samples (or observations) available. Furthermore, only a subset of the genes is relevant in distinguishing different classes (or labels). The selection of relevant genes for classification is known as variable selection or feature selection. In this chapter, we will review a multivariate variable selection technique called Bayesian model averaging (BMA) and its applications in gene selection and classification of gene expression data. BMA considers model uncertainty in the variable
Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
44
j 3 Discovery of Expression Signatures in Chronic Myeloid Leukemia by Bayesian Model Averaging
selection process by averaging over multiple models. In this context, models are sets of (potentially overlapping) predictive variables (or genes). Typical gene selection and classification procedures ignore model uncertainty and use a single set of relevant genes (model) to predict class. We will illustrate the power of BMA on gene expression data studying the progression of chronic myeloid leukemia (CML).
3.2 Chronic Myeloid Leukemia (CML)
CML is a cancer of the bone marrow and blood. It is characterized by a reciprocal translocation between chromosomes 9 and 22 yielding the Bcr–Abl fusion protein. It is this constitutively active tyrosine kinase that drives CML pathophysiology [1]. CML is usually diagnosed in the chronic phase (CP) when treatment is very effective for most patients. The only known cure for CML is bone-marrow or stem-cell transplantation. Tyrosine kinase inhibitors (TKIs), such as imatinib, dasatinib, or nilotinib, which inhibits Bcr–Abl and consequently its downstream targets, have improved survival rates of CML patients, especially patients in the CP [2]. CML originates from the presence of a genetic abnormality in blood cells and progresses through distinct phases. When untreated, the CP evolves through the accelerated phase (AP) to an acute leukemia, blast crisis (BC) [1]. CML tends to progress relatively slowly compared to acute leukemia. However, the range of timing is quite broad, ranging from half a year to 15 years [3]. Currently, there are no clinical or molecular measures that can predict CML progression of individual patients at the time of diagnosis, making it difficult to adapt therapy to the risk level of each patient.
3.3 Variable Selection on Gene Expression Data
Microarrays measure the expression of a large number of genes. A typical microarray dataset consists of experiments under multiple conditions, such as temperature changes, time series, different cancer types, drug, or genetic perturbations. From a computational perspective, a microarray dataset can be conceptualized as a matrix of real numbers, where the rows are the genes and the columns represent experimental conditions. Suppose there are a total of G genes and E experimental conditions. In the case of the CML progression gene expression data, the experimental conditions (i.e., columns in the matrix) represent patient samples in different phases of CML. Specifically, there are a total of E ¼ 72 patient samples, among which 42 samples are in the CP and 30 samples are in BC [4]. Figure 3.1 shows a toy example of gene expression data, in which the rows are the genes and the columns represent patient samples. In the case of gene expression data, the samples usually consist of different types of tissue samples, for example, cancer versus noncancer [5], different types of cancer [6], and response to
3.3 Variable Selection on Gene Expression Data
Figure 3.1 A toy example illustrating gene expression data and the application of variable selection techniques.
therapy [7]. The entries of the matrix represent expression levels. There are different technologies that can be used to profile gene expression data, for example, microarrays (single channel versus two-colored), and next generation sequencing. In two-color microarrays, the entries usually represent log ratios with respect to a reference. The entries in Figure 3.1 are color coded so that positive expression values are shown in red, while negative expression values are shown in green. The goal of variable selection is to select a subset of relevant genes from the training data that are predictive of the classes (i.e., type A or B) in the test data. Typical gene expression data consist of thousands or even tens of thousands of genes. This figure only shows four genes for the clarity of presentation. In this example, both genes 1 and 3 can be used to discriminate between types A and B in the training data. In other words, type A patient samples are down-regulated with respect to the reference (i.e., negative expression ratios), and type B samples are upregulated (i.e., positive) with respect to the reference in genes 1 and 3. In this toy example, either gene 1 or 3 can be used as a signature gene to predict the class of a test sample E 0 , for which the class is assumed to be unknown. Based on the expression of genes 1 and 3, one would likely predict E 0 to be a sample from type A. This is a simple example of univariate gene selection, in which one set of genes is selected to predict the classes of samples in the test set. There are many univariate gene selection methods in the literature, such as the t-test, signal-to-noise ratio [6], the between-sum-of-square to within-sum-of-square (BSS/WSS) ratio [8], and partial least squares [9]. Most of these methods aim to identify variables (genes) with distinct expression levels in different classes, and are usually very efficient. Saeys et al. [10] review variable selection methods used in bioinformatics applications. However, due to the large number of candidate gene predictors, it is likely that more than a single set of genes can be predictive. Typical model selection approaches select a single predictive model (a set of predictive genes) and then proceed as if the selected model has generated the data, which might lead to overconfident inferences. BMA is a multivariate variable selection technique that accounts for model uncertainty by averaging over the posterior distributions of multiple models, weighted by their posterior model probability [11, 12]. We will describe BMA in detail in Section 3.4, and we will use logistic regression as the classification method.
j45
46
j 3 Discovery of Expression Signatures in Chronic Myeloid Leukemia by Bayesian Model Averaging
Here, we will focus on the binary classification case. However, some applications involve three or more classes. Our method, iterative BMA, can also be generalized to more than two classes. Please refer to [13] for details. Many other classification and variable selection methods have been used in the literature for gene expression data. As an example, Ramaswamy et al. [14] combined support vector machines, which are binary classifiers, to solve the multiclass classification problem. Nguyen and Rocke [9, 15] used partial least squares (PLSs) for feature selection, together with traditional classification algorithms such as logistic discrimination and quadratic discrimination to classify multiple tumor types on microarray data. Tibshirani et al. [16] developed an integrated feature selection and classification algorithm called shrunken centroid for classifying multiple cancer types in which features are selected by considering one gene at a time. Dudoit et al. [8] compared the performance of different discrimination methods, including nearest-neighbor classifiers, linear discriminant analysis, and classification trees, for classifying multiple tumor types using gene expression data.
3.4 Bayesian Model Averaging (BMA)
Let Y be the response variable (class) of a sample in the test set, where Y ¼ 0 or 1. In the case of the CML gene expression data, CP patient samples belong to class Y ¼ 0 and the BC samples belong to class Y ¼ 1. Let D be the training data for which the classes are known. In BMA, the posterior probability of Y ¼ 1 given the training set D is the weighted average of the posterior probability of Y ¼ 1, given the training set D and model Mk multiplied by the posterior probability of model Mk given the training set D, summing over a set of models Mk in M: PrðY ¼ 1jDÞ ¼
X k2M
PrðY ¼ 1jD; M k Þ PrðMk jDÞ
ð3:1Þ
We use logistic regression [17] to compute the probability that a test sample belongs to class 1 under model Mk , that is, PrðY ¼ 1jD; Mk Þ. In logistic regression, log½PrðY ¼ 1jD; M k Þ=PrðY ¼ 0jD; Mk Þ ¼ b0 þ b1 x 1 þ þ bq x q , where x’s represent the expression levels of selected genes, b’s are the regression parameters, and q is the number of selected variables with a nonzero regression coefficient. It is nontrivial to determine the set of models M in the weighted average calculation in Equation 3.1. Raftery used the leaps and bounds algorithm [18] to efficiently identify a reduced set of good models. The leaps and bounds algorithm rapidly returns the best nbest models of each size (up to 30 variables). A larger nbest produces more candidate models, and hence, increases the computational time. The parameter nbest is set to be 10 in our empirical studies. After the leaps and bounds step, Madigan and Raftery used the Occam’s window method to further reduce the number of candidate models by discarding models that are much less likely than the best model supported by the data (the default is 20 times less likely).
3.4 Bayesian Model Averaging (BMA)
Another issue is the computation of PrðMk jDÞ, the posterior probability for model M k given training set D. Raftery used the Bayesian information criterion (BIC) score to approximate the posterior probability of a model M k . The posterior probability of each selected gene is computed by summing the posterior probabilities of selected models in which the gene of interest is included. 3.4.1 The Iterative BMA Algorithm (iBMA)
The BMA algorithm described above was originally developed in the context of social sciences and is not applicable to data in which the number of genes (variables) is greater than the number of samples (responses), which is the case for gene expression data. In addition, the leaps and bounds algorithm rapidly becomes inefficient as the number of variables increases beyond 40. However, typical microarray datasets consist of thousands or even tens of thousands of genes with a relatively few experimental conditions (usually in the order of tens). Therefore, we developed an iterative BMA algorithm (iBMA) which first orders genes with a univariate gene selection method and then moves a 30-variable window down the ordered genes [13]. Specifically, we used the ratio BSS/WSS [8] to determine the initial gene order. Intuitively, genes with relatively large variations between classes and relatively small variations within classes are promising discriminative genes. The ratio BSS/WSS is a univariate gene selection method in which genes with large BSS/WSS ratios are good candidate relevant genes. For a kj denote the gene j, let Dij denote the expression level of gene j under sample i; D :j denote the averaverage expression level of gene j over samples in class k; and D age expression level of gene j over all samples. Note that this notation is the transpose of the toy example in Figure 3.1. Also, let the indicator variable IðY i ¼ kÞ equal to 1 if sample i belongs to class k, and equal to 0 otherwise. The BSS/WSS ratio for gene j is defined as PP kj IðY i ¼ kÞðD BSSðjÞ ¼ Pi Pk WSSðjÞ i k IðY i ¼ kÞðDij
:j Þ2 D kj Þ2 D
ð3:2Þ
In step 1 of iBMA, we compute the BSS/WSS ratio for each of the G genes in the training data and arrange the genes in descending order of the BSS/WSS ratio. We then apply BMA to the top 30 ranked genes. We choose the top 30 genes because the leaps and bounds algorithm is inefficient for number of genes (variables) much greater than 30. Since genes with high posterior probabilities are good candidates for relevant genes, we remove genes to which BMA assigns posterior probabilities of being in the predictive models. Suppose that q genes are removed. The next q genes from the rank-ordered BSS/WSS ratios are added back to the set of genes so that we maintain a window of 30 genes and apply BMA again. These steps of gene swaps and iterative applications of BMA are continued until all genes are subsequently considered. Outline of iBMA from Yeung et al. [13]
j47
48
j 3 Discovery of Expression Signatures in Chronic Myeloid Leukemia by Bayesian Model Averaging
Input: training set D containing G genes over E patient samples, class labels for the E patient samples. Preprocessing step: rank all G genes using a univariate gene selection procedure. Let x1 ; x 2 ; . . . ; x G be the ordered list of genes. Select the top p ranked genes: x1 ; x2 ; . . . ; xp . Parameters: nbest and p
1) Initially, start with the 30 top ranked genes (x 1 ; x 2 ; . . . ; x 30 ), and apply BMA. Let toBeProcessed be an ordered list of genes with ranks 31 to p. Initially, toBeProcessed fx31 ; x 32 ; . . . ; x p g. 2) Repeat until all p genes are processed. a) Remove all genes with posterior probabilities < 5%. b) Adaptive threshold step: if all genes have posterior probabilities 5%, determine the minimum posterior probability, minProbne0, among the 30 genes in the current window. Remove all genes with posterior probabilities < ðminProbne0 þ 1Þ%. c) Let removedGenes be the set of genes removed, and suppose that q genes are removed. d) Replace the q removed genes with the next q genes from toBeProcessed. Update toBeProcessed toBeProcessed\removedGenes. e) Apply BMA. Output: selected models and their posterior probabilities, selected genes and their corresponding posterior probabilities, and maximum likelihood estimates of the regression parameters in each model. 3.4.2 Computational Assessment
We use cross validation to assess the prediction accuracy of iBMA. In m-fold cross validation, the training set is randomly divided into m equal subsets. Each of these m subsets is left out in turn for the evaluation of classification accuracy, while the other ðm 1Þ subsets are used as inputs to the classification algorithm. This process is repeated 100 times. In leave-one-out cross validation (LOOCV), each sample in the training set is left out in turn for the evaluation of classification accuracy. In cross validation, we fit the models using the training data in each fold and each run of the cross validation procedure, and the test data are only used for evaluation purposes. For example, in threefold cross validation repeated 100 times, we fit the models 3 100 ¼ 300 times, and the models selected could vary in different folds and runs. We adopt the Brier score [19] to account for the magnitudes of predicted probabilities. The intuition is that a predicted probability close to 0 or 1 is more desirable than a predicted probability around 0.5. Denote the predicted probability that sample i belongs to class 1, PrðY i ¼ 1jDÞ, by pi. The Brier score is defined as PE pi Þ2 , which is the sum of squares of the difference between the true i¼1 ðY i class and the predicted probability over all samples. If the predicted probabilities, pi , are constrained to be equal to 0 or 1, the Brier score is equal to the total number
3.5 Case Study: CML Progression Data
of classification errors. An analytic method with a relatively small number of errors and a relatively small Brier score achieves higher prediction accuracy.
3.5 Case Study: CML Progression Data
We applied iBMA to the CML progression gene expression data consisting of a total of 72 CP and BC CML patient samples [20]. Before applying iBMA, 2612 genes that discriminate the 42 CP and 30 BC patients were identified using a univariate measure by Radich et al. [4]. Using parameters nbest ¼ 10 and p ¼ 1000, iBMA identified six signature genes (see Table 3.1) [20]. Table 3.1 shows that the top three univariate ranked genes (DDX47, IGSF2, and LTB4R) are selected by iBMA, together with three additional genes with lower univariate rankings. In this example, a total of 21 models are selected: each of the six signature genes is selected in a single-gene model with the posterior probability of 12.87%, and in five different two-gene models with posterior probabilities of 1.52%. Because the posterior probability of a gene is equal to the sum of the posterior probabilities of all the selected models containing the gene of interest, each of the six signature genes is selected with posterior probability 12:87% þ ð5 1:52%Þ ¼ 20:47%. In BMA and iBMA, the predicted probability of a sample in a given class, PrðY ¼ 1jDÞ, is computed by averaging over the predicted probabilities from these 21 selected models, weighted by the posterior probabilities of each of these models. Figure 3.2 shows that the six signature genes selected by iBMA are differentially expressed in the CP versus the BC patients. We evaluated the prediction accuracy of iBMA using threefold cross validation. In particular, two-thirds of the E ¼ 72 training samples were randomly selected to be in the training set, and the remaining one-third of the samples were used to evaluate the accuracy of the CP (Y ¼ 0) and BC (Y ¼ 1) prediction. This random split was repeated 100 times. If PrðY ¼ 1jDÞ < 0:5, we classified a patient sample as CP; otherwise, we would classify the sample as BC. We used PrðY ¼ 1jDÞ to compute the Brier score, and the thresholded class prediction to compute the number of mis-classifications. Table 3.1 This research was originally published in Blood ([20]. # the American Society of Hematology.).
Acc #
Gene name
Probne 0 (%)
BSS/WSS rank
NM_016355 NM_004258 NM_000752 NM_014062 NM_005505 NM_005888
DDX47 IGSF2 LTB4R ART4 SCARB1 SLC25A3
20.5 20.5 20.5 20.5 20.5 20.5
1 2 3 13 69 77
j49
50
j 3 Discovery of Expression Signatures in Chronic Myeloid Leukemia by Bayesian Model Averaging
Figure 3.2 A heatmap showing the six signature genes identified by iBMA. This research was originally published in Blood [20]. # the American Society of Hematology.
Since the training data consisting of 23 72 ¼ 48 randomly selected samples were used to build the model in each fold of the cross validation runs, a potentially different set of selected genes is used to predict the classes of the 13 72 ¼ 24 test samples. Subsequently, we derived a distribution of different Brier scores over the 100 cross validation runs (see Figure 3.3). On the CML progression gene expression data, iBMA produced an average number of classification errors of 0.20 and an average Brier score of 0.21 in the 24 test samples over 100 cross validation test sets, producing an average prediction accuracy of 99.17%.
3.6 The Power of iBMA
We illustrate the power of iBMA by comparing its prediction accuracy to that of other methods in cross validation. We use the same threefold cross validation setup repeated 100 times as in Figure 3.3. In the first comparison, we select the top six univariate genes using the BSS/WSS ratio, and use all these six genes in a single model in logistic regression to compute PrðY ¼ 1jDÞ in each fold of 100 cross validation runs. Figure 3.4 shows that the Brier scores span a larger range (between 0
Figure 3.3 A histogram showing the distribution of the Brier scores of iBMA over the 100 cross validation runs.
3.7 Laboratory Validation
Figure 3.4 A histogram showing the distribution of the Brier scores of the top six univariate genes over the 100 cross validation runs.
and 4) than that of iBMA, with an average of 0.375. Using iBMA, only 2 out of 100 cross validation runs produce a Brier score above 0.9, while 24 out of 100 runs produce a Brier score above 0.9 using the top six univariate genes. We repeat this strategy using different numbers of top univariate genes, and produce similar results. In other words, using the top univariate genes yields a higher average Brier score in cross validation and the distribution of these Brier scores is more extreme than that of iBMA. In our second comparison, we aim to show the power of iBMA-selected models containing multiple genes. In particular, in each cross validation run, we take the iBMA selected genes with nonzero posterior probabilities, put each of these genes in a single-gene model, and average over the predicted probability of each of these single-gene models. This strategy yields an average Brier score of 0.24 (> 0:21 from iBMA-selected models). When we compare the Brier scores from the iBMA models and the single-gene models in each cross validation run, the majority (96 out of 100 runs) produce higher (i.e., less favorable) Brier scores than iBMA. Therefore, we conclude that iBMA yields a more robust and less extreme distribution of Brier scores in our cross validation studies. In addition, the average Brier score from iBMA is also more favorable. 3.7 Laboratory Validation
In addition to computational assessment with cross validation, we followed up our computational derived signature genes with laboratory validation. Specifically, we
j51
52
j 3 Discovery of Expression Signatures in Chronic Myeloid Leukemia by Bayesian Model Averaging
profiled the expression levels of selected genes in independent patient samples using quantitative real-time polymerase chain reaction (RT-PCR or QPCR) [20]. QPCR is an alternative technology to microarrays to measure gene expression levels of selected genes. It is less expensive than microarrays when the number of genes is relatively small. We hypothesized that our six-gene signature from Table 3.1 could discriminate early from late CP, a distinction that is not possible with current pathological or clinical classifications. In 67 independent CP patient samples (45 early CP, 22 late CP) the six-gene signature measured by QPCR was highly predictive of early versus late CP disease. Among the 67 patients, 43 early CP patients and 17 late CP patients were correctly classified. Notably, the two misclassified early CP patients subsequently failed imatinib mesylate therapy and among the five misclassified late CP patients, two did well on subsequent nilotinib therapy. Therefore, our predicted labels disagreed with the predetermined CP but were consistent with the subsequent response to treatment.
3.8 Conclusions
In this chapter, the computational problem of variable selection for high-dimensional functional genomics data has been discussed. A supervised, multivariate probabilistic method called iBMA and its application to CML progression gene expression data have been reviewed. iBMA yields posterior probabilities of the predictions, selected genes, and selected models, and systematically determines the number of signature genes and models. We have previously shown that iBMA typically selects a small number of predictive genes that yield comparable prediction accuracy to other methods that use more genes [13]. iBMA adopts the initial univariate ranking step, leaps and bounds algorithm and Occam’s window strategies to constrain its search space, and hence, it is relatively computationally efficient compared to other multivariate variable selection methods. The desirable theoretical properties of BMA have been shown in Raftery et al. [21]. In particular, BMA point estimators and predictions minimize mean squared error on average over datasets drawn from the ensemble of models considered. The BMA model tends to have better predictive performance than standard model selection methods. Here, we complemented these theoretical results with empirical results of iBMA, illustrated on the CML progression gene expression data. We showed the power of iBMA by comparing its prediction accuracy in cross validation to using top univariate genes and single-gene models. We also showcased iBMA when applied to the CML progression gene expression data. These signature genes were predictive of CML progression in independent patient samples and profiled using a different technology (QPCR). Given the close relationship between disease phase and treatment outcome, iBMA has the potential to be a powerful tool for developing diagnostic tests from high-dimensional gene expression data. This can identify patients at increased risk of treatment failure at the time of diagnosis.
3.9 Perspective
3.9 Perspective
There is much room for future work. In terms of method improvement, the theoretical properties of iBMA need to be addressed. In particular, what is the impact of the initial univariate ranking step? There are many possible univariate ranking measures in addition to the BSS/WSS ratios. One could study the empirical performance of various such univariate measures in the initial ranking step. We have previously determined that nbest ¼ 10 and p ¼ 1000 are optimal input parameters for iBMA in Yeung et al. [13] using cross validation. The parameter p controls the number of top univariate ranked genes passed onto iBMA. Instead of a hard threshold, one could study the distribution of the univariate statistics to determine a soft threshold. With ever-increasing computational power, one could potentially relax the iBMA window size (currently set to 30). A larger window size will allow BMA to consider models containing more variables at the same time at the expense of computational efficiency. Our experience is that iBMA typically selects models containing a few variables, but this could be a consequence of all our approximation parameters. Alternatively, one might consider methods other than leaps and bounds to constrain the model search space. Due to high dimensionality of gene expression data, many variables (genes) are highly correlated with the class labels. BMA and iBMA address this issue by averaging over multiple models instead of choosing one and only one model. However, even with this model averaging technique, the set of signature genes selected could still differ when different subsets of the data are used to build the classifier. Regularized regression methods (e.g., lasso [22], elastic net [23]) have also been applied to high-dimensional gene expression data in which there are more variables than observations. Regularized regression methods combine shrinkage and variable selection. As an example, lasso minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the regression coefficients. Efficient implementations for regularized methods also exist, for example [24, 25]. A well-documented problem with variable selection methods using gene expression data is that gene selection is heavily influenced by the subset of patients even when the feature selection method and dataset remained constant [26]. Much larger training data are needed to generate a robust gene list [27]. For the development of diagnostic tests to be used in the clinical setting, the instability of the selected genes remains a challenge. We recently proposed the use of expert knowledge and additional data sources as an alternative strategy to derive a stable set of signature genes [28]. Again using CML as our case study, we developed a method that integrates gene expression data with reference genes known to be associated with CML in the literature and predicted functional relationships estimated from heterogeneous data sources. We showed that our new method, integrated iBMA, identified gene signatures that are more robust and stable than using gene expression data alone. Most importantly, we identified gene signatures that are predictive of relapse of CML CP patients even after adjustment for known risk factors associated with transplant outcomes.
j53
54
j 3 Discovery of Expression Signatures in Chronic Myeloid Leukemia by Bayesian Model Averaging 3.10 Publicly Available Resources
The CML progression microarray data [4] are publicly available at the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/) with accession number GSE4170. The software implementation of BMA is available as an R package at http://www.r-project.org. iBMA is available as a bioconductor package called iterativeBMA at http://www.bioconductor.org.
Acknowledgments
The author would like to thank Drs. Roger Bumgarner, Vivian Oehler, Jerry Radich, and Adrian Raftery for their scientific contributions to the research projects described in this chapter. This work is supported by NIH grants R01GM084163 and 3R01GM084163-02S2.
References 1 Deininger, M.W., Goldman, J.M., and
2
3
4
5
Melo, J.V. (2000) The molecular biology of chronic myeloid leukemia. Blood, 96, 3343–3356. Druker, B.J., Talpaz, M., Resta, D.J., Peng, B., Buchdunger, E., Ford, J.M., Lydon, N.B., Kantarjian, H., Capdeville, R., OhnoJones, S., and Sawyers, C.L. (2001) Efficacy and safety of a specific inhibitor of the BCR-ABL tyrosine kinase in chronic myeloid leukemia. N. Engl. J. Med., 344, 1031–1037. Faderl, S., Talpaz, M., Estrov, Z., and Kantarjian, H.M. (1999) Chronic myelogenous leukemia: biology and therapy. Ann. Intern. Med., 131, 207–219. Radich, J.P., Dai, H., Mao, M., Oehler, V., Schelter, J., Druker, B., Sawyers, C., Shah, N., Stock, W., Willman, C.L., Friend, S., and Linsley, P.S. (2006) Gene expression changes associated with progression and response in chronic myeloid leukemia. Proc. Natl. Acad. Sci., 103, 2794–2799. Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., and Levine, A.J. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed
6
7
8
9
by oligonucleotide arrays. Proc. Natl. Acad. Sci., 96, 6745–6750. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., and Lander, E.S. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. van’t Veer, L.J., Dai, H., van de Vijver, M. J., He, Y.D., Hart, A.A., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., Schreiber, G.J., Kerkhoven, R.M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S.H. (2002.) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536. Dudoit, S., Fridlyand, J., and Speed, T.P. (2002) Comparison of discrimination methods for the classification of tumors in gene expression data. J. Am. Stat. Assoc., 97, 77–87. Nguyen, D.V. and Rocke, D.M. (2002) Tumor classification by partial least squares using microarray gene expression profiles. Bioinformatics, 18, 39–50.
References 10 Saeys, Y., Inza, I., and Larranaga, P.
11
12
13
14
15
16
17
18
19
(2007) A review of feature selection techniques in bioinformatics. Bioinformatics, 23, 2507–2517. Hoeting, J.A., Madigan, D., Raftery, A.E., and Volinsky, C.T. (1999) Bayesian model averaging: a tutorial. Stat. Sci., 14, 382–401. Raftery, A.E. (1995) Bayesian model selection in social research (with discussion). Sociol. Methodol., 25, 111–193. Yeung, K.Y., Bumgarner, R.E., and Raftery, A.E. (2005) Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics, 21, 2394–2402. Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., Poggio, T., Gerald, W., Loda, M., Lander, E.S., and Golub, T.R. (2001.) Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci., 98, 15149–15154. Nguyen, D.V. and Rocke, D.M. (2002) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18, 39–50. Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci., 99, 6567–6572. Hosmer, D.W. (2000) Lemeshow, Applied Logistic Regression, John Wiley & Sons, New York. Furnival, G.M. and Wilson, R.W. (1974) Regression by leaps and bounds. Technometrics, 16, 499–511. Brier, G.W. (1950) Verification of forecasts expressed in terms of probability. Mon. Weather Rev., 78, 1–3.
20 Oehler, V.G., Yeung, K.Y., Choi, Y.E.,
21
22
23
24
25
26
27
28
Bumgarner, R.E., Raftery, A.E., and Radich, J.P. (2009) The derivation of diagnostic markers of chronic myeloid leukemia progression from microarray data. Blood, 114, 3292–3298. Raftery, A.E. and Zheng, Y. (2003) Discussion: performance of Bayesian model averaging. J. Am. Stat. Assoc., 98, 931–938. Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B, 58, 267–288. Zou, H. and Hastie, T. (2005) Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. B, 67, 301–320. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004) Least angle regression. Ann. Stat., 32, 407–499. Friedman, J., Hastie, T., and Tibshirani, R. (2010) Regularization paths for generalized linear models via coordinate descent. J. Stat. Software, 33, 1–22. Ein-Dor, L., Kela, I., Getz, G., Givol, D., and Domany, E. (2005) Outcome signature genes in breast cancer: is there a unique set?. Bioinformatics, 21, 171–178. Ein-Dor, L., Zuk, O., and Domany, E. (2006) Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci., 103, 5923–5928. Yeung, K.Y., Gooley, T.A., Zhang, A., Raftery, A.E., Radich, J.P., and Oehler, V. G. (2012) Predicting relapse prior to transplantation in chronic myeloid leukemia by integrating expert knowledge and expression data. Bioinformatics, 28, 823–830.
j55
j57
4 Bayesian Ranking and Selection Methods in Microarray Studies Hisashi Noma and Shigeyuki Matsui
4.1 Brief Summary
One of the main purposes of microarray studies is screening of differentially expressed genes among different clinical subtypes or prognostic classes of diseases as candidates for further investigation. Because of limited resources in the genome-wide screening, prioritizing genes and quantitative evaluation of selection accuracy are relevant statistical tasks. In this chapter, we review the Bayesian methods for gene ranking based on hierarchical mixture models. The hierarchical mixture models incorporate the differential and nondifferential components and allow information borrowing across differential genes with separation from nuisance, nondifferential genes. We also provide a method for evaluating accuracy in selecting differential genes. Numerical evaluations via simulations and an application to a lung cancer clinical study are presented.
4.2 Introduction
Many advances in modern cancer research have been brought by the attempts to characterize diseases at a molecular level, namely that of genes. The highthroughput DNA microarrays, which allow simultaneous measurement of the levels of expression for thousands of genes, or even an entire genome, have potential to be useful in elucidating disease biology and aggressiveness, identifying new therapeutic targets, and developing new molecular diagnostics for optimized medicine for individual patients. For general background and statistical aspects of microarray studies, see references [1–7]. One of the main purposes of the genome-wide studies using microarrays is the screening of differentially expressed genes among different phenotypes such as clinical subtypes and prognostic classes of disease for further investigations. Due
Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
58
j 4 Bayesian Ranking and Selection Methods in Microarray Studies
to the high dimensionality of microarray data, false finding is so serious, and thus many researchers have concerned about controlling false positives in the framework of multiple testing, particularly, controlling the false discovery rate (FDR) [8–10]. However, the multiple testing methodologies themselves are to ensure control of false positives for a set of significant genes. Because the number of genes that can be investigated in subsequent studies is generally limited without respect to the number of significant genes, prioritizing or ranking genes could be a more relevant statistical output. As a gene ranking based on the magnitude of the association of gene expressions with the clinical phenotype, or the effect size, the fold change has been widely used, which corresponds to the ratio or difference of mean expression levels between different clinical subtype classes [4, 11, 12]. Previously, some researchers have reported that the gene ranking based on the fold change is reproducible (e.g., MAQC Consortium [13]). However, gene ranking based on gene-by-gene basis statistics, such as the fold change, is not necessarily accurate, particularly, in small or moderate sample settings [14, 15]. Here, the accuracy refers to that in selecting differentially expressed genes or selecting genes with the largest effect sizes. Accuracy can override reproducibility. In general, a reproducible gene ranking does not necessarily to be accurate. Bayesian approaches have been widely applied for exploratory clinical and epidemiological studies for hypothesis generation that investigate a large number of associations under limited sample sizes [16–19]. They are also expected to be efficient for microarray data with high dimension. In particular, empirical Bayes approaches are expected to be efficient by “borrowing strength” across genes. Furthermore, decision theoretic formulation can provide optimal ranking and selection rules for gene selection. We consider two criteria on which genes are ranked: (i) selecting genes with the largest effect sizes, and (ii) selecting differentially expressed genes (with minimizing false positives and maximizing true positives). In this chapter, we consider the Bayesian optimal ranking and selection methods based on these two criteria. An important characteristic of microarray data is that a large proportion of the genes investigated are nondifferential. The incorporation of this structure of microarray data could allow for information sharing across differential genes separated from nuisance, nondifferential genes. We employ a hierarchical mixture model, in which the prior distribution is a mixture distribution of differential and nondifferential components [20–25]. This model allows quantitative evaluation of the degree of false positives in gene selection, like the FDR in multiple testing, in the Bayesian scheme. This chapter is organized as follows: we present the framework of the hierarchical mixture modeling and empirical Bayes inference in Section 4.3, and describe Bayesian optimal ranking and selection methods in Section 4.4. In Section 4.5, we compare the proposed methods with other widely used methods through simulations. Finally, we provide an application to a lung cancer clinical study in Section 4.6. Discussion and concluding remarks appear in Section 4.7, and an outlook of these themes is provided in Section 4.8.
4.3 Hierarchical Mixture Modeling and Empirical Bayes Estimation
4.3 Hierarchical Mixture Modeling and Empirical Bayes Estimation
The gene expression data considered here comprise normalized log ratios from two-color cDNA arrays or normalized log signals from oligonucleotide arrays (e.g., Affymetrix GeneChip). We consider a two-class comparison problem, for example, comparison of poor and good prognosis is compared on the basis of the expression levels of m candidate genes from n samples. For gene j, let qj be the parameter of interest, that is, the difference in the mean expression level between the two classes ( j ¼ 1; . . . ; m). As an estimator of qj , let Y j be the fold change, which is the difference in the sample mean expression levels obtained from n samples [11, 12]. We consider a two-stage model with i.i.d. sampling from a three-component mixture prior and from a normal gene-specific sampling model: Y j jqj Nðqj ; s2j Þ
qj p0 dðqÞ þ p1 g 1 ðqjj1 Þ þ p2 g 2 ðqjj2 Þ
ð4:1Þ
Here, dðqÞ is the Dirac delta function, representing nondifferential expression between two classes. The density functions g 1 ðqjj1 Þ and g 2 ðqjj2 Þ correspond to the differential expression in the two directions, the components of underexpression and overexpression, respectively, for a particular class, for example, poor prognosis. We assume natural conjugate normal distributions Nðm1 ; t21 Þ and Nðm2 ; t22 Þðm1 > 0; m2 < 0Þ for g 1 ðqjj1 Þ and g 2 ðqjj2 Þ, respectively. The proportion pi represents the mixing proportion (i ¼ 0; 1; 2), where p0 þ p1 þ p2 ¼ 1. In the firststage model in (4.1), we assume that the gene-specific variance s 2j is known. We denote cij ði ¼ 0; 1; 2; j ¼ 1; 2; . . . ; mÞ as unobservable indicator random variables, such that cij ¼ 1 if gene j belongs to the ith component, and cij ¼ 0 otherwise. The pi ði ¼ 0; 1; 2Þ corresponds to the probability of cij ¼ 1. The number of differentially expressed genes can therefore be defined as m X n¼ ðc1j þ c2j Þ j¼1
such that p1 þ p2 ¼ n=m. In empirical Bayes methods, the prior distribution is estimated from the data of exchangeable units (or genes in our model). This approach could improve efficiency of inference from the frequency perspective through sharing information across the exchangeable units. Many researchers have applied these methods to microarray data [20–25]. The hyperparameter g ¼ ðp0 ; p1 ; p2 ; m1 ; t21 ; m2 ; t22 Þ can be obtained by maximizing the marginal likelihood of Y j [15], which is obtained by integrating over qj from the joint distribution of ðY j ; qj Þ: R h0j ðyjgÞ ¼ wðyjqj ; s2j Þdðqj Þdqj ¼ wðyj0; s2j Þ R hij ðyjgÞ ¼ wðyjqj ; s2j Þwðqj jmi ; t2i Þdqj ¼ wðyjmi ; s2j þ t2i Þ ði ¼ 1; 2Þ where wð:jm; s2 Þ is the density function of Nðm; s2 Þ, and h0j , h1j , and h2j are the marginal densities of the Y j ’s, in each component, that is, Nð0; s 2j Þ, Nðm1 ; s 2j þ t21 Þ, and
j59
60
j 4 Bayesian Ranking and Selection Methods in Microarray Studies Nðm2 ; s2j þ t22 Þ, respectively. We employ the expectation maximization (EM) algorithm [26] to cope with the unobservable indicator variable cij in the mixture model. Details of the EM algorithm are described in the appendix.
4.4 Ranking and Selection Methods
We suppose that top-ranked genes are selected for further investigation in subsequent studies, where the number of selected genes is prespecified owing to limited resources as K (e.g., 100, 200). We consider Bayesian optimal ranking and selection methods for the selected K genes for an explicitly specified criterion. In the Bayesian decision theory, the optimal decision rule based on a loss function Lðq; aÞ is aðYÞ that minimizes the posterior Bayes risk [17, 27], RiskðaðYÞ; YÞ ¼ E qjY ½Lðq; aðYÞÞjY
where the posterior distribution is that pj ðqjyj Þ ¼ Prðc0j ¼ 1jyj Þp0j ðqjyj Þ þ Prðc1j ¼ 1jyj Þp1j ðqjyj Þ þ Prðc2j ¼ 1jyj Þp2j ðqjyj Þ
where p0j , p1j , and p2j are the posterior densities of each component, which are obtained as dðqÞ, and ! t2i yj þ s2j mi t2i s2j ði ¼ 1; 2Þ ; 2 N t2i þ s2j ti þ s2j respectively. The posterior probability that gene j belongs to the ith component (i ¼ 0; 1; 2) is expressed as yij ¼ Prðcij ¼ 1jyj Þ ¼
pi hij ðyj jji Þ
ð4:2Þ
p0 h0j ðyj Þ þ p1 h1j ðyj jj1 Þ þ p2 h2j ðyj jj2 Þ
In this section, we review the Bayesian optimal ranking and selection methods for an adequate criterion or loss function. In Section 4.4.1, we consider the selection of genes with the greatest absolute qj . In Section 4.4.2, we consider the selection of differentially expressed genes, that is, minimizing false positives and maximizing true positives in the selected K genes. In addition, in Section 4.4.2.2, we provide an index of false positives in the Bayesian ranking framework with the fixed number K of selected genes. 4.4.1 Ranking Based on Effect Sizes
This subsection discusses methods of gene ranking based on effect sizes or absolute qj s that are optimal for given criteria or loss functions.
4.4 Ranking and Selection Methods
4.4.1.1 Posterior Mean (PM) One simple approach is to rank genes based on their estimates of qj ’s. The Bayes estimator of qj that minimizes the squared error loss function, qj Þ2
est LSEL ðqj ; qest j Þ ¼ ðqj
is the posterior mean [17]. For the hierarchical mixture model, the posterior mean is obtained as E½qj jyj ¼ y0j E½qj jc0j ¼ 1; yj þ y1j E½qj jc1j ¼ 1; yj þ y2j E½qj jc2j ¼ 1; yj
ð4:3Þ
This is the weighted average of the posterior means for individual components, where the weight is the posterior probability of component membership (4.2). The form of (4.3) indicates that gene ranking based on this statistic incorporates the uncertainty regarding the membership for gene j. Here, we refer to the ranking based on the magnitude of the posterior means as the “PM” method. 4.4.1.2 Rank Posterior Mean (RPM) Although the posterior means of qj ’s are optimal as estimators of gene-specific parameters, if the ranks based on qj ’s are the target parameters, the PM method does not necessarily has to be optimal. Different ranking methods can be derived from different loss functions. Laird and Louis [28] considered a squared error loss function regarding the difference between the estimated and true ranks. We define the ranking parameters for differential genes as Rj ¼ c1j
m X fc1i Iðqi qj Þ þ c2i Ið qi qj Þg i¼1
þ c2j
m X fc1i Iðqi
qj Þ þ c2i Ið qi
qj Þg
i¼1
which corresponds to the ranks based on the absolute values of qj in a descending order. Furthermore, for such gene j that belongs to the nondifferential component, Rj is defined to be 0. Based on the squared error loss function for estimating the rank parameter Rj Rj Þ2
est LSEL ðRj ; Rest j Þ ¼ ðRj
where Rest j is an estimator of Rj , and the Bayesian estimator of Rj is obtained as its posterior mean: m X ^ j ¼ E½Rj jy ¼ y1j R fy1i Prðqi qj jyi ; yj ; c1i ¼ 1; c1j ¼ 1Þ i¼1
þ y2i Prð qi qj jyi ; yj ; c2i ¼ 1; c1j ¼ 1Þg þ y2j
m X fy1i Prðqi
qj jyi ; yj ; c1i ¼ 1; c2j ¼ 1Þ
i¼1
þ y2i Prð qi
qj jyi ; yj ; c2i ¼ 1; c2j ¼ 1Þg
j61
62
j 4 Bayesian Ranking and Selection Methods in Microarray Studies The Bayesian estimator is shrunken toward the mid-rank n þ 1=2 for differential genes, and generally is not an integer. Optimal integer ranks are produced by rank^ j [14, 15, 28–30]. Here, we call this statistic “RPM,” the posterior ing based on R mean of rank parameter. For an ascending ordering, the rank statistic is trans~ j ¼ n Rj þ 1, and the Bayesian estimator is its posterior mean, formed to R P ^~ ¼ ^n R ^ n¼ m R j j þ 1, where ^ j¼1 ðy1j þ y2j Þ. 4.4.1.3 Tail-Area Posterior Probability (TPP) Because of the limited resources available in subsequent studies, the number of selected genes, K, may be prespecified as a small number [31–33]. In this case, the K genes with the greatest absolute qj s would be natural targets in gene selection. Recently, Lin et al. [14] discussed these selection problems in the general Bayesian theory. Lin et al. [14] provided classification loss functions of top-ranked units and derived optimal ranking and selection rules based on the Bayesian decision theory. We adopt their rank-based misclassification loss function that equivalently penalizes to misclassifications between the true top K ranked genes and the other genes: LMC ðK; R; Rest Þ ¼
m 1X est fFPðK; Rj ; Rest j Þ þ FNðK; Rj ; Rj Þg m j¼1
where FPðK; Rj ; Rest j Þ ¼ IfRj n
K; Rest j > n
Kg
FNðK; Rj ; Rest j Þ
K; Rest j
Kg
¼ IfRj > n
n
The derived optimal rule is to select K genes with the largest values of Lj ðKÞ ¼ PrðRj > ðn
KÞjyÞ
However, it is difficult to obtain the posterior distribution of Rj and to calculate Lj ðKÞ analytically. Although the calculation can be conducted via numerical integrations, we can instead use a simple computable approximation of Lj ðKÞ [14, 15]. Define c K ¼ K=ðn þ 1Þ and let G1 and G2 be the cumulative distribution functions of g 1 and g 2 . The approximation of Lj ðKÞ is obtained as 1 ð1 cK Þjc1j ¼ 1; y Þ Lj ðKÞ ¼ y1j Prðqj G 1 j 1 ðcK Þjc2j ¼ 1; y Þ þ y2j Prðqj G 2
j
where 1 ðtÞ ¼ G 2 ðtÞ ¼ G
Pm
j¼1
Pm
j¼1
y1j Prðqj tjc1j ¼ 1; yj Þ Pm j¼1 y1j y2j Prðqj tjc2j ¼ 1; yj Þ Pm j¼1 y2j
Lj ðKÞ corresponds to the tail-area posterior probability of qj . Here, we call this rule the “TPP (tail-area posterior probability)” method. The quantity n can be replaced 2 themselves are the Bayes estimators of 1 and G by its estimator ^n. Additionally, G
4.4 Ranking and Selection Methods
effect size distributions G1 and G2 [30], and thus can provide further useful information for the screening task. 4.4.2 Ranking Based on Selection Accuracy of Differential Genes 4.4.2.1 Posterior Probability of Differentially Expressed (PPDE) Another criterion is for selecting differentially expressed genes, that is, minimizing false positives (selecting nondifferential genes) in the set of selected K genes without respect to their effect sizes, in other words, maximizing true positives in the K genes. This framework is typically considered in multiple testing settings to improve the averaged power [34, 35]. From the Bayesian perspectives, classical decision theoretic formulation provides an explicit selection rule as follows. For minimizing false positive, similar to that considered in Section 4.4.1.3, the misclassification loss function can be applied: LMC ðc; cest Þ ¼
m 1X est fFP ðc0j ; cest 0j Þ þ FN ðc0j ; c0j Þg m j¼1
where the cest 0j is an estimator of c0j and est FP ðc0j ; cest 0j Þ ¼ Ifc0j ¼ 1; c0j ¼ 0g est FN ðc0j ; cest 0j Þ ¼ Ifc0j ¼ 0; c0j ¼ 1g
Minimizing the posterior expected loss function, the Bayes rule is derived explicitly as selecting K genes with the largest values of Yj ¼ y1j þ y2j
Therefore, an optimal gene ranking can be obtained by ranking the genes according to the posterior probability of differentially expressed (PPDE) ranked from largest to smallest. This result has been well known as a Bayes solution of the classical optimal selection problem [27], and also has been discussed in microarray literatures [24, 36–38]. Besides, from the frequentist perspectives, an optimal decision rule named the “optimal discovery procedure (ODP)” was recently developed by Storey [34], similar to the maximum powerful test in single significance testing [39]. The ODP is a decision rule in multiple testing that maximizes the expected true positives under fixed expected false positives. Consider the two-class comparison problem in Section 4.3 and a fixed-effects model for the fold change Y j ðj ¼ 1; 2; . . . ; mÞ: Y j jqj Nðqj ; s2j Þ
where s2j is a known variance of Y j , and we suppose homogeneous variances, such that s2j ¼ s 2 ð¼ 1Þ for j ¼ 1; 2; . . . ; m, for the gene expression levels divided by gene-specific standard deviations within the class of individual genes. Note that qj is the true difference in the mean expression level between the two classes in the
j63
64
j 4 Bayesian Ranking and Selection Methods in Microarray Studies frequentist sense, and not a random parameter as in (4.1). We consider m multiple hypotheses tests for H0j : qj ¼ 0;
vs:
H1j : qj 6¼ 0
ðj ¼ 1; 2; . . . ; mÞ
Here, without loss of generality, suppose that the null hypothesis is true for the test j ¼ 1; . . . ; m0 , and the alternative is true for j ¼ m0 þ 1; . . . ; m. Under that, applying Storey’s [34] result, the test statistic that achieves the ODP criterion is SODP ðyÞ ¼
wðyjqm0 þ1 ; s2 Þ þ wðyjqm0 þ2 ; s2 Þ þ þ wðyjqm ; s2 Þ m0 wðyj0; s2 Þ
Namely, for a fixed cutoff l ð0 l < 1Þ chosen to attain an acceptable EFP level, the null hypothesis for the test j is rejected if and only if SODP ðyj Þ l. Note this result means that the ODP statistic SODP ðyÞ provides an optimal ranking for Storey’s criterion. More recently, Noma and Matsui [35] also derived the ODP statistic under the empirical Bayes framework based on the hierarchical Bayesian models like (4.1). Adapting their results to the hierarchical model (4.1), the ODP statistic becomes R R p1 wðyjqj ; s2 Þwðqj jm1 ; t21 Þdqj þ p2 wðyjqj ; s 2 Þwðqj jm2 ; t22 Þdqj R RODP ðyÞ ¼ p0 wðyjqj ; s2 Þdðqj Þdqj ¼
p1 h1j ðyjm1 ; t21 Þ þ p2 h2j ðyjm2 ; t22 Þ p0 h0j ðyÞ
that is, the marginal likelihood ratio statistic. Again, for a fixed cutoff l ð0 l < 1Þ chosen to attain an acceptable EFP level, the null hypothesis for the test j is rejected if and only if RODP ðyj Þ l. This result also means that the ordering genes based on values of RODP ðyÞ maximize the expected true positives. Note that, through estimating the hyperparameters under the hierarchical mixture model, the method based on RODP ðyÞ can circumvent the problem of estimating the true status of each significance test (null or alternative) and the true probability distribution corresponding to each test in applying the method based on SODP ðyÞ. Interestingly, the ODP statistic RODP ðyÞ corresponds to the posterior probability Y j in a one-to-one manner. Therefore, in the empirical Bayes framework, the Bayesian optimal rule derived from the misclassification loss function LMC accords to the ODP rule from the frequentist perspectives. 4.4.2.2 Evaluating Selection Accuracy In the screening tasks for microarray studies, the assessment of the false positives in the selected gene set is particularly important. In the framework of multiple testing, where the number of selected genes (significant genes) is regarded as a random variable, the FDR [8–10] is widely used as a quantitative measure of false positive. A counterpart of the FDR in the framework of ranking and selection with fixed number of selected genes, K, is the conditional FDR (cFDR; [40]): cFDR ¼
1X c K j2J 0j
4.5 Simulations
where J represents the set of indices for the selected K genes. Because the cFDR exactly accords to the FDR under the Bayesian mixture models [40], we use the term FDR consistently in the following. In the framework of Bayesian selection rules, the FDR accords to the misclassification error rate of differential/nondifferential classification [9, 10, 40]. Thus, an estimator of the FDR can be obtained as X d ¼1 y FDR K j2J 0j which can be regarded as an estimator of the misclassification rate in discriminant analysis [37]. This estimator performed well in numerical evaluations [4, 24, 37, 41]. Note that this estimator is valid for the selected K genes based on any ranking or selection rules, whenever the hierarchical mixture model (4.1) is correctly specified. For example, this estimator can be used even for the top K selected genes based on the naive fold change ranking.
4.5 Simulations
In this section, we describe simulation studies undertaken to evaluate the performance of the Bayesian ranking and selection methods. Simulation data were generated under the two-stage sampling model (4.1). For gene j, the expression data within each of the two classes were generated from a normal distribution, where the mean difference between classes was specified as qj . Throughout the simulations, the number of genes, m, was set at 10 000. The mixing proportions were set to p0 ¼ 0:90 and p1 ¼ p2 ¼ 0:05. With respect to the parameters of differential components Nðm1 ; t21 Þ and Nðm2 ; t22 Þ, we considered m1 ¼ m2 ¼ 0:10; 0:20 and standard deviation t1 ¼ t2 ¼ 0:05. The gene-specific variance was generated from a scaled x2 distribution with one degree of freedom, multiplied by 0.40. These distributions were designed to cover the situations wherein the mean of the standardized mean difference in gene expression between the two classes across differential genes is equal to approximately 0.30, which is a situation typically encountered in clinical studies with microarrays [42, 43]. We considered the total sample size n ¼ 20, 40, or 60, and equal sizes (¼ n=2) for the two classes. We conducted 1200 simulations for each scenario. For assessing the accuracy of gene ranking, we considered the following seven ranking statistics: (a) absolute values of the fold changes jY j j, (b) the two-sided P-values of the two-sample t-statistic, (c) the empirical Bayes estimates of qj assuming the conjugate normal–normal model without invoking the mixture structure (PMU), and the PM, RPM, TPP, and PPDE based on the hierarchical mixture model (4.1). Statistics (a) and (b) have been commonly used in gene ranking in microarray studies [11, 12]. Statistic (c), a well-known empirical Bayes estimator based on the simple natural conjugate model, is considered as a simple ranking statistic for microarray studies [44].
j65
66
j 4 Bayesian Ranking and Selection Methods in Microarray Studies Table 4.1 Simulation results: empirical SPa) of 1200 experiments.
m1
N
K ¼ 100 0.10 20 40 60 0.20 20 40 60 K ¼ 200 0.10 20 40 60 0.20 20 40 60
Fold change
P-value
The Bayesian ranking methods
PMU PM
RPM
TPP
PPDE
0.014 0.019 0.024 0.020 0.031 0.045
0.130 0.198 0.228 0.166 0.183 0.183
0.135 0.186 0.212 0.171 0.201 0.223
0.158 0.213 0.246 0.191 0.255 0.297
0.156 0.208 0.234 0.188 0.251 0.291
0.166 0.235 0.277 0.218 0.281 0.319
0.154 0.204 0.223 0.171 0.176 0.177
0.026 0.034 0.041 0.036 0.056 0.078
0.152 0.230 0.276 0.221 0.278 0.294
0.154 0.216 0.256 0.230 0.280 0.303
0.180 0.250 0.291 0.248 0.304 0.352
0.179 0.247 0.286 0.247 0.300 0.348
0.184 0.259 0.307 0.264 0.339 0.387
0.178 0.244 0.282 0.244 0.281 0.290
a) Selection probability: the probability of selecting the truly top K genes with the largest effect sizes.
We considered two criteria for the accuracy of gene ranking: (1) the selection probability (SP), the probability of selecting the truly top K genes with the largest effect sizes, and (2) the false positive probability (FPP), the probability of selecting the nondifferential genes in the selected K gene set. The SP represents the selection accuracy of the truly top-ranked genes, and FPP represents error of selecting the nondifferential genes. Tables 4.1 and 4.2 present the empirical SP and FPP in the 1200 simulations for K ¼ 100, 200. As a whole, the SP of the PM, RPM, and TPP was greater than that of conventional methods (fold change, P-values). Especially, the TPP method demonstrated the greatest SP in all the scenarios, reflecting its optimality for selecting K top genes, although the PM and RPM methods had comparable SP. In contrast, the PPDE method based on the empirical Bayes approach had smaller SP compared with the PM, RPM, and TPP methods, as the misclassification loss function of the PPDE method does not concern effect sizes. Besides, the FPP of the PPDE method was the smallest in the whole settings, as expected. The RPM ranking provided nearly equivalent FPP with the PPDE ranking. For the TPP ranking, the FPP was greater than the other empirical Bayes methods. In comparison to empirical Bayes methods without the mixture structure, the PMU method had a lower SP and a greater FPP compared with those of the RPM method based on the hierarchical mixture model. Note that the fold change had quite small SP and great FPP through all the settings. This can be explained that, unlike the other statistics, the fold change has quite large variability, especially, under general settings of the microarray experiments with small sample sizes. When the sample size increases,
4.6 Application Table 4.2 Simulation results: empirical FPPa) of 1200 experiments.
m1
n
K ¼ 100 0.10 20 40 60 0.20 20 40 60 K ¼ 200 0.10 20 40 60 0.20 20 40 60
Fold change
P-value
The Bayesian ranking methods
PMU PM
RPM
TPP
PPDE
0.887 0.874 0.858 0.853 0.808 0.755
0.520 0.238 0.099 0.103 0.002 0.000
0.373 0.144 0.046 0.016 0.000 0.000
0.351 0.137 0.061 0.027 0.008 0.004
0.349 0.131 0.047 0.024 0.006 0.003
0.411 0.268 0.203 0.125 0.060 0.040
0.349 0.129 0.042 0.015 0.000 0.000
0.887 0.874 0.860 0.856 0.810 0.759
0.629 0.429 0.299 0.273 0.061 0.011
0.559 0.378 0.258 0.170 0.025 0.004
0.532 0.354 0.243 0.159 0.035 0.018
0.531 0.352 0.239 0.158 0.033 0.015
0.551 0.414 0.337 0.248 0.139 0.094
0.531 0.351 0.237 0.157 0.023 0.003
a) False positive probability: the probability of selecting the nondifferential genes in the selected K gene set.
the ranking variability is improved. Also, the poor performance of the fold change ranking for the FPP can be explained that it lacks the guard against selecting null genes.
4.6 Application
In clinical practice, there are lots of current and former smokers with suspicion for lung cancer based on abnormal radiographic imaging and/or symptoms. Although there have been some noninvasive initial diagnostic tests such as the flexible bronchoscopy, the sensitivities for lung cancer are not sufficient. Thus, most patients require further invasive diagnostic tests, but these diagnostic tests typically delay treatment several months and generate additional costs and risks for the patients. Spira et al. [45] performed gene-expression profiling of large-airway epithelial cell brushings obtained from current and former smokers who underwent flexible bronchoscopy, as a diagnostic study for clinical suspicion of lung cancer. Using Affymetrix HG-U133A microarrays, out of 22 215 probe sets, 80 probes were selected for the diagnostic algorithm. Here, we analyzed their gene expression profiles of 192 participants (73 from smokers without cancer, 119 from smokers with lung cancer). The data are available from the Gene Expression Omnibus (GEO) database with accession number GSE4115. ^ 0 ¼ 0:304; Using the EM algorithm, the hyperparameter g was estimated as p ^ 2 ¼ 0:267; m ^ 1 ¼ 0:429; p ^1 ¼ 0:068; ^t21 ¼ 0:0332 ; m ^2 ¼ 0:075; ^t22 ¼ 0:0582 . p
j67
j 4 Bayesian Ranking and Selection Methods in Microarray Studies
68
Table 4.3 Selected top-ranked genes based on RPM ranking and their summaries of the lung
cancer dataa). Rank
Probe
FDRb)
1
204378_at
0.000
2
205538_at
0.000
3
222339_x_at
0.000
4
204524_at
0.000
5
212001_at
0.000
10
217497_at
0.000
20
213502_x_at
0.000
30
207283_at
0.000
40
221198_at
0.000
50
217653_x_at
0.000
75
221068_at
0.001
100
213326_at
0.001
125
220708_at
0.001
150
217410_at
0.001
175
219957_at
0.001
200
207372_s_at
0.002
Fold change
0:392 (130) 0:255 (545) 0:362 (168) 0:308 (288) 0:302 (312) 0:249 (576) 0:331 (236) 0:186 (1417) 0:273 (437) 0:594 (27) 0:277 (414) 0:187 (1410) 0:206 (1037) 0:316 (270) 0:241 (632) 0:201 (1142)
The Bayesian ranking methods
P-value
0.000 (3) 0.000 (1) 0.000 (5) 0.000 (11) 0.000 (13) 0.000 (26) 0.000 (63) 0.000 (18) 0.000 (129) 0.000 (27) 0.000 (231) 0.000 (179) 0.000 (356) 0.000 (390) 0.000 (542) 0.000 (569)
PM
RPM
TPP
PPDE
0:225 (1) 0:200 (8) 0:215 (29) 0:201 (6) 0:199 (9) 0:183 (21) 0:187 (17) 0:159 (120) 0:174 (40) 0:190 (15) 0:169 (66) 0:150 (201) 0:151 (185) 0:163 (99) 0:153 (167) 0:145 (256)
16.1 (1) 22.4 (2) 27.0 (3) 49.1 (4) 52.6 (5) 87.8 (10) 150.1 (20) 188.3 (30) 222.4 (40) 255.6 (50) 316.0 (75) 401.8 (100) 477.9 (125) 539.4 (150) 592.4 (175) 650.4 (200)
0.670 (1) 0.409 (23) 0.580 (2) 0.434 (12) 0.416 (13) 0.253 (77) 0.320 (28) 0.047 (1706) 0.210 (101) 0.374 (15) 0.183 (135) 0.043 (1596) 0.069 (812) 0.172 (147) 0.102 (405) 0.054 (1069)
1.000 (5) 1.000 (1) 1.000 (10) 1.000 (13) 1.000 (14) 1.000 (11) 1.000 (156) 1.000 (4) 1.000 (198) 0.997 (549) 0.999 (365) 1.000 (113) 0.999 (337) 0.994 (879) 0.996 (728) 0.997 (571)
a) The values in parentheses are the integer ranks of each ranking statistic. b) FDR estimate for a gene set of that gene and its upper ranked genes (like Q-value in multiple testing; Storey [9, 10]).
The s2j was estimated on a gene-by-gene basis assuming a common variance between the two groups. Some of the top-ranked probes ranking by the RPM and PPDE statistics are presented in Tables 4.3 (RPM) and 4.4 (PPDE), respectively. In these tables, realized values of the ranking statistics (fold change, P-value, PM, RPM, TPP [K ¼ 200], PPDE) and their integer ranks are also presented (note: genes with low values of the statistic were highly ranked for the RPM method). Obviously,
4.6 Application Table 4.4 Selected top-ranked genes based on PPDEa) ranking and their summaries of the lung cancer data.
Rank
Probe
FDRb)
1
205538_at
0.000
2
203658_at
0.000
3
204566_at
0.000
4
207283_at
0.000
5
204378_at
0.000
10
222339_x_at
0.000
20
213180_s_at
0.000
30
212102_s_at
0.000
40
203619_s_at
0.000
50
212319_at
0.000
75
206288_at
0.000
100
214188_at
0.000
125
205010_at
0.000
150
205898_at
0.000
175
206567_s_at
0.000
200
205515_at
0.000
Fold change
0:255 (545) 0:140 (3082) 0:175 (1704) 0:186 (1417) 0:392 (130) 0:362 (168) 0:204 (1072) 0:123 (4119) 0:208 (1022) 0:236 (685) 0:151 (2541) 0:261 (511) 0:158 (2286) 0:287 (375) 0:162 (2124) 0:266 (476)
The Bayesian ranking methods
P-value
0.000 (1) 0.000 (8) 0.000 (10) 0.000 (18) 0.000 (3) 0.000 (5) 0.000 (31) 0.000 (110) 0.000 (44) 0.000 (59) 0.000 (138) 0.000 (81) 0.000 (232) 0.000 (45) 0.000 (287) 0.000 (70)
PM
RPM
TPP
PPDE
0:200 (8) 0:115 (931) 0:126 (597) 0:159 (120) 0:225 (1) 0:215 (2) 0:128 (565) 0:102 (1482) 0:126 (625) 0:173 (44) 0:111 (1101) 0:176 (33) 0:135 (407) 0:127 (592) 0:136 (382) 0:124 (682)
22.4 (2) 1125.5 (483) 758.5 (260) 188.3 (30) 16.1 (1) 27.0 (3) 793.6 (279) 1870.4 (1003) 875.1 (333) 160.6 (25) 1460.7 (712) 168.0 (26) 651.0 (201) 930.5 (356) 649.1 (198) 1043.4 (427)
0.409 (23) 0.078 (753) 0.228 (84) 0.047 (1706) 0.670 (1) 0.580 (2) 0.267 (48) 0.023 (2638) 0.247 (56) 0.173 (180) 0.083 (569) 0.217 (97) 0.009 (5907) 0.284 (34) 0.012 (4610) 0.246 (53)
1.000 (1) 1.000 (2) 1.000 (3) 1.000 (4) 1.000 (5) 1.000 (10) 1.000 (20) 1.000 (30) 1.000 (40) 1.000 (50) 1.000 (75) 1.000 (100) 1.000 (125) 1.000 (150) 1.000 (175) 1.000 (200)
a) The values in parentheses are the integer ranks of each ranking statistic. b) FDR estimate for a gene set of that gene and its upper ranked genes (like Q-value in multiple testing; Storey [9, 10]).
these two lists were considerably discordant. The point estimates of qj (the fold change and PM) of the top-ranked genes by the RPM statistics were relatively greater than those by the PPDE statistics. Besides, the estimates of the FDR for the PPDE ranking were smaller than those of the RPM ranking as expected, although these were comparable. These results could be explained by the difference in the designated loss function. In addition, the rankings obtained by different ranking statistics
j69
70
j 4 Bayesian Ranking and Selection Methods in Microarray Studies were quite inconsistent. Especially, the rankings by the fold change listed in Tables 4.3 and 4.4 were considerably lower, and those were fairly discordant to those of the other ranking statistics. This is likely due to the serious variability of the ranking of fold changes as in the simulations in Section 4.5. Besides, the rankings by the PM and RPM were less discrepant. Also, the top-ranked genes of the TPP and RPM statistics were different to some degree, but totally, there were substantial overlaps. Figure 4.1 presents forest plots for the top 100 genes selected by the RPM and PPDE statistics. Those of the upper panels illustrate the fold change and 95% t-intervals, and lower panels illustrate the posterior mean and 95% posterior probability intervals of those genes. For fair comparisons of the magnitudes of these estimates, all of the realized values are presented by absolute values. Again, in these panels, the discrepancy in ranking described in the Tables 4.3 and 4.4 was clearly observed. The locations of the point and interval estimates of the effect sizes for the top genes by the RPM method were relatively greater than those of the top genes by the PPDE. Furthermore, the interval estimates for the top genes by the PPDE method can be more precise than those of the top genes by the RPM method. In terms of the Bayesian estimates of qj for the top genes by the RPM method, the PM and RPM rankings were generally consistent. Besides, the ranking by the fold change was pretty inconsistent to these rankings.
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0 0
20
40
60
80
100
0
20
40
60
80
100
Rank
Rank
PM and 95% posterior probability interval (RPM ranking)
PM and 95% posterior probability interval (PPDE ranking)
0.8
0.8
0.6
0.6
0.4
0.4
θ
θ
Fold change and 95% confidence interval (PPDE ranking)
θ
θ
Fold change and 95% confidence interval (RPM ranking)
0.2
0.2
0.0
0.0 0
20
40
60 Rank
80
100
0
20
40
60
80
100
Rank
Figure 4.1 Forest plots of qj s for top 100 genes based on RPM and PPDE rankings; for fair comparisons, absolute values of the estimates are presented.
4.7 Concluding Remarks
4.7 Concluding Remarks
As the risk of obtaining false finding in gene screening using high-dimensional microarray data is so serious, many researchers have concerned controlling false positives in the framework of multiple testing. However, the multiple testing methodologies themselves are to ensure control of false positives for a set of significant genes. Because a selected gene set will be subject to further investigation, prioritizing or ranking individual genes become particularly important. This chapter goes to this direction for obtaining accurate gene ranking and selection methods. Sharing information across genes and incorporating the differential/nondifferential mixture structure are expected to be effective in improving accuracy. As seen in our simulations, the PM, RPM, and TPP methods were found to have greater selection probabilities for selecting differentially expressed genes with the greatest effects compared to conventional methods, while keeping lower false positive probabilities. Besides, the PPDE method exhibited the smallest false positive probabilities. These results are reasonable because a ranking method performs well for the criterion or loss function in gene ranking from which it is derived. Therefore, ranking methods should be carefully selected according to the criterion of interest in gene ranking, that is, depending on whether one is interested in differentially expressed genes with the largest effect sizes or differentially expressed genes without regard to their effect sizes. The former would be important for selecting biologically significant genes, while the latter would be important for selecting statistically significant genes, possibly before developing a classifier using the selected genes. Generally, the RPM method would be recommended because it performed well for both selection criteria in our simulation study. The accuracy of the ranking based on the fold change could be improved as the sample size increases, because of its frequentists optimality in the large sample theory. However, in small sample settings, like many microarray studies, this ranking method can perform poorly as indicated in the simulations. Furthermore, with respect to the false positive probability, the ranking based on the fold change did not perform well, even under the large sample settings (n ¼ 60). Although the MAQC project [13] has previously reported good reproducibility with this ranking method, it is not necessarily accurate as remarked in Section 4.2. Hence, it is not generally recommended, especially when the analysist has concerns about the selection of false positives. As topics for future research, more studies would be needed for relaxing the modeling assumption. For possible violations of the normal distribution assumption (4.1), other parametric models with some analytical tractability, like the skewed normal distribution [46, 47], could be a good candidate. Nonparametric prior models and empirical Bayes estimation via the smoothing-by-roughing approach [48, 49] could also be applied. Although in the first level of the model (4.1), the variance, s 2j , is assumed to be known, it can be estimated jointly in the framework of hierarchical mixture modeling (e.g., Noma and Matsui [35]).
j71
72
j 4 Bayesian Ranking and Selection Methods in Microarray Studies As demonstrated in Section 4.6, the effect-size estimates should also be reported as one of the relevant statistical outputs in gene ranking. Although the Bayesian optimal point estimator is simply the posterior mean [17], the interval estimate should also be reported. Recently, from the frequentist perspectives, Benjamini and Yeuktieli [50] proposed the false coverage-statement rate for adjustment for the frequentists’ confidence intervals in order to keep the coverage probability for the selected gene set, taking into consideration of data-dependent gene selection. Especially, for the empirical Bayes approaches, their frequentist performance would be subjects of interest. Another important research topic is determination of the required sample size in the ranking and selection framework [31, 32]. Especially, the development of sample size formula for the selection criteria in Section 4.4 under the framework of hierarchical mixture models and empirical Bayes estimation is subject to future research.
4.8 Perspective
Microarray technology has provided a powerful tool for identifying relevant genes in elucidating the mechanisms of oncogenesis and in developing optimized medicine for individual patients. In addition to the outputs on statistical significance from multiple testing, accurate gene ranking for an adequate loss function and accurate estimates of effect sizes can provide relevant statistical outputs for selecting important genes for subsequent studies. The development of related analytical methods and evaluation of their performance and utility in practical use are warranted. At the same time, the current tendency toward large-scale genomic studies in clinical oncology emphasizes the importance of designing powerful gene screening (through development of appropriate sample size estimation methods) and developing efficient strategies for gene screening and validation.
4.9 Appendix The EM Algorithm
Regarding Y k ’s as observed variables and cij ’s and qj ’s as missing variables, the EM algorithm can be adapted. As a standard way, the recursions for pi ði ¼ 0; 1; 2Þ are obtained analytically as ðtþ1Þ
pi
¼
m 1X Prðcij ¼ 1jyj ; gðtÞ Þ; m j¼1
ði ¼ 0; 1; 2Þ
which is the sample mean of the posterior probabilities of belonging to the ith component. For parameters of differential components m1 ; m2 ; t21 ; t22 , the recursions are
References
ðtþ1Þ
mi
ðtþ1Þ2 ti
¼
Pm
¼
Pm
ðtÞ
j¼1
j¼1
ðtÞ2
E½qj jyj ; cij ¼ 1; mi ; ti Prðcij ¼ 1jyj ; gðtÞ Þ Pm ðtÞ j¼1 Prðcij ¼ 1jyj ; g Þ E½ðqj
ðtþ1Þ 2
mi
where
ðtþ1Þ
ði ¼ 1; 2Þ
ðtÞ2
Þ jyj ; cij ¼ 1; mi ; ti Prðcij ¼ 1jyj ; gðtÞ Þ Pm ðtÞ j¼1 Prðcij ¼ 1jyj ; g Þ
h i tðtÞ2 y þ s2 mðtÞ j j i i ðtÞ ðtÞ2 ¼ E qj jyj ; cij ¼ 1; mi ; ti ðtÞ2 ti þ s2j E½ðqj
ðtþ1Þ 2
mi
ðtþ1Þ
ðtÞ2
¼
ðtþ1Þ
Þ jyj ; cij ¼ 1; mi
ti yj þ s2j mi ðtÞ2
tj
þ s2j
!2
ðtÞ2
; ti
ðtÞ2
þ
ti s2j ðtÞ2
ti
þ s2j
ðtÞ2 2 ðtþ1Þ ðtþ1Þ ti yj þ s j mi ðtÞ2 ti þ s2j
2mi
ðtþ1Þ2
þ mi
The superscripts denote the recursive time.
References 1 Parmigiani, G., Garrett, E.S., Irizarry, R.
2
3
4
5
6
7
A., and Zeger, S.L. (eds.) (2003) The Analysis of Gene Expression Data: Methods and Software, Springer, New York. Speed, T. (ed.) (2003) Statistical Analysis of Gene Expression Microarray Data, Chapman & Hall/CRC, Boca Raton, FL. Simon, R.M., Korn, E.L., McShane, L.M., Radmacher, M.D., Wright, G.W., and Zhao, Y. (2003) Design and Analysis of DNA Microarray Investigations, Springer, New York. McLachlan, G.J., Do, K.-A., and Ambroise, C. (2004) Analyzing Microarray Gene Expression Data, John Wiley & Sons, Hoboken, NJ. Wiuf, C. and Andersen, C.L. (eds.) (2009) Statistics and Informatics in Molecular Cancer Research, Oxford University Press, New York. Strachan, T. and Read, A. (2011) Human Molecular Genetics, 4th edn, Garland Science, New York. Matsui, S. and Noma, H. (2012) Handbook of Statistics in Clinical Oncology, 3rd edn (eds J. Crowley and A. Hoering), Chapman & Hall/CRC, Boca Raton, FL, p. 561.
8 Benjamini, Y. and Hochberg, Y. (1995)
J. R. Statist. Soc. B, 57, 289. 9 Storey, J.D. (2002) J. R. Statist. Soc. B, 64,
479. 10 Storey, J.D. (2003) Ann. Statist., 31, 2013. 11 Guo, L., Lobenhofer, E.K., Wang, C.,
12
13 14
15 16 17
18
Shippy, R., Harris, S.C., Zhang, L., Mei, N., Chen, T., Herman, D., Goodsaid, F.M., Hurban, P., Phillips, K.L., Xu, J., Deng, X.T., Sun, Y.M.A., Tong, W.D., Dragan, Y.P., and Shi, L.M. (2006) Nat. Biotechnol., 24, 1162. Choe, S.E., Boutros, M., Michelson, A.M., Church, G.M., and Halfon, M.S. (2005) Genome Biol., 6, R16. MAQC Consortium (2006) Nat. Biotechnol., 24, 1151. Lin, R., Louis, T.A., Paddock, S.M., and Ridgeway, G. (2006) Bayesian Anal., 1, 915. Noma, H., Matsui, S., Omori, T., and Sato, T. (2010) Biostatistics, 11, 281. Breslow, N.E. (1990) Statist. Sci., 5, 269. Carlin, B.P. and Louis, T.A. (2009) Bayesian Methods for Data Analysis, 3rd edn, Chapman & Hall/CRC, New York. Efron, B. (2010) Large-Scale Inference: Empirical Bayes Methods for Estimation,
j73
74
j 4 Bayesian Ranking and Selection Methods in Microarray Studies 19
20
21 22 23
24 25
26 27
28 29 30 31 32 33
Testing, and Prediction, Cambridge University Press, Cambridge. Rothman, K.J., Greenland, G., and Lash, T.L. (eds.) (2008) Modern Epidemiology, 3rd edn, Lippincott Williams & Wilkins, Philadelphia. Gottardo, R., Pannucci, J.A., Kuske, C.R., and Brettin, T. (2003) Biostatistics, 4, 597. L€ onnstedt, I. and Speed, T. (2002) Statist. Sin., 12, 31. Lo, K. and Gottardo, R. (2007) Bioinformatics, 23, 328. Newton, M.A., Kendziorski, C.M., Richmond, C.S., Blattner, F.R., and Tsui, K.W. (2001) J. Comput. Biol., 8, 37. Newton, M.A., Noueiry, A., Sarkar, D., and Ahlquist, P. (2004) Biostatistics, 5, 155. Kendziorski, C.M., Newton, M.A., Lan, H., and Gould, M.N. (2003) Statist. Med., 22, 3899. Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977) J. R. Statist. Soc. B, 39, 1. Berger, J.O. (1985) Statistical Decision Theory and Bayesian Analysis, 2nd edn, Springer, New York. Laird, N.M. and Louis, T.A. (1989) J. Educ. Statist., 14, 29. Louis, T.A. and Shen, W. (1999) Statist. Med., 18, 2493. Shen, W. and Louis, T.A. (1998) J. R. Statist. Soc. B, 60, 455. Matsui, S., Zeng, S., Yamanaka, T., and Shaughnessy, J. (2008) Biometrics, 64, 217. Matsui, S. and Oura, T. (2009) Statist. Med., 28, 2801. M€ uller, P., Parmigiani, G., and Rice, K. (2007) Bayesian Statistics 8 (eds J.M. Bernardo, S. Bayarri, J.O. Berger, D. Dawid, D. Heckerman, A.F.M. Smith, and M. West), Oxford University Press, Oxford, p. 349.
34 Storey, J.D. (2007) J. R. Statist. Soc. B,
69, 347. 35 Noma, H. and Matsui, S. (2012) Statist.
Med., 31, 165. 36 McLachlan, G.J. (1992) Discriminant
37 38 39 40 41
42 43 44 45
46 47 48 49 50
Analysis and Statistical Pattern Recognition, John Wiley & Sons, New York. McLachlan, G.J., Bean, R.W., and Jones, L. B.T. (2006) Bioinformatics, 22, 1608. Noma, H. and Matsui, S. (2010) Jpn. J. Biometrics, 31, 13. Neyman, J. and Pearson, E.S. (1933) Phil. Trans. R. Soc., 231, 13. Tsai, C.-A., Hsueh, H.-M., and Chen, J.J. (2003) Biometrics, 59, 1071. Newton, M.A., Wang, P., and Kendziorski, C.M. (2006) Bayesian Inference for Gene Expression and Proteomics (eds K.-A. Do, P. M€ uller, and M. Vanucci), Cambridge University Press, Cambridge, p. 40. Matsui, S. and Noma, H. (2011) Biostatistics, 12, 223. Matsui, S. and Noma, H. (2011) Biometrics, 67, 1225. Crager, M.R. (2010) Statist. Med., 29, 33. Spira, A., Beane, J.E., Shah, V., Steiling, K., Liu, G., Schembri, F., Gilman, S., Dumas, Y.-M., Calner, P., Sebastiani, P., Sridhar, S., Beamis, J., Lamb, C., Anderson, T., Gerry, N., Keane, J., Lenburg, M.E., and Brody, J.S. (2007) Nat. Med., 13, 361. Azzalini, A. (1985) Scand. J. Statist., 12, 171. O’Hagan, A. and Leonhard, T. (1976) Biometrika, 63, 201. Laird, N.M. and Louis, T.A. (1991) Comput. Statist. Data Anal., 12, 27. Shen, W. and Louis, T.A. (1999) J. Comput. Grap. Statist., 8, 800. Benjamini, Y. and Yekutieli, D. (2005) J. Am. Statist. Ass., 100, 71.
j75
5 Multiclass Classification via Bayesian Variable Selection with Gene Expression Data Yang Aijun, Song Xinyuan, and Li Yunxian
5.1 Brief Summary
Selecting a small number of relevant genes for classification has received a great deal of attention in microarray data analysis. While the development of methods for microarray data with only two classes is relevant, developing more efficient algorithms for classification with any number of classes is important. In this chapter, we focus on the Bayesian method that selects groups of genes on the basis of their expression in DNA samples derived under different experimental conditions. We first propose a Bayesian stochastic search variable selection approach for multiclass classification, which can identify relevant genes by assessing sets of genes jointly. Next, we discuss the issue of the associated multiclass classification. We then demonstrate the performance of the approach with two well-known gene expression profiling data: leukemia data and lymphoma data. In the last two sections, we present some concluding remarks and suggestions for further research. The technical details are provided in supplementary material, which is available at https://sites. google.com/site/andyaijunyang/appendix-and-code.
5.2 Introduction
In practice, DNA microarray gene expression data usually have the characteristics of fewer samples and larger number of genes. Multiclass classification, based on data with a relatively small number of samples (n) as compared to the number of variables ðpÞ involved, is an important topic in bioinformatics. The problem of high-dimensional multiclass classification is challenging because many noise variables that may not be relevant to classification exist, and these variables can potentially degrade the prediction performance of classification. Moreover, identifying which variables contribute most to the multiclass classification is necessary. Many variable-selection methods related to multiclass classification have been described in the bioinformatics literature. These methods can be classified into Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
76
j 5 Multiclass Classification via Bayesian Variable Selection with Gene Expression Data univariate and multivariate approaches. Based on the marginal utility of each variable for the classification task, univariate methods consider each variable individually. These methods include parametric and nonparametric methods. Examples include the weighted voting scheme [1], the threshold number of misclassification score [2], the significance analysis of microarray statistic [3], the ratio of betweengroups to within-groups sum of squares [4], the pairwise mean difference [5], and the Wilcoxon test statistic [6]. Due to their conceptually simple nature, univariate methods have attracted much attention. However, they do not consider the correlations between variables, resulting in a subset of variables that may not be optimal for the considered classification task. To take into account the dependency between genes for achieving a reduced number of relevant genes, Yeung and Bumgarner [7] and Jaeger et al. [8] proposed multivariate gene selection procedures, which do not score each variable individually but determine the combinations of variables that yield high prediction accuracy. The multivariate Bayesian gene selection approach based on the stochastic search variable selection method [9] has been applied to the multiclass classification problem (see [10, 11]). Sha et al. [10] proposed an algorithm that is based on a multinomial probit model by using adding/deleting and swapping algorithm. According to Lamnisos et al. [12], this kind of algorithm that randomly chooses to either add or delete a single explanatory variable, or to swap two explanatory variables in the model often leads to high model acceptance rates when the number of variables is substantially larger than the sample size. Moreover, the Metropolis random walk suggested by Sha et al. [10] with local proposals and high acceptance rate is often associated with the poor mixing of MCMC chains. Furthermore, as their approach did not capture a priori correlation in the parameters, eliciting a prior covariance matrix with p > n is difficult [13]. Zhou et al. [11] proposed a multivariate Bayesian model using the g-prior [14] for the unknown regression coefficients related to relevant genes. For situations with high-dimensional covariates, or highly collinear covariates, the covariance matrix involved in the g-prior is nearly singular (see [15]), resulting in the unstable convergence of the algorithm. Moreover, their methods assumed the covariance matrix of random errors to be an identity matrix. This specification has several limitations. For instance, it entails some symmetry between different classes, and an independence from irrelevant alternatives assumption is not appropriate in some applications [16] because this specification postulates independent latent variables. Finally, both Sha et al. [10] and Zhou et al. [11] calculated the leave one out cross validation (LOOCV) within the gene selection process. According to Ambroise and McLachlan [17] and Rocke et al. [18], a selection bias that optimizes the classification accuracy exists when this internal LOOCV procedure is applied to estimate the prediction error. In this paper, we consider a multivariate Bayesian probit model together with a stochastic search variable selection (SSVS) method for the gene selection and the classification of diagnostic category for a multiclass problem. We propose a generalized g-prior (gg-prior) to overcome the problem induced by the possible singularity of the covariance matrix involved in the g-prior distribution of the regression coefficients. We show that this kind of gg-prior is effective in coping situations with a large number of genes and a small number of samples. Moreover, unlike the method based on
5.4 Method
approximation, we perform full Bayesian analysis through the Markov chain Monte Carlo (MCMC; [19])-based stochastic search algorithm. In developing our gg-SSVS algorithm, the efficient sampling scheme suggested by Panagiotelisa and Smith [20] is implemented. For the posterior analysis associated with this sampling scheme, the unknown intercept and regression coefficients in the proposed model are integrated out from the joint posterior distribution. This gives a simple and well-defined posterior distribution to ensure stable convergence of the resulting MCMC methods. Hence, our algorithm is more stable and efficient as compared to the MCMC-based algorithm of Sha et al. [10] and Zhou et al. [11]. In addition, the gg-SSVS approach produces the posterior probability for the selected genes, which is helpful in a diagnostic setting. We illustrate the advantage of our method on two well-known microarray data sets: acute leukemia data [1] and lymphoma data [21]. We compare the performance of the proposed gg-SSVS approach with some other classification procedures in the literature, such as those of Dettling and B€ uhlmann [22] and Yeung et al. [23], among others. Our results show that the proposed gg-SSVS approach reduces the number of selected genes and produces a prediction accuracy comparable to that of existing methods for variable selection and classification. The rest of this paper is structured as follows. The next section provides a brief review of matrix variate distribution. In Section 5.4, we specify the model on the basis of the stochastic search variable selection procedure. Discussions on the related prior distributions, the implementation of the Bayesian method, and the associated classification are also presented. The results obtained from the analysis of the two published data sets are given in Section 5.5. Some concluding remarks are presented in Section 5.6. The technical details are provided in the supplementary material. 5.3 Matrix Variate Distribution
We follow the notation introduced by Dawid [24] for matrix variate distribution. M þ NðP; SÞ will stand for a matrix normal distribution of X, where M is the matrix mean of X, and P ii S and Sii P are the covariance matrices of the ith row and jth column of X, respectively. Let S IWðd; QÞ, then the induced marginal distribution for X is a matrix T distribution denoted as Tðd; P; QÞ. The probability density functions of matrix normal distribution and matrix T distribution are given by Brown [25] (see supplementary material). 5.4 Method 5.4.1 Model
Suppose we are given a training data set that consists of n samples ðX 1 ; Y 1 Þ; . . . ; ðX n ; Y n Þ, where X i ¼ ðX i1 ; X i1 ; . . . ; X ip Þ 2 Rp represents covariates or
j77
78
j 5 Multiclass Classification via Bayesian Variable Selection with Gene Expression Data
input vectors, and Y i is a categorical response variable from sample i and takes on values, 0; 1; . . . ; K 1. Based on the training data, we aim to predict the target values of previously unseen points given a set of new covariates. Following the standard approach for the multinomial probit model (see [26]), we introduce n auxiliary variables Zi ¼ ðZi1 ; . . . ; ZiK 1 Þ; i ¼ 1; 2; . . . ; n to connect the multinomial probit model to the following multivariate normal linear regression model: Z i ¼ a þ X i B þ ei ;
i ¼ 1; 2; . . . ; n
ð5:1Þ
0
where a is a K 1 dimensional vector of intercept, B is a p ðK 1Þ matrix of regression coefficients, and ei ¼ ðei1 ; . . . ; eiK 1 Þ i:i:d: Nð0; SÞ. The relationship between the auxiliary variables Zi and the discrete observations Y i is defined as follows: j if max1 0; andZij ¼ max1 > > tr Xc Xc þ tI B S 1 B0 > > pc c
K 1 > 0 < c = c
2 X c X c c þtIpc exp > > 2 > > > > ð5:10Þ ; : r0 þK 2 aS 1 a0 trðS 1 R0 Þ exp jR0 j 2 exp 2h 2 p nþpc þr0 þ2K 1 Y ci 1 ci 2 pi ð1 pi Þ jSj i¼1
where Ai is equal to either fZi : max1 0; Zij ¼ max1 Znewk ; 8k 6¼ jÞpðZnew jY; X new ; ZðtÞ ; cðtÞ ÞdZnew
ð5:23Þ
Efficient methods for calculating the multivariate integration in Equations 5.21 and 5.23 are described by Genz and Bretz [31].
5.5 Real Data Analysis
5.5 Real Data Analysis 5.5.1 Leukemia Data
We first applied our classification method to leukemia data, which were originally analyzed by Golub et al. [1] and are available at http://www.broad.mit.edu/cgi-bin/ cancer/datasets.cgi. This gene expression level was obtained from Affymetrix highdensity oligonucleotide arrays containing p ¼ 6817 human genes. Golub et al. [1] gathered bone marrow or peripheral blood samples from 72 patients suffering either from acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML), which were identified based on myeloid (bone marrow related) and their origins, lymphoid (lymph or lymphatic tissue related), respectively. The data comprise 47 cases of ALL (38 B-cell ALL and 9 T-cell ALL) and 25 cases of AML, which were already divided into a training set consisting of 38 samples of which 19 are ALL-B, 8 are ALL-T, and 11 are AML; and a test set of 34 samples, of which 19 are ALL-B, 1 is ALL-T, and 14 are AML. Following the protocol in Dudoit et al. [4], preprocessing steps were taken for the data: (i) thresholding: floor of 100 and ceiling of 16 000; (ii) filtering: exclusion of genes with max/min 5 and (max–min) 500, where max and min refer, respectively, to the maximum and minimum expression levels of a particular gene across samples; and (iii) base 10 logarithmic transformation. The filtering resulted in 3571 genes. We further transformed the gene expression data to have mean zero and standard deviation one across samples. To conduct the Bayesian gg-SSVS procedure, we set c ¼ 10; pi ¼ 0:005; i ¼ 1; . . . ; p; h ¼ 100; R0 ¼ 2I, and r0 ¼ 3; t ¼ 0:01. The initial value of cð0Þ was taken with 25 randomly selected elements set to 1. Three diagnostic plots suggested by Smith and Kohn [32] and Brown et al. [33] were used to check convergence. Figure 5.1a shows the most significant genes, which are determined by the posterior gene inclusion probabilities. Figure 5.1b plots the number of selected genes versus the iteration number, and Figure 5.1c plots the log relative posterior probabilities of the selected genes, log(pðcjY; X; ZÞ), versus the iteration number. Figure 5.1b and c shows that the three chains mixed well within 10,000 iterations. We collected 50,000 observations after 10,000 burn-in iterations to obtain the estimates of the posterior gene inclusion probabilities (see (5.17)). Based on the entire training data, the 12 most significant genes, which were ranked by the posterior gene inclusion probabilities, are presented in Table 5.1. The leading gene in Table 5.1 is M27891, which also leads the list of strong genes in the works of Yeung et al. [23] and Koo et al. [34]. Cystatins (CST3) are endogenous protein inhibitors of cathepsins, and these protease–inhibitor pairs, reported in myeloid cell lines with altered development, might be important in the etiology of AML. Golub et al. [1] already showed that cystatin C gene is responsible for the subtype classification of leukemia as a two-class (ALL/AML) problem. The CST3 gene was also identified by Antonov et al. [35] for AML/ALL classification.
j83
84
j 5 Multiclass Classification via Bayesian Variable Selection with Gene Expression Data
Gene index
Iteration number
Iteration number Figure 5.1 (a) It shows the gene inclusion probabilities (in percentages) versus the gene index, (b) and (c) show the number of selected genes and the log relative posterior probabilities of selected genes versus the first 10,000 iteration number, respectively.
Table 5.1 Significant genes found for discriminating ALL-T, ALL-B, and AML.
Rank
Gene ID
Gene description
1 2 3 4
M27891 X03934 X59871 U23852
5 6 7 8 9 10
D88422 M89957 X04145 M37271 U05259 M31523
11
U22376
12
U49020
CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage),a),b) GB DEF ¼ T-cell antigen receptor gene T3-deltaa) TCF7 Transcription factor 7 (T-cell specific)a) GB DEF ¼ T-lymphocyte specific protein tyrosine kinase p56lck (lck) abberant mRNA CYSTATIN A IGB Immunoglobulin-associated beta (B29) CD3G CD3G antigen, gamma polypeptide (TiT3 complex) T-cell antigen CD7 precursor MB-1 gene TCF3 Transcription factor 3 (E2A immunoglobulin enhancer binding factors E12/E47) C-myb gene extracted from human (c-myb) gene, complete primary cds, and five complete alternatively spliced cds MEF2A gene (myocyte-specific enhancer factor 2A, C9 form) extracted from Human myocyte-specific enhancer factor 2A (MEF2A) gene, first coding
a) Yeung et al. [23]. b) Koo et al. [34].
5.5 Real Data Analysis
The relevance of gene X59871 to T-cell ALLs was reported in the biological literature. The gene TCF7 transcription factor 7 (T-cell specific) encodes a transcription factor that is a member of the high-mobility of group protein family. The expression of TCF7 is specific to T-cells, and the gene product was originally designated as TCF-1, a T-cell specific transcription factor. A closely related factor, LEF-1 (lymphocyte transcription factor), is expressed in both T- and B-cell lineages. Both TCF-1 and LEF-1 arise from the same gene, TCF7, by alternative splicing and the use of dual promoters [36]. We also identified some genes not identified by Yeung et al. [23] and Koo et al. [34], such as U05259 and M31523. The MB-1 gene encodes the Ig-alpha protein of the B-cell antigen component but may have other functions in addition to its role in signal transduction in B lineage cells. Ha et al. [37] reported that MB-1 transcripts could be detected in pre-B cell lines and fetal bone marrow in normal, and mitogen activated- and transformed B cells but not in myeloma plasma cells. Furthermore, MB-1 is located in the 19q13 chromosomal region known to be a site of recurrent abnormalities in ALL. The MB-1 gene was also identified for AML/ALL classification [1, 2]. Kamps et al. [38] showed that the heterodimers between tissue-specific basic helix-loop-helix (bHLH) proteins and TCF3 play major roles in determining tissue-specific cell fate during embryogenesis, such as muscle or early B-cell differentiation. They are involved in a form of pre-B-cell acute lymphoblastic leukemia (B-ALL) through a chromosomal translocation which involves TCF3 and PBX1. We first evaluate the performance of the classification methods for a selected subset of genes with the LOOCV procedure. An external LOOCV procedure proposed by Ambroise and McLachlan [17] was used to perform the evaluation. Similar to many other multivariate methods, the external LOOCV procedure is challenged by server memory requirements and large computational time. According to the traditional attempts to overcome these problems (see [39, 40]), we perform the external LOOCV procedure as follows: (1) omit one observation of the training set, (2) based on the remaining observations, reduce the set of available genes to the top 50 genes as ranked in terms of the ratio BSS/WSS [4], (3) the p most significant genes were re-chosen from the 50 genes by our gg-SSVS approach, (4) the re-chosen p genes were used to classify the left out sample, and (5) go back to Step (1) and select another observation. This process was repeated for all observations in the training set until each observation had been held out and predicted exactly once. The misclassification errors of our method with p ¼ 8, 10, and 12 are 3, 2, and 2, respectively. We further evaluate the performance of the classification methods for the test data. Our classification on the test data with p ¼ 8, 10, and 12 genes reported one misclassification error with error rate 0.0294 (see Table 5.2). The test data have also been analyzed by some other multiclass classification methods. For instance, Lee and Lee [41] reported one test error by multicategory support vector machine procedure using 40 selected genes. Yeung et al. [23] applied the Bayesian model averaging (BMA) approach and reported one misclassified sample on the test set using 15 genes. This result is one of the most favorable results in the literature. Tan et al. [42] applied the k-Top Scoring Pairs (k-TSP) to classify the test data. They reported one
j85
86
j 5 Multiclass Classification via Bayesian Variable Selection with Gene Expression Data Table 5.2 The comparison of classification results for leukemia test data.
Method 1 2 3 4 5 6 7 8 9
Multicategory SVMa) HC-k-TSPc) BMAb) PAMd) SVM-RFEd) SPMd) gg-SSVS gg-SSVS gg-SSVS
No. of genes
Overall test error rate
40 36 15 8 6 4 8 10 12
0.0294 0.0294 0.0294 0.0588 0.0882 0.0882 0.0294 0.0294 0.0294
a) Lee and Lee [41]. b) Yeung et al. [23]. c) Tan et al. [42]. d) Koo et al. [34].
classification error with 36 genes. Koo et al. [34] applied the structured polychotomous machine (SPM) to the test data and reported three classification errors using four genes. Our results on the test error rate, together with those given in previously published papers, are summarized in Table 5.2. Our method with fewer genes is shown to be comparable to other popular classification methods. Whether or not the selected genes serve as legitimate markers for multiclass classification of the test data was further verified by the heat map of the selected genes. By visual inspection of the gene expression of the 12 selected genes, we detect some patterns for classifying ALL-T, ALL-B, and AML. Figure 5.2 illustrates three different patterns of the 12 selected genes in the same fashion as Figure 5.1 in Lee and Lee [41] and Figure 5 in Koo et al. [34]. To assess the sensitivity of the Bayesian results to the inputs of hyperparameters in the prior distributions, we reanalyzed the data set by using different values of c, pi , h, R0 ; r0 , and t. For instance, using c ¼ 5 as suggested by Lamnisos et al. [12],
Figure 5.2 Genes that distinguish ALL-B, ALL-T, and AML. Each column corresponds to a sample array and each row corresponds to a gene. The heat map is generated by using Matrix2png softerware. Genes with expression levels greater than the mean are colored in red and those below the mean are colored in green.
5.5 Real Data Analysis
pi ¼ 0:007, h ¼ 200, R0 ¼ 4I; r0 ¼ 6, and t ¼ 0:005, the identification of the relevant genes and the performance of classification are essentially the same as before. 5.5.2 Lymphoma Data
Gene inclusion probabilities
The lymphoma data set was previously analyzed by Alizadeh et al. [21] and are publicly available at http://llmpp.nih.gov/lymphoma/data/figure1. This data set contains gene expression levels of 4026 well-measured genes involving three most prevalent adult lymphoid malignancies: diffuse large B-cell lymphoma (DLBCL), chronic lymphocytic leukemia (CLL), and follicular lymphoma (FL). The total sample size is 62, of which 42 samples are DLBCL, 11 samples are CLL, and 9 samples are FL. Some samples contain a number of genes with unreliable or missing data. The following steps [4, 43] are used to impute the missing data for each gene with missing entries: (i) compute its correlation with all other p 1 genes, and (ii) for each missing entry, identify the five nearest genes having complete data for this entry and impute the missing entry by the average of the corresponding entries for the five neighbors. Each sample is further standardized to have mean zero and variance one across genes. We classify DLBCL, CLL, and FL using our method. We applied the Bayesian gg-SSVS method with the same input of the hyperparameters as in the first example. The initial value of cð0Þ is also taken with 25 randomly selected elements set to 1. The posterior gene inclusion probabilities estimated on the entire training data are presented in Figure 5.3. The relevant
Gene index Figure 5.3 It shows the gene inclusion probabilities (in percentages) versus the gene index.
j87
88
j 5 Multiclass Classification via Bayesian Variable Selection with Gene Expression Data Table 5.3 Significant genes found for discriminating DLBCL, CLL, and FL.
Rank
Gene ID
Gene description
1 2
GENE1622X GENE3805X
3 4 5 6 7 8 9 10 11 12
GENE1644X GENE1775X GENE1648X GENE1647X GENE1673X GENE1610X GENE1795X GENE653X GENE2403X GENE30X
CD63 antigen (melanoma 1 antigen); Clone ¼ 769861a),b) ISGF3 gamma ¼ IFN alpha/beta-responsive transcription factor ISGF3 gamma subunit (p48); Clone ¼ 1372,520a) (cathepsin L; Clone ¼ 345,538)a),b) (Unknown UG Hs.140,483 ESTs; Clone ¼ 1319,683)a) Cathepsin B; Clone ¼ 297,219a) Cathepsin B; Clone ¼ 261,517a),b) Glutathione peroxidase 1; Clone ¼ 712,106a) Mig ¼ Humig ¼ chemokine targeting T cells; Clone ¼ 8a),b) CD31 ¼ PECAM-1; Clone ¼ 359,925 (Lactate dehydrogenase A; Clone ¼ 686,889)a),b) (Unknown; Clone ¼ 1356,913)a),b) (NC2 alpha subunit ¼ repressor of class II gene transcription through specific binding to TBP-promoter complexes via heterodimeric histone fold domains; Clone ¼ 1340,774)
a) Tibshirani et al. [44]. b) Draminski et al. [45].
genes selected on the basis of these probabilities are reported in Table 5.3, together with the relevant genes selected by Tibshirani et al. [44] and Draminski et al. [45]. Since there is no test set available, the external LOOCV procedure described in Leukemia data section was applied to obtain the classification error on the training set. In Table 5.4, we compare our classification results with the following popular classification methods: LogitBoost, estimated, AdaBoost, 100 iterations, Classification tree [22], random forest var.sel., SC.s, and NN.vs [46]. We observe from Table 5.4 that our results are comparable to those obtained by the existing methods. Table 5.4 Comparison of LOOCV results of different methods for lymphoma data.
Method 1 2 3 4 5 6 7 8 9 10
SC.sb) Random forest var.sel. (s.e. ¼ 0)b) Random forest var.sel. (s.e. ¼ 1)b) NN.vsb) LogitBoost, estimateda) AdaBoost, 100 iterationsa) Classification treea) gg-SSVS gg-SSVS gg-SSVS
a) Dettling and B€ uhlmann [48]. b) Díza-Uriarte and Andes [46].
No. of genes 2796 73 58 15 10 10 10 8 10 12
LOOCV error rate 0.0330 0.0470 0.0420 0.0400 0.0323 0.0484 0.2258 0.0323 0.0323 0.0161
5.7 Perspective
5.5.3 Computational Time
The computational times to run one time of the gg-SSVS on the whole set of variables in the leukemia Data and lymphoma data are about 4.5 and 5 h, respectively, for 60,000 iterations in a PC with an Intel Core2 1.86 GHz CPU and 1 G ram.
5.6 Discussion
This chapter studies the problem of gene selection and multiclass classification when the sample size is small and the number of genes is large. The auxiliary variables are employed to relate the multinomial probit model to a multivariate regression model. We propose the Bayesian stochastic search variable selection method for gene selection on multiclass microarray data. The gg-prior is employed to solve the singular problem of the covariance matrix involved in the g-prior. We use the algorithm by integrating the regression coefficients out the joint posterior distribution to draw the indicator variable, so that the MCMC chain will not be reducible. Our method also produces the posterior probabilities for selected genes, which is helpful in biological interpretation. As compared to other approaches on the same multiclass microarray data, our method uses fewer genes and produces comparable classification accuracy. Yang and Song [47] proposed a hierarchical Bayesian model with a MCMC-based stochastic search algorithm to perform gene selection and classification for a twoclass problem. They employed a generalized singular g-prior (gsg-prior) on the basis of the Moore–Penrose generalized inverse of the covariance matrix. We also use the gsg-prior for gene selection and multiclass classification. The gsg-SSVS with p ¼ 8, 10, and 12 all reported a 0.0588 error rate for leukemia test data, which is slightly worse than the current results in Table 5.2, and 0.0323, 0.0323, and 0.0161 LOOCV error rates for lymphoma data, which are the same as the current results in Table 5.4. However, the gsg-SSVS approach is more computationally demanding due to the simulation of the Moore–Penrose generalized inverse of the covariance matrix in each MCMC iteration. In this chapter, we consider c and pi as known hyperparameters in their prior distributions. This restriction can be relaxed by treating them as unknown parameters and further assigning prior distributions to them. Extending our framework to account for an interaction structure between genes is also interesting.
5.7 Perspective
One possible future research direction is the development of especially Bayesian variable selection approaches to enhance the robustness of the finally selected gene
j89
90
j 5 Multiclass Classification via Bayesian Variable Selection with Gene Expression Data subsets. In order to alleviate the actual small sample sizes of the majority of bioinformatics applications, we feel that the further development of such techniques, combined with appropriate evaluation criteria, constitutes an interesting direction for future Bayesian variable selection research. Other interesting opportunities for future Bayesian variable selection research will be the extension toward upcoming bioinformatics domains, such as text, literature mining. While in these domains, the Bayesian variable selection approaches is not yet as central as in gene expression, we believe that its application will become essential in dealing with the high-dimensional character of these applications.
References 1 Golub, T.R., Slonim, D.K., Tamayo, P.,
2
3
4
5
6
7
8
Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., and Lander, E.S. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., and Yakhini, Z. (2000) Tissue classification with gene expression profiles. J. Comput. Biol., 7, 559–583. Tusher, V.G., Tibshirani, R., and Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Nat. Acad. Sci. USA, 98, 5116–5121. Dudoit, Y., Yang, H., Callow, M., and Speed, T. (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc., 97, 77–87. Nguyen, D.V. and Rocke, D.M. (2002) Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics, 18, 1216–1226. Dettling, M. (2004) Bag boosting for tumor classification with gene expression data. Bioinformatics, 20, 3583–3593. Yeung, K.Y. and Bumgarner, R.E. (2003) Multi-class classification of microarray data with repeated measurements: application to cancer. Genome Biol., 4, R83. Jaeger, J., Sengupta, R., and Ruzzo, W.L. (2003) Improved gene selection for
9
10
11
12
13
14
15
classification of microarrays. Pac. Symp. Biocomput., 8, 53–64. George, E.I. and McCulloch, R.E. (1993) Variable selection via Gibbs sampling. J. Am. Stat. Assoc., 88, 881–889. Sha, N., Vannucci, M., Tadesse, M.G., Brown, P.J., Dragoni, I., Davies, N., Roberts, T.C., Contestabile, A., Salmon, N., Buckley, C., and Falciani, F. (2004) Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics, 60, 812–819. Zhou, X., Wang, X., and Dougherty, E.R. (2006) Multi-class cancer classification using multinomial probit regression with Bayesian gene selection. IEE Proc.-Syst. Biol., 153, 70–78. Lamnisos, D., Griffin, J.E., and Steel, MarkF.J. (2009) Transdimensional sampling algorithms for Bayesian variable selection in classification problems with many more variables than observations. J. Comput. Graph. Stat., 18, 592–612. Gupta, M. and Ibrahim, J.G. (2009) An information matrix prior for Bayesian analysis in generalized linear models with high dimensional data. Stat. Sinica, 19, 1641–1663. Zellner, A. (1986) On assessing prior distributions and Bayesian regression analysis with g-prior distributions, in Goel, P.K. and Zellner,A. (eds) Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, NorthHolland, Amsterdam, pp. 233–243. Gupta, M. and Ibrahim, J.G. (2007) Variable selection in regression mixture
References
16
17
18
19
20
21
22
23
24
25
modeling for the discovery of gene regulatory networks. J. Am. Stat. Assoc., 102, 867–880. Train, K. (2003) Discrete Choice Methods with Simulation, Cambridge University Press, Cambridge. Ambroise, C. and McLachlan, G.J. (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA, 99, 6562–6566. Rocke, D.R., Ideker, T., Troyanskaya, O., Quackenbush, J., and Dopazo, J. (2009) Papers on normalization, variable selection, classification or clustering of microarray data. Bioinformatics, 25, 701– 702. Gilks, W., Richardson, S., and Spiegelhalter, D. (1996) Markov Chain Monte Carlo in Practise, Chapman and Hall, London. Panagiotelisa, A. and Smith, M. (2008) Bayesian identification, selection and estimation of semiparametric functions in high dimensional additive models. J. Econometrics, 143, 291–316. Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Powell, J.I., Yang, L., Marti, G.E., Moore, T., Hudson, J., Lu, L., Lewis, D.B., Tibshirani, R., Sherlock, G., Chan, W.C., Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke, R., Staudt, L.M. et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511. Dettling, M. and B€ uhlmann, P. (2003) Boosting for tumor classification with gene expression data. Bioinformatics, 19, 1061–1069. Yeung, K.Y., Bumgarner, R.E., and Raftery, A.E. (2005) Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics, 21, 2394– 2402. Dawid, A.P. (1981) Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68, 265–274. Brown, P.J. (1993) Measurement, Regression, and Calibration, Clarendon, Oxford.
26 Albert, J. and Chib, S. (1993) Bayesian
27
28
29
30
31
32
33
34
35
36
37
analysis of binary and polychotomous response data. J. Am. Stat. Assoc., 88, 669–679. Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbls distribution, and the Bayesian restoration of images. IEEE T. Pattern Anal., 6, 721–741. Lachenbruch, P.A. and Mickey, M.R. (1968) Estimation of error rates in discriminant analysis. Technometrics, 10, 1–11. McLachlan, G.J. (1992) Discriminant Analysis and Statistical Pattern Recognition, John Wiley & Sons, New York, p. 342. Gelfand, A. (1996) Model determination using sampling-based methods, in Markov Chain Monte Carlo in Practice (eds W.R. Gilks, S. Richardson, and D.J. Spiegelhalter), Chapman and Hall, London, pp. 145–158. Genz, A. and Bretz, F. (2002) Methods for the computation of multivariate t-probabilities. J. Comput. Graph. Stat., 11, 950–971. Smith, M. and Kohn, R. (1996) Nonparametric regression via Bayesian variable selection. J. Econometrics, 75, 317–343. Brown, P.J., Vannucci, M., and Fearn, T. (1998) Multivariate Bayesian variable selection and prediction. J. Royal Statist. Soc. B, 60, 627–641. Koo, J.Y., Sohn, I., Kim, S., and Lee, J.W. (2006) Structured polychotomous machine diagnosis of multiple cancer types using gene expression. Bioinformatics, 22, 950–958. Antonov, A.V., Tetko, I.V., Mader, M.T., Budczies, J., and Mewes, H.W. (2004) Optimization models for cancer classification: extracting gene interaction information from microarray expression data. Bioinformatics, 20, 644–652. Kingsmore, S.F., Watson, M.L., and Seldin, M.F. (1995) Genetic mapping of the T lymphocyte-specific transcription factor 7 gene on mouse chromosome 11. Mamm. Genome, 6, 378–380. Ha, H.J., Kubagawa, H., and Burrows, P. D. (1992) Molecular cloning and expression pattern of a human gene
j91
92
j 5 Multiclass Classification via Bayesian Variable Selection with Gene Expression Data 38
39
40
41
42
homologous to the murine mb-1 gene. J. Immunol., 148, 1526–1531. Kamps, M.P., Murre, C., Sun, X.-H., and Baltimore, D. (1990) A new homeobox gene contributes the DNA binding domain of the t(1;19) translocation protein in pre-B ALL. Cell, 6, 547–555. Chu, W., Ghahramani, Z., Falciani, F., and Wild, D.L. (2005) Biomarker discovery in microarray gene expression data with Gaussian processes. Bioinformatics, 21, 3385–3393. Le Cao, K.-A. and Chabrier, P. (2008) ofw: an R package to selection continuous variables for multi-class classification with a stochastic wrapper method. J. Stat. Software, 28, 1–16. Lee, Y. and Lee, C.K. (2003) Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics, 19, 1132– 1139. Tan, A.C., Naiman, D.Q., Xu, L., Winslow, R.L., and Geman, D. (2005) Simple
43
44
45
46
47
decision rules for classifying human cancers from gene expression profiles. Bioinformatics, 21, 3896–3904. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R.B. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520–525. Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2003) Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat. Sci., 18, 104–117. Draminski, M. et al. (2008) Monte Carlo feature selection for supervised classification. Bioinformatics, 24, 110–117. Díza-Uriarte, A. (2006) Gene selection and classification of microarray data using random forest. BMC Bioformatics, 7, 3. Yang, A.J. and Song, X.Y. (2010) Bayesian variable selection for disease classification using gene expression data. Bioinformatics, 26, 215–222.
j93
6 Semisupervised Methods for Analyzing High-dimensional Genomic Data Devin C. Koestler
6.1 Brief Summary
Researchers are frequently interested in the use of high-dimensional genomic data for predicting molecular subtypes, especially when those subtypes are associated with patient survival time, time to disease recurrence, or response to treatment. This is often complicated by the fact that neither the subtypes themselves nor the number of subtypes are known upfront. The issue of identifying such subtypes using DNA microarray data remains a largely open research question; however, a number of promising approaches have been proposed in recent years that fall into a class of statistical procedures called semisupervised methods. In this chapter, we describe semisupervised methods for identifying biologically and clinically meaningful cancer subtypes using DNA microarray data. We begin by motivating semisupervised methods by explaining fully unsupervised and supervised approaches to this problem. This is followed by a discussion of the basic components of semisupervised methods as well as description of two promising semisupervised procedures: the semisupervised clustering (SS-Clust) algorithm [1] and semisupervised recursively partitioned mixture models (SS-RPMM) [2]. We then illustrate these methods on mesothelioma cancer data where the goal is to identify subtypes that associate with patient survival time. Lastly, we finish with some concluding remarks. Throughout this chapter, we often use the terms subtypes, classes, and clusters interchangeably.
6.2 Motivation
Imagine a hypothetical study where we observe survival information on a group of patients, all of which have the same cancer diagnosis. Given this information, we can begin to understand the long-term prognosis of this cancer by studying the survival profile among these subjects (Figure 6.1). From Figure 6.1a, we can see Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
10
20
0.2
0.4
0.6
0.8
Subtype 1 Subtype 2
t *2
0.0
Median survival time
0.0 0
Probability of survival
1.0
j 6 Semisupervised Methods for Analyzing High-dimensional Genomic Data
Probability of survival 0.2 0.4 0.6 0.8
1.0
94
t* 30
40
0
10
20
Time
Time
(a)
(b)
t 1* 30
40
Figure 6.1 Kaplan Meier survival estimates for (a) all subjects with the same clinical disease diagnosis and (b) stratified by patient subtype. The median survival times for subtype 1 and 2 are denoted by t1 and t2 , respectively. Time is assumed to be in months.
that patients with this type cancer have a median survival time of around t ¼ 23 months. That is, we expect only 50% of subjects with this type of cancer to survive beyond 23 months. Indeed, this is a rapidly fatal cancer! Suppose however, that in reality there are really two subtypes (subtype 1 and subtype 2) of this cancer, which are distinguished by differences in the molecular profile of the tumor and that knowledge of patient subtype holds important implications for survival time. Looking Figure 6.1b, we see that patients with subtype 1 have a considerably better long-term prognosis compared to patients with subtype 2 (i.e., t1 > t2 ). Whether or not this is a clinically relevant difference, is a separate issue, but the important point at the moment is that knowledge of patient subtype tells us something about how long we expect these patients to survive. From a clinical perspective there would be tremendous value in using the molecular information that defines two subtypes to predict subtype for a future group of patients – especially if the difference in survival is clinically relevant. The ability to accurately predict subtype for a future patient might also facilitate more targeted treatment strategies, ultimately resulting in more favorable survival prognoses. Although this example is hypothetical, the prospect of such discoveries has sparked significant interest within the scientific community toward the identification of clinically important molecular subtypes/classes for a variety of different cancers. The task of class identification concerns the discovery or identification of biologically meaningful classes or groups of subjects using information contained in DNA microarrays, with the goal that the resulting classes will provide clues into patient disparities with respect to disease recurrence, survival time, and/or response to treatment. More formally, the objectives are to identify a set of latent
6.3 Existing Approaches
classes Ci ¼ k 2 f1; 2; . . . ; Kg, based on available microarray data Xi , such that the resultant classes associate with some phenotype of interest, Y i , i ¼ 1; 2; . . . ; n. Class identification is complicated by a number of factors, including the highdimensionality (P many features) and small sample sizes (n samples) characteristic of microarray data – aptly referred to as the small n large P problem – and the fact that the number of classes K is unknown for most practical applications. Semisupervised methods address the first of these complications by subsetting the feature space into a relevant subset of features X X . The guiding principle of these approaches is that since much of the feature space, X does not differ significantly across subjects in any biologically meaningful way, subsetting the feature space to consist of the most relevant features for the problem at hand will facilitate more meaningful class identification and subsequent class prediction for a future group of subjects. The term relevant is intentionally left vague for the moment, but will be further described in the sections that follow.
6.3 Existing Approaches
Using the previous example as motivation, our focus throughout the remainder of this chapter concerns the problem of identifying cancer subtypes that associate with patient survival time. To illustrate various methods at our disposal, suppose we have microarray data, XPn ¼ ½X1 ; X2 ; . . . ; Xn on a sample of n subjects that have the same cancer diagnosis. Here, Xi is a P 1 vector of features (i.e., SNPs, gene expression levels, DNA methylation levels, ect.) for subject i. For each of these subjects, we also observe their survival time Si ¼ ðT i ; di Þ; i ¼ 1; 2; . . . ; n, where T i is the time from disease diagnosis to death or censoring and di indicates whether or not death was observed for subject i. More specifically T i 2 ½0; 1Þ and di ¼ 1 if subject i died during the study period and 0 if subject i survived the length of the study or was lost to follow-up. The objectives herein are as follows: given microarray data and survival times D0 ¼ ½ðX1 ; S1 Þ; ðX2 ; S2 Þ; . . . ; ðXn ; Sn Þ, we seek to 1) Identify latent class memberships C1 ; C2 ; . . . ; C n , where Ci ¼ k 2 f1; 2; . . . ; Kg, and ^ 1; C ^ 2; . . . ; C ^m 2) Train a classifier that can be used to predict class membership C for a future group of m subjects using available microarray data, D1 ¼ ½X1 ; X2 ; . . . ; Xm , where Xi is a P 1 vector of features. Approaches for identifying clinically relevant cancer subtypes typically fall into three broad categories – fully unsupervised, fully supervised, and more recently, semisupervised approaches. These different approaches are characterized by the way in which the available data ðX ; SÞ is use to determine class membership C. As described below, whereas fully unsupervised and supervised approaches use either X or S for the determination of classes, semisupervised approaches use both X and S to guide class discovery and in doing so, may facilitate more meaningful class discovery.
j95
96
j 6 Semisupervised Methods for Analyzing High-dimensional Genomic Data 6.3.1 Fully Unsupervised Procedures
Broadly speaking, unsupervised learning refers to the problem of trying to find concealed structure in unlabeled data. In the context of the problem at hand, “concealed structure” represents disparities in the molecular profile that define the cancer subtypes. Thus, a fully unsupervised approach uses only microarray data, X for the identification of classes/clusters of patients that have a similar molecular profile across some set features. This can be achieved using any one of a number of different approaches, including nonparametric clustering methods such as, Kmeans clustering [3] and hierarchical clustering [4], or model-based clustering methods via finite mixture models [5]. The choice of one clustering method over another is beyond the scope of this chapter; however, readers are referred to [6, 7], for a more thorough coverage of this topic. Whether all or a selected subset of the features are used for cluster analysis, they are chosen independently of the phenotype of interest, S. As we will soon see, this is the key aspect that delineates fully unsupervised from semisupervised procedures. Choices for subsetting the feature space X X independently of S include: considering only the top M < P most variable features or by using the the coefficient of variation (i.e., c v ¼ s=m), which represents the inverse “signal-tonoise” ratio. The underlying assumption here is that the features exhibiting the greatest amounts of variation or “signal-to-noise” ratio are likely candidates for meaningful class discrimination and separation. Alternatively, features can be chosen using prior biological knowledge, such as features involved in specific biological pathways or those that have an established or purported involvement in the disease process under study. For example, in the context of gene expression data collected on cancer patients, genes could be selecting based on their involvement in cell-cycle control, apoptosis, or signaling pathways, or any other biological pathways relevant to the disease under study. Once class assignments are made on a training data set D0 , the class labels C1 ; C2 ; . . . ; C n and microarray data, X, are used to train a classifier for predicting class membership for a future group of subjects, D1 (Figure 6.2). Although class discovery is completely driven by biological information contained in the microarray data, since S was not used in the selection of features for clustering analysis there is no guarantee that the identified classes will strongly predict survival time. 6.3.2 Fully Supervised Procedures
Contrary to fully unsupervised approaches, fully supervised approaches use only the phenotypic information, S for the determination of class membership. In the context of the problem at hand, class discovery and assignment are based only on survival time, where for example, samples are partitioned into two classes (low- and high-risk classes) based on the observed median survival time. Alternatively, if
6.3 Existing Approaches
(a) Determination of survival
Train classifier
Class assignment
threshold
Class prediction
0.6
subtype 1: Si
0.4
0.8
1.0
Microarray data
subtype 2: Si >
Cˆ1,Cˆ 2 ,...,Cˆ m
i = 1,2,...,n
0.2
Predict class membership for a future group of m subjects:
Subtype 1
where Cˆi = k, k
Subtype 2
{1,2}
0.0
Probability of survival
j97
0
10
20
30
P Features
40
Time
(b)
Feature selection
Unsupervised clustering
Train classifier
Partition samples into clusters based on the profile of the selected features
Select a set of features for unsupervised clustering analysis:
Subtype 1
Class prediction
Microarray data
Predict class membership for a future group of m subjects:
Subtype 1
Cˆ1,Cˆ 2 ,...,Cˆ m Where
where Cˆi = k, k
represents
the feature space of X
Subtype 2
Subtype 2
Selected Features
Figure 6.2 Diagrams illustrating full supervised (a) and full unsupervised procedures (b). Fully supervised procedures (a) begin by selecting a threshold for survival time which is used to partition subjects into two or more groups (i.e., high/low risk groups). Using the class labels in conjunction with subjectspecific microarray data, a classifier is trained that can be used for predicting class
P Features
membership using the genomic data collected for a future group of subjects. Fully unsupervised procedures (b) begin by clustering the subjects on the basis of all features or a selected subset. Based on the clustering solution, subjects are assigned class labels, which serve as the basis for training a classifier for predicting class membership for a future set of subjects.
clinically relevant thresholds exist for the disease under study, samples could be divided into groups based on that information. Once class labels have been assigned to the subjects in the training data, C1 ; C2 ; . . . ; C n , this information is used in combination with the microarray data collected on those subjects, X, for training a classifier. This classifier is then used to predict class membership for a future subset of subjects, D1 (Figure 6.2). One obvious limitation of this approach is that it requires prespecification of the number of risk groups, which is not often known and may not be easily determined using only the phenotypic information. Additionally, the classes may lack biological meaning since the microarray data were not used for the identification of classes. 6.3.3 Semisupervised Procedures
Both fully unsupervised and supervised approaches have some obvious limitations. To circumvent these limitations, semisupervised methods make use of both X and S for the identification of cancer subtypes and by doing so provide biologically meaningful classes that are likely to accurately predict outcome. We describe below
{1,2}
98
j 6 Semisupervised Methods for Analyzing High-dimensional Genomic Data the five components shared by semisupervised methods followed by a description of semisupervised clustering and semisupervised RPMM. 1) Splitting the full data into training and testing sets: The first step of any semisupervised procedure consists of splitting the data into training (used for learning) and testing (used for validation) sets. Sometimes the split is a natural one, such as cases where separate data sets were collected for the purposes of learning and validation. Often times, however, data are collected as one complete set and need to be randomly split into training and testing sets for subsequent steps of the semisupervised procedure. Conventions vary with respect to the proportion of samples allocated to the training and testing sets, however a 50/50 split balances model overfit and a sufficient number of samples in the testing data for model validation. 2) Supervision: The supervision step is the primary component that distinguishes semi-supervised from fully unsupervised approaches. In this step, features are (1) examined with respect to their association with the primary phenotype of interest and (2) ranked based on their strength of association. There are a number of considerations for carrying out (1), which depend on the nature of the phenotype variable (i.e., binary, categorial, continuous, and time-to-event), the existence of potential confounders in the covariate data, and correlation between features. 3) Clustering and class assignment: After features are ranked based on their association with the phenotype of interest, samples in the training set are then clustered using the top M ranked features, where M is either preset by the user or selected using a cross-validation approach. The resulting clustering solution is then used to assign class membership to the samples in the training set. 4) Predicting class assignment in the testing set: Class membership is then predicted for the samples in the testing set using the latent class solution fit to the training data. The method employed for class assignment predictions, however, largely depends upon the general framework used for clustering the training set samples. For example, in the semisupervised RPMM procedure, training set samples are clustered using a mixture model approach. Since this solution results in parameters that define the various clusters/classes, an empirical Bayes framework can be easily employed for predicting class membership for a future group of subjects based on the same top M features used in the initial clustering of the training data. 5) Testing the association with phenotype: The final step involves testing the association between the classes predicted for the samples in the testing set and the phenotype of interest. Methods for testing the association between predicted class and phenotype will depend on the nature of the primary phenotype variable, ( i.e., binary, categorial, continuous, time-to-event) and the existence of potential confounders in the covariate data. For example, if the phenotype is a time-to-event outcome (i.e., survival time) then one can use a log-rank test or a Cox proportional hazards model that adjusts for potential confounders to examine the association between predicted class and time-to-event outcome.
6.3 Existing Approaches
6.3.3.1 Semisupervised Clustering Unlike fully unsupervised procedures where features are chosen for clustering irrespective of the phenotype, the semisupervised clustering (SS-Clust) procedure [1] begins by preselecting features that are most associated with the phenotype of interest. When the phenotype is survival time, this is accomplished by fitting P Cox-proportional hazards models (one for each of the P features): hðti Þ ¼ expfc0 þ cp X ip gh0 ðtÞ
ð6:1Þ
where h0 ðtÞ represents the baseline hazard function, X ip is the value of the pth feature for the ith subject, and cp is the proportional hazards estimate of the log-hazard ratio for the pth feature. For each of the P univariate Cox-models the resulting absolute test-statistics (i.e., values of jcp j=seðcp Þ) are recorded and used to rankorder the features based on their association with survival time. Using the M features with the largest absolute test-statistics, K-means clustering is used partition the n subjects into K clusters where each of samples belong only to a single class. Since M and K are tuning parameters, they should be selected with care. M can be selected based on prior constraints regarding the maximum number of allowable features – in some instances an investigator may have an upper limit with respect to the number of features used for clustering and class assignment. Alternatively, M can be more robustly determined using cross-validation. We refer interested readers to [1, 2] for further details on cross-validation for semisupervised approaches. The selection of K is one of the foremost issues in problems involving clustering and is a separate issue from the process of actually solving the clustering problem. Similar to the selection of M, in some cases existing biological knowledge or user defined thresholds can be used for selecting K; however, these methods for specifying K may not coincide with the number of clusters that is best supported by the data. For these reasons, more robust statistical methods have been proposed for determining K, which typically involve iterating through multiple selections of K (i.e., K 2 f2; 3; . . . ; K max g) and comparing the between and within cluster distances to arrive at a suitable specification for K. Depending on the user defined upper threshold, K max , these procedures can result in considerable computational burden. Using the class labels determined by the K-means clustering solution and the P n microarray data, X, a nearest shrunken centroid (NSC) classifier [8] is then used to train a classifier. Briefly, this method computes standardized centroids for ^ pk , represent the within-class mean each of the K classes, xpk =^ spk , where xpk and s and standard deviation of the pth feature in the kth class, k 2 f1; 2; . . . ; Kg. Each of the class centroids are then “shrunk” toward the overall centroid for all classes by a user defined threshold. This shrinkage has the advantage of making the classifier more accurate by reducing the effect of noisy genes. This can be easily accomplished using the function “pamr:train” in the R-package pamr. Given microarray data XPm for a future group of m subjects (testing/validation data), NSC compares the squared distance between the microarray data for each
j99
100
j 6 Semisupervised Methods for Analyzing High-dimensional Genomic Data subject and each of the class centroids and assigns subjects to the class with the ^ ; C ^ ; . . . ; C ^ for minimum squared distance. This results in predicted classes C 1 2 m each of the m subjects. This can be achieved using the “pamr:predict” function, also contained in the R-package pamr. The last step involves an assessment of the phenotypic relevance of the identified classes. This is carried out by a fitting Cox proportional hazards model of the form ( ) K X T ^ hðtÞ ¼ exp c0 þ Z d þ ck IðC ¼ kÞ h0 ðtÞ ð6:2Þ i
k¼2
where Z represents any additionally relevant patient-specific information to be ^ ¼ kÞ is an indicator of controlled for (i.e., age, gender, smoking status, etc.), IðC i class membership in the kth class for the ith patient, and ck is the proportional hazards estimate of the log-hazard ratio for the kth class. Based on (6.2), the predicted classes are examined for their association with survival time by testing the hypothesis H 0 : c1 ¼ ¼ cK ¼ 0. 6.3.3.2 Semisupervised RPMM Semisupervised RPMM (SS-RPMM) [2] operates much in the same way as SSClust, however it substitutes the use of K-means clustering with RPMM [9] and NSC with an empirical Bayes classifier (Figure 6.2). Similar to SS-Clust, SS-RPMM begins by identifying the features that are most associated with survival time using the Cox proportional hazards model in Equation 6.1. The P features are then rank-ordered based on their strength of association with survival time and the top M are selected for the clustering of samples using RPMM. In short, RPMM is a hierarchical, model-based clustering method for navigating clusters in a mixture-model. Based on binary recursive partitioning (BRP), RPMM begins by comparing the model goodness-of-fit between 1-class and 2-class mixture models. If the 2-class model fits the data better, these classes are further split into two new classes and compared to the previous split in terms of model goodness-of-fit. Recursion continues until the algorithm arrives at the most parsimonious representation of the data. This procedure results in an estimate of the number of clusters K as well as the posterior probabilities of class membership, PðC i ¼ kjXi ; ^zÞ; k ¼ 1; . . . ; K and i ¼ 1; . . . ; n, where ^z is a vector of parameter estimates based on the mixture-model fit. Because RPMM is a model-based method for clustering data, it relies on the choice of a distribution for modeling the features in the microarray data. For suitably transformed gene expression data (i.e., log-transformed), a Gaussian-distributed RPMM generally works well. For array-based DNA methylation data, methylation values for a particular locus are approximately continuously distributed between 0 (fully unmethylated locus) and 1 (fully methylated locus) – aptly referred to as b-values. Hence, a beta-distributed RPMM is a sensible choice for clustering DNA methylation data. Functions for fitting Gaussian- and betadistributed RPMMs are available as functions “glcTree” and “blcTree” in the R-package RPMM.
6.3 Existing Approaches
Using the parameter estimates obtained from RPMM, ^z and microarray data Xi ; i ¼ 1; 2; . . . ; m for a future group of subjects, a naive Bayes classifier is used to predict class membership probabilities: ^k Þ ^ PðXi jCi ¼ k; q g PðCi ¼ kjXi ; ^zÞ ¼ PK k ^k Þ ^k PðXi jC i ¼ k; q k¼1 g
ð6:3Þ
^k Þ is the likelihood for subject i within the ^k ¼ PðC i ¼ kÞ, PðXi jCi ¼ k; q where g ^k is a vector of model parameters for the kth class. Once class memkth class, and q bership probabilities have been estimated, the phenotypic relevance of the identified classes can be accessed by fitting Cox proportional hazards models of the form ( ) K X T ^ ck PðC ¼ kjX ; ^zÞ h0 ðtÞ ð6:4Þ hðtÞ ¼ exp c0 þ Z d þ i
i
k¼2
where Z and ck are as described earlier. Alternatively, the model in Equation 6.2 can be used by assigning subjects to the class that has the maximum class member^ ¼ argmaxk PðC ¼ kjX ; ^zÞ). ship probability (i.e., C i i i
6.3.3.3 Considerations Regarding Semisupervised Procedures As noted from the preceding sections, the SS-Clust and SS-RPMM strategies are procedurally identical, but differ in the underlying methodological framework used to carry out class identification and class prediction. Disparities at this level influence the outcome of these procedures, thus warranting further discussion. In the paragraph below we compare K-means clustering and RPMM, which respectively represent the basis for class identification for the SS-Clust and SS-RPMM methods. One limitation of K-means clustering is the requisite prespecification of K, which is unknown for most practical applications. Although there are iterative procedures for determining suitable values of K in the context of K-means, RPMM estimates estimates K automatically. This does not come without a cost, which comes in the form of increased computational time; the magnitude varying considerably based on the specification of the underlying model. For example, since there is no closeform maximum likelihood solution for the estimation of the two parameters that define the beta distribution, numerical optimization is necessary, thereby contributing to increased computational time. Another key limitation of K-means is its cluster model, which is based on spherical clusters that are separable in a way so that the mean value converges toward the cluster center. Under this framework, clusters are expected to be of the similar size so that the assignment to the nearest cluster center is the correct assignment. RPMM on the other hand, which uses the expectation-maximization (EM) algorithm for parameter estimation, is better able to accommodate clusters of variable size compared to K-means. A limitation of RPMM that was not described above is its reliance on the assumption of class-conditional independence of genomic features. This assumption is typically made for the purposes of scalability, as fully unstructured covariance matrices are generally infeasible for large P small n data. Although research has
j101
102
j 6 Semisupervised Methods for Analyzing High-dimensional Genomic Data shown that violations of this assumption do not drastically impact the clustering solution, several techniques have been proposed that relax the assumption of classconditional independence by incorporating more flexible covariance structures. Alternatively, in the context of SS-RPMM, one could limit the selection of features for clustering to those that are mutually independent, but individually predictive of survival. The above points are intended to raise awareness to some of the major considerations between the SS-Clust and SS-RPMM methodologies and are by no means comprehensive.
6.4 Data Application: Mesothelioma Cancer Data Set
Here we examine the above methods using a cancer data set. The mesothelioma data set [10] is described in Christensen et al. [10] and consists of 158 tumor samples derived from two, independent series of mesothelioma cases. Using one series as the training data (n0 ¼ 79) and one as the testing (n1 ¼ 79), we illustrate the above methods with the goal of identifying cancer subtypes that associate with patient survival time. For the meosthelioma data set, tumor samples were profiled for DNA methylation using the Illumina GoldenGate methylation bead array, resulting in methylation b-values for 1413 autosomal CpG loci. Thus, our microarray data X is a 1413 158 matrix of methylation b-values. In addition to the DNA methylation measurements and survival information collected for these subjects, the mesothelioma data set also contains information on the patient’s age at the time of diagnosis, patient gender, and tumor histology. Several different unsupervised approaches were considered for the purpose of identifying cancer subtypes that relate to patient survival time. Using the training data, the top M most variable CpGs were identified and subsequently used to cluster the observations in the training set. Here, we fixed M ¼ 100 for simplicity and ease of comparison across the various methods, but note that in practice this number would be determined more robustly using crossvalidation. We considered three different unsupervised clustering approaches. K-means clustering with K ¼ f2; 3; 4g (K-means), a Gaussian-distributed RPMM (RPMM (Gaussian)), and a beta-distributed RPMM (RPMM (Beta)). For the K-means based approaches, a NSC was used to train a classifier for predicting class membership for the subjects in the testing data. For the RPMM-based approaches, empirical Bayes was used to predict the class membership probabilities and subjects were assigned to the class with the highest ^ ¼ argmaxk PðC ¼ kjX ; ^zÞ; k ¼ 1; 2; . . . ; K). probability (i.e., C i i i For the fully supervised approaches, subjects in the training data were assigned to “high”- “low” -risk classes based on the median survival time computed in the training set. The two fully supervised methods, Median cut and Median cut (no cens) differed only in how censored observations were assigned to classes. For the
6.4 Data Application: Mesothelioma Cancer Data Set
the later, all censored observations were considered to be in the “low” -risk class, whereas for the former, subjects were assigned to classes based on the observed censoring or failure time. Similar to the K-means procedure described above, once subjects were assigned to classes a NSC was used to train a classifier for predicting class membership for the subjects in the testing data. Lastly, we considered both SS-Clust and SS-RPMM as representative semisupervised approaches. Similar to the fully unsupervised approaches which clustered subjects in the training data using the M ¼ 100 most variable loci, here we used the M ¼ 100 loci most associated with patient survival, adjusting for patient age, gender, and tumor histology. Again, we note that in practice M would be determined using cross-validation, but fix it here for the purposes of simplicity. For the SS-Clust procedure we considered several selections for K ¼ f2; 3; 4g and for the SS-RPMM procedure we fit both Gaussian- and beta-distributed RPMMs (Figure 6.3). Similar to the fully unsupervised RPMM procedures described earlier, subjects in the testing data were assigned to the class with the highest class membership probability. For each of the above methodologies, model (6.2) was used to examine the association between the predicted classes in the testing set adjusting for patient age,
Figure 6.3 Diagram illustrating the various components of the SS-RPMM procedure. empirical Bayes (EB).
j103
104
j 6 Semisupervised Methods for Analyzing High-dimensional Genomic Data Table 6.1 Results obtained from fully unsupervised, supervised, and semisupervised procedures
for identifying phenotypically important classes in the mesothelioma cancer data set. Procedure
Fully unsupervised
Fully supervised
Semisupervised
Method
K
P-value
Pseudo-R2
K-meansa) K-meansa) K-meansa) RPMM (Gaussian)b) RPMM (Beta)b) Median-cuta) Median-cut (no cens)a) SS-Clusta) SS-Clusta) SS-Clusta) SS-RPMM (Gaussian)b) SS-RPMM (Beta)b)
2 3 4 2c) 3c) 2 2 2 3 4 7c) 5c)
0.257 0.383 0.527 0.152 0.243 0.230 0.209 0.050 0.006 0.028 0.002 0.004
0.05 0.05 0.05 0.07 0.07 0.06 0.06 0.09 0.16 0.13 0.29 0.21
a) Denotes methods that used a NSC classifier for predicting class membership for the subjects in the testing set. b) Denotes methods that used empirical Bayes for predicting class membership probabilities for the subjects in the testing data – subjects were assigned to the class with the largest probability of class membership. c) Denotes instances in which K was directly estimated from the data.
gender, and tumor histology. The p-value and pseudo-R2 from these models were recorded and used to compare the various methodologies and their capacity for uncovering classes that associate with patient survival time. 6.4.1 Results: Mesothelioma Cancer Data Set
Although our analysis represents a relatively small scale comparison of fully unsupervised, fully supervised, and semisupervised methods, the results given in Table 6.1 show a clear preference toward semi-supervised methods for identifying phenotypically important classes based on DNA methylation information. The pseudo-R2, which ranges from 0 to 1 with higher values indicating better model fit, suggests that the semisupervised methods are much better suited for identifying biologically and clinically important classes. This is unsurprising, given that these methods partition the feature space that consists of only the features that are most relevant to the phenotype under consideration. This results in more meaningful and directed class identification in the training data and an increased likelihood that the predicted classes in the independent testing set hold clinical and biological value. Fully unsupervised approaches, which used the most variable features in the training set for class discovery, resulted in much poorer performance, suggesting that class discovery based on the most variable features alone may fall short in terms of identifying classes that associate with the phenotype of interest.
10
20
0.8 0.6 0.4
Probability of survival
0.0
Median survival time 16 months
0
Class 1 Class 2 Class 3 Class 4 Class 5
0.2
1.0 0.8 0.6 0.4 0.2 0.0
Probability of survival
j105
1.0
6.5 Perspective
30 40 50 Time (months) (a)
60
0
10
20
30 40 50 Time (months) (b)
Figure 6.4 Kaplan Meier survival cruve(s) based on the estimates for all subjects in the testing data (a) and stratified by the five classes identified using the SS-RPMM (Beta) procedure (b).
Focusing our attention on the results from the SS-RPMM (beta) in more detail, we note that the five predicted classes in the testing data have very different distinct survival trajectories (Figure 6.4). Most notably, subjects predicted to be in class 2 have much more favorable long-term prognoses than any of the other classes. Even more striking is that the median survival time for subjects in this class is more than 2.5 times greater than the overall median survival time for subjects diagnosed with meosthelioma (approx. 14 months, depending on the source) and 3.5 times greater than the highest-risk class identified by this procedure. Although the profile of methylation markers identified here far from being used in clinical practice, the results suggest that information on the molecular level (in this case DNA methylation signatures) can provide additional insight about the pathophysiology of a cancer and may help identify patients with differing long-term prognoses.
6.5 Perspective
Large-scale genomic data have provided researchers a glimpse into the molecular composition of wide variety of different diseases, including cancer. This has generated a great deal of interest in the application of these data for the discovery of biologically and clinically meaningful cancer subtypes, coupled with companion need for appropriate analytical methods for carrying out such analyses. In this chapter, we described several promising semisupervised approaches (namely, SS-Clust and SS-RPMM) for addressing such questions, with an emphasis on identifying cancer subtypes that associate with patient survival time. Although
60
106
j 6 Semisupervised Methods for Analyzing High-dimensional Genomic Data our focus was on “time-to-event” phenotypes (i.e., survival), the semisupervised procedures described here can be easily modified to accommodate binary, continuous, or categorial phenotypes – for example, substituting the Cox-proportional hazards model with a linear regression or generalized linear model, depending on the nature of the phenotypic variable. Semisupervised methods operate on the premise of more directed class identification by restricting the feature space to the most meaningful features for the problem at hand. In doing so, they reduce the feature space to a more manageable set of features, thereby lessening the complexity and computational burden of the problem. Also, from economic standpoint, reducing the number of features for class discovery and prediction will contribute to lower costs in terms of validating the identified biomarkers using more comprehensive biological validation techniques (i.e., pyrosequencing, quantitative PCR, etc.). Taken together, these points render semisupervised methods as promising new avenues for the analysis of high-dimensional genomic data.
References 1 Bair, E. and Tibshirani, R. (2004) Semi-
2
3
4
5
6
supervised methods to predict patient survival from gene expression data. PLoS Biol., 2 (4), E108. Koestler, D.C., Marsit, C.J., Christensen, B.C., Karagas, M.R., Bueno, R., Sugarbaker, D.J., Kelsey, K.T., and Houseman, E.A. (2010) Semi-supervised recursively partitioned mixture models for identifying cancer subtypes. Bioinformatics, 26 (20), 2578–2585. Tavazoie, S., Hughes, J.D., Campbell, M. J., Cho, R.J., and Church., G.M. (1999) Systematic determination of genetic network architecture. Nat. Genet., 22 (3), 281–285. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95 (25), 14863–14868. Fraley, C. and Raftery, A.E. (2002) Modelbased clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc., 97, 611–631. Clifford, H., Wessely, F., Pendurthi, S., and Emes, R.D. (2011) Comparison of clustering methods for investigation of genome-wide methylation array data. Front Genet., 2, 88.
7 Datta, S. and Datta, S. (2003)
Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics, 19 (4), 459–466. 8 Tibshirani, R., Hastie, T., Narashimhan, B., and Chu, G. (2003) Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat. Sci., 18, 104–117. 9 Houseman, E.A., Christensen, B.C., Yeh, R.-F., Marsit, C.J., Karagas, M.R., Wrensch, M., Nelson, H.H., Wiemels, J., Zheng, S., Wiencke, J.K., and Kelsey, K.T. (2008) Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinf., 9, 365. 10 Christensen, B.C., Houseman, E.A., Godleski, J.J., Marsit, C.J., Longacker, J.L., Roelofs, C.R., Karagas, M.R., Wrensch, M.R., Yeh, R.-F., Nelson, H.H., Wiemels, J.L., Zheng, S., Wiencke, J.K., Bueno, R., Sugarbaker, D.J., and Kelsey, K.T. (2009) Epigenetic profiles distinguish pleural mesothelioma from normal pleura and predict lung asbestos burden and clinical outcome. Cancer Res., 69 (1), 227–234.
j107
Part Three Network-Based Approaches
Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
j109
7 Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation Vishal N. Patel and Mark R. Chance
7.1 Brief Summary
In this chapter, we begin by introducing a specific type of cancer – colorectal cancer – and the bioinformatic tools applied in the analysis of colorectal tumor specimens. We then introduce an emerging approach for analyzing multidimensional datasets, and we refer to the resulting construct as a “molecular subsystem.” We describe relevant issues related to the construction, interpretation, and validation of these subsystems. In particular, we provide a worked example at the end of chapter to illustrate the statistical concerns in constructing a molecular subsystem from proteomic data.
7.2 Colon Cancer: Etiology
Colorectal cancer – a cancer of the large bowel – is the third leading cause of death among adult Americans, and it is estimated that 1 in 20 individuals will develop colorectal cancer in their lifetime [2]. As the intestinal epithelium serves as a crucial medium for interaction with our environment (e.g., absorption of nutrients from food), it is not surprising that a variety of environmental triggers have been associated with an increased risk for developing colorectal cancer, such as smoking [3] and saturated fat intake [4], among others. However, environmental insults can explain only a fraction of cancer risk, and much of the remaining story arises from an individual’s genetic background (see Figure 7.1). The most striking example of the genetic contribution to colorectal cancer can be found in the hereditary syndrome known as familial adenomatous polyposis (FAP), in which numerous polyps grow throughout the large bowel at an early age, some of which invariably progress into adenocarcinomas [5]. Known to run in an autosomal dominant fashion in families, the genetic basis for FAP was uncovered in the late 1980s as mapping uniquely to the gene Apc [6–8]. In demonstrating the genetic basis for colorectal
Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
110
j 7 Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation
Figure 7.1 Diagram of colorectal cancer development. An individual’s genetic background and his/her environment contribute to the initiation of colorectal cancer by inducing both somatic mutations in the epithelium and/or a chronic inflammatory response in the supportive tissue (called the lamina propria). The accumulation of genetic aberrations allows for clonal expansion of an aberrant precursor cell population, resulting in tumor formation.
cancer, studies of FAP patients paved the way for recent genome-wide experiments searching for genetic risk factors [9].
7.3 Colon Cancer: Development
A lifetime of environment insults, coupled with the natural aging process, results in the accumulation of damage to the DNA of colonic epithelial cells. The genetic character of nonhereditary colorectal cancer first became apparent in the early 1990s, when patterns of mutations were observed to be correlated with disease severity: it was observed that those aberrant crypt foci (ACFs) – the precursor epithelial lesion [10] – that progress to neoplasia are often characterized by mutations in Apc, and the growing tumor “builds” upon this first hit with mutations in additional genes, such as Kras and/or Tp53 [11]. By perturbing normal gene function, mutations can drastically alter the state of a cell, establishing new – and potentially uncontrollable – equilibria for cellular survival and proliferation. Cancer genes can be categorized as tumor suppressors if they suppress tumor growth or as oncogenes if they promote growth; consequently, mutated tumor suppressors (e.g., Tp53) often have reduced or
7.4 The Pathway Paradigm
abrogated functions, while mutated oncogenes (e.g., Kras) have enhanced or new functions in the tumor. While mutations in such genes that actively sustain tumor development – referred to as “drivers” – are necessary for tumor progression, an individual tumor may be littered with mutations in over 80 different genes [12], and these additional “passenger” mutations may have arisen simply due to the hyperproliferative and/or promutagenic condition of the tumor. While genetic aberrations are necessary for tumor initiation, promotion of tumor growth (see Figure 7.1) depends upon sustained injurious signals from the environment, which, in turn, induce further genetic damage. Though environmental insults can assault the epithelial cells directly, the supporting cells around the epithelium also mount a response to the insult by way of inflammation – a process observed in colorectal tumor specimens over a century ago [13]. The chronic inflammatory response is characterized by an invasion of lymphocytes and other immune cells into the intestinal epithelium. As the inflammatory process is designed to repair tissue damage, the immune cells secrete a variety of communicatory molecules, called cytokines, which promote proliferation of the damaged epithelium, effectively allowing the cells to form a “wound” at the site of injury. By promoting cellular proliferation, inflammation creates an environment favorable to the accumulation of genetic insults. The inflammatory response can also damage DNA directly via the release of reactive oxygen and nitrogen species, free radicalcontaining molecules designed to damage invading bacteria and viruses [14].
7.4 The Pathway Paradigm
While the accumulation of somatic mutations is the hallmark of cancer, it has become increasingly clear that the mutant genes alone do not define the trajectory of the disease. Indeed, patients whose tumor’s mutational profiles are >85% distinct still result in similar histopathologic grading and patient survival (personal analysis of publicly available glioblastoma data [15]). Given the apparent redundancy of gene function, the field of cancer biology relies upon a modular framework for molecular analysis, organizing genes according to signaling pathways: series of biochemical processes whereby a (molecular) stimulus is transduced into a functional output (see Figure 7.2 for an example). Though individual genes often appear to have their own identity (possessing unique functions and interaction partners), their redundancy becomes apparent when considered in the context of a signaling pathway. Mutated driver genes were recently found to co-exist within signaling pathways [16], supporting the hypothesis that multiple hits to a pathway or module are necessary to produce a functional effect. A more specific example involves the tumor suppressor Tp53, a gene frequently mutated in late-stage colorectal cancer [17]. As Tp53 controls the transcription of a cell cycle inhibitor, Cdkn1a (also known as p21), homozygous deletion of Cdkn1a in mouse models also results in a tumorigenic predisposition [18]; amplification of Mdm2 can also disrupt the Tp53 pathway (as in Figure 7.2).
j111
112
j 7 Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation
Figure 7.2 Pathways of Tp53 (shown above as p53). Tp53 controls the transcription of Cdkn1a (shown above as p21), which, when mutated, also results in a predisposition to cancer. Tumorigenesis can also be triggered by mutations in Mdm2. Protein interactions are indicated by blue diamonds; red arrows indicate
transcriptional induction. Genes with red boxes have been found to be mutated in the germline; those in green boxes have been found to be mutated only somatically. Reprinted by permission from Macmillan Publishers Ltd: Nature Medicine # [1].
In addition to functional redundancy, pathways also contain information on interactions. As shown in Figure 7.2, genes can affect each other through direct, physical interactions between their protein products, as well as via indirect regulation of gene transcription. Often, mutant genes in separate pathways (who, therefore, do not interact with each other via the aforementioned mechanisms) synergize with each other, and these long-range genetic interactions are central to cancer biology, where dozens of mutations in myriad pathways are operating simultaneously. In fact, a mutant gene’s functional role is strongly affected by the presence of these synergistic, accompanying mutations. A lone mutation in Kras, for instance, leads only to self-limiting lesions and/or ACFs, while a Kras mutation secondary to a hit in Apc correlates with polyp formation and cancer [11].
7.5 Cancer Subtypes and Therapies
In spite of the mutational heterogeneity found among colorectal cancer patients, it is clear that similarities between patients exist at more general phenotypic levels. Recently, various cancers have been clustered by their mRNA expression profiles [19–21] and proteomic signatures [22, 23], pointing to the existence of molecular subtypes in cancer. Importantly, certain molecular subtypes show differences in outcome and response to therapy. For instance, it was recently found that a molecular therapy, cetuximab, targeting the epidermal growth factor receptor (Egfr) improved overall colorectal cancer survival, though resistance to therapy was widespread [24]. Interestingly, Egfr’s mutational status alone is less helpful in
7.7 Molecular Subsystems: Construction
understanding cetuximab efficacy than other molecular and genetic markers [25]. In particular, oncogenic transformation of Kras correlates with cetuximab resistance in patients, as these mutations activate Egfr-related pathways independent of Egfr’s mutational status [26]. By understanding the genetic interactions and pathways linking Kras to Egfr, prescribing practices for cetuximab therapy have been drastically altered, forcing physicians to consider a patient’s tumor genotype prior to administering treatment.
7.6 Molecular Subsystems: Introduction
The use of signaling pathways and other maps (e.g., the genome, interaction networks) to organize data has been a characteristic feature in the era of highthroughput data analysis of cancer. This approach has heralded a new subfield that uses theoretical frameworks to couple biological measurements together. In these recent approaches, the molecules of interest – for example, driver genes – are studied in the context of an organizing manifold, for example, a signaling network, and we refer to this joint structure between molecules and manifold as a molecular subsystem. The manifold imposes a structure onto the system components, and, thus, a particular manifold represents one hypothesis for the architecture of driver gene cooperativity. As a subsystem is defined, in part, by its constituent molecules, subsystems can be constructed at individual levels of the molecular hierarchy to create networks of interacting genes, RNAs, or proteins. Alternatively, subsystems traversing multiple hierarchical levels exist, such as a regulatory network linking a transcription factor to the gene products whose expression it controls. In the following, we first discuss approaches for measuring and mapping the components of molecular subsystems (“Construction”), and then discuss the biological and statistical meaning of the resultant subsystem (“Interpretation”). Since a molecular subsystem has two facets – the molecules and the manifold – that require experimental exploration, we conclude with a section on this important, but often overlooked, point (“Validation”).
7.7 Molecular Subsystems: Construction 7.7.1 Measurements
As molecular subsystems arise from the coordinated actions of many individual parts – be they nucleotides, amino acids, peptides, or proteins – we must be able to measure these numerous parts efficiently and en masse if we wish to study subsystems. However, our limited exploration of biological space prevents us from fully
j113
114
j 7 Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation knowing all of a subsystem’s components, and, consequently, our measurement tools must also be capable of revealing pieces unbeknownst to the observer. Such tools have emerged in the past decade due, in no small part, to the sequencing of the genome. With the genome sequence, it became possible to create thorough compendia of genetic elements, and technologies soon followed for measuring these elements and their products. While some of these tools measure only prespecified targets (e.g., microarrays), other tools – particularly in proteomics – take measurements stochastically from an experimental sample. Both types of technologies lend themselves to the study of molecular subsystems and have helped to usher in an era of “discovery science,” wherein large omic datasets are explored to arrive at biological hypotheses ex post facto. Array-based technologies for measuring mRNA transcripts were among the first to emerge in the mid-1990s; for a review of the technology see Allison et al. [27] and Wheelan et al. [28]. The use of microarrays for measuring mRNA transcripts has become increasingly commonplace as their costs have declined, and large repositories for expression profiling experiments now exist (i.e., the Gene Expression Omnibus [29]). Variants of expression profiling have been introduced for a variety of applications: arrays with probes spanning splice junctions for detecting alternative splicing [30]; probes targeting transposable elements [31]; and probes targeting microRNA (miRNA) species [32]. For proteins, antibody arrays are the intuitive analog to microarrays. However, antibodies are notorious for their inconsistency and lack of specificity [33]. Moreover, with current estimates of the occurrence of alternative splicing, the number of potential proteins is estimated to be well over 300,000, and, thus, specific antibody production becomes very costly. The alternative modality that has dominated the field is based on mass spectrometry (MS), wherein a sample is separated by liquid chromatography (LC) prior to ionization and sequencing by tandem rounds of MS. While DNA- or RNA-based experiments can tile their probes on an array, liquid chromatography coupled mass spectrometry (LC-MS)-based approaches sample semistochastically from a complex mixture of peptides; this approach is discussed in detail in the worked example at the end of the chapter. Though ripe for discovery, this critical step introduces a degree of uncertainty into the analysis that is still being addressed [34]. Nonetheless, steady improvements in chromatographic reproducibility and spectral sensitivity have allowed for the development of quantitative LC-MS-based methods where up to a few thousand proteins can be quantified in relative terms. 7.7.2 Manifolds
While the aforementioned technologies provide point measurements of subsystem components, an organizing principle, which we refer to as a manifold, is also necessary to describe the system. The genome has proven to be a surprisingly complex manifold due to the wealth of information hidden within it. The most obvious structure in this manifold is the gene itself. However, as our awareness of
7.7 Molecular Subsystems: Construction
regulatory mechanisms grows – promoters, enhancers, miRNA, and so on – we push the boundaries of where a gene starts and where a gene stops [35]. The hidden regulation of the genome has consistently created problems in the field of microarray analysis, where, for example, probes for a single gene often show inconsistent differential expression across the length of the gene, strongly suggestive of alternative splicing. Though this form of regulation was largely ignored early on, new array technologies with increased probe density across a gene are allowing for the improved detection of splice variants [30]. Top-down, proteomic approaches also show promise for unbiased detection of splice isoforms by making use of the genome, as illustrated in Figure 7.1a. Early subsystem analysis of microarray data focused on the use of ontologies to organize the measurements. Initial studies [36] used the gene ontology (GO), the most widely used and well-curated ontology, to annotate individual molecular species, and the saturation of GO terms within a list of differentially expressed genes (or vice versa) remains a commonly used measure. While ontologies are used for finding a common annotation among a group of genes, their biological relevance is called into question as they are (1) biased toward well-studied genes and processes and (2) difficult to organize and interpret (e.g., higher terms in the GO hierarchy are less biologically informative) [37]. An alternative manifold to represent more complex relationships is one based on networks, wherein a molecular unit may interact with any number of partners. A network is constructed from a set of one-dimensional elements, since only one dimension – the edge weight – is required to describe the relationship between any two points or nodes. The genome itself can be represented as a network in which the edges are measured by the inverse pairwise genomic distance between loci. Ontologies can also be represented as networks, as was done for the set of gene– disease relationships found in OMIM (Online Mendelian Inheritance in Man) to create a disease network [38]. Transcriptional, or coexpression, networks are commonly used as they can be constructed from a (sufficiently large) set of microarray data, calculating pairwise correlations between all genes (see Figure 7.1c). Coexpression manifolds have proven useful because coexpressed genes (1) may have a common regulator or (2) may be regulating each other. With the genomic coverage of current microarrays, a coexpression network can provide a global picture of interactions among genes, as opposed to the more limited view provided by protein–protein interaction (PPI) networks (Figure 7.3). However, coexpression networks have little meaning in the context of traditional pathways or interactions – two coexpressed genes may not physically interact or they may be in different cellular compartments – and, as such, they have been most useful in uncovering targets with strong transcriptional roles [39]. Protein–protein interaction networks are derived from databases compiled from specific experimental probes of physical interactions among proteins. These databases are largely composed of interactions measured via yeast twohybrid (Y2H) or affinity purification-mass spectrometry (AP/MS). Some of the most widely used interaction databases are the Human Protein Reference
j115
116
j 7 Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation
Figure 7.3 Three types of commonly used manifolds: (a) the genome, shown with peptide measurements made via LC-MS/MS from wild-type (WT) and knock-out (KO) mice; (b) a protein– protein interaction network, shown with mRNA measurements from wild-type and Apc mutant mice (data from Patel et al.); and (c) a coexpression network between proteomic targets and transcription factors, displayed as a heatmap.
Database (HPRD) [40], biological General Repository for Interaction Datasets (BioGRID) [41], Biomolecular Interaction Network Database (BIND) [42], Database of Interacting Proteins (DIP) [43], Online Predicted Human Interaction Database (OPHID) [44], Molecular Interaction database (MINT) [45], and IntAct [46]. AP/MS “pulls out” a complex of proteins interacting with a bait, and the set of interactions are modeled as either complete (with every protein interacting with every other protein) or as a “wheel-and-spoke” (with the bait alone interacting with every protein in the complex), though it is recognized that neither model is representative of most cases [47]. The Y2H method focuses on pairwise interactions generated through genomic cloning of specific domains, and can be applied on a genome scale. These high-throughput screens for interactions have well-known limitations that produce false positives (e.g., spurious interactions, or two proteins interacting in a Y2H may not interact in vivo) and false negatives (weak interactions may be missed), and several groups – both academic (e.g., HPRD) and commercial (e.g., MetaCore, Ingenuity) – have provided manual curation of the data.
7.8 Molecular Subsystems: Interpretation
While the genome is a “complete” manifold, PPI networks are certainly not, and our current knowledge of human PPIs is estimated to cover only 10% of the true number of interactions [48]. While coupling multiple independent PPI networks together can serve to filter out the noise, multiplexing PPIs introduces additional biases, such as promoting well-studied proteins – which tend to be disease associated – and artificially inflating their value as “hubs” [49]. However, as discussed previously, the construction of molecular subsystems hinges not on the contribution of any single molecule or interaction, but, rather, on the properties of the molecular ensemble. Consequently, subsystems based on PPIs can often tolerate a high degree of noise in the underlying manifold or in the measured molecules.
7.8 Molecular Subsystems: Interpretation
In designing and, more importantly, interpreting a molecular subsystem, we make use of two well-founded biological assumptions: hierarchy and heterarchy. Molecular heterarchies are systems in which there are no absolute governing molecules, but, rather, each molecule is attributed a similar position of importance. The principle of heterarchy is implicit in the use of many manifolds, for the manifolds discussed here have no innate pecking order. Yet, the biological tools at our disposal are not amenable to studying heterarchies, as mouse models, siRNA transfections, and so on are designed to target a single (or a few) genes at a time. To use these tools efficiently, we are inadvertently biased toward looking for hierarchy: we may cluster nodes by degree or unearth regulatory “hub” proteins to discover hidden hierarchy in networks [50, 51]. The global hierarchy of molecular biology stems from the introduction of the central dogma, which gave primacy to “lower” levels (DNA) as controlling or storing information found at “higher” levels (RNA, protein). With the unceasing discovery of new forms of regulation (e.g., RNA–protein interactions, microRNA), our current understanding of reality is far more complex than imagined by the central dogma. Nonetheless, the original principles are still useful when integrating high-throughput data sources: mutations in the cancer genome, for example, that are in cis with changes in mRNA expression may have a greater functional role than those not transcribed [56]. Along these lines, considering how the protein product may be affected by a mutated driver gene (e.g., via the introduction of a stop codon or splicing alterations) can provide evidence as to the functional role of the mutation [16, 57]. 7.8.1 Examples
In Table 7.1, we have listed a few commonly studied molecular subsystems; some use two forms of high-throughput data to improve the power for biological
j117
118
j 7 Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation Table 7.1 Potential molecular subsystems and general questions that can be asked of them.
Molecular species measured
Manifold
Pairs well with
Can be used to study
mRNA
Genome
mRNA
Genome
Protein–DNA bindinga) DNA dosageb)
Peptidesc)
Genome
–
mRNA
–
SNPs
Known pathway(s) Coexpression network Coexpression network PPI network
Transcriptionally active euchromatin [52] Transcriptionally active copy number variation Alternative splice variants that are translated Transcriptional activity of pathway(s)
miRNA binding (predictions) mRNA data
Proteins
PPI network
mRNA data
PTMs Mutations
PPI network PPI network
– –
Mutations
Disease network
–
SNPs mRNA
–
Transcriptional modules regulated by a SNP Modules of genes regulated by a common miRNA species Clustering of transcriptionally active SNPs [53] Dysregulated network neighborhoods [23] Modularity of PTM patterns Genetic interactions in a pathwaycontext [54] Phenotypically related genes [38]; new candidate disease genes [55]
a) Via Chip-seq. b) Via CGH or SNP array. c) Digested from proteins in LC-MS/MS workflows.
inference. In the following section, we discuss the implications of using particular manifolds and provide examples from the literature. The genome is a useful manifold as it can serve as a defensible point-of-causation in certain instances. To interpret mRNA expression changes in obese human populations, for example, Emilsson et al. correlated these expression changes with genetic polymorphisms by way of SNP arrays [58]. This allowed expression changes with a genetic basis to be culled from those that arise by secondary or indirect mechanisms. Genomic location is also a valuable manifold in cancer studies, where genomic aberrations are frequent hallmarks of the disease [59]. Relative amounts of DNA can be measured using comparative hybridization (CGH) arrays, as was done for the analysis of glioblastoma samples by Bredel et al., and, after mapping these measurements to the genome, were used to provide insight into correlated patterns of genomic gains and losses [60]. Leveraging the modular nature of coexpression manifolds, Akavia et al. found that genes with correlated patterns of mRNA expression can serve to identify causal mutations in melanoma [61]. Similarly, Horvath et al. traversed coexpression
7.9 Molecular Subsystems: Validation
modules in glioblastoma to uncover a gene, Aspm, with a previously unexamined regulatory role [39]. Compiling PPIs allows one to extend the linear signaling pathway paradigm to create networks of physical interactions within which traditional pathways are embedded. Cerami et al. capitalized on this idea to identify clusters of physically connected driver genes in glioblastoma, providing support for the hypothesis that driver genes are mutated in different pathways [62]. Nibbe et al. [63] found that groups of proteins proximal (as measured by protein–protein interactions) to multiple driver genes show higher levels of mRNA dysregulation than groups of proteins proximal to only a single driver gene, simultaneously illustrating that multiple levels of information in the molecular hierarchy can be integrated to identify hotspots in the signaling network [63].
7.9 Molecular Subsystems: Validation
As discussed, subsystems are composed of two features – the molecular units and the organizing manifold – and both components require experimental validation. One approach to validation has been to demonstrate that analysis of a particular manifold can triangulate functionally relevant molecules. Nibbe et al. [23] showed that PPI networks are eminently useful in finding neighborhoods of dysregulated proteins in colorectal cancer [23]. Lim et al. used a PPI subnetwork for inherited ataxias to identify Purotrophrin-1, previously unknown to associate with these diseases. In addition, they validated their PPI manifold directly by testing randomly sampled interactions (which were compiled via Y2H) by coimmunoprecipitation [54]. The biological relevance of manifolds can also be supported by demonstrating their evolutionary conservation, as was done for PPI networks [64], serving as a global validation of the manifold. An alternative approach to validation is motivated statistically, using aggregate measures of the subsystem to assess its importance. The assumption in these approaches is that a manifold–molecule coupling should exhibit stronger coordinated changes (i.e., differential expression) than expected. Thus, when Nibbe et al. [23] used exhaustive searches to identify transcriptional “hotspots” in PPI subnetworks, they tested the validity of these resultant hotspots statistically, permuting either the phenotype labels or the gene labels. These empirical null distributions represent two different null hypotheses: phenotype label permutation models a null distribution in which the measurements have no association with the known phenotype categories; gene name permutation creates a null distribution in which the interaction pattern among genes/proteins is abrogated. In recently published work, we illustrated how proteomic data can be leveraged to statistically test the relevance of a candidate subnetwork [65]. As proteomic data are sparse and measure targets downstream of oncogenes and tumor suppressors, a second map is required to associate proteomic targets with a hypothesized subsystem. A coexpression network can be particularly useful in this situation, as
j119
120
j 7 Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation strong coexpression of subsystem genes with proteomic targets provides evidence for a regulatory role of the upstream, subsystem genes. This relationship can be directly tested by calculating the strength of coexpression between the subsystem and the proteomic targets, and then calculating significance against the background level of coexpression between the subsystem and all measured molecules [66]. Drawing from engineering systems theory, system structures can be tested by examining system properties upon failure of individual components, allowing us to gauge the cascading effect of perturbations within the network architecture. This is analagous to the underpinnings of cancer biology, wherein we search for genes whose “failure” (i.e., tumor suppressors) critically impacts the functioning of the system. Using these principles, we demonstrated that the classic and most widely used model of gene failure – the knock-out mouse – can be re-envisioned in a subsystem context: the mutated gene is one component of a hypothesized molecular subsystem, and the knock-out mouse is a model perturbation of this subsystem. In our study, we tested a PPI network manifold by using mice mutated at two different subsystem genes, Apc and Cdkn1a [67]. Coupled with mRNA and proteomic measurements, we were able to test for (1) the differential expression of subsystem genes at the mRNA level and (2) the association – via PPIs or coexpression – of the subsystem with proteomic targets measured upon each of the two perturbations.
7.10 Worked Example: Label-Free Proteomics
In this section, we discuss a statistical framework that can be used in the interpretation of proteomic data collected by means of LC-MS, or liquid chromatography coupled mass spectrometry (LC-MS). LC-MS has been increasingly used for proteomic profiling as it allows for a “bottom up” ‘omics approach, wherein a complex protein sample is enzymatically digested, the resulting mixture of peptides analyzed by LC-MS/MS (MS/MS indicating tandem mass spectrometry). The component peptides are finally identified by comparing the observed spectra to a database of theoretical spectra. While peptide-level intensity information – often in the form of spectral counts, XIC, or p-values – is technically informative and suggestive of protein level changes, it may be only minimally biologically informative, as numerous possibilities exist as to why a particular peptide exhibits differences in abundance between two biological conditions. Thus, the question remains – and is not by any means solved in the field – as to how one should interpret a change in abundance at the peptide level. A particular peptide may be found to differ between two biological conditions for a number of technical reasons: (1) ion suppression in the mass spectrometer in one condition alone, (2) misalignment of the chromatograms, and (3) technical variation. With sufficient biological or technical replicates, and the increasing accuracy of both mass spectrometers and chromatographic separation systems, many of these issues can be overcome, leaving room for the more compelling biological explanations for differences in the abundance of a peptide:
7.10 Worked Example: Label-Free Proteomics
Figure 7.4 The nested hierarchy of cooperativity as we assemble molecular subsystems from peptides to proteins to networks.
(1) differences in abundance of the parent protein between the two biological conditions (the result typically of interest), (2) the gain/loss of that peptide’s exon (alternative splicing) in one sample, or (3) posttranslational modifications (PTMs) on the peptide that were not searched for explicitly in the spectrum-matching step (Figure 7.4). Thus, we are interested in both the global expression change, for example, the overall change in protein expression, as well as the local changes that reflect the peptide-by-peptide behavior, as both may be biologically informative. To incorporate these issues into the label-free quantification process, we present a statistical framework to quantify the dependence of a single peptide on the behavior of its sibling peptides, allowing one to prioritize “deviant” peptides (i.e., candidates for PTMs) or deviant exons (i.e., candidates for alternative splicing).
j121
122
j 7 Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation 7.10.1 Whole Protein-Level Significance
Our experiment begins with LC-MS/MS data collected on two experimental groups, and our goal is to identify entities that are differentially expressed between the two groups. To infer changes in protein abundance, we first assume that such regulation would result in unidirectional changes shared by the majority of peptides measured from a protein. To assess this, peptide-level t-statistics were first summarized at the exon level: te ¼
k 1X ti k i¼1
where the t-statistic assigned to an exon, te , was defined as the average of the k member peptides’ t-statistics. This prevents a single exon or domain from overshadowing undersampled regions of the protein. Then, the gene-level statistic, tg , is defined as the sum of the np exon-level t-statistics; np should only include those exons that had daughter peptides measured in the LC-MS/MS experiment. tg ¼
np X
te
e¼1
the two-sided p-value of a gene being differentially expressed as calculate We can P t tg , approximating the null distribution with the bootstrap density estimate (10,000 bootstrap samples). 7.10.2 Peptide-Level Significance
While current LC-MS/MS pipelines output peptide-level intensity, this information must be evaluated in a whole protein context to be biologically informative. At the individual peptide level, we are interested in cases when the behavior of a single peptide does not conform to the behavior of the rest of the protein, as such situations indicate local (as opposed to global) regulation of the protein, which may occur in the form of PTMs. Using t-statistics as our summary measure of choice, we must calculate the conditional probability of observing a peptide i given the behavior of the protein’s remaining peptides: Pðti jti ; Si Þ ¼
P ðti ; ti jSi Þ P ðti jSi Þ
where ti is the t-statistic associated with peptide i, and ti is the statistic associated with the rest of the protein, defined as ti ¼
k 1 X tj k 1 i6¼j
7.10 Worked Example: Label-Free Proteomics
ti represents the average t-statistic of the sibling peptides (i.e., from the same parent protein) for peptide i. The covariance between peptides, Si , must be included since the observed abundance of a particular peptide depends not only on the abundance of its sibling peptides, but also on the level of interpeptide dependence. Thus, we are interested in calculating the probability of observing a certain behavior for peptide i, given some observed behavior for the remainder of the protein. As seen above, this requires knowing (1) the joint distribution of ti and ti , which is Pðti ; ti jSi Þ, (2) the marginal distribution of ti , Pðti jSi Þ, and (3) the covariance, Si , between all of the peptides. The average (or sum) of t-statistics tends toward a normal distribution if the component distributions’ degrees of freedom are sufficiently large. Given the small sample size in most proteomic datasets, this tendency is not always observed. Furthermore, the patent lack of independence among peptides results in a sampling distribution that cannot be calculated analytically (i.e., from the convolution of the individual t-distributions). Thus, we can approximate the distribution for ti by generating 10,000 bootstrap samples and using maximum likelihood estimation to fit a t-location-scale distribution to the data (fixing the mean at zero); we call the resulting degrees of freedom nti and the scale parameter sti . While the scale parameter can be estimated robustly, the degrees of freedom is usually overfit to the particular instance of the boostrap density, resulting in unstable estimates for nti . As bootstrap estimation preserves the correlation structure between peptides, this procedure has the added benefit of simplifying the probability as follows: ^ ðt i Þ Pðti jSi Þ ¼ P
^ Þ refers to the bootstrap density estimate, and, thus, Si does not need to where Pð be explicitly calculated as it is inherent in the estimated density. Similarly, we can use bootstrap samples to parametrize the distribution of ^ ðti ; ti Þ. We suggest the bivariate t-distribution as developed by Shaw and Lee as (1) P it can be decomposed into the product of two t-distributions as the correlation between the marginals approaches zero (i.e., the property of statistical independence) and (2) it allows us to specify different degrees of freedom on the marginals [68]. The latter property is particularly important in many real-world applications, since assuming equidensity of the tails is a strong and often unrealistic assumption. The distribution is specified as follows: n1
P ðt1 ; t2 Þ ¼ Ca1 2
2 F 1
1
n2
a2 2
1
n1 þ 1 n2 þ 1 pffiffiffiffiffiffiffiffiffiffi C a1 a2 C 2 2
n
n1 þ 1 n2 þ 1 1 c2 1 þ. . . c C þ1 ; ; ; 2 2 2 2 4a1 a2
n
n1 n2 3 c2 2 þ 1 2 F1 þ 1; þ 1; ; C 2 2 2 2 4a1 a2
j123
124
j 7 Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation
Figure 7.5 The bivariate t-distribution. The mixing angle, q, controls the amount of density along the diagonal. The degrees of freedom control the density of the tails (i.e., the “flaring” of the tails seen above).
t21 n1 cos2 q t22 a2 ¼ 1 þ n2 cos2 q 2t1 t2 sin q c ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi n1 n2 cos2 q
a1 ¼ 1 þ
C ¼
1 pffiffiffiffiffiffiffiffiffiffiffiffiffi n1 n2
p cos q n1 n2 C C 2 2
where Cð Þ is the gamma function, and 2 F 1 ð Þ is the hypergeometric function. The “mixing angle,” q, of the two marginals is related to Pearson’s correlation coefficient as r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n
C n12 1 C n22 1 n1 2 ¼ sin q 1 1 n1 n2 2 2 C 2 C 2
n1 and n2 represent the degrees of freedom for the two statistics, t1 and t2 , whose joint distribution we are modeling; the distribution is illustrated in Figure 7.5. For our data, these two statistics are ti and ti =st , where ti must be rescaled by sti . The i degrees of freedom for ti was estimated above as nti , and the degrees of freedom for ti can be calculated analytically by the Welch–Satterthwaite equation . . 2 s21
ni ¼
1 n1 1
. 2 s1
n1
n1 2
þ
s22
n2
þ n21 1
. 2 2 s2
n2
To solve for the “mixing angle,” q, we calculate Pearson’s correlation, , between ti and ti for each bootstrap sample (10,000 pairs). However, the small sample size in most proteomic datasets results in inflated values of (and q) – a situation well
7.10 Worked Example: Label-Free Proteomics
described in the statistics literature surrounding covariance estimation [69]. To account for overestimation of the sample covariance, we must “regularize,” or stabilize, the estimate of q. Regularization of the covariance is particularly important when specifying null bivariate distributions (Student’s t or normally distributed), as these distributions concentrate along the diagonal as q approaches p2 , resulting in an underestimation of the real error – and misleading p-values – in tightly correlated data. Current methods for covariance regularization introduce a degree of sparsity to the sample covariance matrix by shrinking the eigenvalues to a prespecified target – a process which can be distilled into a linear combination of two matrices: U ¼ lT þ ð1
lÞU
where U is the sample covariance matrix, T is the target matrix, and l is the shrinkage parameter. While Schafer and Strimmer [70] outline several target matrices for different applications, the choice of a particular shrinkage estimator requires assumptions that are not always clear to the investigator. In addition, the estimation of l is based on calculating variances of the sample (co)variances [70], which is not straightforward in our case, as our estimates are based on bootstrap samples. ^ ðti ; ti Þ and P ^ ðti Þ are estimated from bootstrap samples – a As the null densities P process that scrambles phenotype membership – the resultant probability of a ^ ðti jti Þ, is aptly suited to hypothesis testing, with observing a particular peptide, P the null hypothesis being “no association with phenotype.” We calculate the p-value ^ ðt ti jti Þ, that is, the probability of observing a t-statistic more for a peptide as P extreme than ti , given the observed value for ti. 7.10.3 Exon-Level Significance
The framework for evaluating the significance of a single peptide can be extended to study the behavior of exons, provided they have been proteomically probed. As before, the statistic used to represent an individual exon is the average of its k constituent peptides: te ¼
k 1X ti k i¼1
The dependence between peptides prevents the distribution of te from being calculated analytically, and we can determine its associated scale, ste , by fitting 10,000 bootstrap estimates to a t-location-scale distribution via maximum likelihood. We are interested in the probability ^ ðte jte Þ Pðte jte ; SÞ ¼ P
where te is defined as n
te ¼
p 1 X tf np 1 f 6¼e
j125
126
j 7 Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation
te is the average of the tf for the np 1 exons that have daughter peptides meas^ ðte jte Þ follows as ured in the experiment. The specification of the distribution for P above for the peptide-level analysis. 7.10.4 Summarizing the Results
As proteomic datasets often sample 1000–2000 proteins simultaneously, the pvalues we calculate can be immensely useful in prioritizing candidates for further study. Proteins with significant changes across all peptides are likely to exhibit changes in overall protein abundance; proteins with one to two peptides significantly changing between groups may have PTMs on these peptides; and proteins with evidence for a differentially expressed exon may have undergone alternative splicing. These targets can then be validated with appropriate wet-lab strategies, for example, Western blots for proteins changing in abundance, or PCR followed by gel electrophoresis for alternatively spliced candidates. Further insight can be gained by organizing the three events – peptides, exons, and proteins – by their “genomic width,” that is, the nucleotide length. The frequency of each event must be normalized by the total number of the event observed. When plotted against the logarithm of the number of nucleotides, the three events form three roughly normal distributions. In Figure 7.6, we have illustrated this process with proteomic data collected from the intestinal epithelium of a particular mouse model of colon cancer, the hydroxyprostaglandin dehydrogenase, or Hpgd, knock-out mouse [71]. For illustration, we have also plotted a theoretical noise level (not based on real data) which is important in assessing the reproducibility of the three events. We can see that the distribution of peptides, many of which have very short genomic widths, is strongly convolved with both biologic and technical (i.e., the LC-MS/MS pipeline) noise. We also see that, as measurements are integrated across larger and larger spans of the genome, the noise level should decrease and the corresponding signal increase, allowing us to detect events at larger genomic widths. Thus, we reiterate that such an analysis would not have been possible if the peptide data had been considered independently of the genome manifold. To evaluate our p-values, we must calculate rates of false discoveries or errors, that is, situations where the null hypothesis is incorrectly rejected. A false discovery in an LC-MS/MS pipeline may arise due to misassignment of a peptide, or to suppression by other peptides, or from other unknown issues. In addition, the probability of making a false discovery increases as we evaluate an increasing number of p-values. However, there is no absolute way to quantify or control this rate. Various strategies – from controlling the number of false discoveries (the false discovery rate, or FDR) to controlling the errors in a group (the family-wise error rate, or FWER) – have been proposed, and each method provides a tradeoff between conservative control and the number of false positives. If the validation strategy – for example, Western blots for proteins – is labor intensive, then we do not want to waste our effort testing false positives; if the validation strategy is high throughput
7.11 Conclusions
Figure 7.6 Genomic width of proteomic data. Peptides measured in a shotgun mouse proteomics experiment were binned by their genomic width (number of base pairs); the length was also calculated for exons covered by two or more peptides and genes covered by three or more peptides. Each group was normalized by the total number of events observed in that category. Approximating normal distributions are shown as shadows. A curve of theoretical exponentially decaying noise is also plotted.
– for example, microarray – then we can tolerate a higher number of false positives with little cost. In the end, no single approach is appropriate in all settings, and we caution the student to control error rates with the validation strategy in mind.
7.11 Conclusions
Herein, we have summarized a bioinformatic trend emerging in cancer biology, one in which ensembles of high-throughput molecular measurements are studied in the context of an organizing map, or manifold, and we dub this marriage of molecule and manifold a molecular subsystem. As studies of molecular subsystems often link disparate analytical tools and approaches, the field is becoming increasingly complex, especially for scientists unfamiliar with the past decade of bioinformatics research. To this end, we illustrated one approach to deconvoluting proteomic data, making use of the genome as a scaffold on which peptide-level information can be interpreted. The interpretation of the end results, however, is far from obvious. Traditional biology suggests that we simply cherry pick proteomic “targets” from this list of proteins for further experimental exploration, though this comes at the expense of discarding valuable data. In this context, a basic understanding of manifolds – and the resultant molecular subsystems that can be studied with them – prevents us from losing the forest for the trees, and we hope that our explanation
j127
128
j 7 Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation of this concept proves useful both for students navigating an ‘omics experiment, as well as for experienced bioinformaticians developing new approaches for studying subsystems.
7.12 Perspective
The emerging trend in bioinformatics focuses on the analysis of high-throughput genome sequencing data, particularly in cancers, with the aim of personalizing medical diagnosis and therapy. Motivated by the success of key molecular therapeutics – such as trastuzumab in breast cancer and imatinib in chronic myelogenous leukemia – scientists have been mining genomic data for mechanistic insight into disease. Early sequencing studies of breast and colon cancer [16] made it clear, however, that there is a surprising amount of discordance in the genes mutated between patients. Given that patients with histologically similar tumors can have drastically different mutations driving tumor growth, the prevailing paradigm is that mutational heterogeneity is somehow integrated to produce similar phenotypes, and the current model for integration of these mutations is the signaling pathway. While useful for representing known knowledge, signaling pathways may fall short in predicting disease outcomes or responses to therapy, as they do not include all possible connections between proteins. In this regard, more liberal models of signaling, such as protein–protein interaction networks, may be valuable in discovery-oriented projects. As pathways and networks ultimately only represent sets of proteins, however, these existing models are limited by their inability to accommodate multidimensional data – sequencing, methylation, miRNA, mRNA, protein, and so on – that can aid in identifying the genetic mutations that are contributing to an observable phenotype. While efforts have been made to develop computational systems for multiplexing high-dimensional datasets [65, 72], the next phase of bioinformatic innovation will focus on the development of new theoretical frameworks for integrating and interpreting genomic data. Perhaps more importantly, bioinformaticians must develop methods for determining which data are important. Awash in a torrent of genomic information, scientists struggle to identify which mutations are biologically valuable and worth pursuing further. To cull the signal from the noise, several groups have developed approaches that call upon the interplay between bioinformatics and wet-lab biology [60, 73, 74], demonstrating that the success of bioinformatic pipelines for analyzing genomic data hinges on their ability to identify single, biologically relevant targets. While this objective often seems at odds with the global understanding that systems biologists pursue, we must remember that the current paradigm in medicine strives towards identifying single, druggable molecules, and this approach has proven beneficial for patients with certain diseases (e.g., chronic myelogenous leukemia). This is not to say that we should be lured away by the call of the most “frequent” mutations or the most “significant” expression patterns: the field is in dire need of models and theories to make sense of the hundreds of infrequently
References
mutated genes that constitute the majority of a patient’s cancer genome [75]. Thus, it falls upon the bioinformatician to balance the development of frameworks providing a global perspective for high-throughput ‘omic data with those yielding local information – that is, biologically relevant targets – for designing clinical interventions.
References 1 Vogelstein, B. and Kinzler, K.W. (2004)
2
3
4
5
6
7
8
9
10
11
Cancer genes and the pathways they control. Nat. Med., 10, 789–799. Jemal, A., Siegel, R., Xu, J., and Ward, E. (2010) Cancer statistics. CA Cancer J. Clin., 60, 277–300. Samowitz, W.S. et al. (2006) Association of smoking, CpG island methylator phenotype, and V600E BRAF mutations in colon cancer. J. Natl. Cancer Inst., 98, 1731–1738. Willett, W.C., Stampfer, M.J., Colditz, G.A., Rosner, B.A., and Speizer, F.E. (1990) Relation of meat, fat, and fiber intake to the risk of colon cancer in a prospective study among women. N. Engl. J. Med., 323, 1664–1672. Galiatsatos, P. and Foulkes, W.D. (2006) Familial adenomatous polyposis. Am. J. Gastroenterol, 101, 385–398. Bodmer, W.F. et al. (1987) Localization of the gene for familial adenomatous polyposis on chromosome 5. Nature, 328, 614–616. Kinzler, K.W. et al. (1991) Identification of FAP locus genes from chromosome 5q21. Science, 253, 661–665. Groden, J. et al. (1991) Identification and characterization of the familial adenomatous polyposis coli gene. Cell, 66, 589–600. Tomlinson, I.P. et al. (2008) A genomewide association study identifies colorectal cancer susceptibility loci on chromosomes 10p14 and 8q23.3. Nat Genet., 40, 623–630. Takayama, T. et al. (1998) Aberrant crypt foci of the colon as precursors of adenoma and cancer. N. Engl. J. Med., 339, 1277– 1284. Pretlow, T.P. and Pretlow, T.G. (2005) Mutant KRAS in aberrant crypt foci (ACF):
12
13
14
15
16
17
18
19
20
initiation of colorectal cancer? Biochim. Biophys. Acta, 1756, 83–96. Wood, L.D. et al. (2007) The genomic landscapes of human breast and colorectal cancers. Science, 318, 1108–1113. Balkwill, F. and Mantovani, A. (2001) Inflammation and cancer: back to Virchow? Lancet, 357, 539–545. Maeda, H., Okamoto, T., and Akaike, T. (1998) Human matrix metalloprotease activation by insults of bacterial infection involving proteases and free radicals. Biol. Chem., 379, 193–200. Chin, L. et al. (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455, 1061–1068. Wood, L.D. et al. (2007) The genomic landscapes of human breast and colorectal cancers. Science, 318, 1108–1113. Vogelstein, B. et al. (1988) Genetic alterations during colorectal-tumor development. N. Engl. J. Med., 319, 525–532. Yang, W.C. et al. (2001) Targeted inactivation of the p21(WAF1/cip1) gene enhances Apc-initiated tumor formation and the tumor-promoting activity of a Western-style high-risk diet by altering cell maturation in the intestinal mucosal. Cancer Res., 61, 565–569. Tay, S.T. et al. (2003) A combined comparative genomic hybridization and expression microarray analysis of gastric cancer reveals novel molecular subtypes. Cancer Res., 63, 3309–3316. Jorissen, R.N. et al. (2009) Metastasisassociated gene expression changes predict poor outcomes in patients with dukes stage B and C colorectal cancer. Clin. Cancer Res., 15, 7642–7651.
j129
130
j 7 Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation 21 Rouzier, R. et al. (2005) Breast cancer
22
23
24
25
26
27
28
29
30
31 32
33
molecular subtypes respond differently to preoperative chemotherapy. Clin. Cancer Res., 11, 5678–5685. Boyd, Z.S. et al. (2008) Proteomic analysis of breast cancer molecular subtypes and biomarkers of response to targeted kinase inhibitors using reverse-phase protein microarrays. Mol. Cancer Ther., 7, 3695–3706. Nibbe, R.K., Markowitz, S., Myeroff, L., Ewing, R., and Chance, M.R. (2009) Discovery and scoring of protein interaction subnetworks discriminative of late stage human colon cancer. Mol. Cell Proteomics, 8, 827–845. Jonker, D.J. et al. (2007) Cetuximab for the treatment of colorectal cancer. N. Engl. J. Med., 357, 2040–2048. Cunningham, D. et al. (2004) Cetuximab monotherapy and cetuximab plus irinotecan in irinotecan-refractory metastatic colorectal cancer. N. Engl. J. Med., 351, 337–345. Karapetis, C.S. et al. (2008) K-ras mutations and benefit from cetuximab in advanced colorectal cancer. N. Engl. J. Med., 359, 1757–1765. Allison, D.B., Cui, X., Page, G.P., and Sabripour, M. (2006) Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet., 7, 55–65. Wheelan, S.J., Martinez Murillo, F., and Boeke, J.D. (2008) The incredible shrinking world of DNA microarrays. Mol. Biosyst., 4, 726–732. Barrett, T. et al. (2009) NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res., 37, D885–D890. Wang, G.S. and Cooper, T.A. (2007) Splicing in disease: disruption of the splicing code and the decoding machinery. Nat. Rev. Genet., 8, 749–761. Gabriel, A. et al. (2006) Global mapping of transposon location. PLoS Genet., 2, e212. Krichevsky, A.M., King, K.S., Donahue, C.P., Khrapko, K., and Kosik, K.S. (2003) A microRNA array reveals extensive regulation of microRNAs during brain development. RNA, 9, 1274–1281. Fuchs, S.M., Krajewski, K., Baker, R.W., Miller, V.L., and Strahl, B.D. (2011) Influence of combinatorial histone
34
35
36
37
38
39
40
41
42
43
44
45
46
47
modifications on antibody and effector protein recognition. Curr. Biol., 21, 53–58. Nesvizhskii, A.I., Keller, A., Kolker, E., and Aebersold, R. (2003) A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem., 75, 4646–4658. Heard, E. et al.Ten years of genetics and genomics: what have we achieved and where are we heading? Nat. Rev. Genet., 11, 723–733. Subramanian, A. et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA, 102, 15545–15550. du Plessis, L., Skunca, N., and Dessimoz, C. (2011) The what, where, how and why of gene ontology – a primer for bioinformaticians. Brief Bioinform., 12, 723–735. Goh, K.I. et al. (2007) The human disease network. Proc. Natl. Acad. Sci. USA, 104, 8685–8690. Horvath, S. et al. (2006) Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target. Proc. Natl. Acad. Sci. USA, 103, 17402–17407. Mishra, G.R. et al. (2006) Human protein reference database – 2006 update. Nucleic Acids Res., 34, D411–D414. Breitkreutz, B.J. et al. (2008) The BioGRID Interaction Database: 2008 update. Nucleic Acids Res., 36, D637–D640. Alfarano, C. et al. (2005) The biomolecular interaction network database and related tools 2005 update. Nucleic Acids Res., 33, D418–D424 Salwinski, L. et al. (2004) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res., 32, D449–D451. Brown, K.R. and Jurisica, I. (2005) Online Predicted Human Interaction Database. Bioinformatics, 21, 2076–2082. Chatr-aryamontri, A. et al. (2007) MINT: the Molecular INTeraction database. Nucleic Acids Res., 35, D572–D574. Kerrien, S. et al. (2007) IntAct – open source resource for molecular interaction data. Nucleic Acids Res., 35, D561–D565. Bader, G.D. and Hogue, C.W. (2002) Analyzing yeast protein–protein interaction data obtained from different sources. Nat. Biotechnol., 20, 991–997.
References 48 Hart, G.T., Ramani, A.K., and Marcotte,
49
50
51
52
53
54
55
56
57
58
E.M. (2006) How complete are current yeast and human protein-interaction networks? Genome Biol., 7, 120. Hakes, L., Pinney, J.W., Robertson, D.L., and Lovell, S.C. (2008) Protein-protein interaction networks and biology – what’s the connection? Nat. Biotechnol., 26, 69–72. Barabasi, A.L., Gulbahce, N., and Loscalzo, J. (2011) Network medicine: a network-based approach to human disease. Nat. Rev. Genet., 12, 56–68. Zhang, B. and Horvath, S. (2005) A general framework for weighted gene coexpression network analysis. Stat. Appl. Genet. Mol. Biol., 4, Article17. http://www. degruyter.com/view/j/sagmb.2005.4.1/ sagmb.2005.4.1.1128/sagmb.2005. 4.1.1128.xml. Schubeler, D. et al. (2004) The histone modification pattern of active genes revealed through genome-wide chromatin analysis of a higher eukaryote. Genes Dev., 18, 1263–1271. Butte, A., Califano, A., Friend, S., Ideker, T., and Schadt, E.E. (2011) Integrative network-based association studies: leveraging cell regulatory models in the post-GWAS era. Nat. Prec. Available from Nature Precedings Lim, J. et al. (2006) A protein–protein interaction network for human inherited ataxias and disorders of Purkinje cell degeneration. Cell, 125, 801–814. Wu, X., Jiang, R., Zhang, M.Q., and Li, S. (2008) Network-based global inference of human disease genes. Mol. Syst. Biol., 4, 189. Ghazalpour, A. et al. (2006) Integrating genetic and network analysis to characterize genes related to mouse weight. PLoS Genet., 2, e130. Youn, A. and Simon, R. (2011) Identifying cancer driver genes in tumor genome sequencing studies. Bioinformatics., 27, 175–181. Emilsson, V. et al. (2008) Genetics of gene expression and its effect on disease. Nature, 452, 423–428.
59 Kinzler, K.W. and Vogelstein, B. (2002)
60
61
62
63
64
65
66
67
68
69
70
71
The Genetic Basis of Human Cancer, 2nd edn, McGraw-Hill Professional, New York. Bredel, M. et al. (2009) A network model of a cooperative genetic landscape in brain tumors. JAMA, 302, 261–275. Akavia, U.D. et al. (2010) An integrated approach to uncover drivers of cancer. Cell, 143, 1005–1017. Cerami, E., Demir, E., Schultz, N., Taylor, B.S., and Sander, C. (2010) Automated network analysis identifies core pathways in glioblastoma. PLoS One, 5, e8918. Nibbe, R.K., Koyuturk, M., and Chance, M.R. (2010) An integrative-omics approach to identify functional subnetworks in human colorectal cancer. PLoS Comput. Biol., 6, e1000639. Sharan, R. et al. (2005) Conserved patterns of protein interaction in multiple species. Proc. Natl. Acad. Sci. USA, 102, 1974–1979. Bebek, G., Patel, V., and Chance, M.R. (2010) PETALS: proteomic evaluation and topological analysis of a mutated Locus’ signaling. BMC Bioinform., 11, 596. Linderman, G.C., Patel, V.N., Chance, M.R., and Bebek, G. (2011) BiC: a web server for calculating bimodality of coexpression between gene and protein networks. Bioinformatics., 27, 1174–1175. Patel, V.N. et al. (2010) Prediction and testing of biological networks underlying intestinal cancer. PLoS One, 5., e12497. Shaw, W.T. and Lee, K.T.A. (2008) Bivariate Student t distributions with variable marginal degrees of freedom and independence. J. Multivariate Anal., 99, 1276–1287. Ledoit, O. and Wolf, M. (2002) Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size. Ann. Stat, 30, 1081–1102. Schafer, J. and Strimmer, K. (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat. Appl. Genet. Molec. Biol., 4. Yan, M. et al. (2009) 15Hydroxyprostaglandin dehydrogenase inactivation as a mechanism of resistance to celecoxib chemoprevention of colon
j131
132
j 7 Colorectal Cancer and Its Molecular Subsystems: Construction, Interpretation, and Validation tumors. Proc. Natl. Acad. Sci. USA, 106, 9409–9413. 72 Kong, J. et al. (2011) Integrative, multimodal analysis of glioblastoma using TCGA molecular data, pathology images, and clinical outcomes. IEEE Trans. Biomed. Eng., 58, 3469–3474. 73 Verhaak, R.G. et al., (2010) Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in
PDGFRA, IDH1, EGFR, and NF1. Cancer Cell, 17, 98–110. 74 Brennan, C. et al. (2009) Glioblastoma subclasses can be defined by activity among signal transduction pathways and associated genomic alterations. PLoS One, 4, e7752. 75 Sj€ oblom, T. et al. (2006) The consensus coding sequences of human breast and colorectal cancers. Science, 314, 268–274.
j133
8 Network Medicine: Disease Genes in Molecular Networks Sreenivas Chavali and Kartiek Kanduri
8.1 Brief Summary
Important challenges in understanding human genetic diseases include (i) identifying disease genes and (ii) obtaining mechanistic insights into disease etiology. Addressing these challenges is essential to understand disease pathogenesis and design intervention, and prevention strategies. Decades long research has identified disease genes associated with several human genetic diseases. Recent advances in building databases, integrating, and analyzing multiple largescale datasets, especially in the field of network biology, have permitted elucidating properties of disease genes at systems level. In this chapter, we will discuss how disease and disease gene networks have shed light on the properties of different diseases, their shared genetic architecture and how different genes contribute to comorbidity. Furthermore, we discuss about the position of the disease genes in the protein interaction networks, their implications and how the lessons learnt by analyzing systems properties of disease genes are being exploited to prioritize novel disease candidates, and how these advances provide mechanistic insights into disease pathogenesis. We outline some important problems that need to be addressed to realize better diagnosis, prevention, and intervention.
8.2 Introduction
The rediscovery of Mendel’s work at the dawn of twentieth century laid a strong foundation for the field of genetics. The characterization of pathological conditions such as sickle-cell anemia, phenylketonuria, and hemophilia, and subsequent elucidation of the contribution of underlying genetic causes toward their etiology instigated the evolution of the field of medical genetics. Methodological and conceptual advances in medical genetics drove the identification of many genetic diseases, mostly caused by single gene defects with clear familial inheritance patterns. Subsequent efforts, predominantly led by Victor McKusick, helped in cataloging Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
134
j 8 Network Medicine: Disease Genes in Molecular Networks genetic diseases, and the genes, leading to the creation of one of the most comprehensive databases named Online Mendelian Inheritance in Man (OMIM) [1]. In this chapter, we will discuss about the genetic architecture of human diseases determined by studying individual diseases, systems properties of disease genes determined by diseasome-wide studies (many diseases taken together), and how this knowledge could be applied for prioritizing genes for identifying genetic determinants of common complex diseases, with special emphasis on cancers.
8.3 Genetic Architecture of Human Diseases
Genotype of an individual is defined as the genetic component which leads to the externally tractable trait, referred to as phenotype. Phenotypes, here diseases, result from interplay of genes and the environment. A disease gene is determined by a significantly high segregation of sequence variations in a gene in a group of individuals with the disease as compared to healthy subjects. Thus, a gene that harbors a sequence variation (allele) leading to the disease phenotype or significantly segregating in diseased individuals is referred as disease gene. Thus, a mutation in a gene would result in a disease, and most often such diseases can be tracked in families, provided the mutations do not cause prepubertal lethality. Diseases can be genetically heterogeneous at two levels, namely allelic and genic (or loci). Allelic heterogeneity refers to a condition where many mutations in a single gene lead to the same disease, for instance, mutations in TP53 gene encoding tumor suppressor protein p53 cause Li–Fraumeni syndrome. Locus heterogeneity means that mutations in more than one gene can lead to the same phenotype, as is the case for inherited breast cancer which is caused by mutations in as many as 13 genes [2, 3]. Human genetic diseases can be broadly classified based on the number of the disease genes as monogenic, oligogenic, and polygenic. Monogenic diseases are caused by single gene defects with little or no influence from the environment (complete penetrance). In monogenic diseases, the presence of a single mutation at a locus is essential and sufficient to manifest the disease. The disease-causing loci show Mendelian patterns of inheritance. However, increasing knowledge suggests that the classification of certain diseases as monogenic might be oversimplification and that more than one gene might be involved in the manifestation of the disease [4]. Diseases where interactions between a few genes result in the phenotype are known as oligogenic. For instance, breast cancers resulting from mutations in BRCA1 are known to be modified by a mutation at APC locus [5]. Complex diseases such as cancers, metabolic, inflammatory, and neurological diseases are multifactorial involving many genes (polygenic) and several environmental factors. Thus, unlike monogenic forms, complex diseases result from the interaction of many alleles with small but varying effect sizes and environment and are hence multifactorial (Figure 8.1). Monogenic disease genes were mostly discovered by studying families that inherit the disease using positional cloning. Positional cloning involves
8.3 Genetic Architecture of Human Diseases
Figure 8.1 Simplified view of genetic architecture of human genetic diseases. The diseases can result from mutations in single genes (monogenic) or multiple genes (polygenic). Contribution from modifier loci and environmental factors reduce penetrance of a mutation in a disease gene. When all individuals carrying the same mutation, express
the same phenotype, the mutation is said to be completely penetrant. Complex diseases are often multifactorial and involve a large number of mutations in different genes with varying degrees of contribution and several environmental factors. Figure is adapted from [8].
(i) identification of the candidate region that represents the approximate chromosomal position of a disease gene using linkage analysis and (ii) identification of the disease gene and mutation in the candidate region. However, identifying the genetic and environmental determinants of the common complex diseases has been a formidable challenge. Traditional methods, such as linkage analysis, have yielded limited success in identifying the causal genes, predominantly owing to the small effects of the alleles. This led to the advent of framework for population-based association studies, which provide a statistical statement about the co-occurrence of alleles and/or phenotypes based on the allele frequency differences in diseased subjects compared to the healthy individuals [6]. Technological advancements in genotyping in recent years have allowed the identification of common genetic variants (mostly single nucleotide polymorphisms) associated with complex diseases through genome-wide association studies (GWAS) [7]. Both linkage studies and GWAS identify candidate regions on the genome, which often contain large numbers of disease-candidate genes. Owing to the limited potential of linkage analysis and GWAS to identify causal mutations and genes, very little is understood about the genes and mechanisms that underlie complex diseases.
j135
136
j 8 Network Medicine: Disease Genes in Molecular Networks In the subsequent sections, we will discuss our current knowledge on the systems level properties of disease genes and how application of systems biology approaches and integrating multiple large datasets helps us in prioritizing and identifying novel disease genes.
8.4 Systems Properties of Disease Genes
Several years of reductionist work worldwide has lead to the accumulation of an enormous amount of information. Systems biology aims at converting this information into knowledge and aid in better comprehension of biological systems such as all protein–protein interactions in a cell. This is facilitated by fast evolving theory of complex networks that provides an integrated framework to map and understand the biological systems by quantifying the topological and dynamic properties of networks that characterize them. Networks are graphical representations consisting of nodes (vertex) and edges (links). For instance, in a protein interaction network a node represents a protein and an edge between two nodes represents an interaction. Edges can be directed, providing the directionality of interaction or simply undirected. Several studies based on the network theory have provided important insights into organization, behavior, and evolution of biological systems. It is well established now that biological networks are not random, but rather are characterized by a core set of organization principles. 8.4.1 Network Measures
The basic network measures that are used to characterize the centrality of nodes and thus their importance in the network are as follows [9, 10]. a) Degree: It quantifies the number of links a node has with other nodes in a network. The degree distribution provides the probability of finding a node with a specified number of links. A random network has a peaked degree distribution with a characteristic node degree. Biological networks are scale free with a powerlaw distribution, wherein there are a few highly connected nodes (hubs) and many low degree nodes. In Figure 8.2, the degree of node “C” is 4 as it has four links. b) Closeness centrality: This centrality (also called as closeness) is defined as the reciprocal average distance (number of links in the shortest path) to every other node. Thus, a node with high closeness, on average, is close in graph distance to the other nodes. Closeness is calculated as follows: Cc ¼
1 avgðLðn; mÞÞ
where avg is the average; L(n,m) is the length of the shortest path between nodes n and m. The closeness for node “C” in Figure 8.2 is 0.83.
8.4 Systems Properties of Disease Genes
Figure 8.2 A representative network. Nodes are shown in solid gray circles (A through F) and the connection between two nodes (also called link or edge) is shown in solid black line.
c) Betweenness centrality: This centrality (also referred as betweenness) is a global centrality measure, which determines the centrality of a node in a network based on the total number of shortest paths going through a given node. Thus, nodes that occur on many shortest paths between other nodes have high betweenness. Betweenness is calculated as follows: Cb ðnÞ ¼
X
s6¼n6¼t
Pst ðnÞ Pst
where Cb(n) is the betweenness centrality of node n; s, and t are the nodes different from n, and Pst denotes the number of shortest paths from s to t. Pst(n) is the number of shortest paths from s to t that n lies on. The betweenness for node “C” in Figure 8.2 is 0.7. d) Clustering coefficient: This network measure quantifies the cohesiveness of the neighborhood of a node and is defined as the ratio between the number of edges linking nodes adjacent to a node and the total possible number of edges among them. Thus, clustering coefficient characterizes the overall tendency of nodes to form clusters or groups: Cn ¼ 2en =ðkn ðkn
1ÞÞ
where kn is the number of neighbors of n and en is the number of connected pairs between all neighbors of n. Thus, the clustering coefficient of node C in Figure 8.2 is 0.17. 8.4.2 Disease and Disease-Gene Networks
As highlighted previously, decades of research in medical genetics and subsequent efforts in creating databases provided an opportunity to explore as to whether human genetic disorders and the corresponding disease genes might be related to each other at a higher level of cellular and organismal organization. Goh et al. for the first time adopted a systems-based approach to construct human disease network (HDN; Figure 8.3a) and disease gene network (DGN; Figure 8.3b) based on data from OMIM [11]. In HDN, a node represents a genetic disease and two nodes are connected if they share a common disease gene. Similarly, in the DGN, a node represents a disease gene and two genes
j137
138
j 8 Network Medicine: Disease Genes in Molecular Networks
Figure 8.3 Networks of human diseases and human disease genes. (a) HDN: two diseases share a connection if they have one or more disease genes in common. The node size corresponds to the number of participating disease genes, as shown on the right. (b) DGN:
two genes share a connection if they are implicated in the same disease. The colors in both the panels refer to the different disease classes provided on the right side of the figure. The figure was adapted from [11].
8.4 Systems Properties of Disease Genes
share a link if they are associated with the same disease. The most striking property of the HDN is the presence of a giant component of 516 diseases, suggesting some shared genetic origin of most diseases. Oncological disorders such as colon cancer and breast cancer represented hubs that are connected to a large number of distinct disorders, because of the tight connection among different cancer subtypes through common tumor repressor genes such as TP53 and PTEN. Cancers clearly appear as a distinct cluster in the HDN with many overlapping genes owing to high locus heterogeneity, as can be seen from DGN. Strikingly, in the HDN and DGN disorders and genes are more likely linked to disorders and genes of the same disorder class, suggesting pathophysiological clustering of diseases and disease genes. In line with this approach Barrenas et al. constructed a complex disease network (CDN; Figure 8.4a) and a complex disease gene network (CGN; Figure 8.4b) based on data from complex disease genes identified by GWAS [12]. As GWAS identifies high allele effect size variants, the CDN indicated that a few diseases such as type 1 diabetes and multiple sclerosis might involve more number of such variants compared to Parkinson’s disease or restless leg syndrome. Most cancers formed distinct clusters with very few shared genes. Similarly, the CGN showed that genes involved in cancers and certain nervous system diseases were rarely associated with any other diseases; genes associated with one type of cancer were not associated with other types. Phenotypically similar or comorbid diseases are perceived to share disease-associated genes (shared components hypothesis) [13]. However, this indicated that the phenotypically similar diseases do not share high effect size alleles, but the shared pathogenesis might suggest that they share moderate effect size alleles (Figure 8.4c). Taken together, constructing disease and disease-gene networks help in identifying the genetic architecture of human diseases, underlying genes and their shared molecular pathogenesis. 8.4.3 Disease Genes in Protein Interaction Networks
No molecule in a cell acts in isolation but acts in a combination with several other molecules, forming a network of interactions to affect the phenotype. So a mutation in a gene leads to the perturbation of large and complex interactions at the protein level, resulting in a disease. This effect can be mediated through complete loss of gene products (e.g., proteins) or interaction-specific alterations [14]. Genes essential for survival (essential genes) in yeast were found to be high degree nodes (hubs) in protein interaction network (interactome) [15]. This led to the hypothesis that the hubs in human interactome might be associated with diseases, as a mutation in a hub protein might affect interactions with many more proteins than nonhub proteins. This prompted an extensive investigation of protein interactions in diseases by determining the aforementioned network measures for disease genes [16].
j139
140
j 8 Network Medicine: Disease Genes in Molecular Networks
Figure 8.4 Networks of complex diseases and complex disease genes. (a) CDN in which two diseases are linked if they shared one or more disease genes. The node size is indicative of the number of disease genes, as highlighted on the left. The colors refer to the different disease classes provided on the right. (b) CGN in which two genes are linked if they are associated with
the same disease. The node size indicates the number of diseases a gene is associated with. Panels (a) and (b) are adapted from [12]. (c) Allelic spectrum inferred from CDN and CGN showing that the phenotypically similar complex diseases share less number of high effect size alleles and might share more moderate effect size alleles.
Initial studies investigating the network properties of human disease genes were based on cancers. Genes that were upregulated in cancerous tissues were found to be central in the interactome and highly connected [17]. Shared topological features with essential genes in the interactome were attributed to the
8.4 Systems Properties of Disease Genes
requirement of these upregulated genes for proliferation of the cancerous tissue. A subsequent study considering cancer genes with mutations showed that these proteins had many interaction partners compared to noncancer proteins. In addition, the authors also showed that the cancer proteins resided in larger clusters, with cancer proteins having a higher tendency to participate in more clusters than noncancer proteins [18]. By analyzing the DGN, Goh et al. showed that the products of disease genes tended (i) to have more interactions with each other than with nondisease genes, (ii) to be expressed in the same tissues and (iii) to have shared functionality. They also found that the vast majority of disease genes are nonessential and contradicting previous studies, they showed that human disease genes are not high degree nodes in the human interactome [11]. Subsequently, Feldman et al. evaluated the network properties of disease genes, largely represented by monogenic disease genes and showed that genes with intermediate degrees were more likely to harbor germ-line disease mutations [19]. Thus, it was less clear as to whether human disease genes have elevated degrees comparable to essential genes. A better picture of network properties was obtained by classifying disease genes based on their effect on the disease phenotype. This showed that in the interactome the different classes of genes were organized as essential genes, monogenic disease genes, complex disease genes, and nondisease genes, in decreasing order of their centrality (Figure 8.5a) [12]. Furthermore, the discrepancy was resolved by the finding that orthologs of mouse essential genes which cause diseases in humans (essential disease genes) are overrepresented in disease genes associated with cancers [20]. These genes have higher degrees in interactome. Taken together these studies suggest that the disease genes are not randomly positioned in interactome, but tend to have higher connectivities than nondisease genes and occur in central regions of the network. A recent study undertook a different approach of classifying disease genes based on the number of diseases they were associated with. Pleiotropy, in human genetic diseases, refers to the ability of different mutations within the same gene to cause different pathological effects. Accordingly, disease genes were classified as shared genes associated with phenotypically divergent diseases (phenodiv genes), those that were associated with phenotypically similar diseases (phenosim genes) and specific genes that were associated with only one disease in OMIM. Phenodiv genes had high degree comparable to essential genes in interactome. Phenosim genes were less central compared to phenodiv genes, followed by specific genes and nondisease genes. With both essential genes and phenodiv genes being hubs, the authors investigated other network attributes that conferred lethality and pleiotropy, respectively. Essential and phenodiv genes were found to be intramodular (Figure 8.5b) and intermodular hubs (Figure 8.5c) with the former being highly co-expressed with their interactors contrary to the latter. Essential genes were predominantly nuclear proteins with transcriptional regulation activities, while phenodiv genes such as AKT1 were cytoplasmic proteins involved in signal transduction [21]. This clearly established how different classes of disease genes are positioned in interactome.
j141
142
j 8 Network Medicine: Disease Genes in Molecular Networks
Figure 8.5 Disease genes in protein interaction network. (a) Topological position of different types of disease genes in the interactome. In utero essential genes (mutations in which are lethal) and essential genes in mouse, mutations in which lead to disease in humans (essential disease genes) occupy central positions, followed by monogenic and complex disease genes. Essential genes are intramodular hubs (node with dark blue border in panel (b)) most
often co-expressed with all its interactors in all the tissues while shared genes associated with phenotypically diverse diseases are intermodular hubs node with dark blue border, expressed with different interactors in different tissues (as referred by different colors in panel (c)). Notably, both the essential genes and phenodiv genes have same number of connections (degree). Panels (b) and (c) are adapted from [21].
Taken together, the analyses of disease genes in interactome showed how disease genes are positioned in the interactome, co-expressed with their interactors, and functionally important. However, given the incompleteness of human interactome, these studies need to be carefully interpreted as the disease-causing proteins may have higher degrees simply because they are better studied (knowledge bias).
8.4 Systems Properties of Disease Genes
8.4.4 Identification of Disease Modules
Most biological functions such as signaling arise from interactions among different cellular components, than from the activity of a single molecule. This suggests that a protein and its interactors are more likely to participate in the same biological function. Accordingly, proteins involved in the same disease also showed an increased tendency to interact with each other (local hypothesis) [11–13, 19, 22]. This indicates that the proteins involved in the same disease share a network neighborhood and form local clusters in the interactome, commonly referred to as disease modules. Thus, the disease outcome is perceived to be a result of combination of different defects that perturbs the activity of a disease module or a disease subnetwork (Figure 8.6). For instance, 108 pathways, most of which involved phosphatidylinositol 3kinase (PI3K) signaling, were preferentially mutated in breast cancer [23]. Similarly, 38 colorectal cancer pathways were identified many of which centered on PI3K signaling. Pathways related to cell adhesion, the cytoskeleton, and the extracellular matrix were also altered in colorectal cancer [23]. In lung adenocarcinoma genetic alterations were shown to frequently occur in genes of the MAPK signaling, p53 signaling, Wnt signaling, cell cycle, and mTOR pathways [24]. In glioblastoma, frequent genetic alterations were observed in three critical signaling pathways, namely RTK/RAS/PI3K, p53, and RB which were altered in 88, 87, and 78% of tumors, respectively [25]. Similarly, RB, PI3K/RAS, and NOTCH signaling pathways were altered in ovarian carcinoma [26]. These studies, thus demonstrate combinatorial disease mechanisms and overrepresentation of disease genes in distinct interaction clusters relevant to disease. Identification of disease modules includes integration of interactome with memberships in protein complexes, regulatory interactions such as co-expression and possible co-regulation, and metabolic network. Topological modules, determined by different network clustering methods, or functional modules representing a local clustering of functionally related proteins are identified in molecular networks. These are then tested for statistically significant enrichment for previously known disease genes compared to random expectation and are then clustered to obtain a topologically compact subgraph with most disease genes, which forms the disease module. Through integration of gene expression and protein–protein interactions in acute myleiod leukemia (AML), Lee et al., identified subnetworks, dysregulated in AML patients. These subnetworks were enriched for known AML genes, and associated with key leukemogenic processes such as myeloid differentiation [27]. Since complex diseases involve a large number of genes, the knowledge thus obtained from studying systems properties of disease genes and identifying disease modules has been extensively exploited to prioritize likely disease candidates, which is discussed in the following section.
j143
144
j 8 Network Medicine: Disease Genes in Molecular Networks
Figure 8.6 Different scales of organization of disease-related components. (a) Simplified overview of disease gene identification at the genome level by large-scale re-sequencing or genotyping efforts. The distribution of disease genes (G1 to G7) on different chromosomes (Chr A to F) is shown. (b) Functional relevance
of the disease genes is determined by identifying pathways enriched for their gene products. (c) Disease modules are identified in the interactome by applying clustering algorithms to obtain a compact subgraph that is enriched for known disease genes (referred by solid gray nodes).
8.5 Disease Gene Prioritization
8.5 Disease Gene Prioritization
As stated in Section 8.3, linkage analysis and GWAS point toward many candidates, and identifying the causal disease mutation and gene is difficult. The knowledge acquired on the systems properties of disease genes has led to the development of several network-based tools to predict potential disease candidate genes [28]. Barabasi et al. loosely grouped these tools into three categories, namely linkage methods, disease module-based methods and diffusion-based methods [13]. In this section, we discuss these different methods and highlight how they have aided in cancer gene identification, wherever applicable. 8.5.1 Linkage Methods
Linkage methods were developed to prioritize disease candidates in the loci linked to disease using positional cloning approach. These methods assume that the interaction partners of a disease gene located in its chromosomal vicinity have higher likelihood of being associated with the same disease (Figure 8.7a). This method was employed to prioritize disease candidates in several complex diseases such as schizophrenia, bipolar disease, autism and type 2 diabetes for which genome-wide linkage scans were available [29–33]. By integrating genomic information such as shared chromosomal positions with molecular interaction networks and cellular localization of their protein products, these methods showed several fold enrichment (10–1000-folds) of true disease-causing genes over random selection.
Figure 8.7 Different methods to prioritize disease candidates. (a) Linkage methods: protein products of genes located in the linkage interval of a disease (P1, P2, etc.) which interact with known disease proteins are considered to be likely disease candidates. (b) Disease module-based methods: topological, functional, or disease-relevant clusters are obtained in the interactome and the members of the module hitherto unknown to be disease associated are
considered as disease candidates (for instance, P4). (c) Diffusion-based methods: an iterative random walk or propagation flow is undertaken starting from disease proteins along the links in the interactome with certain probability. As a result of this, each protein is assigned a disease-association score, which reflects the likelihood of a protein to be associated with a disease. PPI stands for protein–protein interaction. The figure was adapted from [13].
j145
146
j 8 Network Medicine: Disease Genes in Molecular Networks 8.5.2 Disease-Module-Based Methods
These methods assume that genes belonging to a module (topological, functional or disease) in the interactome are more likely to be involved in the same disease [34]. Genes that were hitherto unknown to be disease-relevant but are interaction partners in the disease modules, identified as detailed in Section 8.4.4, are inspected as potential disease candidates (Figure 8.7b). Variants of this conceptual framework, which are primarily based on “guilt by association” have been employed for various complex diseases including different types of cancers, type 2 diabetes, and cardiovascular disease [35–45]. For instance, Bonifaci et al. applied an integrative approach for identifying candidate low-penetrance breast cancer susceptibility genes in GWAS data through the analysis of diverse sources of biological evidence. They identified cell communication and cell death as the major biological processes to be perturbed in the risk of breast cancer and prioritize candidates based on their genomic and transcriptomic properties, molecular interaction and possible functional effects [35]. By combining reverse-engineered gene networks with expression profiles, androgen receptor gene (AR) was identified to be an important genetic mediator and the AR pathway as a highly enriched pathway for metastatic prostate cancer [36]. Heiser et al. integrated genomic, transcriptomal, and proteomic profiles from different breast cancer cell lines and identified Pak1 to be an important regulator of MAPK pathway, which is deregulated in breast cancer [37]. Nibbe et al. identified proteins, levels of which changed significantly in late stage human colorectal cancer. Using these proteins as seeds they generated disease subnetworks in interactome integrating gene-expression data and identified protein combinations and novel disease candidates significantly discriminative of late stage cancer [38]. One of the important limitations in obtaining a disease module is the incompleteness of the currently available interactomes, especially in the vicinity of known disease proteins. Several studies have been undertaken to overcome this by identifying relevant interactions, and this approach was successfully applied to several complex diseases [46–49]. Starting with four known genes – BRCA1, BRCA2, ATM, and CHEK2 – encoding tumor suppressors of breast cancer, Pujana et al. combined gene-expression profiling with functional genomic and proteomic data from various species and identified HMMR, encoding a centrosome subunit, and demonstrated previously unknown functional associations with the breast cancerassociated gene BRCA1 [46]. In addition to their use in disease gene prioritization, disease modules or subnetworks have been shown to provide better disease prognosis, that is, in predicting disease outcome. For instance, by integrating differential expression with protein interaction network, Chuang et al. demonstrated that disease subnetworks are better predictors of breast cancer metastasis than individual genes [50]. Taylor et al. examined the changes to the dynamic structure of human interactome, and showed that disease-associated protein interaction modules are useful indicators for predicting breast cancer outcome [51].
8.6 Conclusion
8.5.3 Diffusion-Based Methods
These methods capture global relationships by global distance measurements between proteins and disease proteins within interactome and their potential involvement in the same disease. They employ random walk with restart or propagation flow starting from the disease proteins diffusing to any neighboring protein in the interactome with equal probability. This would aid in the identification of proteins and their interactions that are in close proximity to the disease proteins, as these would be most often visited in successive iterations (Figure 8.7c). Thus, both proteins that interact with several disease proteins and those that may not directly interact with any disease proteins but are in close proximity to them will gain high probabilistic weights. These methods have been used to prioritize candidates in diverse diseases including multifactorial diseases such as type 2 diabetes, prostrate cancer, and breast cancer [52–54]. For prostrate cancer out of the top 19 candidates prioritized by this method, 12 had some proof of their disease association [53]. A comparative study of different prioritization methods showed that the diffusionbased methods outperformed the other two in disease candidate prioritization [55].
8.6 Conclusion
In summary, retrieving data from several different experimental studies helps to construct networks, which provide enormous power to understand the general principles underlying human genetic diseases and the properties of different disease genes. Thus, the extensive study of the systems properties of different diseases, and disease genes in disease networks, and protein interaction networks has provided better insights into genetic architecture, molecular pathogenecity of individual, and comorbid diseases. Furthermore, these studies have paved way to develop new strategies and tools to prioritize and predict likely disease candidates to expedite disease gene discovery. In addition, network medicine has also aided in shifting scales from studying and identifying individual disease genes to identifying disease mechanisms. Recent advances in this field permit the integration and analysis of multiple large-scale datasets capturing disease-related changes at various molecular levels such as (i) genetic variation including single nucleotide changes, insertions and deletions, and copy number variations [44, 56]; (ii) epigenetic changes such as methylation differences, nucleosome positioning, and posttranslational modifications of histones [57–59]; (iii) transcriptomal changes which include geneexpression changes affected by changes in synthesis, transport, and degradation by regulators such as transcription factors, RNA binding proteins and microRNAs [60–62]; (iv) changes in proteome, which include changes in protein levels brought about changes in synthesis, posttranslational modifications, localization, and degradation [63]; and (v) metabolic changes affected by changes in the levels of the
j147
148
j 8 Network Medicine: Disease Genes in Molecular Networks substrates, catalysts, and products [64]. Integration and analysis of multiple largescale datasets aim at providing a holistic picture of molecular pathogenesis of a disease and its associated traits, and identifying reliable and efficient biomarkers, which aid in better diagnosis and prognosis, and more importantly better drug targets. However, currently such efforts are limited by our incomplete knowledge of different molecular networks. To conclude, future advances in understanding the organization, structure, and function of different molecular networks of a cell and how they are altered in disease conditions will provide a better understanding of pathophysiology of diseases. This will provide a promising beginning to develop better drugs to locally target disease-related changes that bring about the global changes in cellular activity.
8.7 Perspectives
To obtain better understanding of the human genetic diseases and to realize the dream of better diagnosis and intervention, a few important problems need to be effectively addressed. This includes improvement in (i) symptoms-based diagnosis of the disease, as most of the complex diseases are heterogeneous involving varying clinical severities, (ii) endo-phenotyping or subphenotyping of complex diseases based on constitutive traits to obtain a phenome-based categorization, (iii) cheap, fast, and efficient methods to obtain large-scale datasets at various molecular levels detailed in Section 8.6, and (iv) statistical and network construction and analysis tools to integrate and analyze multiple genome-level datasets.
Acknowledgments
Sreenivas Chavali acknowledges Medical Research Council, UK and European Molecular Biology Organization for support. We thank Pavithra L Chavali for critically reading the manuscript and providing helpful comments.
References 1 McKusick, V.A. (2007) Mendelian
4 Badano, J.L. and Katsanis, N. (2002)
inheritance in man and its online version, OMIM. Am. J. Hum. Genet., 80, 588. 2 Walsh, T. and King, M.C. (2007) Ten genes for inherited breast cancer. Cancer Cell, 11, 103. 3 McClellan, J. and King, M.C. (2010) Genetic heterogeneity in human disease. Cell, 141, 210.
Beyond mendel: an evolving view of human genetic disease transmission. Nat. Rev. Genet., 3, 779. 5 Redston, M. et al. (1998) The APCI1307K allele and breast cancer risk. Nat. Genet., 20, 13. 6 Cardon, L.R. and Bell, J.I. (2001) Association study designs for complex diseases. Nat. Rev. Genet., 2, 91.
References 7 Hirschhorn, J.N. and Daly, M.J. (2005)
8
9
10
11
12
13
14
15
16
17
18
19
Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet., 6, 95. Chavali, S., Ghosh, S., and Bharadwaj, D. (2009) Hemophilia B is a quasiquantitative condition with certain mutations showing phenotypic plasticity. Genomics, 94, 433. Assenov, Y., Ramirez, F., Schelhorn, S.E., Lengauer, T., and Albrecht, M. (2008) Computing topological parameters of biological networks. Bioinformatics, 24, 282. Barabasi, A.L. and Oltvai, Z.N. (2004) Network biology: understanding the cell’s functional organization. Nat. Rev. Genet., 5, 101. Goh, K.I. et al. (2007) The human disease network. Proc. Natl. Acad. Sci. USA, 104, 8685. Barrenas, F., Chavali, S., Holme, P., Mobini, R., and Benson, M. (2009) Network properties of complex human disease genes identified through genome-wide association studies. PLoS One, 4, e8090. Barabasi, A.L., Gulbahce, N., and Loscalzo, J. (2011) Network medicine: a network-based approach to human disease. Nat. Rev. Genet., 12, 56. Zhong, Q. et al. (2009) Edgetic perturbation models of human inherited disorders. Mol. Syst. Biol., 5, 321. Jeong, H., Mason, S.P., Barabasi, A.L., and Oltvai, Z.N. (2001) Lethality and centrality in protein networks. Nature, 411, 41. Kann, M.G. (2007) Protein interactions and disease: computational approaches to uncover the etiology of diseases. Brief Bioinform., 8, 333. Wachi, S., Yoneda, K., and Wu, R. (2005) Interactome–transcriptome analysis reveals the high centrality of genes differentially expressed in lung cancer tissues. Bioinformatics, 21, 4205. Jonsson, P.F. and Bates, P.A. (2006) Global topological features of cancer proteins in the human interactome. Bioinformatics, 22, 2291. Feldman, I., Rzhetsky, A., and Vitkup, D. (2008) Network properties of genes harboring inherited disease mutations. Proc. Natl. Acad. Sci. USA, 105, 4323.
20 Dickerson, J.E., Zhu, A., Robertson, D.L.,
21
22
23
24
25
26
27
28
29
30
31
32
and Hentges, K.E. (2011) Defining the role of essential genes in human disease. PLoS One, 6, e27368. Chavali, S., Barrenas, F., Kanduri, K., and Benson, M. (2010) Network properties of human disease genes with pleiotropic effects. BMC Syst. Biol., 4, 78. Gandhi, T.K. et al. (2006) Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat. Genet., 38, 285. Wood, L.D. et al. (2007) The genomic landscapes of human breast and colorectal cancers. Science, 318, 1108. Ding, L. et al. (2008) Somatic mutations affect key pathways in lung adenocarcinoma. Nature, 455, 1069. Cancer Genome Atlas Research Network (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455, 1061. Cancer Genome Atlas Research Network (2011) Integrated genomic analyses of ovarian carcinoma. Nature, 474, 609. Lee, E., Jung, H., Radivojac, P., Kim, J.W., and Lee, D. (2009) Analysis of AML genes in dysregulated molecular networks. BMC Bioinf, 10 (Suppl. 9), S2. Wang, X., Gulbahce, N., and Yu, H. (2011) Network-based methods for human disease gene prediction. Brief. Funct. Genom., 10, 280. Krauthammer, M., Kaufmann, C.A., Gilliam, T.C., and Rzhetsky, A. (2004) Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer’s disease. Proc. Natl. Acad. Sci. USA, 101, 15148. Franke, L. et al. (2006) Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet., 78, 1011. Oti, M., Snel, B., Huynen, M.A., and Brunner, H.G. (2006) Predicting disease genes using protein–protein interactions. J. Med. Genet., 43, 691. Iossifov, I., Zheng, T., Baron, M., Gilliam, T.C., and Rzhetsky, A. (2008)
j149
150
j 8 Network Medicine: Disease Genes in Molecular Networks
33
34
35
36
37
38
39
40
41
42
43
Genetic-linkage mapping of complex hereditary disorders to a whole-genome molecular-interaction network. Genome. Res., 18, 1150. Sharma, A., Chavali, S., Tabassum, R., Tandon, N., and Bharadwaj, D. (2010) Gene prioritization in Type 2 diabetes using domain interactions and network analysis. BMC Genom., 11, 84. Lage, K. et al. (2007) A human phenome– interactome network of protein complexes implicated in genetic disorders. Nat. Biotechnol., 25, 309. Bonifaci, N. et al. (2008) Biological processes, properties and molecular wiring diagrams of candidate low-penetrance breast cancer susceptibility genes. BMC Med. Genom., 1, 62. Ergun, A., Lawrence, C.A., Kohanski, M. A., Brennan, T.A., and Collins, J.J. (2007) A network biology approach to prostate cancer. Mol. Syst. Biol., 3, 82. Heiser, L.M. et al. (2009) Integrated analysis of breast cancer cell lines reveals unique signaling pathways. Genome. Biol., 10, R31. Nibbe, R.K., Markowitz, S., Myeroff, L., Ewing, R., and Chance, M.R. (2009) Discovery and scoring of protein interaction subnetworks discriminative of late stage human colon cancer. Mol. Cell Proteomics, 8, 827. Liu, M. et al. (2007) Network-based analysis of affected biological processes in type 2 diabetes models. PLoS Genet., 3, e96. Ray, M., Ruan, J., and Zhang, W. (2008) Variations in the transcriptome of Alzheimer’s disease reveal molecular networks involved in cardiovascular diseases. Genome. Biol., 9, R148. Wheelock, C.E. et al. (2009) Systems biology approaches and pathway tools for investigating cardiovascular disease. Mol. Biosyst., 5, 588. Diez, D. et al. (2010) The use of network analyses for elucidating mechanisms in cardiovascular disease. Mol. Biosyst., 6, 289. Calvano, S.E. et al. (2005) A network-based analysis of systemic inflammation in humans. Nature, 437, 1032.
44 Chen, Y. et al. (2008) Variations in DNA
45
46
47
48
49
50
51
52
53
54
55
56
elucidate molecular networks that cause disease. Nature, 452, 429. Dobrin, R. et al. (2009) Multi-tissue coexpression networks reveal unexpected subnetworks associated with disease. Genome. Biol., 10, R55. Pujana, M.A. et al. (2007) Network modeling links breast cancer susceptibility and centrosome dysfunction. Nat. Genet., 39, 1338. Goehler, H. et al. (2004) A protein interaction network links GIT1, an enhancer of Huntington aggregation, to Huntington’s disease. Mol. Cell, 15, 853. Lim, J. et al. (2006) A protein–protein interaction network for human inherited ataxias and disorders of Purkinje cell degeneration. Cell, 125, 801. Camargo, L.M. et al. (2007) Disrupted in schizophrenia 1 interactome: evidence for the close connectivity of risk genes and a potential synaptic basis for schizophrenia. Mol. Psychiatry, 12, 74. Chuang, H.Y., Lee, E., Liu, Y.T., Lee, D., and Ideker, T. (2007) Network-based classification of breast cancer metastasis. Mol. Syst. Biol., 3, 140. Taylor, I.W. et al. (2009) Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat. Biotechnol., 27, 199. Kohler, S., Bauer, S., Horn, D., and Robinson, P.N. (2008) Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet., 82, 949. Vanunu, O., Magger, O., Ruppin, E., Shlomi, T., and Sharan, R. (2010) Associating genes and protein complexes with disease via network propagation. PLoS Comput. Biol., 6, e1000641. Yang, P., Li, X., Wu, M., Kwoh, C.K., and Ng, S.K. (2011) Inferring gene-phenotype associations via global protein complex network propagation. PLoS One, 6, e21502. Navlakha, S. and Kingsford, C. (2010) The power of protein interaction networks for associating genes with diseases. Bioinformatics, 26, 1057. Emilsson, V. et al. (2008) Genetics of gene expression and its effect on disease. Nature, 452, 423.
References 57 Esteller, M. (2008) Epigenetics in cancer. 58
59
60
61
N. Engl. J. Med., 358, 1148. Hansen, K.D. et al. (2011) Increased methylation variation in epigenetic domains across cancer types. Nat. Genet., 43, 768. Portela, A. and Esteller, M. (2010) Epigenetic modifications and human disease. Nat. Biotechnol., 28, 1057. Lu, J. et al. (2005) MicroRNA expression profiles classify human cancers. Nature, 435, 834. Kim, J. et al. (2010) A Myc network accounts for similarities between
embryonic stem and cancer cell transcription programs. Cell, 143, 313. 62 Lukong, K.E., Chang, K.W., Khandjian, E.W., and Richard, S. (2008) RNA-binding proteins in human genetic disease. Trends Genet., 24, 416. 63 Solit, D.B. and Mellinghoff, I.K. (2010) Tracing cancer networks with phosphoproteomics. Nat. Biotechnol., 28, 1028. 64 Sreekumar, A. et al. (2009) Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression. Nature, 457, 910.
j151
j153
9 Inference of Gene Regulatory Networks in Breast and Ovarian Cancer by Integrating Different Genomic Data Binhua Tang, Fei Gu, and Victor X. Jin
9.1 Brief Summary
The primary goal of modeling gene regulatory networks in human cancer is to reveal pathways governing the cancer cellular to specific phenotypes. Knowledge of cancer-specific gene regulatory networks could potentially aid to design effective intervention strategies such as introduction of a factor or drug for altering the network to avoid undesirable cancerous cellular states. This chapter presents two computational approaches to infer the underlying regulatory architecture by integrating different high-throughput experiment data in human cancer. Our gene regulatory network analysis strongly suggested that a rewired estrogen receptor a (ERa) regulated network in breast cancer cells and a rewired SMAD4 regulated network in ovarian cancer cells.
9.2 Introduction
The expression of one gene is affected by many other genes, where their interactions constitute a complex regulatory network of gene expression. It is well recognized that almost all cellular activities are governed by the gene network. Construction of altered gene regulatory networks in human cancer versus normal tissue will enable design or development of new therapeutic drugs by targeting one of network nodes. The advance of various biological technologies, especially microarray, next generation sequencing, and protein mass spectrometry makes it possible to study gene network from the genomic level. Therefore, researchers have begun to study gene networks in a complex systems point of view. Many theories have been developed to study gene regulatory networks such as stochastic models [1–3], Boolean models [4–6], information theory-based methods [7–14], ordinary differential equation (ODE) methods [15–17], and Bayesian statistical methods [18–20].
Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
154
j 9 Inference of Gene Regulatory Networks in Breast and Ovarian Cancer Inference of gene regulatory network is one of the main fields of postbioinformatics. Utilizing bioinformatics methods and techniques for data collection, analysis, modeling, simulation, and inference is a major way to study of complex network relationship among genes. Till now, those approaches have been introduced to model and analyze diverse cancer cases, for example, Torkamani et al. adopted a network reconstruction and coexpression module identification method to analyze functionally related gene modules targeted in breast and colorectal cancer [21]; Ciriello et al. proposed a correlation and statistical test-based approach for analyzing the glioblastoma and serous ovarian cancers [22]; Fujita et al. discussed the measurement error effects on modeling regulatory network, and they also proposed a improved least square estimation method on human lung and mouse liver cancers [23]; Tomaru et al. proposed a matrix RNAi approach to identify regulatory functions and inter-TRFs (transcriptional regulatory factor) networks in human hepatocytes and liver cases [24]; Shimamura et al. utilized a vector autoregressive model for identifying gene functional associations from time series microarray experiments in breast cancers [25]. This chapter reviews the study of gene regulatory networks and the applications in several study cases in human cancer.
9.3 Theory and Contents of Gene Regulatory Network 9.3.1 Basic Theory of Gene Regulatory Network
One assumption of gene regulatory network is: if two genes are similar in expression profiles, they may have collaborations in regulation. They may also function similar and have the same expression process. It is the basic theory for a variety of analysis and modeling. In higher organisms, each gene or protein has interactions with an estimated average of 4 to 8 genes, which involved about 10 biological functions. Therefore, the overall regulation of gene expression patterns is the result of joint cooperation of local gene expressions. Therefore, in a high degree of connectivity of the cellular environment, it is important to consider the whole genome as a network when analyzing gene function. Various high-throughput experiments (gene microarray, ChIP-seq (chip), and mass spectrometry) can be used to study gene functions, gene synergies and constraints relationships, and gene regulatory networks. When a gene is transcribed, a group of transcription factors (TFs) binds to the genomic regions of the gene, regulating this gene expression. Meanwhile, these TFs are also products of other genes. When a gene is translated to protein, it is possible to open the promoter region of another gene. By transcription and translation, product of one gene can directly or indirectly affect the expression of other genes, and even affect its own expression. Changing the expression of multiple genes, create variable environment for other
9.3 Theory and Contents of Gene Regulatory Network
gene expression. This actually constitutes a complex molecular level of biological information systems like gene network. The purpose of studying gene network is to systematically analyze and simulate overall gene expression relationship within the framework of the system in particular species or tissues through the establishment of gene transcriptional regulatory networks. Firstly, based on the synchronous or antisynchronous expression and expression intensity, to identify the characteristics of each gene; then, group genes based on clustering method; finally, reconstruct gene regulatory network, analyze relevant parameters of specific model. Gene regulatory network is essentially a continuous and complex dynamic system; it is very important to simplify the solution of the model. 9.3.2 Content of Gene Regulatory Network 9.3.2.1 Identify and Infer the Structure Properties and Regulatory Relationships of Gene Networks Currently, the most studied and identified gene network is from gene expression profiles. Its properties include: to identify gene regulatory network structure from expression data, to test the global impact of dynamic network characteristics by random perturbation of individual gene, to analyze gene network from large number of datasets, to infer gene interactive mechanisms of a network in the steady state through the establishment of a static network, to infer gene function and biological network of a series of logic circuits on the basis of expression profile, to identify the causal structure of gene networks, and so on. Gene network is a complex system that contains the complex regulatory relationship between a gene and its product. For example, whether a gene can be transcribed to mRNA or not, can be translated to protein or not, can be regulated after transcription (such as splicing, etc.), can be modified after translation (e.g., phosphorylation, etc.), and can be changed in gene silencing-related methylation status. By appropriate simplifying and quantifying treatment of the structure of gene networks, and according to the experimental data obtained and prior knowledge, bioinformatics helps to explore information, construct network model, and understand regulation and related mechanisms. 9.3.2.2 Understand the Basic Rules of Gene Expression and Function Understanding when and which gene’s expression is the key to construct gene network is important. Each gene can be considered individual variable, and its status could be expressed or not expressed. Alternatively, each gene can be regarded as a continuous variable and its value is based on gene expression rate. One of the study purposes of gene regulatory networks is to understand the potential rules that are dominant for gene expression and functions. 9.3.2.3 Discover the Transfer Rules of Genetic Information During Gene Expression Gene transfers genetic information through the expression. In gene regulatory network, the quantitative methods can be used to explore information during gene
j155
156
j 9 Inference of Gene Regulatory Networks in Breast and Ovarian Cancer expression. Gene regulatory network model is an approach to get genetic information from gene expression data. By extraction and analysis of information, we can understand the disease pathogenesis. 9.3.2.4 Study on the Gene Function in a Systematic Framework With the development of functional genomics, the discovery of new gene function has become an important task. Gene regulatory networks are important aspects in functional genomics research. Through systematically research and data mining, it may explore new features of genes. It is important in drug design to correlate diseases with genes or related process through the network.
9.4 Inference of Gene Regulatory Networks in Human Cancer
We use reverse engineering that is a computational methodology to infer the unknown or hidden gene network topology from various genomic data (ChIP-seq (chip), gene expression (RNA-seq), or protein–protein interaction). 9.4.1 The In Silico Analytical Approach
The in silico analysis method (Figure 9.1) begins with ChIP-seq datasets with gene expression data. The identified binding peaks of a given TF are located to known genes, and genes having the given TF-binding peaks are further correlated with gene expression data based on RefSeq Gene ID. The given TF-binding peaks are further used for finding the most significant motifs by the software ChIPMotifs [26], in which they are used as hub TFs based on the known factor motifs in TRANSFAC [27] and JASPAR [28] databases. The hub TF gene connection is determined by scanning the position weight matrices (PWMs) of hub TFs in all binding peaks, and a permutation test is used to calculate the reliability of each connection of the network. In order to infer the regulatory network, PWMs for mapped TFs to TRANSFAC and JASPAR databases were used to scan the peak region of each gene. To make the result more reliable, we use stringent threshold (1 for core score, 0.95 for PWM score) to determine the underlying TF-binding site. The PWM score and core score are calculated as follows: Given a sequence with the same length (column number n) of PWM, we can first calculate the sequence score: Sseq ¼
n X
wðSi ; jÞ
ð9:1Þ
j¼1
where wðSi ; jÞ is the score at row i at which the nucleotide is the same as the given sequence at column j.
9.4 Inference of Gene Regulatory Networks in Human Cancer
Figure 9.1 The computational analysis procedure for building the ER network. After peak calling, the hub TFs were identified by the ChIPMotifs from the ChIP-seq data. Moreover, those data were further processed by
overlapping ChIP peaks with the corresponding microarray gene expression data. Then hub TF and regulated genes were further analyzed by scanning the PWM before the network building and pathway analysis.
Then, the minimum and maximum scores of PWM can be calculated as Smin ¼ Smax ¼
n X
j¼1 n X
minfwði; jÞg
ð9:2Þ
maxfwði; jÞg
ð9:3Þ
j¼1
The PWM score for the given sequence is shown in Equation 9.4 SPWM ¼
Sseq Smax
Smin Smin
ð9:4Þ
Our program ChIPMotifs will give the length (k) and start (k1) and end (k2) positions of core region in a PWM. The given sequence will be scanned (n k þ 1) times (move forward one position each time) to find the maximum core score. For the dth time, the sequence score can be represented as Sseq;d ¼
k1
dþk X1
wðSi ; jÞ
j¼d
For Equations 9.1 and 9.2, only the boundary of j was changed from 1 k2. The core score for the dth time is
ð9:5Þ
n to
j157
158
j 9 Inference of Gene Regulatory Networks in Breast and Ovarian Cancer Score;d ¼
Sseq;d Smax
Smin Smin
ð9:6Þ
and the core score for the given sequence is Score ¼ maxðScore; d Þ d
ð9:7Þ
If there is a motif for ERa and hub TFs for certain gene, a connection between the hub TF (include ERa) and the gene was made. To test the significance of the network, a statistical strategy (permutation) was used to determine the probability of each edge of the network under random circumstances. Since the TF-binding site region is composed of specific sequences, and only by scanning the sequence region using PWM, we get the network edges; hence, we shuffled the sequence of each peak region for 1000 times, to see how many times a specific TF-binding site is hit by the scanning process. The ratio (times hit by scanning divided by 1000) of each edge was calculated. The number with low value was considered high statistical significance (we used 0.2 as a cutoff to include as more connection as possible while keeping the relationship reliable). 9.4.1.1 Study Case 1: Inference of Static Gene Regulatory Network of EstrogenDependent Breast Cancer Cell Line Estrogen-mediated gene regulation is such a challenging question that it may require powerful genome-wide profiling tools like ChIP-based technologies. In breast cancer cells, ERa can mediate genomic transcription regulation with nuclear-initiated steroid signaling and nongenomic activation of various protein kinase cascades. In the classical genomic pathway, estrogen receptor binds to estrogen response elements (EREs) at the regulatory region of the target genes and recruits co-activators or co-repressors to modulate gene transcription. The nonclassical genomic pathway does not require ERE but mediates transcription by the interactions of ERa with other proteins such as AP1 [29], NF-kB [30], SP1 [31, 32], and others. At molecular level, we need to identify genes that are targeted and regulated by estrogen receptors. Estrogen receptors, once activated, may induce increased or decreased transcription of its numerous targets, which have been investigated by expression arrays. In this study, publicly available ChIP-seq dataset is collected for ERa-binding sites in breast cancer MCF7 cells upon estrogen exposure, which also include RNA polymerase II (Pol-II)-binding sites in cells since the binding of Pol-II could provide direct information of potential transcription activation. Then computational approaches are applied to investigate the hierarchical regulatory information for ERa regulation in MCF7 cells. The transcriptional regulatory networks have been constructed with target hubs. In ChIP-seq data, it turns out to have a total of 12,516 ERa-binding peaks corresponding to 5693 annotated genes and 13,261 Pol-II-binding peaks corresponding to 5186 genes, and 2661 genes were identified to have both Pol-II-and ERa-binding peaks. Among these 2661 genes with enriched double (ERa and Pol-II) binding
9.4 Inference of Gene Regulatory Networks in Human Cancer
Figure 9.2 A summary of correlations of identified ERa- and Pol-II-binding peaks with gene expression profile after E2 induced in MCF7 cells.
peaks, 273 of them overlapped with 1,513 E2-induced genes in MCF7 showing differential expression (Figure 9.2) [33]. In this study case, the top 2000 peaks are selected with high scores (enrichments) as the input data for ChIPMotifs. After running ChIPMotifs, ERE, PAX6, PITX2, and RORA were identified. Several TFs (CEBP, FOS (AP1), and FOXA1) are failed to identify reported to be associated with ERa in previous studies. One possible reason is that ChIPMotifs first ab initio identify motifs at a set of relatively short sequences (300–500 bp), and then find possible matched TFs from the TRANSFAC database after obtaining significant motifs. Therefore, this might miss some co-TFs if they locate outside 500-bp distance from ERa. However, if longer sequences are used, that might lose the specificity for the identified motifs such as missing identifying ERE. Regardless, these three co-TFs are included in the analysis. These TFs’ Seq-LOGOs are shown in Figure 9.3. The ChIP-seq data is integrated with the time series of E2-induced gene expression data. In order to determine if a gene is differentially expressed, the difference of expression levels between time point 12 and 0 h is used. Positive value means up-regulated and negative value means down-regulated. Thus, all genes in the networks were differentially expressed and with ERa- and Pol-II-binding ChIP-peaks (the ERa peak location is between 100 kb upstream of 50 TSS and 100 kb downstream of 30 TSS, while the Pol-II peak location is between 10 kb upstream of 50 TSS and 10 kb downstream of 30 TSS. The ERa peak and Pol-II peak do not necessarily be overlapped, but they must be located in the same gene). Furthermore, for the transcriptional regulatory network, only TFs were used for the network construction (called normal TFs). ERa-binding peaks associated with those normal TFs
j159
160
j 9 Inference of Gene Regulatory Networks in Breast and Ovarian Cancer
Figure 9.3 The list of TF motifs identified by the de novo ChIPMotifs approach.
were further scanned by hub TF (the TF at which motif was enriched in top 2000 ERa-binding peaks of the genes with both ERa- and Pol-II-binding sites) PWMs to determine if there is any connection between a hub TF and a normal TF. A shuffling test was performed to test the reliability of each connection of the network. The resulted regulatory networks were thus constructed and topologically visualized using Cytoscape. In the network, all the normal/hub TFs are represented as nodes (red nodes represented for up-regulated genes, green nodes represented for down-regulated genes, and blue nodes represented for hub TFs), and all the connections are represented as edge between two nodes (Figure 9.4). An edge has a direction, as it starts from hub TFs to normal TFs. The edge between a hub TF PWM (e.g., FOXA1) and a normal TF is differentially expressed following estrogen stimulation (e.g., MYC), that is, a motif of hub TF (i.e., FOXA1) is found within ERa peak region(s) associated with another TF gene (i.e., MYC). Since every normal TF was with both ERa and Pol-II peaks, the edge represented for the possible direct/indirect binding of hub TF (e.g., FOXA1) together with ERa to regulate normal TF (i.e., MYC). 9.4.1.2 Study Case 2: Gene Regulatory Network of Genome-Wide Mapping of TGFb/SMAD4 Targets in Ovarian Cancer Patients In this study case, the ChIP-seq technology is applied to study transforming growth factor-b (TGFb)/SMAD4 regulation in platinum-sensitive ovarian cancer cell line A2780. SMAD4-binding loci are profiled in this cell with TGFb stimulation. Combining with computational approaches, the binding patterns have been investigated for SMAD4 and are also compared with a normal immortalized ovarian surface epithelial cell (IOSE) from previous study as well as human keratinocytes (HaCaT) from Koinuma et al. [34]. The TGFb signaling pathway plays an important role in controlling proliferation, differentiation, and other cellular processes including the growth of
9.4 Inference of Gene Regulatory Networks in Human Cancer
Figure 9.4 The regulatory network for E2-treated MCF7 cells.
ovarian surface epithelial cell. Dysregulation of TGFb signaling may be crucial to the development of epithelial ovarian cancer. The effects of TGFb are mediated by three TGFb ligands, that is, TGFb1, TGFb2, and TGFb3, through TGFb type 1 and type 2 receptors. As the affinity of the activated SMAD complex for the SMAD-binding element is insufficient to support association with endogenous promoters of target genes, SMAD complexes are associated with other DNA-binding TFs to regulate expression. Many studies have shown that various families of TFs, such as the forkhead, homeobox, zinc finger, LEF1, Ets, and basic helix–loop–helix families, can serve as SMAD4 partner proteins to achieve high affinity and selectivity for target promoters with the appropriate binding elements [35]. A2780 is a human epithelial ovarian cancer cell line, but not an aggressive cancer line. The A2780 cells are still sensitive to a key chemotherapeutic drug cisplatin, cis-diamminedichloroplatinum(II). The A2780 cells have only an intermediate level of TGFb dysregulation: they are still able to induce SMAD4 expression and transduce
j161
162
j 9 Inference of Gene Regulatory Networks in Breast and Ovarian Cancer
Figure 9.5 SMAD4 target genes in three cell lines. (a) Overlapped genes among three cell lines. (b) GO analysis of the genes of three cell lines.
existing SMAD4 from the cytoplasm to the nucleus following TGFb stimulation [36]. As such, this cancer line is often used as a model of studying ovarian cancer. Previous studies have identified a set of 150 TGFb-stimulated SMAD4 target genes in IOSE and a set of 92 TGFb-stimulated SMAD4 target genes in HaCaT (an immortalized keratinocyte cell line). It is not surprisingly to find only 6 of 150 in IOSE and 6 of 92 in HaCaT are common with 318 SMAD4 target genes in this study, 1 for all 3 studies (Figure 9.5) since 1 (A2780) is a cancer cell and the other 2 are normal cells. The other possibility for such lower overlapping rates may due to the limited targets identified using promoter array (ChIP-promoter-chip). GO analysis also shows that target genes in HaCaT and IOSE are majorly involved in regulation of cell proliferation (or antiapoptosis) and development process (muscle development), which are different from target genes in A2780 (Figure 9.5). To further compare the difference of the TGFb-stimulated SMAD4-dependent gene regulatory information between these three cell types, computational analytical approach is applied to build the SMAD4-dependent regulated networks in HaCaT, IOSE, and A2780, respectively (Figure 9.6). Briefly, the computational analytical approach started with ChIP-based datasets and gene expression data. The identified SMAD4-binding loci are then located to known genes, and genes having at least a SMAD4-binding loci are further correlated with gene expression data based on RefSeq Gene ID. A set of differentially expressed SMAD4 target genes after TGFb stimulation is further used for finding the most significant TF-binding partners by the ChIPMotifs or ChIPModules, in which they are used as hub TFs. The hub TF-gene connection is determined by scanning the hub TFs’ PWMs in all binding loci and a permutation test is used to test the reliability of each connection of the network.
9.4 Inference of Gene Regulatory Networks in Human Cancer
Figure 9.6 Gene regulatory network of three cell lines.
Six hub TFs have been identified: GFI1, NR3C1, SOX17, STAT4, ZNF354C, and TCF8 from 318 SMAD4-dependent target genes in A2780 cells, while 4 hub TFs, LEF1 (TCF), ELK1, COUPTF (NR2F5), and E2F, are identified in IOSE cells in previous studies using similar approach (CART model). Three hub TFs, E2F1, SP1 and USF, have also been identified for 92 SMAD4-dependent target genes in HaCaT cells, which are very similar to the TF motifs identified from previous study. A top motif AP1 reported previously is missed in the results. This is due to using an advanced classification algorithm in ChIPModules and being able to eliminate those TF motifs which are also enriched in random sets. Interestingly, one hub TF E2F (E2F1) is common between two normal cells, but none is common with A2780 cells. Together with GO function analysis, the results indicate that E2F may act as a major SMAD4 cotranscriptional factor partner in mediating cell proliferation in normal cells but is lost in carcinoma cells. The resulted gene regulatory networks (GRNs) for three cells are shown in Figure 9.6. Overall, the gene regulatory network analysis strongly suggested that TGFb stimulates a different SMAD4-dependent regulatory mechanism in ovarian cancer cells compared to normal cells – in other words, a rewired SMAD4 regulation network in ovarian cancer cells.
j163
164
j 9 Inference of Gene Regulatory Networks in Breast and Ovarian Cancer
9.4.2 A Bayesian Inference Approach for Genetic Regulatory Analysis
We propose a Bayesian multivariate statistical approach for modeling the timevariant transcriptional regulatory network. The basic model framework is illustrated as follows: X aij x j ðtÞ þ e; i ¼ 1; . . . ; M; j ¼ 1; . . . ; N ð9:8Þ y_i ðtÞ ¼ i;j
where y_i ðtÞ denotes the ith gene’s transcription rate, xj(t) for the jth gene’s expression level at the investigated time, aij for the corresponding regulatory argument or strength of the jth gene which has any possible transcription regulatory activity on the ith gene, and e represents the potential stochastic effects during the transcription regulatory process, which normally follows a normal distribution, that is, e Nðm; s 2 Þ. Thus, for a genetic regulatory network containing M transcription factors at T time points, the above equation can be organized as, Y_ MT ¼ ½AX MT þ JMT
ð9:9Þ
0
where Y_ ¼ ðy_1 y_2 y_m Þ denotes the transcription rate matrix of M transcription factors, A ¼ ða1 a2 am Þ0 the regulatory coefficient matrix, X ¼ ðx 1 x 2 xm Þ0 the gene matrix, and J ¼ ðe1 e2 em Þ0 the error term. Thus, inference of coefficient matrix A in the above equation is to acquire concrete knowledge about the transcription regulatory strength of transcription factors over diverse target genes under investigation. Thus, based on the Bayes theorem, the above equation can be formulated as, _ ðYjA; X Þ ¼ AX þ J
ð9:10Þ
and the error term J is specified as an independent and normally distributed random vector with l-dimensional zero mean and a l l covariance S. The multivariate l-dimensional normal distribution for errors is represented as, X X 1 1 0P 1 2 2ei ei ð9:11Þ p ei / e where ei is the l-dimensional error vector. And from the multivariate normal error specification, the observation vector also follows a multivariate normal distribution, denoted as X X 12 1ðy_ Ax Þ0 P 1 ðy_ Ax Þ i i p y_ A; xi ; / ð9:12Þ e 2 then with extension to matrix model representation, the above equation can be formulated as X X / p Y_ A; X ;
1 2
e
P1 1 _ 0 ðY_ AX Þ 2trðY AX Þ
ð9:13Þ
9.4 Inference of Gene Regulatory Networks in Human Cancer
where the trace operator “tr” defines the sum of the diagonal entries of its matrix argument. Furthermore, inference of parameters contains two subproblems, that is, determining the parameters’ marginal posterior distributions and computing the marginal posterior mean estimation. Thus, to determine the marginal posterior distributions for coefficient matrix A, the joint posterior distribution should be marginalized with integration on S, that is, Z XÞ _ XÞdS ¼ pðAjY; _ X Þ / f ðD; A; pðA; SjY; ð9:14Þ S
where f ðÞ denotes a functional operator, D ¼ ðX 0 X 0 0 Þ 1 , X 0 is the prior input matrix, and the posterior coefficient mean matrix is given by ¼ ðX Y_ 0 þ A0 DÞðD A
1
þ X X 0Þ
1
ð9:15Þ
where A0 is the prior distribution mean matrix. Due to proportionality characteristics of coefficient matrix inferred by the Bayesian statistical analysis, we normalize those coefficients by scaling them within the range of 1 to 1. 9.4.2.1 Study Case: ERa Transcriptional Regulatory Dynamics in Breast Cancer Cell ERa often responds to estrogen stimulus at a time-dependent manner to regulate downstream target genes. In this study case, we propose a Bayesian multivariate statistical approach to integrate both estrogen (E2)-stimulated time-series ChIP-seq and gene expression data in a breast cancer cell, MCF7, to infer the hierarchical structure within dynamic ERa-centered regulatory networks. In addition, the statistical network property, network modularity and their underlying functions, correlation among motif patterns, and corresponding gene expression patterns are also analyzed. The identification of ER-binding sites in each ChIP-seq data was performed by the wBELT peak calling program developed in our laboratory [37]. Since there are several statistical parameters associated with each output such as FDR, bin-size, and others, we need to determine an optimal set of binding sites. Thus, we propose a data feature detection algorithm that can be formalized as a class of optimal track analysis, illustrated as follows: arg maxP i ; i 2 N i
s:t: : f i x; bi ¼ b; pi d
ð9:16Þ
where Pi denotes a set of optimal peak numbers under corresponding argument constraints, fi stands for the argument FDR, bi for the bin-size, pi for the p-threshold, and x, b, and d represent presupposed argument values, respectively. Herein we define a track rate function (TR) to quantitatively characterize underlying data features from diverse argument pair sets (peak number and FDR), depicted as PM SATi j¼1 SðjÞ TRi ¼ ¼ PN ;i 2 N ð9:17Þ SSTi k¼1 SðkÞ
j165
166
j 9 Inference of Gene Regulatory Networks in Breast and Ovarian Cancer
Figure 9.7 The selection of optimal parameters for the ERa ChIP-seq data at time point 0. A distribution of peak numbers (the upper panel) and FDR (the lower panel) versus the p-threshold (a). A track rate distribution for peak number and FDR with respect to the iteration number N (b). The identification of hub TFs according to the occurrence frequency of those candidates (c). The percentages denote the corresponding occurrence
distribution among all candidates identified from the time-series ChIP-seq data. TF candidates’ pairwise intersection matrix across all the four time points (d). The diagonal entries denote the candidates’ quantities at each time points, and other nondiagonal entries denote those candidates’ quantities of intersection identified between any two different time points.
where SAT represents actual track scores, SST the shortest track scores, and S()the corresponding score operator for each track step, respectively. Estrogen-stimulated ERa ChIP-seq was conducted at the time 0, 1, 4, and 24 h. Figure 9.7a and b illustrates such an optimal parameters selection process for ERa ChIP-seq data at the time point 0, where normally a selection needs to find the highest peak number with a suitable bin-size when the corresponding FDR is statistically acceptable. Once the optimal parameters for each time point of data were determined, ERa target genes were then identified in each time points, respectively.
9.5 Conclusions
Next, we identified ERa-regulated hub TFs by scanning all possible TF candidates through the TRANFAC database that collects more than 1000 PWMs [27, 38]. We determined the final hub TFs using two different methods: (1) a frequency of occurrence of each TF in all binding sites (Figure 9.7c), where the frequencies of occurrence range vary from 1% to 4%, and the total occurrence quantities for those identified TF candidates are 3,997,164 (0 h), 4,184,256 (1 h), 2,174,429 (4 h), and 3,912,712 (24 h); and (2) a manual selection of other major hub TFs functionally known as associated with ERa from many literature papers [39–47]. As a result, a total of 11 TFs, that is, ERa, AP1, CEBP, GATA2, HNF3a, MYC, NFkB, OCT1, PAX2, SP1, and XBP1 were determined as the hub TFs and used for inferring a Bayesian statistical model. We further performed correlation analysis for those TF candidates at each time point by calculating a pairwise TF intersection matrix across the four time points (Figure 9.7d). Our analysis revealed that the correlation between those TF candidates differs in any two time points, that is, the number of the identified TF candidates at each time point and the number of identified pairwise common TF candidates among those time points. Those facts also indicated the underlying time-variant property during the ERa regulatory process. Due to a relative simple size for E2-stimulated time-series gene expression data that contain no knowledge of transcription factors, binding information, and direct target genes, the conventional Bayesian modeling has its limitation. In order to infer ERa-centered regulatory network, we integrate the time-series E2-stimulated ERa ChIP-seq data, where it can detect transcription factors and hubs and facilitate the further reverse-engineering of the regulatory network by means of inferring parameters in the Bayesian statistical framework (Figure 9.8).
9.5 Conclusions
High-throughput techniques such as ChIP-seq, ChIP-chip, ChIP-PET, RNA-seq, and expression microarray can provide the detailed information of genome-wide binding of TFs, histone modifications, as well as expression level of each gene in normal and cancerous cells. Those high-throughput omics data demand new computational methods and analysis perspectives to satisfy diverse research purposes. In this chapter, we reviewed some recent works in this field and discussed our recently proposed computational methods for inference and analysis of gene regulatory network. Particularly, we cited two investigated cases, that is, the TGFb-stimulated SMAD4-dependent gene regulatory network in ovarian cancer and ERa in the breast cancer cell lines. These cases have integrated the ChIP-seq and microarray gene expression data. For the ERa case, based on the time-series ChIP-seq datasets, we investigated the statistical properties and modularity characteristics in the inferred dynamic genetic network. The quantitative modeling provides a feasible and effective approach to fathom our understanding of the underlying biological mechanisms in transcription regulation.
j167
168
j 9 Inference of Gene Regulatory Networks in Breast and Ovarian Cancer
Figure 9.8 The ERa transcription regulatory network structure and related analysis at the time 1 h. (a) The inferred ERa transcription regulatory network structure at the time 1 h. The red edges denote positive activation, and dashed blue edges denote negative inhibition. The arrow denotes activation and T shape for inhibition; the plots for other time points also take the same representation. (b) The
hierarchical topology structure of the inferred ERa transcription regulatory network at the time 1 h. Panels (c) and (d) illustrate the connectivity distribution, Pearson correlation, and p-value distributions (between the regulatory coefficients and SNRs) as the functions of uniform regulatory strength for the network structure at the time 1 h.
9.6 Perspective
Integration of diverse sources of omics data has greatly facilitated the analysis of transcription factor associations and has brought our view of transcriptional regulation from a systems biology perspective. New and advanced experimental techniques such as Hi-C [48, 49] now allow investigators to more globally dissect the networks from a linear paradigm to a three-dimensional model. Meanwhile, the
References
feasible and problem-oriented methods and algorithms for gene network analysis promote diverse computational topics, to gain novel insights and results in the cancer research field. The ultimate goal of modeling gene regulatory networks in human cancer is to find and design effective intervention strategies for affecting the network to avoid undesirable cancerous cellular states. As gene therapeutic interventions, such alterations may be possible by introduction of a factor or drug that alters a cancer-specific gene regulatory network into a normal network. Undoubtedly, efficient collaboration and synergy of wet-lab experiments and computational power, the development of modeling and analysis will eventually promote such genomics network research to a new level.
References 1 de Jong, H. (2004) Modeling and
2
3
4
5
6
7
8
simulation of genetic regulatory networks. Positive Systems, pp. 111–118. El Samad, H., Khammash, M., Petzold, L., and Gillespie, D. (2005) Stochastic modelling of gene regulatory networks. Int. J. Robust Nonlinear Control, 15 (15), 691–711. Cai, X. and Wang, X. (2007) Stochastic modeling and simulation of gene networks – A review of the state-of-the-art research on stochastic simulations. IEEE Signal Proc. Mag., 24 (1), 27–36. Rene, T. (1973) Boolean formalization of genetic control circuits. J. Theor. Biol., 42 (3), 563–585. Akutsu, T., Miyano, S., and Kuhara, S. (1999) Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. Pac Symp Biocomput, 4, 17–28. Shmulevich, I., Dougherty, E., Kim, S., and Zhang, W. (2002) Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics, 18 (2), 261–274. Butte, A. and Kohane, I.(2000) Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput, 5, 418–429. Zhao, W., Serpedin, E., and Dougherty, E.R. (2006) Inferring gene regulatory networks from time series data using the
9
10
11
12
13
14
minimum description length principle. Bioinformatics, 22 (17), 2129–2135. Ribeiro, A.S., Kauffman, S.A., Lloyd-Price, J., Samuelsson, B., and Socolar, J.E.S. (2008) Mutual information in random Boolean models of regulatory networks. Phys. Rev. E, 77 (1), 011901. Margolin, A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R., and Califano, A. (2006) ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinf., 7 (S1), S7. Faith, J.J., Hayete, B., Thaden, J.T., Mogno, I., Wierzbowski, J., Cottarel, G., Kasif, S., Collins, J.J., and Gardner, T.S. (2007) Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol., 5 (1), e8. Meyer, P., Lafitte, F., and Bontempi, G. (2008) minet: A R/bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinf., 9 (1), 461. Tang, B., Wu, X., Tan, G., Chen, S.-S., Jing, Q., and Shen, B. (2010) Computational inference and analysis of genetic regulatory networks via a supervised combinatorial-optimization pattern. BMC Syst. Biol., 4 (S2), S3. Altay, G. and Emmert-Streib, F. (2010) Inferring the conservative causal core of gene regulatory networks. BMC Syst. Biol., 4 (1), 132.
j169
170
j 9 Inference of Gene Regulatory Networks in Breast and Ovarian Cancer 15 Tegn er, J., Yeung, M.K.S., Hasty, J., and
16
17
18
19
20
21
22
23
24
25
Collins, J.J. (2003) Reverse engineering gene networks: integrating genetic perturbations with dynamical modeling. PNAS, 100 (10), 5944–5949. Perkins, T.J., Hallett, M., and Glass, L. (2004) Inferring models of gene expression dynamics. J. Theor. Biol., 230 (3), 289–299. Tang, B., He, L., Jing, Q., and Shen, B. (2009) Model-based identification and adaptive control of the core module in a typical cell cycle pathway via network and system control theories. Adv. Complex Syst., 12 (1), 21–43. Friedman, N., Linial, M., Nachman, I., and Pe’er, D. (2000) Using Bayesian networks to analyze expression data. J. Comput. Biol., 7 (3–4), 601–620. Werhli, A., Grzegorczyk, M., and Husmeier, D. (2006) Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical Gaussian models and Bayesian networks. Bioinformatics, 22, 2523–2531. Needham, C., Bradford, J., Bulpitt, A., and Westhead, D. (2006) Inference in Bayesian networks. Nat. Biotechnol., 24, 51–53. Torkamani, A. and Schork, N.J. (2009) Identification of rare cancer driver mutations by network reconstruction. Genome Res., 19 (9), 1570–1578. Ciriello, G., Cerami, E., Sander, C., and Schultz, N. (2012). Mutual exclusivity analysis identifies oncogenic network modules. Genome Research 22, 398–406. Fujita, A., Patriota, A., Sato, J., and Miyano, S. (2009) The impact of measurement errors in the identification of regulatory networks. BMC Bioinf., 10 (1), 412. Tomaru, Y., Nakanishi, M., Miura, H., Kimura, Y., Ohkawa, H.,Ohta, Y., Hayashizaki, Y., and Suzuki, M. (2009) Identification of an inter-transcription factor regulatory network in human hepatoma cells by Matrix RNAi. Nucleic Acids Research, 37, 1049–1060. Shimamura, T., Imoto, S., Yamaguchi, R., Fujita, A., Nagasaki, M., and Miyano, S. (2009) Recursive regularization for inferring gene networks from time-course
26
27
28
29
30
31
32
33
34
gene expression profiles. BMC Syst. Biol., 3 (1), 41. Jin, V.X., Apostolos, J., Nagisetty, N.S.V.R., and Farnham, P.J. (2009) W-ChIPMotifs: a web application tool for de novo motif discovery from ChIP-based highthroughput data. Bioinformatics, 25 (23), 3191–3193. Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Pr€ u, M., Reuter, I., and Schacherer, F. (2000) TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res., 28 (1), 316–319. Wasserman, W. and Sandelin, A. (2004) Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet., 5, 276–287. Jakacka, M., Ito, M., Weiss, J., Chien, P., Gehm, B., and Jameson, J. (2001) Estrogen receptor binding to DNA is not required for its activity through the nonclassical AP1 pathway. J. Biol. Chem., 276, 13615–13621. Speir, E., Yu, Z., Takeda, K., Ferrans, V., and Cannon, R. (2000) Competition for p300 regulates transcription by estrogen receptors and nuclear factor-kappaB in human coronary smooth muscle cells. Circ. Res., 87, 1006–1011. Gaub, M., Bellard, M., Scheuer, I., Chambon, P., and Sassone-Corsi, P. (1990) Activation of the ovalbumin gene by the estrogen receptor involves the fosjun complex. Cell, 63, 1267–1276. Abdelrahim, A., Samudio, I., Smith, R., Burghardt, R., and Safe, S. (2002) Small inhibitory RNA duplexes for Sp1 mRNA block basal and estrogen-induced gene expression and cell cycle progression in MCF-7 breast cancer cells. J. Biol. Chem., 277, 28815–28822. Gu, F., Hsu, H.-K., Hsu, P.-Y., Wu, J., Ma, Y., Parvin, J., Huang, T., and Jin, V. (2010) Inference of hierarchical regulatory network of estrogen-dependent breast cancer through ChIP-based data. BMC Syst. Biol., 4 (1), 170. Koinuma, D., Tsutsumi, S., Kamimura, N., Taniguchi, H., Miyazawa, K., Sunamura, M., Imamura, T., Miyazono, K., and Aburatani, H. (2009) Chromatin immunoprecipitation on microarray
References
35
36
37
38
39
40
41
42
analysis of Smad2/3 binding sites reveals roles of ETS1 and TFAP2A in transforming growth factor b signaling. Molecular and Cellular Biology, 29, 172–186. Kennedy, B.A., Deatherage, D.E., Gu, F., Tang, B., Chan, M.W.Y., Nephew, K.P., Huang, T.H.M., and Jin, V.X. (2011) ChIPseq defined genome-wide map of TGFb/SMAD4 targets: implications with clinical outcome of ovarian cancer. PLoS ONE, 6 (7), e22606. Derynck, R., Akhurst, R.J., and Balmain, A. (2001) TGF-b signaling in tumor suppression and cancer progression. Nat. Genet., 29 (2), 117–129. Lan, X., Bonneville, R., Apostolos, J., Wu, W., and Jin, V.X. (2011) W-ChIPeaks: a comprehensive web application tool for processing ChIP-chip and ChIP-seq data. Bioinformatics, 27 (3), 428–430. Jin, V.X., Rabinovich, A., Squazzo, S.L., Green, R., and Farnham, P.J. (2006) A computational genomics approach to identify cis-regulatory modules from chromatin immunoprecipitation microarray data – A case study using E2F1. Genome Res., 16 (12), 1–11. Cicatiello, L., Mutarelli, M., Grober, O.M. V., Paris, O., Ferraro, L., Ravo, M., Tarallo, R., Luo, S., Schroth, G.P., Seifert, M. et al. (2010) Estrogen receptor a controls a gene network in luminal-like breast cancer cells comprising multiple transcription factors and microRNAs. Am. J. Pathol., 176 (5), 2113–2130. Jin, V.X., Leu, Y.-W., Liyanarachchi, S., Sun, H., Fan, M., Nephew, K.P., Huang, T. H.-M., and Davuluri, R.V. (2004) Identifying estrogen receptor a target genes using integrated computational genomics and chromatin immunoprecipitation microarray. Nucleic Acids Res., 32 (22), 6627–6635. Hurtado, A., Holmes, K.A., Ross-Innes, C. S., Schmidt, D., and Carroll, J.S. (2011) FOXA1 is a key determinant of estrogen receptor function and endocrine response. Nat. Genet., 43 (1), 27–33. Carroll, J.S., Liu, X.S., Brodsky, A.S., Li, W., Meyer, C.A., Szary, A.J., Eeckhoute, J., Shao, W., Hestermann, E.V., Geistlinger, T.R. et al. (2005) Chromosome-wide
43
44
45
46
47
48
49
mapping of estrogen receptor binding reveals long-range regulation requiring the forkhead protein FoxA1. Cell, 122 (1), 33–43. DeNardo, D.G., Kim, H.-T., Hilsenbeck, S., Cuba, V., Tsimelzon, A., and Brown, P. H. (2005) Global gene expression analysis of estrogen receptor transcription factor cross talk in breast cancer: identification of estrogen-induced/activator protein-1dependent genes. Mol. Endocrinol., 19 (2), 362–378. Carroll, J.S., Meyer, C.A., Song, J., Li, W., Geistlinger, T.R., Eeckhoute, J., Brodsky, A.S., Keeton, E.K., Fertuck, K.C., Hall, G. F. et al. (2006) Genome-wide analysis of estrogen receptor binding sites. Nat. Genet., 38 (11), 1289–1297. Cheng, A.S.L., Jin, V.X., Fan, M., Smith, L. T., Liyanarachchi, S., Yan, P.S., and Leu, Y.-W., Chan, M.W.Y., Plass, C., Nephew, K. P. et al. (2006) Combinatorial analysis of transcription factor partners reveals recruitment of c-MYC to estrogen receptor-a responsive promoters. Mol. Cell, 21 (3), 393–404. Hurtado, A., Holmes, K.A., Geistlinger, T.R., Hutcheson, I.R., Nicholson, R.I., Brown, M., Jiang, J., Howat, W.J., Ali, S., and Carroll, J.S. (2008) Regulation of ERBB2 by oestrogen receptor-PAX2 determines response to tamoxifen. Nature, 456 (7222), 663–666. Sengupta, S., Sharma, C.G.N., and Jordan, V.C. (2011) Estrogen regulation of X-box binding protein-1 and its role in estrogen induced growth of breast and endometrial cancer cells. Horm. Mol. Biol. Clin. Investig., 2 (2), 235–243. Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B.R., Sabo, P.J., Dorschner, M.O. et al. (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 326 (5950), 289–293. van Berkum, N.L., Lieberman-Aiden, E., Williams, L., Imakaev, M., Gnirke, A., Mirny, LA., Dekker, J., and Lander, E.S. (2010) Hi-C: a method to study the threedimensional architecture of genomes. J. Vis. Exp., (39), e1869.
j171
j173
10 Network-Module-Based Approaches in Cancer Data Analysis Guanming Wu and Lincoln Stein
10.1 Brief Summary
Researchers have recognized that biological phenotypes are underlain by the network of interactions among proteins, RNAs, DNAs, and other entities in the cell. Network-based data analysis approaches have been widely used in cancer data analysis. In this chapter, we review network-module-based approaches and their applications to cancer data analysis. We also introduce a software tool for doing network-module-based data analysis for cancer data.
10.2 Introduction
It has been widely recognized that it is the network of interactions among DNAs, proteins, regulatory RNAs, and other macromolecules in the cell, instead of single genes, that underlies the phenotypes displayed by cells [1]. It is crucial to consider this cellular network in order to understand the behavior of the cell in both normal and disease states. Recently, the biomedical research community has begun to routinely generate massive data sets at the genomic, proteomic, and tissue levels. Systems biology approaches can combine heterogeneous data sets into integrative views, thereby generating new insights. Network-based approaches are popular and powerful methods for integrating diverse data sets. During past decade, researchers have recognized that protein–protein interaction network and other kinds of biological networks contain modular structures [2]. These modules are organized hierarchically: from small network motifs, to intermediate network modules, to larger subnetworks. A network motif usually contains several proteins or genes forming a feedback loop or other kind of structure [3]. A network module contains a group of proteins or genes that are topologically close (topological module), functionally similar (functional module), or related by a common disease phenotype (disease module) [2]. A subnetwork is a large ensemble of genes, Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
174
j 10 Network-Module-Based Approaches in Cancer Data Analysis usually constructed from a gene list derived from differential gene expression analysis or other high-throughput data analysis methods. A subnetwork typically contains multiple network modules. This chapter focuses on the network modules, and talks about how they can be applied to the high-throughput analysis of cancer data sets. We first review some methods and progresses using network-module-based approaches, and then introduce a software tool, called the Reactome FI Cytoscape plug-in, which was developed for network-module-based cancer data analysis. We use the public TCGA ovarian cancer (OV) [4] mutation data set as an example.
10.3 Notation and Terminology
A biological network can be described as a graph GðV; EÞ, with a set of vertices V and a set of edges E. Vertices in such a graph can be proteins, genes, regulatory RNAs, or other biological entities, while edges indicate interactions between vertices. These interactions may be physical interactions, functional interactions, or other types of relationships (e.g., coexpression), depending on the type of biological network under study. A subnetwork is defined as a subgraph, fSðV 0 ; E 0 Þg. Here V 0 V and E 0 E. Vertices in a subgraph may or may not be linked together. A network module is a subnetwork. Usually, vertices in a network module are linked together forming a connected subgraph or graph component. Other terms for network module in common use in the literature are “network cluster” and “network component.” For cancer data analysis, researchers can use pregenerated normal cellular networks, such as protein–protein interaction networks, project cancer data onto these networks to construct cancer-related subnetworks, or build cancer networks directly from cancer candidate genes. Several types of data can be used to construct cancer-related networks: differentially expressed genes, somatic mutations, copy number variations, and others such as structural variations and fusion proteins.
10.4 Network Modules Containing Functionally Similar Genes or Proteins
Network modules can be divided into three types: topological module, functional module, and disease module [2]. Topological modules are usually found using network clustering algorithms, which are based on the topological properties of the interaction network only without considering the function of the genes or proteins contained within the network. By definition, nodes within topological modules tend to link to each other more frequently than to nodes in different modules. In contrast, functional modules are significantly enriched with pathway or gene ontology (GO) [5] annotations. By definition, nodes in
10.5 Network Module Searching Methods
functional modules share similar biological functions, and should be linked together in the underlying network. Genes in disease modules are related by sharing underlying disease phenotypes. These three types of modules overlap each other, but may not be the same because of different algorithms used to define them. Regardless of algorithms used to identify them, genes in the same network modules tend to have similar biological functions [6]. One reason for this is trivial: many biological networks are constructed based on annotated pathways or complexes, and therefore the interactions among network nodes are implicitly related to their biological functions. More interestingly, some biological networks are constructed from physical protein–protein interactions generated by high-throughput experiments such as yeast two hybrid (Y2H), and coimmunoprecipitation (Co-IP)/ mass spectrometry. Though Y2H has a high false positive rate [7], proteins involved in Y2H interactions usually have similar biological functions based on GO annotations [7]. Proteins involved in interactions generated from high-throughput Co-IP also show similar biological functions [8].
10.5 Network Module Searching Methods
Many methods have been developed to search for network modules or components. The basic problem can be stated thus: “given a network of biological entities and their interactions, partition the network into two or more modules such that the number of interactions within modules is maximized relative to interactions between modules.” In the following sections, we introduce some of popular algorithms that are widely used. Our categorization of these algorithms is not strict, and only for convenience of presentation. 10.5.1 Greedy Network Module Search Algorithms
Greedy search algorithms are the simplest and earliest network module algorithms employed. For example, Ideker et al. [9] used a greedy network module search to identify modules in protein–protein interaction networks using gene expression data. This method works by projecting gene expression values generated from high-throughput microarray data sets onto protein–protein interaction networks. The algorithm converts p-values from t-test or other tests generated from differential gene expression analysis into z-scores, and searches for gene modules that have enriched z-scores. Based on Ideker et al.’s original algorithm, Chuang et al. [10] from the same group developed a new algorithm by using mutual information (MI), instead of z-scores, to search for MI-enriched network components that can be used to classify metastatic versus nonmetastatic breast tumors.
j175
176
j 10 Network-Module-Based Approaches in Cancer Data Analysis Based on Ideker et al.’s original algorithm, the Ideker group has developed an open-source Cytoscape plug-in called jActiveModules (http://chianti.ucsd. edu/cyto_web/plugins/displayplugininfo.php?name¼jActiveModules) to search for network modules based on pregenerated p-values for genes or proteins in protein–protein interaction networks. jActiveModules has been used in many disease-related studies. Liu et al. [11] used this plug-in to identify significant pathways for type 2 diabetes. Liu et al. [12] used the same algorithm to search for p-value-enriched network modules in genome-wide association studies (GWAS). The algorithm behind greedy search is straightforward. The algorithm starts by selecting a protein node within a protein interaction network as the seed, checks this seed’s neighbor nodes one by one based on a preassigned parameter called neighborhood depth, and then calculates a score for this group of proteins. If the score derived from including the neighbor node in the module is better than the score derived without including the neighbor node, the newly checked node is kept in the growing module; otherwise, a new node in the neighborhood is checked. In this algorithm, found network modules are allowed to overlap such that one protein can be assigned into more than one network module. 10.5.2 Objective Function Guided Search
Greedy search algorithms are not guaranteed to find a globally maximum-scored network component or module. Very recently, Dao et al. [13] reported an algorithm using the color-coding paradigm to identify optimally discriminative network module-based markers from protein–protein interaction networks based on gene expression data sets to classify drug-responsive breast cancer patients. The main idea behind Dao’s approach is to project samples into k-dimensional space (gene space) and ensure that samples from the same class have minimum distance, while samples from different classes have maximum distance. The result of running the algorithm is a subnetwork containing k genes. The authors used this algorithm for breast cancer drug-responsive study and showed that their algorithm outperforms several others. 10.5.3 Network Clustering Algorithms
The above two types of algorithms are designed to find functional or disease modules and build modules based on the phenotype under study. In contrast, network clustering algorithms identify network modules based on their topological properties alone and produce topological modules from pregenerated networks, or cancer-related subnetworks. Many network clustering algorithms fall into this category.
10.5 Network Module Searching Methods
MCODE [14] is a popular network clustering algorithm developed to search for protein complexes in a protein interaction network. CFinder [15] is an algorithm to locate network cliques and overlapping modules in biological networks. SPICi [16] searches for network modules based on a weighted graph using network seeds and expansion methods similar to a greedy search algorithm. MCL [17] is based on Markov random walking in a weighted network. Recently, Rhrissorrakrai and Gunsalus [18] published a new module identification algorithm called MINE based on the MCODE algorithm, which outperforms other popular algorithms based on functional annotations using GO [5] and modularity based on Newman and Girvan’s definition [19]. 10.5.4 Community Search Algorithms
Community search algorithms were originally developed for studying communities in social networks. Similar to network clustering algorithms, these algorithms are also based on network topological properties, and can be used for pregenerated networks or cancer related subnetworks. Newman and Girvan [19] proposed an edge-betweenness algorithm to search for network communities. Later, Newman [20] developed another algorithm inspired by classic spectral partition algorithm. Several studies [21–23] have used the edge-betweenness algorithm to search for network modules in biological networks after treating functional network modules as communities embedded in biological networks. Functional enrichment analysis [21] shows that network modules generated from these community search algorithms show similar functional enrichment as modules generated by other types of network module search algorithms. Newman defined a metric score called modularity [20]. The modularity score [20] can be described as ki kj ki kj 1 X 1 X Q¼ Aij Aij ð10:1Þ si sj þ 1 ¼ si sj 2m 2m 4m ij 4m ij where Aij is the number of edges between vertices i and j, which is usually 0 for no edge, and 1 for one edge; ki and kj are the degrees of vertices i and j, respectively; and m is the total number of edges in the network. The above equation can be written in matrix form as follows: Q¼
1 T s Bs 4m
ð10:2Þ
where s is the column vector containing elements si , which is used to describe a node's cluster assignment, and B is a real symmetric matrix with elements: ki kj ð10:3Þ 2m Newman borrowed the spectral partition method [20] to solve the above equation optimally. For more detailed discussion, refer to [20]. In our Reactome FI Cytoscape Bij ¼ Aij
j177
178
j 10 Network-Module-Based Approaches in Cancer Data Analysis plug-in, we implemented this spectral partition-based algorithm for network module detection for subnetworks (see below). The algorithm runs much faster than the original edge-betweenness algorithm. 10.5.5 Mutual Exclusivity-Based Search Algorithms
It has been recognized that in cancer genomes, if several genes are contained in the same biological pathway, then usually only one gene in the pathway will have been mutated in the same cancer patient. Particularly striking examples have been found for genes whose protein products are involved in p53 and RB pathways. This fact has been called mutual exclusivity [24]. Recently, several high-throughput cancer resequencing projects provide further support for this phenomenon [4, 25], and several groups [26, 27] have used mutual exclusivity to search for cancer-related network modules. Using protein interaction networks, Ciriello et al. [26] proposed a brute force search algorithm to search for network modules having genes that are mutually exclusive. Their method, called “mutual exclusivity modules in cancer (MeMo)”, has been applied to the TCGA glioblastoma multiforme (GBM) [25] and high-grade serous OV [4] data sets, yielding a number of novel modules that appear to follow the mutual exclusivity rule. Miller et al. [27] developed a method based on the winnow algorithm. They used winnow to score each gene pair by exclusivity, and generated a weighted graph based on a mutation matrix calculated using validated SNPs and focal CNAs. Then they used a greedy search-based method to search for network modules and calculated the module significance score based on a minimum length encoding method. The authors found some significant modules by applying their method to the same TCGA GBM data set. Both the Ciriello and Miller methods identified a p53 pathway-related module containing three genes, two of which were in common (TP53 and MDM2). Otherwise, there is little overlap among modules identified by these two mutual exclusion algorithms. Another issue is that the modules identified by the Ciriello and Miller algorithms tend to be small. The top significant modules from Ciriello et al. [26] contain five genes only, while the significant modules from Miller et al. [27] are of size two to three, which are barely large enough to merit designation as network modules. 10.5.6 Weighted Gene Expression Network
A gene expression network is a network constructed based on high-throughput gene expression data sets generated from microarray or RNA-seq experiments. Many reports have been published to construct gene coexpression networks based on correlation analysis [28] or to reverse-engineer gene regulatory networks using a
10.6 Applications of Network-Module-Based Approaches in Cancer Studies
variety of complex and sophisticated algorithms such as Bayesian networks [29]. Horval et al. [30] developed a simple algorithm called the weighted gene coexpression network method. This method uses pair-wise gene expression correlation analysis results as edge weights in a coexpression network, and then generates coexpression modules using hierarchical clustering. The clustering-derived modules are then analyzed for enrichment in pathways or GO annotations. Horval et al. have applied their methods to several disease data analyses, including glioblastoma multiforme [31].
10.6 Applications of Network-Module-Based Approaches in Cancer Studies
The above network module detection algorithms have been widely used in cancer and other disease data analyses. The results from these studies are promising. Groups have used these algorithms to identify cancer prognostic signatures to predict patient survival and response to pharmaceutical agents, to prioritize disease candidate genes, and to search for network patterns for cancer candidate genes. 10.6.1 Network Modules and Cancer Prognostic Signatures
Chuang et al. [10] used a greedy search algorithm based on Ideker et al.’s original algorithm [9] by using mutual information as the network module score between two breast cancer sets: one metastatic and the other nonmetastatic. Based on the network modules identified, Chuang et al. constructed a classifier to predict breast cancer patients’ metastasis status. Their results show better performance than individual gene-based approaches: on the basis of area-under-curve classifier metrics, their network-based approach was more reproducible between two breast cancer data sets than signatures based on same number of top genes sorted by discriminative scores. Dao et al. [13] used their color-coding-based algorithm to search for subnetwork prognostic signature that could distinguish drug-sensitive and nonsensitive breast cancer samples. Chowdhury et al. [32] built a neural network based on subnetwork signature that could predict metastasis of colorectal cancer using an algorithm called CRANE. 10.6.2 Cancer Driver Gene Search Based on Network Modules
Finding cancer driver genes is very important for drug target discovery. Many methods have been proposed for searching for drug targets and cancer driver genes. Cerami et al. [23] developed a network-module-based method using the
j179
180
j 10 Network-Module-Based Approaches in Cancer Data Analysis edge-betweeness algorithm. Applying their method to the TCGA GBM data set, Cerami et al. found several network modules that were TCGA GBM related, and identified several new GBM candidate driver genes, including AGAP2, AVIL, KIT, TEK, FRS2, and KDR. 10.6.3 Using Network Patterns to Identify Cancer Mechanisms
Projecting cancer candidate genes onto the gene interaction network and then studying the derived network modules may identify common mechanisms across different cancer types. By constructing a cancer network using cancerrelated signaling pathways from NCI-PID [33] and BioCarta (http://www.bio carta.com/), Cui et al. [34] constructed a cancer-related signaling network containing 1634 nodes and 5089 interactions. The results from their study show that network modules (called blocks in the original paper) can “collaborate” each other. A subset of these collaborations occurs in most tumors. These collaborative patterns cannot be found at the gene level but appear clearly at the network module level. In our work [35], we constructed a functional interaction network by combining multiple network-related data sources, including curated pathways, Y2H, co-immunoprecipitation, gene expression, and shared gene ontology terms. This network is described in more detail in the next section. By applying this network to a variety of public data sets from brain, breast, colorectal, and pancreatic cancer patients, we found that the majority of samples have mutated genes in two large modules, one corresponding to nucleus proteins and the other to cytoplasmic and plasma membrane-expressed proteins. We inferred from this network pattern that cancer cells most likely require complementary alterations in nuclear genes controlling gene regulation on the one hand and cytoplasmic/plasma membrane genes mediating signal processing on the other.
10.7 The Reactome FI Cytoscape Plug-in
In this section, we introduce a software tool we developed called the Reactome FI [35] plug-in, a Cytoscape [36] extension, that will allow readers to perform network-module-based data analysis on their own cancer data sets. The reader can launch this software application by following links in the user guide web page: http://wiki.reactome.org/index.php/Reactome_FI_Cytoscape_Plugin. The plug-in can be launched as a Java Web Start application, and can also be downloaded into a previously installed Cytoscape standalone application. A Java virtual machine with version 1.5 or above is required to run this software application. In the remainder of this chapter, we will present some technical background on this tool, and show how to use it to perform a network-module-based analysis using the recently published TCGA ovarian cancer mutation data set [4] as an example.
10.7 The Reactome FI Cytoscape Plug-in
10.7.1 Construction of a Functional Interaction Network
For high-throughput data analysis, we need a biological network that has high enough coverage (e.g., > 50% of all human proteins or genes). Protein interaction networks constructed based on high-throughput Y2H experiments or Co-IP usually have high coverage, but may have a high-false positive rate [7]. Interaction networks constructed by extracting interactions from human curated pathways are more reliable, but the manual curation of pathways is a labor-intensive work, and genomic coverage in a manually curated pathway database usually is low. To construct a biological interaction network with both high coverage and reliability, we constructed a so-called functional interaction network (FI network thereafter). We first extracted interactions from human curated pathways from multiple pathway databases, including Reactome [37], KEGG [38], Panther [39], NCI-PID [33], and CellMap (http://cancer.cellmap.org/cellmap/), then scored protein–protein pairwise relationships using a simple machine learning technique based on multiple data sources, including protein–protein interactions, gene coexpression, and shared GO annotations. After we scored protein–protein pairwise relationships, we merged high-scoring relationships into extracted curated interactions to form an interaction network. We called interactions in this network as functional interactions since they are more likely to indicate that two proteins have stronger functional relationship if they are linked together in the network than not. The final interaction network contains around 10,000 proteins, 46% of total Swiss Protein proteins, and close to 210,000 functional interactions, 47% extracted from human curated pathways. By comparing the predicted functional interactions against literature references, we found that our functional interaction network is highly reliable [35]. 10.7.2 Network Clustering Algorithm
To search for network modules, we use Newman’s spectral partition-based modularity optimization algorithm [20] for fast performance. Our network clustering algorithm runs on the server-side as a web service. 10.7.3 Cancer Gene Index Data Set
For cancer data analysis, we downloaded cancer gene index data set as described in this document: https://wiki.nci.nih.gov/display/ICR/CancerþGeneþIndexþEndþ UserþDocumentation. The cancer gene index is a caBIG project (https://cabig.nci. nih.gov/), which annotates cancer genes based on published literature using both text mining and human validation methods. We imported the whole cancer gene index data set into a relational mysql database built by using a programming API, so that the data can be loaded into the Reactome FI plug-in.
j181
182
j 10 Network-Module-Based Approaches in Cancer Data Analysis 10.7.4 Analyzing the TCGA OV Mutation Data Set
As an example to describe how to use the Reactome FI plug-in, we take the recently published TCGA ovarian cancer mutation data set [4]. The ovarian data set is the second data set published by the TCGA project after the 2008 GBM data set [25]. This data set can be downloaded from the supplemental content hosted on Nature’s web site. The mutation data file (2010-09-11380C-Table_S2.1_20110113.txt) contains mutation information for 316 ovarian cancer samples based on the exome second generation sequencing results. This mutation file contains 8420 genes having nonsynonymous mutations. 10.7.4.1 Loading the Mutation File into Cytoscape and Constructing a FI Subnetwork The plug-in has been designed to work on TCGA mutation data file format (MAF) directly. Besides the MAF format, the plug-in can also load a simple gene list, or a tab-delimited file containing rows of genes versus numbers of samples containing the mutated genes. The user can choose the file format in the set parameter dialog (Figure 10.1) after selecting Reactome FIs/Gene Set/Mutation Analysis from the main PlugIns menu. We choose genes that are mutated in at least three samples. By not choosing “Use linker genes” and “Show genes not linked to others,” we construct a FI subnetwork containing genes that interact with at least one other gene. Since about half of our FIs were extracted from human curated reactions and complexes, we
Figure 10.1 The dialog to select the file format and set parameters for constructing a FI subnetwork using a mutation data file.
10.7 The Reactome FI Cytoscape Plug-in
have developed a method to annotate FIs in the FI subnetwork. The user can annotate FIs after a FI subnetwork has been constructed, or during the time of construction as we are doing here. The constructed FI subwork for the TCGA ovarian mutation data contains 490 genes and 1764 FIs (Figure 10.2). The node sizes in the figure are proportional to the numbers of samples having genes mutated. For example, TP53 is the largest node because almost 95% (299 of 316) of patient samples have at least one mutation in the gene.
Figure 10.2 The FI subnetwork containing genes mutated in three or more samples from the TCGA OV mutation data file. The node sizes are proportional to the numbers of samples having genes mutated. TP53 has the most samples mutated, and has been displayed as the largest node in this screenshot.
j183
184
j 10 Network-Module-Based Approaches in Cancer Data Analysis
Figure 10.3 A part of the TCGA OV FI subnetwork that has been zoomed-in to show FI annotations. FIs extracted from activation or catalysis are displayed as arrows, inhibition as “T,” and FIs extracted from complexes or reaction inputs in solid lines without arrows or T. Predicted FIs are displayed as dash lines.
FIs can be annotated based on the original data sources. Figure 10.3 shows a zoomed-in version of a part of the FI network displayed in Figure 10.2. For example, the functional interaction between MMP8 and JUP is an FI extracted from a reaction where MMP8 is used as a catalyst, which was annotated in the Panther database. The FI between THBS3 and LTBP1 is an inhibition as described by the KEGG pathway. The FI between APC and THBS2 is a predicted FI based on protein–protein physical interaction occurring in fly and shared GO biological process annotation. One can select a FI in the diagram panel, use a popup menu called “Reactome FI/Query FI Source” to drill down to the original data source, and see how two proteins are involved in the same FI. 10.7.4.2 Network Clustering and Network Module Functional Analysis To perform network clustering using the built-in spectral partition algorithm, use the popup menu in the network panel, “Cluster FI Network.” After network clustering, genes in different network modules are colored differently. One can navigate
10.7 The Reactome FI Cytoscape Plug-in
Figure 10.4 Network module browser showing some properties of network modules.
network modules using “Network Module Browser” at the bottom of the Cytoscape main window (Figure 10.4). The user can use the plug-in to annotate the whole FI subnetwork or discovered network modules only by performing pathway and/or GO annotation enrichment analysis. To perform functional enrichment analysis for network modules, use the popup menu in the network panel, “Analyze Module Functions,” and either choose “Pathway Enrichment” or one of the GO annotations (cellular component, biological process, or molecular function). Some genes may not been annotated by any pathway. So the user may need to use GO annotation terms since GO annotation has much higher coverage than human curated pathways. Figure 10.5 shows the results of pathway annotations for modules, listed at the bottom of the main window as a tab. The p-values displayed in the results are based on binomial test for fast calculation, so that a 1000-permutation test can be run very quickly to get false discovery rates. To further explore the annotation of discovered modules, the user can select a pathway in the “Pathways in Modules” tab (Figure 10.5), and then use a popup menu item called “Show Pathway Diagram” to bring up a pathway diagram that shows the genes contained in the selected module. For example, OV Module 2 is enriched in genes corresponding to the Reactome pathway “Integrin cell surface
Figure 10.5 Pathway annotations for network modules to show enriched pathways in each network modules. In this screenshot, a Reactome pathway, “Integrin cell surface interactions,” has been enriched in Module 2 with FDR < 2:0 10 4 .
j185
186
j 10 Network-Module-Based Approaches in Cancer Data Analysis
Figure 10.6 Pathway diagram for Reactome pathway “Integrin cell surface interactions.” Proteins or complexes that can be mapped to listed genes in the selected row have been highlighted in blue. The mapping of complex is based on its contained protein components.
interactions.” After selecting the pathway diagram menu item, the FI plug-in will display Reactome’s diagram of this pathway. Proteins or complexes in the diagram that match genes listed in the selected row are highlighted in blue in the diagram (Figure 10.6). The plug-in uses manually laid-out pathway diagrams for pathways from the Reactome database, auto-laid-out diagrams for NCI-PID, Panther, and CellMap based on a hierarchical layout algorithm, and uses a web service API to map genes directly to KEGG pathway diagrams in the KEGGs web site. By studying the detailed pathway diagrams displayed for highlighted genes or proteins, the user can understand the actual mechanism underneath the functions of cancer-related mutated genes. 10.7.4.3 Module-Based Survival Analysis As described above, a single network module or a set of such modules can be used as a signature of cancer patient prognosis. The FI plug-in provides a feature to perform survival analysis if clinical information is available for the samples used in network construction. For the TCGA OV data set, the appropriate clinical information can be downloaded from the same web site as the mutation data: 2010-0911380C-Table_S1.2.txt.
10.7 The Reactome FI Cytoscape Plug-in
The plug-in uses a server-side R script to perform survival analysis. It provides an option to use either Cox proportional hazards (CoxPH) or Kaplan-Meier model to do survival analysis [40]. The CoxPH model is a popular mathematic model for doing survival analysis. It is usually described as following: ! p X ð10:4Þ hðt; XÞ ¼ h0 ðtÞexp bi X i i¼1
where X ¼ ðX 1 ; X 2 ; . . . ; X p Þ is a vector of explanatory/predictor variables. hðt; XÞ is the hazard at a time point t. h0 ðtÞ is the baseline hazard, which is a function of t, but not explanatory variables, an assumption of proportional hazards ratio. On the other hand, the exponential expression in the equation is not related to time t, so these variables are assumed to be time independent. In actual data sets, some clinical variables may in fact be time dependent, necessitating the use of the extended Cox model; unfortunately, this model is not yet supported by the FI plug-in. Usually, maximum likelihood is used to estimate coefficients bi with p-value output based on several statistic tests. The FI plug-in is also capable of performing Kaplan–Meier survival analysis on network modules derived from cancer mutation data sets. After network clustering, the plug-in splits samples into two groups: samples having genes mutated in a module and samples having no genes mutated in the module. The plug-in uses the log-rank test to compare two or more survival curves and estimates p-values. To perform survival analysis with the FI plug-in, use the popup menu, “Analyze Module Functions/Survival Analysis . . . .” One can perform CoxPH analysis on all modules, or select a single network module only for the analysis. The survival analysis results are displayed in the “Survival Analysis” tab in the left results pane (Figure 10.7). In the OV data set, we found that Module 6 is significantly related to patient survival. Figure 10.8 shows the Kaplan–Meier survival analysis result for Module 6, which indicates that OV patients having genes mutated in Module 6 have longer overall survival times than patients having no mutated genes. To explore the molecular basis for this observation, we can check pathway annotations for Module 6. Pathway annotations show that genes in Module 6 are significantly related to calciumsignaling pathways in KEGG (http://www.genome.jp/kegg/pathway/hsa/hsa04020. html). Calcium homeostasis is essential for cell migration, and tumor metastasis in particular [41]. It may be that mutations in Module 6 genes disrupt calcium homeostasis, thereby impairing the tumor’s ability to metastasize and extending patient’s overall survival. 10.7.4.4 Cancer Gene Index Data Overlay Analysis Finally, the plug-in allows users to overlay discovered network modules with cancer gene index data derived from text-mining and human curation. To overlay cancer gene index data onto the displayed FI subnetwork, choose popup menu “Load
j187
188
j 10 Network-Module-Based Approaches in Cancer Data Analysis
Figure 10.7 Module-based CoxPH survival analysis results. Two results are displayed in this figure: the first is for results generated from CoxPH analysis for all modules containing genes no less than 13; the second is for results for CoxPH analysis based on Module 6.
Cancer Gene Index” in the network panel. The FI plug-in loads the NCI diseases ontology into the NCI Disease tab in the Cytoscape Control Panel. The user can browse, search disease ontologies, and view definitions for a selected disease ontology term. In Figure 10.9, we have selected Maligant_Ovarian_Neoplasm from the NCI diseases ontology. Genes that have been annotated by this selected disease are highlighted in yellow. The user can also select a single gene in the network panel, and use “Reactome FI/Fetch Cancer Gene Index” to view detailed annotations for the selected gene.
10.9 Perspective
Figure 10.8 Kaplan–Meier survival analysis for Module 6. Samples in Group 1 (red) have no genes mutated in Module 6, while samples in Group 2 (green) have genes mutated. The inset shows the detailed results from Kaplan–Meier survival analysis.
10.8 Conclusions
In this chapter, we have reviewed methods for identifying and characterizing cellular network modules, and using these modules to interpret cancer high-throughput data analysis. We also present a walk-through of using the Reactome FI Cytoscape plug-in to perform network-module-based analysis on a published cancer data set. For reasons of space, the focus of this chapter has been the more established topics of protein interaction networks. We have not covered other types of cellular networks. For example, many recent studies [42, 43] have been performed on networks of miRNAs and their targets, which we have not covered in this chapter.
10.9 Perspective
A number of challenges remain in the creation and application of network modules. One issue is the reliability of the underlying network. Different types of
j189
190
j 10 Network-Module-Based Approaches in Cancer Data Analysis
Figure 10.9 Screenshot showing overlaying of cancer gene index data onto a displayed FI subnetwork. Genes that have been annotated for the selected disease term have been highlighted in yellow.
networks have been used in high-throughput data analysis, including protein–protein interaction network, gene regulatory network, miRNA/target interaction network, and genetic interaction network. It is critical that these networks used as foundation for network analysis are reliable and cover enough proteins or genes. Many of interactions have been generated from high-throughput data analysis; some of them may be predicted computationally and have not been validated. It is expected that more reliable networks will become available in the future for more reliable data analysis. Another issue is the methodology for combining diverse sources of network information. In a cell, proteins, regulatory RNAs, DNAs, and other molecules interact with each other to form an interacting network. Currently most, if not all, network-based data analyses use a single type of interaction network only. We expect that these different entity types will be integrated together in one single network. To do this kind of integrative analysis, more powerful network analysis algorithms and software tools will be needed. Finally, the annotation of network modules remains a challenge. Pathway annotations and enrichment analysis are routinely used to annotate the functions of network modules. Many of interaction networks used in data analysis are undirected networks, and no semantic meanings, such as activation or inhibition, are assigned to the interactions. Pathways are usually annotated as functional units and show directed relationships among pathway components in either directed interactions
References
or biochemical reactions. Some of interaction networks are actually generated based on pathways. How to integrate rich pathway information into undirected networks is a challenging question. We hope that future network-based studies will shed more light in this area and provide much more powerful network tools for systems biology-based data analysis.
References 1 Vidal, M., Cusick, M.E., and Barabasi, A. 2 3
4 5
6 7
8
9
10 11
12
13
(2011) Cell, 144, 986. Barabasi, A., Gulbahce, N., and Loscalzo, J. (2011) Nat. Rev. Genet., 12, 56. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., and Alon, U. (2002) Science, 298, 824. The cancer genome atlas research network. (2011) Nature, 484, 609. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., IsselTarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., and Sherlock, G. (2000) Nat. Genet., 25, 25. Sharan, R., Ulitsky, I., and Shamir, R. (2007) Mol. Syst. Biol., 3, 88. von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., and Bork, P. (2002) Nature, 417, 399. Ewing, R.M., Chu, P., Elisma, F., Li, H., Taylor, P., Climie, S., McBroom-Cerajewski, L., Robinson, M.D., O’Connor, L., Li, M. et al. (2007) Mol. Syst. Biol., 3, 89. Ideker, T., Ozier, O., Schwikowski, B., and Siegel, A.F. (2002) Bioinformatics, 18 (Suppl 1), S233. Chuang, H., Lee, E., Liu, Y., Lee, D., and Ideker, T. (2007) Mol. Syst. Biol., 3, 140. Liu, M., Liberzon, A., Kong, S.W., Lai, W. R., Park, P.J., Kohane, I.S., and Kasif, S. (2007) PLoS Genet., 3, e96. Liu, Y., Patel, S., Nibbe, R., Maxwell, S., Chowdhury, S.A., Koyuturk, M., Zhu, X., Larkin, E.K., Buxbaum, S.G., Punjabi, N. M., Gharib, S.A., Redline, S., and Chance, M.R. (2011) Pac. Symp. Biocomput., 16, 14. Dao., P., Wang, K., Collins, C., Ester, M., Lapuk, A., and Sahinalp, S.C. (2011) Bioinformatics, 27, i205.
14 Bader, G.D. and Hogue, C.W. (2003) BMC
Bioinform., 4, 2. 15 Adamcsek, B., Palla, G., Farkas, I.J.,
16 17
18 19
20 21 22
23
24 25 26 27
28
29
30
Der9 Znyi, I., and Vicsek, T. (2006) Bioinformatics, 22, 1021. Jiang, P. and Singh, M. (2010) Bioinformatics, 26, 1105. van Dongen, S. (2000) Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, the Netherlands, May. Rhrissorrakrai, K. and Gunsalus, K.C. (2011) BMC Bioinform., 12, 192. Newman, M.E.J. and Girvan, M. (2004) Phys. Rev. E Stat. Nonlin. Soft Matter Phys., 69, 26113. Newman, M.E.J. (2006) Proc. Natl. Acad. Sci. USA, 103, 8577. Dunn, R., Dudbridge, F., and Sanderson, C.M. (2005) BMC Bioinf., 6, 39. Luo, F., Yang, Y., Chen, C.F., Chang, R., Zhou, J., and Scheuermann, R.H. (2007) Bioinformatics, 23, 207. Cerami, E., Demir, E., Schultz, N., Taylor, B.S., and Sander, C. (2010) PLoS One, 5, e8918. Vogelstein, B. and Kinzler, K.W. (2004) Nat. Med., 10, 789. The cancer genome atlas research network. (2008) Nature, 455, 1061. Ciriello, G., Cerami, E., Sander, C., and Schultz, N. (2012) Genome Res., 22, 398. Miller, C.A., Settle, S.H., Sulman, E.P., Aldape, K.D., and Milosavljevic, A. (2011) BMC Med. Genom., 4, 34. Lee, H.K., Hsu, A.K., Sajdak, J., Qin, J., and Pavlidis, P. (2004) Genome Res., 14, 1085. Friedman, N., Linial, M., Nachman, I., and Pe’er, D. (2000) J. Comput. Biol., 7, 601. Langfelder, P. and Horvath, S. (2008) BMC Bioinform., 9, 559.
j191
192
j 10 Network-Module-Based Approaches in Cancer Data Analysis 31 Horvath, S., Zhang, B., Carlson, M., Lu, K.
32
33
34
35 36
V., Zhu, S., Felciano, R.M., Laurance, M. F., Zhao, W., Qi, S., Chen, Z., Lee, Y., Scheck, A.C., Liau, L.M., Wu, H., Geschwind, D.H., Febbo, P.G., Kornblum, H.I., Cloughesy, T.F., Nelson, S.F., and Mischel, P.S. (2006) Proc. Natl. Acad. Sci. USA, 103, 17402. Chowdhury, S.A., Nibbe, R.K., Chance, M. € M. (2011) J. Comput. R., and KoyutYrk, Biol., 18, 263. Schaefer, C.F., Anthony, K., Krupa, S., Buchoff, J., Day, M., Hannay, T., and Buetow, K.H. (2009) Nucleic Acids Res., 37, D674. Cui, Q., Ma, Y., Jaramillo, M., Bari, H., Awan, A., Yang, S., Zhang, S., Liu, L., Lu, M., O’Connor-McCourt, M., Purisima, E.O., and Wang, E. (2007) Mol. Syst. Biol., 3, 152. Wu, G., Feng, X., and Stein, L. (2010) Genome Biol., 11, R53. Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D.,
37
38
39
40
41 42 43
Amin, N., Schwikowski, B., and Ideker, T. (2003) Genome Res., 13, 2498. Croft, D., O’Kelly, G., Wu, G., Haw, R., Gillespie, M., Matthews, L., Caudy, M., Garapati, P., Gopinath, G., Jassal, B., Jupe, S. et al. (2011) Nucleic Acids Res., 39, D691. Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., and Hattori, M. (2004) Nucleic Acids Res., 32, D277. Mi, H., Dong, Q., Muruganujan, A., Gaudet, P., Lewis, S., and Thomas, P.D. (2010) Nucleic Acids Res., 38, D204. Kleinbaum, D.G. and Klein, M. (2005) Survival Analysis: A Self Learning Guide, Springer, New York. Prevarskaya, N., Skryma, R., and Shuba, Y. (2011) Nat. Rev. Cancer, 11, 609. Zhang, S., Li, Q., Liu, J., and Zhou, X.J. (2011) Bioinformatics, 27, i401. Schmeier, S., Schaefer, U., Essack, M., and Bajic, V.B. (2011) BMC Syst. Biol., 5, 183.
j193
11 Discriminant and Network Analysis to Study Origin of Cancer Li Chen, Ye Tian, Guoqiang Yu, David J. Miller, Ie-Ming Shih, and Yue Wang
11.1 Brief Summary
Enabled by rapid advances in biological data acquisition technologies and developments in computational methodologies, interdisciplinary research in machine learning for biomedicine tackles various challenging biological questions by comprehensively scrutinizing (multiplatform) data from multiple, distinct vantages. Understanding the origin and progression of cancer has great practical import for advancing both biological knowledge and potential clinical treatments. Technically, the most challenging biological questions inspire and promote the development and applications of novel computational methods. This chapter presents a coalition of state-of-the-art machine learning methods and leading-edge scientific puzzles. With DNA copy number and transcriptome data, we were able to design specific statistical hypothesis tests to reveal the origin of cancer by comparing the genomic and transcriptome codes and biological network structures.
11.2 Introduction
Machine learning has played an important role in analyzing high-dimensional genomic and transcriptome data for biomedical research [1–4]. Many machine learning approaches, such as discriminant analysis, network modeling, cluster analysis, and pattern classification, can be effectively used to help test novel biomedical hypotheses [5–7]. In this chapter, we describe two novel applications of machine learning methods to study the origin of cancer using high-throughput DNA copy number and gene expression data. Although significant progress has been made in understanding and diagnosis of various human cancers, basic questions about the origin of some cancers remain unanswered. For example, controversy exists both over whether or not metastatic cancers arise from a single precursor cancer cell [7] and over where and how cancer
Statistical Diagnostics for Cancer: Analyzing High-Dimensional Data, First Edition. Edited by Frank Emmert-Streib and Matthias Dehmer. # 2013 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2013 by Wiley-VCH Verlag GmbH & Co. KGaA.
194
j 11 Discriminant and Network Analysis to Study Origin of Cancer originates in the body [8]. Specifically, many studies have shown that primary prostate cancers are multifocal [9–11], composed of multiple, genetically distinct cancer cell clones. Whether or not multiclonal primary prostate cancers typically give rise to multiclonal or monoclonal prostate cancer metastases is largely unknown, although studies at single chromosomal loci are consistent with the latter case. Another example is the controversy over the origin of ovarian serous carcinoma. The accepted view of ovarian carcinogenesis is that serous carcinoma begins in the ovaries and then spreads to the pelvic and abdominal cavities before metastasizing to distant sites. Based on recent studies, several research groups [8, 12] found that the most aggressive ovarian cancers, high-grade (HG) serous carcinoma, is associated with a presumable precursor lesion in the fallopian tube (FT) rather in the ovary. This suggests that the prevailing theory underlying the origin of ovarian carcinogenesis may be flawed. Clarification of these puzzles should have significant scientific and clinical implications. For example, if ovarian cancer indeed originates in the fallopian tube, instead of the ovary, we can focus on detecting early lesions in the fallopian tube and thus advance the detection of ovarian cancer; we can also then excise the fallopian tube without removing the ovary in preventive medicine, a new strategy expected to minimize the adverse effects of oophorectomy (resect the ovary) but maintaining the efficacy in ovarian cancer prevention. Thanks to rapid advances in genome-wide analysis technologies, it has now become possible to test cancer-related hypotheses based on high-throughput genomic data in a statistically sound manner [6–8]. To rigorously test specific biomedical hypotheses using specific types of genomic data, novel machine learning approaches must be carefully designed, based on: (1) good understanding of the underlying biomedical problems and acquired data characteristics, (2) vigilant consideration of unique, subtle issues in modeling biological mechanisms, and (3) innovative methodologies that exploit both statistical power of available data and accumulation of relevant biomedical knowledge [1, 3, 4, 13–15].
11.3 Overview of Relevant Machine Learning Techniques 11.3.1 Fisher’s Discriminant Analysis and ANOVA
Fisher’s linear discriminant analysis is an effective pattern recognition scheme to find a linear combination of features that maximizes the separability of patterns based on the ratio of between-class to within-class scatter derived from multiple classes of samples. The resulting combination is commonly used for dimensionality reduction, allowing improved data visualization or subsequent classification. In Section 11.4.1, Fisher’s linear discriminant analysis will be used for visualizing somatic DNA copy number alteration (CNA) patterns [16] of prostate cancer samples in a 3D space (see Figure 11.4d and e).
11.3 Overview of Relevant Machine Learning Techniques
Given K classes (of samples), each with prior probability pk, mean vector mk, and covariance matrix Sk , k ¼ 1; 2; . . . ; K, we define the pairwise between-class scatter matrix as Sk;k0 ¼ ðmk mk0 Þðmk mk0 ÞT and the pooled within-class scatter matrix as P SW ¼ Kk¼1 pk Sk . The discriminatory component vector v is given by PK 1 PK T 0 0 k0 ¼iþ1 pk pk0 vðDk;k Þu Sk;k u v ¼ arg max k ð11:1Þ T u SW u u where vðDk;k0 Þ is the weighting function of the between-class Mahalanobis distance qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1=2 Dk;k0 ¼ ðmk mk0 ÞT SW1 ðmk mk0 Þ. It can be shown that v ¼ SW emax , where emax PK 1 PK 1=2 1=2 0 0 associated with the is the eigenvector of k k0 ¼kþ1 pk pk0 vðDk;k ÞSW Sk;k SW maximum eigenvalue [16]. The discriminatory component vector v can be used to visualize a high-dimensional data set XMN (where M is the vector dimension and N the sample size) in a lower dimensional space while maximally preserving the group separability (see (11.1)) [17]. For example, we can visualize the data set XMN in a 3D space by ½v1 ; v2 ; v3 T XMN , where v1 ; v2 ; and v3 are the eigenvectors associated with the three largest eigenvalues. A related technique is the analysis of variance (ANOVA). Different from Fisher’s discriminant analysis, ANOVA can be used to select a subset (instead of a linear combination) of features that separates multiple classes of samples. In Section 11.4.2, ANOVA will be used to select differentially expressed genes between three different normal tissue types, and thus improve the accuracy of subsequent multiclass classifications. Let N k denote the number of samples in class k, k ¼ 1, 2, . . . , K. For gene j, denote the expression level of the ith sample in the kth class by x k;i ðjÞ, the class mean by mk ðjÞ, and the pooled mean by mðjÞ. The F statistic for gene j is then given by PK N k ðmk ðjÞ mðjÞÞ2 =ðK 1Þ FðjÞ ¼ PK k¼1 ð11:2Þ 2 PN k mk ðjÞ =ðN K Þ k¼1 i¼1 x k;i ðjÞ The P value associated with gene j, which indicates the gene’s discrimination power, is then given by 1 CF ðFðjÞÞ, where CF is the cumulative distribution function of F(j). 11.3.2 Hierarchical Clustering
Clustering algorithms assign a set of samples into groups (called clusters), such that samples in the same cluster are more similar to each other than to samples in other clusters. Hierarchical clustering is a popular algorithm that creates a hierarchy of tree-structured groups, represented by a dendrogram. In Section 11.4.1, we will apply a standard hierarchical clustering technique, that is, average linkage agglomerative clustering, to reveal the potential clusters embedded in metastatic prostate cancer CNA profiles, and will perform significance testing to assess the hypothesis
j195
196
j 11 Discriminant and Network Analysis to Study Origin of Cancer that metastatic tumors in the same patient have much more similar CNA profiles than those of tumors in other patients. In average linkage agglomerative clustering, the average linkage function for two clusters, denoted by X1 ¼ fx1;1 ; . . . ; x1;N 1 g and X2 ¼ fx2;1 ; . . . ; x2;N 2 g, is the average distance between any pair of samples from the two clusters, over all samples in both clusters, as given by DðX1 ; X2 Þ ¼
N1 X N2 1 X dðx1;i ; x2;i0 Þ N 1 N 2 i¼1 0
ð11:3Þ
i ¼1
where dðx1;i ; x2;i0 Þ is the Euclidean distance (or any other measure of dissimilarity) between two samples. An agglomerative approach is used to create the hierarchy of clusters based on the linkage function (11.3). Initially, each sample is treated as a cluster. Then pairs of clusters are merged in a stage-wise fashion. Among all candidate clusters at each stage, the pairs of clusters, Xk and Xk0 for which DðXk ; Xk0 Þ is minimum, are merged. Repeatedly merging clusters creates a dendrogram. 11.3.3 One-Versus-All Support Vector Machine and Nearest-Mean Classifier
In supervised classification, a trained classifier is used to assign an input sample (represented by a given number of measured features) to one of the underlying categories (classes). In Section 11.4.2, multiclass classification will be used in a novel fashion to ascertain the origin of ovarian cancer. Specifically, we will use a classifier trained on three normal tissue candidates to classify the HG samples and assess the most probable origin of HG ovarian cancer, based on high-throughput gene expression data [5, 6, 18]. The support vector machine (SVM) is a widely applied binary classification method [19] whose robust outperformance is attributable to the convex optimization and structural regularization principles that are key parts of the SVM classifier design formulation [20]. Given samples xi ; i ¼ 1; 2; . . . ; N and their class labels yi ¼ 1 indicating which class xi belongs to, the SVM solves the following optimization problem (where C is a constant weight): ( ) N X 1 2 arg min jjwjj þ C ji ð11:4Þ w;j;b 2 i¼1 s:t: yi ðw xi bÞ 1 ji ; j 0 Effectively, the above problem seeks the linear weight vector that maximizes the margin (the minimum distance of any sample to the classifier decision boundary), while at the same time allowing for some slackness (relaxation) with respect to this margin, applied to outlier samples. To carry over the advantages of SVMs from binary classification tasks to multicategory classification, the one-versus-rest SVM (OVRSVM) classifier synthesizes multiple SVM binary classifiers, each trained to distinguish the samples in a particular class from the samples in all remaining
11.3 Overview of Relevant Machine Learning Techniques
Figure 11.1 Illustration of an OVR committee classifier solution for multicategory classification (three classes, in this case). The dotted lines are the decision hyperplanes associated with each of the component binary SVMs and the bold line set represents the final decision boundary after the winner-take-all classification rule is applied.
classes [21]. To classify a sample to one of K classes, K SVMs are applied for each sample, with the SVM that produces the largest (most positive) output value chosen to indicate the “winning” class. Figure 11.1 shows an illustrative OVRSVM classifier for three classes. The OVRSVM classifier has been proved highly successful for multicategory classification tasks involving limited amounts of high-dimensional training data in real-world applications. OVRSVM produces results that are often at least as accurate as other more complicated methods, including single-machine multicategory schemes [22]. The Nearest-Mean classifier is another popular classification method. For samples xi ; i ¼ 1; 2; . . . ; N belonging to K classes, letting mk be the mean of class k, k ¼ 1; 2; . . . ; K, the classification rule is given by 8 1; if di;1 ¼ min ðdi;k ¼ jjxi mk jj2 Þ > > k¼1;2;...;K > > < ¼ min ðdi;k ¼ jjxi mk jj2 Þ 2; if d i;2 f ðxi Þ ¼ ð11:5Þ k¼1;2;...;K > ... > > > : K; if di;K ¼ min ðdi;k ¼ jjxi mk jj2 Þ k¼1;2;...;K
11.3.4 Differential Dependency Network
It is widely believed that genes function in complex, context-specific networks that are subject to dynamic changes. The differential dependency network [23] (DDN), a computational statistical method, identifies the most discriminating network topologies between two tissue types, thus revealing how different their gene regulatory
j197
198
j 11 Discriminant and Network Analysis to Study Origin of Cancer networks are. In an earlier study [24], network structure changes are analyzed in a computational way based on given pathways to uncover the pathological alterations caused by chronic fatigue syndrome. It has a similar idea of DDN that focused on the differential networks under certain biological conditions but emphasized more on the statistical analysis of network distances between different conditions. DDN is a quite general and computational method for detecting statistically significant network changes between two groups. Given a P value threshold, DDN finds the gene–gene connections that exclusively exist in one of the two tissues. In Section 11.4.2, DDN will be applied to assess the similarity between gene regulatory networks of tumor and normal tissues. For a given tumor tissue and multiple normal tissues, the normal tissue with most similar gene regulatory network to the tumor tissue is suggested to be the origin of the tumor. The DDN method uses gene expression data to detect the network topological differences between the networks of interest (e.g., pathways). Input gene expression data for tumor and normal tissues are standardized and then used in the network inference of tumor and normal tissues, respectively. Network inference applies Lasso optimization [25] to identify gene connections. These connections are identified based on the power of one set of genes to predict the expression levels of another (target) gene. If a linear combination of a set of genes can predict the expression value of another gene with acceptable error, connections are defined to exist between the set of genes and the predicted gene. By adopting this idea, all possible gene connections are evaluated and assigned to represent the most confident network structures in tumor and normal tissues, respectively. Finally, permutation testing is used to determine if a connection shows a statistically significant difference between tumor and normal tissues. A differential network is thus constructed to highlight the parts of the transcriptional network with statistically significant differences between tumor and normal tissues. The workflow of DDN is shown in Figure 11.2.
11.4 Methods
This chapter introduces two hybrid machine learning methods to test hypotheses on the origin of cancer, based on different genomic data platforms. Section 11.4.1 describes a method for testing the existence of monoclonality in metastatic cancer cells via the CNA profile; Section 11.4.2 describes a hybrid method for testing the origin of ovarian cancer based on microarray gene expression data. 11.4.1 CNA Data Analysis for Testing Existence of Monoclonality
As described in Section 11.2, whether or not multiclonal primary prostate cancers typically give rise to multiclonal or monoclonal prostate cancer metastases is largely unknown. To test the existence of monoclonality, we propose to perform a hybrid
11.4 Methods
Figure 11.2 Network distance evaluation process. (a) Gene expression data of tumor and normal are standardized before calculation. (b) Local dependency models under two conditions, tumor and normal, are inferred, respectively, by using the Lasso estimator, with a corresponding coefficient of determination assigned. (c) Confident local dependency
models are combined as a test model set, with permutation testing used to identify significantly changed local dependency models. (d) All significantly changed local dependency models are put together to form a differential dependency network. A network distance can then be calculated between tumor and normal.
statistical machine learning analysis on high-throughput genome-wide CNA profiles [16]. There are K patients, each with several different metastatic tumor samples. For each tumor sample, we have the CNA profile. Our assumption is that the CNA profile can be used to characterize the similarity between DNA clones, that is, if monoclonality exists, metastatic cancer cells from the same patient should have similar
j199
200
j 11 Discriminant and Network Analysis to Study Origin of Cancer
CNA patterns. Thus, we want to investigate whether metastatic tumor samples from the same patient have CNA profiles that are more similar to each other than to profiles from other patients. 11.4.1.1 Preprocessing To assess possible clonal relationships of metastasizing cells, CNAs are isolated from anatomically separate cancer sites in men who died from metastatic prostate cancer. For the CNA data, we first remove the uninformative loci, with few CNAs across all samples in the data set. Using the resulting dimension-reduced data, we perform our subsequent analysis. 11.4.1.2 Assessing Statistical Significance of Monoclonality 11.4.1.2.1 Hierarchical Clustering We first perform average linkage clustering (an agglomerative strategy) on the CNA profiles, as introduced in Section 11.3, merging clusters until the number of clusters equals the number of patients (K). After this clustering, samples are represented by a dendrogram (a tree structure), which describes the similarity of their CNA patterns in a hierarchical fashion. If monoclonality exists, we would expect that samples from the same patient will be grouped into a common cluster.
11.4.1.2.2 The Measure of Monoclonality Hierarchical clustering by itself gives us a visual, subjective impression about monoclonality (based on the tendency of samples from the same patient to be clustered together). However, we can go further to give a quantitative measure of monoclonality. Specifically, “by-patient” and “by-cluster” impose two different partitions of the same set of samples. We quantify the agreement between these two partitions by the “minimal mismatch error” measure, which was originally proposed to evaluate the performance of clustering algorithms [26]. The minimal mismatch error is defined as the minimal number of mismatched samples that can be achieved by matching clusters to patients. Mathematically, for samples X ¼ fxi ji ¼ 1; 2; . . . ; Ng from K patients, HC partitions the samples into K clusters. For xi, let its patient label be yi 2 f1; 2; . . . ; Kg and its cluster label be wðxi Þ 2 f1; 2; . . . ; Kg. Since clustering is unsupervised, the cluster labels can be randomly permuted, for example, with samples in the kth cluster assigned the k0 th cluster label. Letting the permutation operator be p, the permuted cluster labels are pðwðxi ÞÞ; i ¼ 1; 2; . . . ; N. Thus, the minimal mismatch error dðWðXÞ; YÞ between WðXÞ ¼ fwðxi Þji ¼ 1; 2; . . . ; N g and Y ¼ yi ji ¼ 1; 2; . . . ; N is defined by choosing WðXÞ to minimize the matching error between WðXÞ and Y, as follows: EðWðXÞ; YÞ ¼
K X k¼1
min p
Nk X
1fpðwðx i ÞÞ 6¼ yi g
ð11:6Þ
i¼1
where N k is the number of samples from the kth patient, and 1ðÞ is the indicator function, which returns 1 if pðwðxi ÞÞ 6¼ yi , and 0 otherwise.
11.4 Methods
Calculation of (11.6) appears to be a daunting computational task, even without considering repeated calculations that will be required within a random permutation test context (see the next paragraph). By further inspecting (11.6), however, it can be recognized that this is an instance of the combinatorial assignment problem, whose optimal solution can be found via the Hungarian algorithm in time polynomial in the number of patients [27]. 11.4.1.2.3 Assessing Statistical Significance of Monoclonality Based on the minimal mismatch error EðWðXÞ; YÞ, we can assess the statistical significance of monoclonality. The null hypothesis is that the CNA patterns of samples from the same patient are no more similar to each other than those from other patients, that is, EðWðXÞ; YÞ is not significantly small. To assess the statistical significance, we approximate the null distribution by applying random permutation testing, rather than by making any (potentially unjustified) statistical assumptions about EðWðXÞ; YÞ. For each realization, we randomly permute the patient labels of samples and recalculate EðWðXÞ; YÞ based on the permuted patient labels. 11.4.1.3 Visualization of Monoclonality Hierarchical clustering already gives some visual illustration of how well samples from the same patient are clustered together. To better visualize the pattern of monoclonality, we displayed the CNA patterns via the top discriminatory components in 3D Euclidean space, as extracted by discriminatory component analysis under the weighted pairwise Fisher criterion (wFC-DCA) [17]. These results are shown in Section 11.5. This approach is designed to limit the influence of outlier classes on the final discriminatory projection, making it more robust than the classical multiclass linear discriminatory analysis. The details of wFC-DCA are introduced in Section 11.3. 11.4.2 A Two-Stage Analytical Method for Testing the Origin of Cancer
Identifying the origin of tumors is a crucial yet difficult task in cancer research. So far, significant efforts have been made, through experimental or computational approaches, to reveal the origin of high-grade ovarian serous carcinoma. Since tumor cells evolve from some original, normal cells, it is plausible to suppose that the tumor cells and the normal cells from which they originate should have biological traits in common. We look for hints along this direction by comparing gene expression profiles and related network topologies of tumor cells and candidate “origin” normal cells. Based on gene expression data, we propose a statistical machine learning method for testing the origin of HG by referencing to normal tissues from several organs, which are the origin candidates. Although some previous studies [5, 6] have proposed to compare the gene expression profiles of HG and normal organ tissues, they were designed in an imperfect way, failing to measure the similarity between HG and normal organs directly based on expression values [6], making only an
j201
202
j 11 Discriminant and Network Analysis to Study Origin of Cancer
Figure 11.3 Identification of the origin of HG.
incomplete comparison between HG and normal organs [5], and ignoring tissue contamination, which may bring noticeable bias to their results. The proposed method uses two complementary pipelines, which converge on the same goal: (1) by comparing expression profiles via feature selection and classification, the origin of tumor is decided as the candidate with most similar expression profile to the tumor on the subset of genes that are (likely) unaffected by the tumor; (2) by representing topological changes using a network distance measure, the normal cell with the least network distance to the tumor is concluded to be the origin. An experiment on ovarian cancer will be demonstrated, with the results shown to be consistent with recent biological hypotheses. 11.4.2.1 Basic Assumptions Figure 11.3 describes the problem of identifying HG’s origin. There has been controversy, over the past decades, concerning which organ is the origin of HG [5, 6, 8, 28]. Existing studies suggest three candidates – the ovarian surface epithelium (OSE), the FT, and the endometrium (Endo), together comprising the epithelial cells of gynecological organs. Our task is to infer which candidate is most probably the origin, based on gene expression profiles of HG tissue and normal tissues from the three candidate organs. To identify the origin of HG, two well-accepted biological concepts are employed as cornerstones of our statistical analysis. The first is that a cancer will carry the unique gene expression signature of its normal origin on a subset of its genes (let us denote this gene set as A); the second is that there is a common gene subset (let us denote it as set B) over which every normal organ will have a unique gene expression pattern (i.e., tissue type-specific transcriptome), well-distinguishing it from other normal organs (cell differentiation [29]). From the first concept, the candidate that is the origin of HG will have a similar gene expression profile to HG on gene subset A; from the second concept, the candidates will have different gene expression profiles to each other on gene subset B. Thus, A \ B is the ideal gene set to tell which candidate is the true origin, comparing the gene expression pattern of HG on this gene set against the patterns of each of the origin candidates. Put simply, over A \ B, we can identify the origin of HG as the candidate possessing the most similar expression pattern to HG.
11.4 Methods
Unfortunately, it is difficult in practice to specify the gene set A, because we have no a priori knowledge about which subset of genes carries the signature of HG’s origin. So it is difficult to estimate A \ B. However, we can find some alternative by further considering the available underlying biological knowledge. Because A \ B is the ideal gene set that makes HG uniquely close to its true origin, and \ BÞ, if A \ B does not make HG closer to its false positive candiB ¼ ðA \ BÞ [ ðA and B, where A dates, we can use B as a reasonable surrogate for A \ B. Consider A denotes the genes affected by cancer, and B denotes the gene set discriminating the candidates. Note that, in principle, neither A nor B should have strong association with any individual candidate organ, that is, neither of these gene sets should bias \ B should not bias HG HG to be closer to any individual candidate. Thus, A toward any of the candidates. Accordingly, it is reasonable to use the gene set B (which can be estimated) to test the origin of HG. 11.4.2.2 Tissue Heterogeneity Correction Gene expression profiling is widely applied in cancer research. Physicians believe that expression profiling promises to be an important tool for identifying the unknown origin of cancer [18]. However, tissue contamination may occur while acquiring gene expression data, resulting in a mixture of the object tissue of interest with its adjacent tissues, which may be of different phenotypes [30, 31]. Thus, the gene expression pattern of a tissue sample may represent a mixture of the gene expression patterns of the multiple types of tissues that are present, which will affect subsequent analysis. Based on observed gene expression data, we use in silico tools to correct this possible tissue heterogeneity [30, 31]. By estimating the mixing matrix of tissue contamination, we can approximately recover the pure-tissue gene expression profiles from the observed gene expression profiles. 11.4.2.3 Stage 1: Feature Selection and Classification Gene expression data is high-dimensional, especially compared to the usually limited sample size. The gene set that differentiates the origin of HG from other normal organs may be only a small portion of the whole gene set. Moreover, because of the genomic damage in cancer cells, HG will only show similarity to its origin over a subset of these informative genes – the similarity produced by this (small) subset may easily be “buried” among the huge number of uninformative and noisy genes. In sum, the curse of dimensionality [20] arises as a major issue here. By experimental assessment, we can also observe that gene expressions of HG samples do not resemble those of any normal organ. Therefore, a gene selection strategy is unavoidable for our subsequent analysis. Based on our discussion in Section 11.4.2.1, a one-way ANOVA F test, as introduced in Section 11.3, was applied to select a gene set that can significantly differentiate the candidate normal organs from each other. The significance threshold is 0.05 after Bonferroni correction, yielding 134 probe sets selected. Based on the selected genes, we measure the similarity between HG and the three candidates of origin (OSE, FT, and Endo). Here we apply two classifiers
j203
204
j 11 Discriminant and Network Analysis to Study Origin of Cancer (OVRSVM and Nearest-Mean as introduced in Section 11.3) to measure such similarity. We use the testing accuracy of the two classifiers as summary statistics, respectively, and perform random permutation testing to assess the statistical significance of each classification result. The testing accuracy is obtained via leave-one-out cross-validation. In each permutation, we randomly permute the label for each sample and, based on the label-permuted data, we retrain the classifiers, rerun cross-validation, and recalculate the testing accuracy. The null distribution of the summary statistic is realized by tens of thousands of permutations, and the significance is inferred by comparing the observed testing accuracy (without permutation) to the null distribution. 11.4.2.4 Stage 2: Transcriptional Network Comparison While all existing literature has focused on the comparison of gene expression values between the tumor and the suspected origins, we also perform gene regulatory network comparison using network inference and evaluation tools. Gene networks are very complex, context-specific, and subject to dynamic changes, with different network topology inference methods potentially leading to very different inference results. Instead of identifying the underlying structures, detection of network differences will give more sparse and accurate results. We use DDN to identify the most significantly different network topologies between tumor and origins, to reveal how different their gene regulatory networks are. We try a computational approach to find some clues identifying which normal organ is the true origin, where HG starts to evolve. It is prudent to hypothesize that HG tumor cells may retain certain gene network structure characteristics from their normal ancestral cells, and that gene network topological differences between HG and its real origin will be less pronounced than with nonorigin organs.
11.5 Experiments and Results 11.5.1 Monoclonality 11.5.1.1 Testing Existence of Monoclonality We isolated DNA from 94 anatomically separate cancer sites in 30 men who died from metastatic prostate cancer (Figure 11.4a) and analyzed it by chromosomal metaphase-based comparative genomic hybridization (CGH), Affymetrix GenomeWide Human SNP (single-nucleotide polymorphism) Array 6.0 analysis (Affy6), or both. We studied 85 sites from 29 of the patients by CGH. To assess possible clonal relationships of metastasizing cells, we studied two or more anatomically separate cancerous lesions by CGH in 24 patients (80 samples, ranging from two to eight samples per patient).After removing uninformative loci through preprocessing, we
11.5 Experiments and Results
Figure 11.4 DNA CNA profile in prostate cancer samples. (a) Anatomic sample type indicators for 85 cancerous DNA samples analyzed by CGH and 58 cancerous samples analyzed by Affy6 superimposed on posterior bone scan views for each patient. Legend indicates color and number coding of anatomic origin categories. Patients for whom two or more anatomically distinct prostate cancer samples were studied are denoted by colored symbols, with the patient’s number superimposed. (b) Unsupervised hierarchical clustering dendrogram based on 218-locus
metastatic prostate cancer CGH dataset for 80 samples from 24 patients for whom more than one anatomically separate cancerous DNA sample was available. (c) Unsupervised hierarchical clustering of Affy6 CNA data from 58 anatomically separate metastatic prostate cancer sites in 14 patients. All samples from each of 14 patients cluster together. (d) wFCDCA of CGH data from 80 metastatic prostate cancer samples from 24 patients projected in 3D Euclidean space. (e) wFC-DCA of Affy6 data for 58 metastatic prostate cancer samples from 14 patients projected in 3D Euclidean space.
obtained 218 loci across the genome that were affected by either copy number gain or loss. We analyzed CNA profiles on these 218 loci, for the 80 samples, by unsupervised hierarchical clustering, as introduced in Section 11.4.1.2 (Figure 11.4b). Among 24 patients, 15 of them (63%) have all member samples in a common cluster, that is, there are 15 clusters that perfectly correspond to 15 patients, indicating monoclonality in the majority of patients.
j205
206
j 11 Discriminant and Network Analysis to Study Origin of Cancer
Figure 11.5 Histogram of minimal mismatch error by random permutation test of (a) CGH data and (b) Affy6 data. Red line denotes the minimal mismatch error of the observed label assignment; blue bar denotes the matching error of random label assignment. P value