Statistical Genomics: Methods and Protocols (Methods in Molecular Biology, 1418) 1493935763, 9781493935765

This volume expands on statistical analysis of genomic data by discussing cross-cutting groundwork material, public data

143 103 14MB

English Pages 429 [419] Year 2016

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Contributors
Part I: Groundwork
Chapter 1: Overview of Sequence Data Formats
1 Introduction
2 Results
2.1 FASTQ: Format of Raw Short Read Files
2.2 FASTA: Format of Reference Genome Files
2.3 BAM/SAM: Format of Alignment (Mapping) Data Files
2.3.1 Header Section
@HD Line
@SQ Lines
@RG Lines
@PG Lines
@CO Line
2.3.2 Alignment Section
2.4 GFF/GTF and BED: Formats for Genomic Feature Annotation Files
2.5 VCF: A Special File Format for Genomic Variations
3 Conclusions
References
Chapter 2: Integrative Exploratory Analysis of Two or More Genomic Datasets
1 Introduction
2 Methods
2.1 Analysis of Two Datasets Using Co-inertia Analysis
2.1.1 Co-inertia Analysis
2.2 Case Study I: Integration of NCI-60 Cell Line Transcriptomic and Proteomic Data
2.2.1 PCA of Individual Datasets
2.2.2 CIA of Both Datasets
2.2.3 CIA: Exploring the Variables
2.3 Exploration of Three or More Datasets
2.3.1 Multiple Co-inertia Analysis
2.3.2 Case Study 2: Cross Comparison of Gene Expression Data Obtained on Four Different Microarray Platforms
2.3.3 Example2: Integrative Analysis of Transcriptome, Proteome, and Phosphoproteome of Stem Cell Lines
3 Session Info
References
Chapter 3: Study Design for Sequencing Studies
1 Introduction
2 Statistical Considerations
2.1 Statistical Background
2.1.1 Randomization
2.1.2 Sampling Distributions
2.1.3 Statistical Testing
2.1.4 Statistical Power
2.1.5 Confounding
2.2 Statistical Principles of Design
2.2.1 Pilot Experiments
2.2.2 Protocols
2.2.3 Blocking
2.2.4 Replication
2.3 Design of Sequencing Studies
2.3.1 Treatment Design
2.3.2 Sample Design
2.3.3 Sample Preparation Design
3 Sequencing Design
3.1 Library Construction
3.2 Library Validation
3.3 Read Length
3.4 Multiplexing
3.5 Sequencing Depth
3.6 Targeted Sequencing
4 Conclusion
References
Chapter 4: Genomic Annotation Resources in R/Bioconductor
1 Introduction
2 Materials
3 Methods
3.1 Using the AnnotationHub
3.2 OrgDb Objects
3.3 TxDb Objects
3.4 OrganismDb Objects
3.5 BSgenome Objects
3.6 biomaRt
3.7 Creating Annotation Objects
4 Notes
References
Part II: Public Genomic Data
Chapter 5: The Gene Expression Omnibus Database
1 Introduction
2 Methods
2.1 Retrieve a Specific GEO Record
2.2 Quick Search Using Keywords
2.3 Advanced Search Using Structured Queries and Filters
2.4 Search Programmatically
2.5 Query with a Nucleotide Sequence
2.6 DataSet Analysis Tools
2.7 GEO Profiles Analysis Tools
2.8 Analyze with GEO2R
2.9 Visualize Data as a Genome Track
2.10 Download GEO Data
3 Methods/Conclusion
4 Notes
References
Chapter 6: A Practical Guide to The Cancer Genome Atlas (TCGA)
1 Introduction to TCGA
1.1 Data Use Restrictions
1.2 Future Directions in Cancer Genomics
1.2.1 Exceptional Responders Initiative
1.2.2 ALCHEMIST Clinical Trial
1.2.3 Cancer Driver Discovery Project
2 Using the TCGA Data Coordinating Center (DCC) Portal
2.1 Getting Help
2.2 TCGA Data Access and Use
2.3 Data Types and Data Levels
2.4 TCGA Samples and Identifiers
2.5 Data Archives and Files
2.6 Querying and Downloading TCGA Data
2.6.1 Data Matrix
Data Matrix Application/Tutorial
2.6.2 File Search
File Search Application/Tutorial
2.6.3 Bulk Download
Bulk Download Application/Tutorial
2.6.4 HTTP Directories
Open Access Directories/Controlled Access Directories/Tutorial
2.7 Other Key Features and Resources
2.7.1 MAF Search Facility
2.7.2 Annotations Database
2.7.3 Biospecimen Metadata Browser
2.7.4 Clinical Data Dictionary
2.7.5 Publication Pages
2.7.6 Web Services
3 Accessing TCGA Sequence Data
3.1 Cancer Genomics Hub (CGHub) Overview
3.2 Query Metadata
3.2.1 CGHub Metadata Browser
3.2.2 cgquery
3.3 Steps to Download BAM/FASTQ Files
3.3.1 Get a dbGAP Authorization
3.3.2 Get a CGhub Keyfile
3.3.3 Get and Install GeneTorrent
3.3.4 Retrieve BAM/FASTQ Files
Scenario One
Scenario Two
3.4 Safeguard Raw Sequence Files
3.4.1 Regarding Redistribution of Raw Sequence Files
3.4.2 Publication Moratorium
References
Part III: Applications
Chapter 7: Working with Oligonucleotide Arrays
1 Introduction
1.1 Array Types and Applications
1.2 The Need for Flexible Software
1.3 Handling Oligonucleotide Microarrays Data
1.3.1 Data Import and Manipulation of Probe-Level Data
1.3.2 Preprocessing
Background Correction
Normalization
Summarization
1.3.3 Preprocessing Different Array Types
1.3.4 Using Preprocessed Data for Additional Analyses
2 Materials
3 Methods
3.1 Preprocessing Affymetrix Expression Arrays
3.2 Preprocessing Affymetrix Gene ST Arrays
3.3 Preprocessing Affymetrix SNP Arrays
4 Notes
References
Chapter 8: Meta-Analysis in Gene Expression Studies
1 Introduction
2 Materials
2.1 Microarray Dataset Identification
2.1.1 The Gene Expression Omnibus
2.2 Dataset Preparation
2.2.1 Curation
2.2.2 Preprocessing
2.2.3 Batch Effects
2.2.4 Gene Collapsing
2.2.5 Pathway or Gene Set Collapsing
2.2.6 Duplicate Checking
2.2.7 Gene Pre-filtering
3 Methods
3.1 Fixed Effects Meta-analysis
3.2 Random Effects Meta-analysis
3.3 Fixed vs. Random Effects Meta-analysis
3.4 Rank-Based Meta-analysis
3.5 Other Approaches to Synthesis
3.6 Case Study
3.7 Extensions to Predictive Modeling
4 Notes
References
Chapter 9: Practical Analysis of Genome Contact Interaction Experiments
1 Introduction
2 Materials
3 Methods
3.1 Getting Started with Genome Contact Interaction Data Analysis
3.2 Identifying Statistical Significant Hi-C Contacts Using a Model-Based Approach
3.3 Detecting Genome Contact Interaction Differences between Two Conditions
4 Notes/Conclusion
References
Chapter 10: Quantitative Comparison of Large-Scale DNA Enrichment Sequencing Data
1 Introduction
2 Materials
2.1 Installation of MEDIPS
2.2 Preprocessing of the Sequencing Data
3 Methods
3.1 Parameter Settings and Quality Control
3.1.1 Stacked Reads
3.1.2 Extend Reads
3.1.3 Window Size
3.1.4 Coverage Saturation Analysis
3.2 Importing Alignment Data
3.3 Differential Coverage Analysis
3.4 Merging Neighboring DERs and Annotation
3.5 Notes
3.6 R Script to Download the H3K4me2 ChIP-seq Data from SRA
3.7 Shell Script for the Alignment of the H3K4me2 ChIP-seq Data (Fastq Files)
3.8 R Script for Quality Control and Differential Enrichment Analysis of the Aligned H3K4me2 ChIP-seq Data Comparing TH2 and N...
References
Chapter 11: Variant Calling From Next Generation Sequence Data
1 Introduction
1.1 Statistical Methods in Variant Analysis
1.2 Choice of Analysis Software
1.3 Variant Filtering and Accuracy
1.4 Software Pipelines and Reproducibility
2 Materials
3 Methods
3.1 Alignment of Reads
3.1.1 FASTQ Files
3.1.2 Running Novoalign
3.2 Removal of PCR Duplicates
3.2.1 Duplicate Removal Option 1
3.2.2 Duplicate Removal Option 2
3.3 Variant and Genotype Calling
3.3.1 Variant Calling with Freebayes
3.3.2 Variant Calling with bam2mpg
3.3.3 Compression and Random Access with VCF Files
3.4 Annotation of VCF Files
3.5 Viewing and Filtering Variants
3.6 Supplementary Protocols
3.6.1 Obtaining Sequence Data from NCBI
3.6.2 Downloading a Genome´s Reference Sequence
3.6.3 Removing Alternate Haplotypes and Masking Pseudo-Autosomal Regions in the Reference
3.6.4 Formatting a Novoalign Database
4 Notes
References
Chapter 12: Genome-Scale Analysis of Cell-Specific Regulatory Codes Using Nuclear Enzymes
1 Introduction
2 Analysis of Chromatin Accessibility
2.1 Assay Protocols, Biases, and Data Reproducibility
2.2 Building a Profile for Data Visualization in a Genome Browser
2.3 Region Detection Algorithm
2.4 Region Annotation and Integrative Analyses
2.5 Motif Analysis on Hotspots
3 Analysis of TF Footprints
3.1 Data Requirement
3.2 Cut Count Profiles
3.3 Artifacts from the Enzyme Bias on Sequence Patterns at Cut Sites
3.4 Cut Count Analysis When Matching ChIP-seq Data Are Available for TFs of Interest
3.5 Detection of Putative Footprints and Limitations in Inference of TF Occupancy In Vivo
3.6 Sequence Motif Analysis
References
Part IV: Tools
Chapter 13: NGS-QC Generator: A Quality Control System for ChIP-Seq and Related Deep Sequencing-Generated Datasets
1 Introduction
2 Materials
2.1 Data Description: Preliminary Datasets Quality Evaluation (FastQC)
2.2 Data Description: Datasets Format for NGS-QC Generator Analysis
3 Methods
3.1 Preliminary Sequencing Quality Assessment
3.1.1 Sequence Quality
3.1.2 Clonal Reads
3.1.3 Adapter Contamination
3.2 Exploring the NGS-QC Generator Portal
3.3 Accessing to the NGS-QC Generator Tool via the Dedicated Galaxy Instance
3.3.1 An Overview of the Galaxy Interface
3.3.2 Performing a Global Quality Control Analysis
3.3.3 Visualize Local Enrichment Patterns in the Context of Their Quality
3.4 Exploring the NGS-QC Generator Database
3.4.1 A General Overview
3.4.2 The NGS-QC Database as a Source of Information for Providing Potential Hints to Interpret ChIP-Sequencing Assays with Lo...
3.4.3 Identifying ChIP-seq Grade Antibodies from the NGS-QC Database Content
4 Conclusions and Perspectives
5 Notes
References
Chapter 14: Operating on Genomic Ranges Using BEDOPS
1 Introduction
2 Methods
2.1 Genomic Data Formats
2.2 Pipelining with Streams
3 Core Functionality
3.1 Set-Based Operations
3.2 Proximity
3.3 Statistics and Annotations
4 Working Efficiently with Big Data
4.1 Parallelization
4.2 Sorting
4.3 Compression
5 Further Reading
References
Chapter 15: GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality
1 Introduction
1.1 Brief History
2 Computational Features
2.1 Using Global Information
2.2 Known Splicing Information
2.3 Tolerance to SNPs, Bisulfite Sequencing, RNA-Editing, and PAR-CLIP Sequencing
2.4 Standardized Indel Positions
2.5 Trimming Ends
2.6 Ambiguous Splicing
2.7 Chimeric Alignments and Distant Splicing
2.8 Overlapping Alignments
2.9 Suboptimal Alignments
2.10 Ranking Alignments and Eliminating Duplicates
2.11 Circular Chromosomes
2.12 Alternate Scaffolds
3 Usage
3.1 Downloading and Compilation
3.2 Pre-processing a Genome
3.3 Auxiliary SNP and Splicing Information
3.4 Alignment
3.5 Parallel Computation
4 Data Structures and Algorithms
4.1 Linear Genome: Vertical Storage Format
4.2 Genomic Hash Tables: Increased Specificity Through Compression
4.3 Handling Large Genomes
4.4 Segmental Hash Tables
4.5 Diagonalization in GMAP
4.6 Resolving Gaps in GMAP: SIMD Dynamic Programming
4.7 ESAs: Greedy Match-and-Extend
4.8 Integration of GMAP Algorithm into GSNAP
4.9 Segment Chaining
5 User Features and Downstream Analysis
5.1 Output Formats
5.2 Alignment Types and Split Output
5.3 Utility Programs
5.4 Integration with R/Bioconductor
5.5 Analyzing Datasets
5.6 Biological Inference
6 Discussion
References
Chapter 16: Visualizing Genomic Data Using Gviz and Bioconductor
1 Introduction
2 Materials
2.1 How to Install
2.2 Data Set Description
3 Methods
3.1 Track Objects
3.2 Display Parameters
3.3 A Real World Example
4 Notes
References
Chapter 17: Introducing Machine Learning Concepts with WEKA
1 Data Mining and Machine Learning
1.1 Structured vs. Unstructured Data
1.2 Supervised vs. Unsupervised Learning
1.3 Prediction: Classification and Regression
1.4 Clustering and Associations
1.5 Nominal and Numeric Data
2 WEKA: Getting Started
2.1 Loading Data
2.2 Preprocessing Data
2.3 Choosing a Classifier
2.4 Training and Testing
2.5 Running the Experiment
2.6 Understanding the Results
3 Approaching a Bioinformatics Problem
4 Other Kinds of Learning
5 Concluding Remarks
References
Chapter 18: Experimental Design and Power Calculation for RNA-seq Experiments
1 Introduction
2 Materials
3 Method
3.1 Obtaining PROPER
3.2 Setting Up Simulation Scenario
3.2.1 Using PROPER Provided Simulation Parameters
3.2.2 Using User Identified Data Source to Set Up Simulation Basis
3.3 Running Simulation and DE Detection
3.4 Power Assessment and Comparison
3.5 More Samples or Deeper Sequencing
3.6 To Filter or Not to Filter
4 Notes
References
Chapter 19: It´s DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Meth
1 Introduction
2 Basic Theory of the Differential Expression Analysis
2.1 Overview
2.2 From Reads to Genewise Counts
2.3 Modelling Variability in a Quasi-Negative Binomial Framework
2.4 Introducing the Generalized Linear Model
2.5 Normalizing for Composition Biases
3 Tutorial with Real Data
3.1 Description of the Data Set
3.2 Read Alignment and Processing
3.3 Count Loading and Annotation
3.4 Filtering to Remove Low Counts
3.5 Normalization for Composition Bias
3.6 Exploring Differences Between Libraries
3.7 Dispersion Estimation
3.8 Testing for Differential Expression
4 Advanced Usage
4.1 Analysis of Variance
4.2 Complicated Contrasts
4.3 Gene Ontology Enrichment Analysis
4.4 Rotation Gene Set Tests
4.5 Session Information
References
Index
Recommend Papers

Statistical Genomics: Methods and Protocols (Methods in Molecular Biology, 1418)
 1493935763, 9781493935765

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Methods in Molecular Biology 1418

Ewy Mathé · Sean Davis Editors

Statistical Genomics Methods and Protocols

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK

For further volumes: http://www.springer.com/series/7651

Statistical Genomics Methods and Protocols

Edited by

Ewy Mathé Biomedical Informatics, College of Medicine, Ohio State University, Columbus, OH, USA

Sean Davis National Institutes of Health, National Cancer Institute, Bethesda, MD, USA

Editors Ewy Mathe´ Biomedical Informatics, College of Medicine Ohio State University Columbus, OH, USA

Sean Davis National Institutes of Health National Cancer Institute Bethesda, MD, USA

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-4939-3576-5 ISBN 978-1-4939-3578-9 (eBook) DOI 10.1007/978-1-4939-3578-9 Library of Congress Control Number: 2016933669 # Springer Science+Business Media New York 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper This Humana Press imprint is published by Springer Nature The registered company is Springer Science+Business Media LLC New York

Preface Statistical Analysis of Genomic Data is, indeed, a very broad topic. We have attempted in this volume to provide chapters with cross-cutting groundwork materials, public data repositories, common applications of statistical analysis in genomics, and some representative toolsets for operating on genomic data. While we cannot be comprehensive in a single volume, we have tried to provide a breadth of both applications and tools. The authors of the individual chapters have largely focused on practical aspects of their topics, as we feel that application is an integral part of learning about statistical analysis of genomic data. More specifically, the volume is divided into four parts. In the first part, we have included overview material and resources that can be applied across topics later in the book. In the second part, a couple of prominent public repositories for genomic data are covered in some depth. In the third part, several different biological applications of statistical genomics are presented. In the fourth and last part, software tools that can be used to facilitate ad hoc analysis and data integration are highlighted. Finally, we thank the chapter authors for the generosity of their time and insight in preparing their excellent contributions. Ewy Mathe´ Sean Davis

Columbus, OH, USA Bethesda, MD, USA

v

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

PART 1

GROUNDWORK

1 Overview of Sequence Data Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongen Zhang 2 Integrative Exploratory Analysis of Two or More Genomic Datasets . . . . . . . . . Chen Meng and Aedin Culhane 3 Study Design for Sequencing Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loren A. Honaas, Naomi S. Altman, and Martin Krzywinski 4 Genomic Annotation Resources in R/Bioconductor . . . . . . . . . . . . . . . . . . . . . . . Marc R.J. Carlson, Herve´ Page`s, Sonali Arora, Valerie Obenchain, and Martin Morgan

PART II

3 19 39 67

PUBLIC GENOMIC DATA

5 The Gene Expression Omnibus Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emily Clough and Tanya Barrett 6 A Practical Guide to The Cancer Genome Atlas (TCGA) . . . . . . . . . . . . . . . . . . . Zhining Wang, Mark A. Jensen, and Jean Claude Zenklusen

PART III

v ix

93 111

APPLICATIONS

7 Working with Oligonucleotide Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Benilton S. Carvalho 8 Meta-Analysis in Gene Expression Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Levi Waldron and Markus Riester 9 Practical Analysis of Genome Contact Interaction Experiments . . . . . . . . . . . . . Mark A. Carty and Olivier Elemento 10 Quantitative Comparison of Large-Scale DNA Enrichment Sequencing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Lienhard and Lukas Chavez 11 Variant Calling From Next Generation Sequence Data . . . . . . . . . . . . . . . . . . . . . Nancy F. Hansen 12 Genome-Scale Analysis of Cell-Specific Regulatory Codes Using Nuclear Enzymes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Songjoon Baek and Myong-Hee Sung

vii

145 161 177

191 209

225

viii

Contents

PART IV 13

14

15

16 17 18

19

TOOLS

NGS-QC Generator: A Quality Control System for ChIP-Seq and Related Deep Sequencing-Generated Datasets . . . . . . . . . . . . . . . . . . . . . . . . Marco Antonio Mendoza-Parra, Mohamed-Ashick M. Saleem, Matthias Blum, Pierre-Etienne Cholley, and Hinrich Gronemeyer Operating on Genomic Ranges Using BEDOPS . . . . . . . . . . . . . . . . . . . . . . . . . . Shane Neph, Alex P. Reynolds, M. Scott Kuehn, and John A. Stamatoyannopoulos GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality . . . . . . . . . . . . . . . . . . . . . . Thomas D. Wu, Jens Reeder, Michael Lawrence, Gabe Becker, and Matthew J. Brauer Visualizing Genomic Data Using Gviz and Bioconductor . . . . . . . . . . . . . . . . . . Florian Hahne and Robert Ivanek Introducing Machine Learning Concepts with WEKA . . . . . . . . . . . . . . . . . . . . . Tony C. Smith and Eibe Frank Experimental Design and Power Calculation for RNA-seq Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhijin Wu and Hao Wu It’s DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR . . . . . . . Aaron T.L. Lun, Yunshun Chen, and Gordon K. Smyth

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

243

267

283

335 353

379

391 417

Contributors NAOMI S. ALTMAN  Department of Statistics and Huck Institutes of Life Sciences, The Pennsylvania State University, PA, USA SONALI ARORA  Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA, USA SONGJOON BAEK  Laboratory of Receptor Biology and Gene Expression, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA TANYA BARRETT  National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA GABRIEL BECKER  Genentech, South San Francisco, CA, USA MATTHIAS BLUM  Equipe Labellise´e Ligue Contre le Cancer, Department of Functional Genomics and Cancer, Institut de Ge´ne´tique et de Biologie Mole´culaire et Cellulaire (IGBMC)/CNRS/INSERM/Universite´ de Strasbourg, Illkirch Cedex, France MATTHEW BRAUER  Genentech, South San Francisco, CA, USA MARC R.J. CARLSON  Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA, USA; Seattle Children’s Research Institute, Seattle, WA, USA MARK A. CARTY  Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY, USA; Memorial Sloan Kettering Cancer Center, New York, NY, USA BENILTON S. CARVALHO  Brazilian Institute of Neuroscience and Neurotechnology (BRAINN) and Department of Statistics, University of Campinas, Campinas, Sa˜o Paulo, Brazil LUKAS CHAVEZ  German Cancer Research Center, Heidelberg, Germany YUNSHUN CHEN  Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia PIERRE-ETIENNE CHOLLEY  Equipe Labellise´e Ligue Contre le Cancer, Department of Functional Genomics and Cancer, Institut de Ge´ne´tique et de Biologie Mole´culaire et Cellulaire (IGBMC)/CNRS/INSERM/Universite´ de Strasbourg, Illkirch Cedex, France EMILY CLOUGH  National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA AEDIN CULHANE  Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA OLIVIER ELEMENTO  Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY, USA EIBE FRANK  Department of Computer Science, University of Waikato, Hamilton, New Zealand HINRICH GRONEMEYER  Equipe Labellise´e Ligue Contre le Cancer, Department of Functional Genomics and Cancer, Institut de Ge´ne´tique et de Biologie Mole´culaire et Cellulaire (IGBMC)/CNRS/INSERM/Universite´ de Strasbourg, Illkirch Cedex, France FLORIAN HAHNE  Novartis Institute for Biomedical Research, Basel, Switzerland NANCY F. HANSEN  National Human Genome Research Institute, Rockville, MD, USA LOREN A. HONAAS  Department of Biology, The Pennsylvania State University, Wenatchee, WA, USA

ix

x

Contributors

ROBERT IVANEK  Department of Biomedicine, University of Basel, Basel, Switzerland MARK A. JENSEN  Research Administration Directorate, Leidos Biomedical Research Inc., Frederick National Laboratory for Cancer Research, Frederick, MD, USA MARTIN KRZYWINSKI  Canada’s Michael Smith Genome Sciences Centre, Vancouver, BC, Canada M. SCOTT KUEHN  Opower Inc., San Francisco, CA, USA MICHAEL LAWRENCE  Genentech, South San Francisco, CA, USA MATTHIAS LIENHARD  Max Planck Institute for Molecular Genetics, Berlin, Germany AARON T. L. LUN  Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia MARCO ANTONIO MENDOZA-PARRA  Equipe Labellise´e Ligue Contre le Cancer, Department of Functional Genomics and Cancer, Institut de Ge´ne´tique et de Biologie Mole´culaire et Cellulaire (IGBMC)/CNRS/INSERM/Universite´ de Strasbourg, Illkirch Cedex, France CHEN MENG  Chair of Proteomics and Bioanalytics, Technische Universit€ at Mnchen, Freising, Germany MARTIN MORGAN  Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA, USA SHANE NEPH  Department of Genome Sciences, Altius Institute for Biomedical Sciences, Seattle, WA, USA VALERIE OBENCHAIN  Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA, USA HERVE´ PAGE`S  Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA, USA JENS REEDER  Genentech, South San Francisco, CA, USA ALEX P. REYNOLDS  Department of Genome Sciences, Altius Institute for Biomedical Sciences, Seattle, WA, USA MARKUS RIESTER  Novartis Institutes for BioMedical Research (NIBR), Cambridge, MA, USA MOHAMED-ASHICK M. SALEEM  Equipe Labellise´e Ligue Contre le Cancer, Department of Functional Genomics and Cancer, Institut de Ge´ne´tique et de Biologie Mole´culaire et Cellulaire (IGBMC)/CNRS/INSERM/Universite´ de Strasbourg, Illkirch Cedex, France TONY C. SMITH  Department of Computer Science, University of Waikato, Hamilton, New Zealand GORDON K. SMYTH  Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia JOHN A. STAMATOYANNOPOULOS  Altius Institute for Biomedical Sciences, Seattle, WA, USA; Department of Medicine, University of Washington, Seattle, WA, USA; Department of Genome Sciences, University of Washington, Seattle, WA, USA MYONG-HEE SUNG  Laboratory of Receptor Biology and Gene Expression, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA; Laboratory of Molecular Biology and Immunology, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA LEVI WALDRON  Department of Epidemiology and Biostatistics, City University of New York, School of Public Health, New York, NY, USA ZHINING WANG  Center for Cancer Genomics, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA

Contributors

xi

THOMAS D. WU  Genentech, South San Francisco, CA, USA ZHIJIN WU  Department of Biostatistics, Brown University, Providence, RI, USA HAO WU  Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA JEAN CLAUDE ZENKLUSEN  Center for Cancer Genomics, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA HONGEN ZHANG  Center for Cancer Research, National Institutes of Health, National Cancer Institute, Bethesda, MD, USA

Part I Groundwork

Chapter 1 Overview of Sequence Data Formats Hongen Zhang Abstract Next-generation sequencing experiment can generate billions of short reads for each sample and processing of the raw reads will add more information. Various file formats have been introduced/developed in order to store and manipulate this information. This chapter presents an overview of the file formats including FASTQ, FASTA, SAM/BAM, GFF/GTF, BED, and VCF that are commonly used in analysis of next-generation sequencing data. Key words Sequencing data file format, Next-generation sequencing, Sequencing data, FASTQ, FASTA, SAM/BAM, GFF/GTF, BED, VCF

1

Introduction Next-generation sequencing (NGS) refers to high-throughput technologies for large scale DNA sequencing (such as whole genome sequencing, whole-exome sequencing, RNA-seq, miRNAseq, ChIP-seq, and DNA Methylation) and could be conducted with different platforms, for example, Illumina(Solexa), Roche 454, Ion Torrent, and SOliD [1–5]. The main output of a nextgeneration sequencing experiment is short reads (short DNA sequences of ¼ 15"> ##INFO¼ ##INFO¼ ##INFO¼ ##INFO¼ ##FILTER¼ ##FILTER¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼

In this example, VCF file version is v4.1 and the file is generated with VarScan2 software. The ##INFO lines listed five IDs which will be used in the INFO column of data line to describe, for the variant in each line, the average read depth of bases with Phred score  15, how many of samples are wild type, heterozygous,

Overview of Sequence Data Formats

15

homozygous, or no call. The two ##FILTER lines listed how a variant being filtered. The nine ##FORMAT lines show what contents will and how be arranged in the FORMAT column of each data line. Data lines in VCF file are tab-delimited text lines and each line contains nine fixed fields (columns) followed by one or more sample column(s). The first nine fields are: l

CHROM: chromosome name.

l

POS: the leftmost position of the variant in the sequence.

l

ID: variant identifier such as SNP id. “.” denotes unavailable.

l

REF: reference base (for SNP) or sequence (for INDEL).

l

ALT: alternate base or bases (the observed variant).

l

QUAL: Phred-scaled quality score.

l

FILTER: fiter status. PASS for passed all filters or a semicolonseparated list of code for filters that fail.

l

INFO: additional information in ¼ format.

l

FORMAT: colon separated key and value.

l

Sample field(s): one or more sample columns may exist in a VCF file. This field has the values for a sample arranged in the format defined in previous FORMAT field.

The first seven columns in each VCF data line are information of the variant itself, for example, the line below shows variant is on chromosome 1 at position of 880639 base, no ID (snp) since it is a deletion, reference bases are TC and C is missing in sample sequence. The call of this variant has no quality score but has passed the filtering. #CHROM POS ID REF ALT QUAL 1 880639 . TC T . PASS

FILTER

The INFO and FORMAT columns in VCF data line list information of the variant in sample(s). To understand the meaning of contents in these two columns, one must refer to the information given by header lines. For example, the contents of “ADP¼10; WT¼0;HET¼1;HOM¼0;NC¼0” in INFO column indicate that average depth (ADP) for the variant in the samples is 10, one heterozygous (HET) is called and no other calls. The FORMAT column lists all tags (type of information) and their order for each sample. If a FORMAT column has following content: “GT:GQ: SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR” and values in a sample column is as “0/1:22:10:10:4:6:60 %:5.418E-3:36:38:3:1:0:6”, the information listed for the sample will be genotype (0/1), genotype quality (22), raw read depth from SAMtools (10), quality read depth with Phred score>¼15 (10),

16

Hongen Zhang

depth of reference-supporting bases (4), depth of variantsupporting bases (6), variant allele frequency (60 %), P-value from Fisher’s Exact Test (0.005418), average quality of referencesupporting bases (36), average quality of variant-supporting bases (38), depth of reference-supporting bases on forward strand (3), depth of reference-supporting bases on reverse strand (1), depth of variant-supporting bases on forward strand (0), and depth of variant-supporting bases on reverse strand (6). Detailed specification of VCF format can be found at https:// github.com/samtools/hts-specs. VCFtools [16] is a software equipped with numerous functionalities to process VCF files. A small VCF file can also be viewed from command line of Unix/ Linux computer or Microsoft Excel.

3

Conclusions Next-generation sequencing technologies have brought new file formats and resurrected old formats. Understanding the basics of these formats is important to knowing what information they contain and how they can be used in various steps from raw data generation to interpretable biological results.

References 1. Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26:1135–1145 2. Metzker ML (2010) Sequencing technologies— the next generation. Nat Rev Genet 11:31–46 3. Quail MA, Smith M, Cooupland P et al (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13:341 4. Mardis ER (2008) Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9:387–402 5. Mardis ER (2013) Next-generation sequencing platforms. Annu Rev Anal Chem 6:287–303 6. Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nat Methods 6(Suppl 11):S6–S12 7. Medvedev P, Stanciu M, Brudno M (2009) Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 6(Suppl 11):S13–S20 8. Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat Methods 6(Suppl 11):S22–S32

9. van Dijk EL, Auger H, Jaszczyszyn Y et al (2014) Ten years of next-generation sequencing technology. Trends Genet 30:418–426 10. Voelkerding KV, Dames SA, Durtschi JD (2009) Next-generation sequencing: from basic research to diagnostics. Clin Chem 55:641–658 11. Pavlopoulos GA, Oulas A, Lacucci E et al (2013) Unraveling genomic variation from next generation sequencing data. BioData Min 6:13 12. Allcock RJN (2014) Production and analytic bioinformatics for next-generation DNA sequencing. In: Trent R (ed) Clinical bioinformatics, 2nd edn. Humana, New York, pp 17–30 13. Cock PJ, Fields CJ, Goto N et al (2010) The Sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res 38:1767–1771 14. Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/Map format and SAMtools. Bioinformatics 25:2078–2079 15. The SAM/BAM Format Specification Working Group (2014) Sequence alignment/map format specification. http://samtools.github.io/ hts-specs/SAMv1.pdf

Overview of Sequence Data Formats 16. Danecek P, Auton A, Abecasis G et al (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158 17. Ewing B, Hillier L, Wendl MC et al (1998) Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8:175–185 18. Ewing B, Green P (1998) Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res 8:186–194 19. Andrews S (2010) FastQC: a quality control tool for high throughput sequence data., Available online at http://www.bioinformatics. babraham.ac.uk/projects/fastqc 20. Lipman D, Pearson W (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441 21. Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci 85:2444–2448 22. Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859–1875 23. Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25 24. Li H, Durbin R (2010) Fast and accurate longread alignment with Burrows-Wheeler transform. Bioinformatics 26:589–595

17

25. Dobin A, Davis CA, Schlesinger F et al (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21 26. Robinson JT, Thorvaldsdo´ttir H, Winckler W et al (2011) Integrative Genomics Viewer. Nat Biotechnol 29:24–26 27. Thorvaldsdo´ttir H, Robinson JT, Mesirov JP (2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 14:178–192 28. Generic Feature Format (GFF). http://www. sanger.ac.uk/resources/software/gff/spec.html 29. GFF/GTF File Format—Definition and supported options. http://www.ensembl.org/ info/website/upload/gff.html 30. BED File Format. Definition and supported options. http://useast.ensembl.org/info/ website/upload/bed.html 31. BED format. http://genome.ucsc.edu/FAQ/ FAQformat.html#format1 32. The 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073 33. McVean GA, Abecasis DM, Auton R et al (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491:56–65

Chapter 2 Integrative Exploratory Analysis of Two or More Genomic Datasets Chen Meng and Aedin Culhane Abstract Exploratory analysis is an essential step in the analysis of high throughput data. Multivariate approaches such as correspondence analysis (CA), principal component analysis, and multidimensional scaling are widely used in the exploratory analysis of single dataset. Modern biological studies often assay multiple types of biological molecules (e.g., mRNA, protein, phosphoproteins) on a same set of biological samples, thereby creating multiple different types of omics data or multiassay data. Integrative exploratory analysis of these multiple omics data is required to leverage the potential of multiple omics studies. In this chapter, we describe the application of co-inertia analysis (CIA; for analyzing two datasets) and multiple co-inertia analysis (MCIA; for three or more datasets) to address this problem. These methods are powerful yet simple multivariate approaches that represent samples using a lower number of variables, allowing a more easily identification of the correlated structure in and between multiple high dimensional datasets. Graphical representations can be employed to this purpose. In addition, the methods simultaneously project samples and variables (genes, proteins) onto the same lower dimensional space, so the most variant variables from each dataset can be selected and associated with samples, which can be further used to facilitate biological interpretation and pathway analysis. We applied CIA to explore the concordance between mRNA and protein expression in a panel of 60 tumor cell lines from the National Cancer Institute. In the same 60 cell lines, we used MCIA to perform a cross-platform comparison of mRNA gene expression profiles obtained on four different microarray platforms. Last, as an example of integrative analysis of multiassay or multiomics data we analyzed transcriptomic, proteomic, and phosphoproteomic data from pluripotent (iPS) and embryonic stem (ES) cell lines. Key words Multivariate, Dimension reduction, Multi-omics, Multiassay, Data integration

1

Introduction High throughput technologies including microarray, sequencing, mass spectrometry based proteomics which assay biological molecules have developed rapidly in the past decades. These technologies generate vast amounts of data that describe biological samples at genomic scale and are often called omics data. The capacity and performance of these technologies have improved concurrently with dramatic decreases in cost, and therefore modern

Ewy Mathe´ and Sean Davis (eds.), Statistical Genomics: Methods and Protocols, Methods in Molecular Biology, vol. 1418, DOI 10.1007/978-1-4939-3578-9_2, © Springer Science+Business Media New York 2016

19

20

Chen Meng and Aedin Culhane

omics studies frequently apply multiple omics techniques to describe the same set of biological observations, such studies include The Cancer Genome Atlas (TCGA), Cancer Cell Line Encyclopedia (CCLE), and ENCyclopedia of DNA Elements (ENCODE). These projects systematically profile large numbers of biological samples resulting in multiple levels of qualitative or quantitative omics data. Whilst the systematically measuring large number of biological molecules (genes, proteins) can reveal novel knowledge that cannot be discovered by traditional methods, the accumulation of multiple omics data presents new challenges for data integration and interpretation. Several exploratory data analysis (EDA) methods including correspondence analysis (CA), principal component analysis (PCA) have been widely applied to study single omics data [1, 2]. These EDA methods are frequently performed in the early stage of analysis for quality control, detecting batch effect or exploring basic cluster structure in a dataset. In the analysis of multiple omics data, EDA also needs to identify correlations and associations between each of the high dimensional datasets. In this chapter, we describe the following EDA methods that enable researchers to identify relationships between two or more high dimensional datasets: l

Co-inertia analysis (CIA) can be used to explore relationships between two datasets [3];

l

Multiple co-inertia analysis (MCIA) can be applied to analyze multiple datasets [4].

Both methods project observations (samples) and variables onto a lower dimensional space, but constrain the dimension reduction such that the new variables represent covariant structure among datasets. Variables from each dataset are transformed onto the same scale. The association between variables and samples can be visualized in this new space, which greatly facilitates the detection of global variance structure and identification of the most informative variables across datasets.

2

Methods

2.1 Analysis of Two Datasets Using Co-inertia Analysis 2.1.1 Co-inertia Analysis

Co-inertia analysis is a multivariate exploratory approach used to identify the covariance between two datasets that have the same set of observations [5]. In the field of omics data analysis, CIA was first introduced to cross-platform comparison of microarray data [3]. With increasing availability of other omics data, it has also been applied to integration of different types of omics data [6]. Two omics datasets may be represented by two matrices, X and Y . In this chapter, we assume the rows of a matrix are samples (observations) and columns are variables, such as genes, proteins, other small molecules, etc. Similar to PCA, CIA is a dimension

Integrative Exploratory Analysis of Two or More Genomic Datasets

21

reduction technique but it considers two datasets simultaneously. For the ith dimension, CIA finds a pair of new variables, designated as co-inertia components or dimensions, using a linear combination of the original variables in X and Y , so as to maximize the squared covariance between them. argmaxui , vi cov 2 ðX ui , Y vi Þ

ð1Þ

X ui and Y vi are the co-inertia components for matrix X and Y , respectively; vi and ui are the linear combination coefficients, which is comparable to the loading vectors in the PCA. Due to the optimized criteria, the co-inertia components capture the most important covariance structure between the two datasets. The co-structure between the two datasets may be visualized by the co-inertia components in a lower dimensional space. 2.2 Case Study I: Integration of NCI-60 Cell Line Transcriptomic and Proteomic Data

The NCI-60 panel is a collection of 60 cancer cell lines from nine different tissues of origin. It includes leukemia, melanoma, ovarian, renal, breast, prostate, colon, lung, and central nervous system (CNS). These cell lines are widely used for in vitro screening of anticancer compounds. In attempts to discover gene–drug interactions, several genome wide data profiling approaches have been applied to these cell lines, including DNA copy number variation, DNA mutation, gene expression, protein expression, drug sensitivity, etc. In this case study, we will examine the mRNA expression measure by Agilent GE 4x44K microarray platform (downloaded from [7]) and the proteome data (mass spectrometry based proteomics) [8]. We will use CIA to explore the similarity between datasets and cell lines. We load the required package and data using: library(omicade4) library(made4) load("../data/NCI60_rnaprotein.RDA")

NCI60_rnaprotein is an object of class list, which consists of two numerical matrices, mRNA and protein. These two matrices have the following dimensions:

summary(NCI60_rnaprotein) ##

Length Class

Mode

## mRNA

640958 -none- numeric

## protein 431288 -none- numeric sapply(NCI60_rnaprotein, dim) ##

mRNA protein

## [1,] 11051 ## [2,]

58

7436 58

22

Chen Meng and Aedin Culhane

Each of the matrices has 58 cell lines in columns. Due to the problem of data quality, we removed two cell lines, resulting in 58 cell lines included in this analysis. CIA requires that the columns in the matrices are correctly matched, to verify this: identical(colnames(NCI60_rnaprotein$mRNA),colnames(NCI60_rnaprotein$protein)) ## [1] TRUE

However, the number of rows in different matrices may be different. To facilitate the visualization later, we first create some auxiliary variables to indicate the names of cell lines, tissues of origin of cell lines, and the color for each. names