424 78 15MB
English Pages 418 [431] Year 2016
Methods in Molecular Biology 1418
Ewy Mathé · Sean Davis Editors
Statistical Genomics Methods and Protocols
METHODS
IN
MOLECULAR BIOLOGY
Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK
For further volumes: http://www.springer.com/series/7651
Statistical Genomics Methods and Protocols
Edited by
Ewy Mathé Biomedical Informatics, College of Medicine, Ohio State University, Columbus, OH, USA
Sean Davis National Institutes of Health, National Cancer Institute, Bethesda, MD, USA
Editors Ewy Mathe´ Biomedical Informatics, College of Medicine Ohio State University Columbus, OH, USA
Sean Davis National Institutes of Health National Cancer Institute Bethesda, MD, USA
ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-4939-3576-5 ISBN 978-1-4939-3578-9 (eBook) DOI 10.1007/978-1-4939-3578-9 Library of Congress Control Number: 2016933669 # Springer Science+Business Media New York 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper This Humana Press imprint is published by Springer Nature The registered company is Springer Science+Business Media LLC New York
Preface Statistical Analysis of Genomic Data is, indeed, a very broad topic. We have attempted in this volume to provide chapters with cross-cutting groundwork materials, public data repositories, common applications of statistical analysis in genomics, and some representative toolsets for operating on genomic data. While we cannot be comprehensive in a single volume, we have tried to provide a breadth of both applications and tools. The authors of the individual chapters have largely focused on practical aspects of their topics, as we feel that application is an integral part of learning about statistical analysis of genomic data. More specifically, the volume is divided into four parts. In the first part, we have included overview material and resources that can be applied across topics later in the book. In the second part, a couple of prominent public repositories for genomic data are covered in some depth. In the third part, several different biological applications of statistical genomics are presented. In the fourth and last part, software tools that can be used to facilitate ad hoc analysis and data integration are highlighted. Finally, we thank the chapter authors for the generosity of their time and insight in preparing their excellent contributions. Ewy Mathe´ Sean Davis
Columbus, OH, USA Bethesda, MD, USA
v
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PART 1
GROUNDWORK
1 Overview of Sequence Data Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongen Zhang 2 Integrative Exploratory Analysis of Two or More Genomic Datasets . . . . . . . . . Chen Meng and Aedin Culhane 3 Study Design for Sequencing Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loren A. Honaas, Naomi S. Altman, and Martin Krzywinski 4 Genomic Annotation Resources in R/Bioconductor . . . . . . . . . . . . . . . . . . . . . . . Marc R.J. Carlson, Herve´ Page`s, Sonali Arora, Valerie Obenchain, and Martin Morgan
PART II
3 19 39 67
PUBLIC GENOMIC DATA
5 The Gene Expression Omnibus Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emily Clough and Tanya Barrett 6 A Practical Guide to The Cancer Genome Atlas (TCGA) . . . . . . . . . . . . . . . . . . . Zhining Wang, Mark A. Jensen, and Jean Claude Zenklusen
PART III
v ix
93 111
APPLICATIONS
7 Working with Oligonucleotide Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Benilton S. Carvalho 8 Meta-Analysis in Gene Expression Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Levi Waldron and Markus Riester 9 Practical Analysis of Genome Contact Interaction Experiments . . . . . . . . . . . . . Mark A. Carty and Olivier Elemento 10 Quantitative Comparison of Large-Scale DNA Enrichment Sequencing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Lienhard and Lukas Chavez 11 Variant Calling From Next Generation Sequence Data . . . . . . . . . . . . . . . . . . . . . Nancy F. Hansen 12 Genome-Scale Analysis of Cell-Specific Regulatory Codes Using Nuclear Enzymes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Songjoon Baek and Myong-Hee Sung
vii
145 161 177
191 209
225
viii
Contents
PART IV 13
14
15
16 17 18
19
TOOLS
NGS-QC Generator: A Quality Control System for ChIP-Seq and Related Deep Sequencing-Generated Datasets . . . . . . . . . . . . . . . . . . . . . . . . Marco Antonio Mendoza-Parra, Mohamed-Ashick M. Saleem, Matthias Blum, Pierre-Etienne Cholley, and Hinrich Gronemeyer Operating on Genomic Ranges Using BEDOPS . . . . . . . . . . . . . . . . . . . . . . . . . . Shane Neph, Alex P. Reynolds, M. Scott Kuehn, and John A. Stamatoyannopoulos GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality . . . . . . . . . . . . . . . . . . . . . . Thomas D. Wu, Jens Reeder, Michael Lawrence, Gabe Becker, and Matthew J. Brauer Visualizing Genomic Data Using Gviz and Bioconductor . . . . . . . . . . . . . . . . . . Florian Hahne and Robert Ivanek Introducing Machine Learning Concepts with WEKA . . . . . . . . . . . . . . . . . . . . . Tony C. Smith and Eibe Frank Experimental Design and Power Calculation for RNA-seq Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhijin Wu and Hao Wu It’s DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR . . . . . . . Aaron T.L. Lun, Yunshun Chen, and Gordon K. Smyth
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
243
267
283
335 353
379
391 417
Contributors NAOMI S. ALTMAN Department of Statistics and Huck Institutes of Life Sciences, The Pennsylvania State University, PA, USA SONALI ARORA Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA, USA SONGJOON BAEK Laboratory of Receptor Biology and Gene Expression, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA TANYA BARRETT National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA GABRIEL BECKER Genentech, South San Francisco, CA, USA MATTHIAS BLUM Equipe Labellise´e Ligue Contre le Cancer, Department of Functional Genomics and Cancer, Institut de Ge´ne´tique et de Biologie Mole´culaire et Cellulaire (IGBMC)/CNRS/INSERM/Universite´ de Strasbourg, Illkirch Cedex, France MATTHEW BRAUER Genentech, South San Francisco, CA, USA MARC R.J. CARLSON Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA, USA; Seattle Children’s Research Institute, Seattle, WA, USA MARK A. CARTY Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY, USA; Memorial Sloan Kettering Cancer Center, New York, NY, USA BENILTON S. CARVALHO Brazilian Institute of Neuroscience and Neurotechnology (BRAINN) and Department of Statistics, University of Campinas, Campinas, Sa˜o Paulo, Brazil LUKAS CHAVEZ German Cancer Research Center, Heidelberg, Germany YUNSHUN CHEN Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia PIERRE-ETIENNE CHOLLEY Equipe Labellise´e Ligue Contre le Cancer, Department of Functional Genomics and Cancer, Institut de Ge´ne´tique et de Biologie Mole´culaire et Cellulaire (IGBMC)/CNRS/INSERM/Universite´ de Strasbourg, Illkirch Cedex, France EMILY CLOUGH National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA AEDIN CULHANE Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA OLIVIER ELEMENTO Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY, USA EIBE FRANK Department of Computer Science, University of Waikato, Hamilton, New Zealand HINRICH GRONEMEYER Equipe Labellise´e Ligue Contre le Cancer, Department of Functional Genomics and Cancer, Institut de Ge´ne´tique et de Biologie Mole´culaire et Cellulaire (IGBMC)/CNRS/INSERM/Universite´ de Strasbourg, Illkirch Cedex, France FLORIAN HAHNE Novartis Institute for Biomedical Research, Basel, Switzerland NANCY F. HANSEN National Human Genome Research Institute, Rockville, MD, USA LOREN A. HONAAS Department of Biology, The Pennsylvania State University, Wenatchee, WA, USA
ix
x
Contributors
ROBERT IVANEK Department of Biomedicine, University of Basel, Basel, Switzerland MARK A. JENSEN Research Administration Directorate, Leidos Biomedical Research Inc., Frederick National Laboratory for Cancer Research, Frederick, MD, USA MARTIN KRZYWINSKI Canada’s Michael Smith Genome Sciences Centre, Vancouver, BC, Canada M. SCOTT KUEHN Opower Inc., San Francisco, CA, USA MICHAEL LAWRENCE Genentech, South San Francisco, CA, USA MATTHIAS LIENHARD Max Planck Institute for Molecular Genetics, Berlin, Germany AARON T. L. LUN Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia MARCO ANTONIO MENDOZA-PARRA Equipe Labellise´e Ligue Contre le Cancer, Department of Functional Genomics and Cancer, Institut de Ge´ne´tique et de Biologie Mole´culaire et Cellulaire (IGBMC)/CNRS/INSERM/Universite´ de Strasbourg, Illkirch Cedex, France CHEN MENG Chair of Proteomics and Bioanalytics, Technische Universit€ at Mnchen, Freising, Germany MARTIN MORGAN Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA, USA SHANE NEPH Department of Genome Sciences, Altius Institute for Biomedical Sciences, Seattle, WA, USA VALERIE OBENCHAIN Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA, USA HERVE´ PAGE`S Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA, USA JENS REEDER Genentech, South San Francisco, CA, USA ALEX P. REYNOLDS Department of Genome Sciences, Altius Institute for Biomedical Sciences, Seattle, WA, USA MARKUS RIESTER Novartis Institutes for BioMedical Research (NIBR), Cambridge, MA, USA MOHAMED-ASHICK M. SALEEM Equipe Labellise´e Ligue Contre le Cancer, Department of Functional Genomics and Cancer, Institut de Ge´ne´tique et de Biologie Mole´culaire et Cellulaire (IGBMC)/CNRS/INSERM/Universite´ de Strasbourg, Illkirch Cedex, France TONY C. SMITH Department of Computer Science, University of Waikato, Hamilton, New Zealand GORDON K. SMYTH Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia JOHN A. STAMATOYANNOPOULOS Altius Institute for Biomedical Sciences, Seattle, WA, USA; Department of Medicine, University of Washington, Seattle, WA, USA; Department of Genome Sciences, University of Washington, Seattle, WA, USA MYONG-HEE SUNG Laboratory of Receptor Biology and Gene Expression, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA; Laboratory of Molecular Biology and Immunology, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA LEVI WALDRON Department of Epidemiology and Biostatistics, City University of New York, School of Public Health, New York, NY, USA ZHINING WANG Center for Cancer Genomics, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
Contributors
xi
THOMAS D. WU Genentech, South San Francisco, CA, USA ZHIJIN WU Department of Biostatistics, Brown University, Providence, RI, USA HAO WU Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA JEAN CLAUDE ZENKLUSEN Center for Cancer Genomics, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA HONGEN ZHANG Center for Cancer Research, National Institutes of Health, National Cancer Institute, Bethesda, MD, USA
Part I Groundwork
Chapter 1 Overview of Sequence Data Formats Hongen Zhang Abstract Next-generation sequencing experiment can generate billions of short reads for each sample and processing of the raw reads will add more information. Various file formats have been introduced/developed in order to store and manipulate this information. This chapter presents an overview of the file formats including FASTQ, FASTA, SAM/BAM, GFF/GTF, BED, and VCF that are commonly used in analysis of next-generation sequencing data. Key words Sequencing data file format, Next-generation sequencing, Sequencing data, FASTQ, FASTA, SAM/BAM, GFF/GTF, BED, VCF
1
Introduction Next-generation sequencing (NGS) refers to high-throughput technologies for large scale DNA sequencing (such as whole genome sequencing, whole-exome sequencing, RNA-seq, miRNAseq, ChIP-seq, and DNA Methylation) and could be conducted with different platforms, for example, Illumina(Solexa), Roche 454, Ion Torrent, and SOliD [1–5]. The main output of a nextgeneration sequencing experiment is short reads (short DNA sequences of ¼ 15"> ##INFO¼ ##INFO¼ ##INFO¼ ##INFO¼ ##FILTER¼ ##FILTER¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼ ##FORMAT¼
In this example, VCF file version is v4.1 and the file is generated with VarScan2 software. The ##INFO lines listed five IDs which will be used in the INFO column of data line to describe, for the variant in each line, the average read depth of bases with Phred score 15, how many of samples are wild type, heterozygous,
Overview of Sequence Data Formats
15
homozygous, or no call. The two ##FILTER lines listed how a variant being filtered. The nine ##FORMAT lines show what contents will and how be arranged in the FORMAT column of each data line. Data lines in VCF file are tab-delimited text lines and each line contains nine fixed fields (columns) followed by one or more sample column(s). The first nine fields are: l
CHROM: chromosome name.
l
POS: the leftmost position of the variant in the sequence.
l
ID: variant identifier such as SNP id. “.” denotes unavailable.
l
REF: reference base (for SNP) or sequence (for INDEL).
l
ALT: alternate base or bases (the observed variant).
l
QUAL: Phred-scaled quality score.
l
FILTER: fiter status. PASS for passed all filters or a semicolonseparated list of code for filters that fail.
l
INFO: additional information in ¼ format.
l
FORMAT: colon separated key and value.
l
Sample field(s): one or more sample columns may exist in a VCF file. This field has the values for a sample arranged in the format defined in previous FORMAT field.
The first seven columns in each VCF data line are information of the variant itself, for example, the line below shows variant is on chromosome 1 at position of 880639 base, no ID (snp) since it is a deletion, reference bases are TC and C is missing in sample sequence. The call of this variant has no quality score but has passed the filtering. #CHROM POS ID REF ALT QUAL 1 880639 . TC T . PASS
FILTER
The INFO and FORMAT columns in VCF data line list information of the variant in sample(s). To understand the meaning of contents in these two columns, one must refer to the information given by header lines. For example, the contents of “ADP¼10; WT¼0;HET¼1;HOM¼0;NC¼0” in INFO column indicate that average depth (ADP) for the variant in the samples is 10, one heterozygous (HET) is called and no other calls. The FORMAT column lists all tags (type of information) and their order for each sample. If a FORMAT column has following content: “GT:GQ: SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR” and values in a sample column is as “0/1:22:10:10:4:6:60 %:5.418E-3:36:38:3:1:0:6”, the information listed for the sample will be genotype (0/1), genotype quality (22), raw read depth from SAMtools (10), quality read depth with Phred score>¼15 (10),
16
Hongen Zhang
depth of reference-supporting bases (4), depth of variantsupporting bases (6), variant allele frequency (60 %), P-value from Fisher’s Exact Test (0.005418), average quality of referencesupporting bases (36), average quality of variant-supporting bases (38), depth of reference-supporting bases on forward strand (3), depth of reference-supporting bases on reverse strand (1), depth of variant-supporting bases on forward strand (0), and depth of variant-supporting bases on reverse strand (6). Detailed specification of VCF format can be found at https:// github.com/samtools/hts-specs. VCFtools [16] is a software equipped with numerous functionalities to process VCF files. A small VCF file can also be viewed from command line of Unix/ Linux computer or Microsoft Excel.
3
Conclusions Next-generation sequencing technologies have brought new file formats and resurrected old formats. Understanding the basics of these formats is important to knowing what information they contain and how they can be used in various steps from raw data generation to interpretable biological results.
References 1. Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26:1135–1145 2. Metzker ML (2010) Sequencing technologies— the next generation. Nat Rev Genet 11:31–46 3. Quail MA, Smith M, Cooupland P et al (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13:341 4. Mardis ER (2008) Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9:387–402 5. Mardis ER (2013) Next-generation sequencing platforms. Annu Rev Anal Chem 6:287–303 6. Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nat Methods 6(Suppl 11):S6–S12 7. Medvedev P, Stanciu M, Brudno M (2009) Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 6(Suppl 11):S13–S20 8. Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat Methods 6(Suppl 11):S22–S32
9. van Dijk EL, Auger H, Jaszczyszyn Y et al (2014) Ten years of next-generation sequencing technology. Trends Genet 30:418–426 10. Voelkerding KV, Dames SA, Durtschi JD (2009) Next-generation sequencing: from basic research to diagnostics. Clin Chem 55:641–658 11. Pavlopoulos GA, Oulas A, Lacucci E et al (2013) Unraveling genomic variation from next generation sequencing data. BioData Min 6:13 12. Allcock RJN (2014) Production and analytic bioinformatics for next-generation DNA sequencing. In: Trent R (ed) Clinical bioinformatics, 2nd edn. Humana, New York, pp 17–30 13. Cock PJ, Fields CJ, Goto N et al (2010) The Sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res 38:1767–1771 14. Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/Map format and SAMtools. Bioinformatics 25:2078–2079 15. The SAM/BAM Format Specification Working Group (2014) Sequence alignment/map format specification. http://samtools.github.io/ hts-specs/SAMv1.pdf
Overview of Sequence Data Formats 16. Danecek P, Auton A, Abecasis G et al (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158 17. Ewing B, Hillier L, Wendl MC et al (1998) Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8:175–185 18. Ewing B, Green P (1998) Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res 8:186–194 19. Andrews S (2010) FastQC: a quality control tool for high throughput sequence data., Available online at http://www.bioinformatics. babraham.ac.uk/projects/fastqc 20. Lipman D, Pearson W (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441 21. Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci 85:2444–2448 22. Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859–1875 23. Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25 24. Li H, Durbin R (2010) Fast and accurate longread alignment with Burrows-Wheeler transform. Bioinformatics 26:589–595
17
25. Dobin A, Davis CA, Schlesinger F et al (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21 26. Robinson JT, Thorvaldsdo´ttir H, Winckler W et al (2011) Integrative Genomics Viewer. Nat Biotechnol 29:24–26 27. Thorvaldsdo´ttir H, Robinson JT, Mesirov JP (2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 14:178–192 28. Generic Feature Format (GFF). http://www. sanger.ac.uk/resources/software/gff/spec.html 29. GFF/GTF File Format—Definition and supported options. http://www.ensembl.org/ info/website/upload/gff.html 30. BED File Format. Definition and supported options. http://useast.ensembl.org/info/ website/upload/bed.html 31. BED format. http://genome.ucsc.edu/FAQ/ FAQformat.html#format1 32. The 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073 33. McVean GA, Abecasis DM, Auton R et al (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491:56–65
Chapter 2 Integrative Exploratory Analysis of Two or More Genomic Datasets Chen Meng and Aedin Culhane Abstract Exploratory analysis is an essential step in the analysis of high throughput data. Multivariate approaches such as correspondence analysis (CA), principal component analysis, and multidimensional scaling are widely used in the exploratory analysis of single dataset. Modern biological studies often assay multiple types of biological molecules (e.g., mRNA, protein, phosphoproteins) on a same set of biological samples, thereby creating multiple different types of omics data or multiassay data. Integrative exploratory analysis of these multiple omics data is required to leverage the potential of multiple omics studies. In this chapter, we describe the application of co-inertia analysis (CIA; for analyzing two datasets) and multiple co-inertia analysis (MCIA; for three or more datasets) to address this problem. These methods are powerful yet simple multivariate approaches that represent samples using a lower number of variables, allowing a more easily identification of the correlated structure in and between multiple high dimensional datasets. Graphical representations can be employed to this purpose. In addition, the methods simultaneously project samples and variables (genes, proteins) onto the same lower dimensional space, so the most variant variables from each dataset can be selected and associated with samples, which can be further used to facilitate biological interpretation and pathway analysis. We applied CIA to explore the concordance between mRNA and protein expression in a panel of 60 tumor cell lines from the National Cancer Institute. In the same 60 cell lines, we used MCIA to perform a cross-platform comparison of mRNA gene expression profiles obtained on four different microarray platforms. Last, as an example of integrative analysis of multiassay or multiomics data we analyzed transcriptomic, proteomic, and phosphoproteomic data from pluripotent (iPS) and embryonic stem (ES) cell lines. Key words Multivariate, Dimension reduction, Multi-omics, Multiassay, Data integration
1
Introduction High throughput technologies including microarray, sequencing, mass spectrometry based proteomics which assay biological molecules have developed rapidly in the past decades. These technologies generate vast amounts of data that describe biological samples at genomic scale and are often called omics data. The capacity and performance of these technologies have improved concurrently with dramatic decreases in cost, and therefore modern
Ewy Mathe´ and Sean Davis (eds.), Statistical Genomics: Methods and Protocols, Methods in Molecular Biology, vol. 1418, DOI 10.1007/978-1-4939-3578-9_2, © Springer Science+Business Media New York 2016
19
20
Chen Meng and Aedin Culhane
omics studies frequently apply multiple omics techniques to describe the same set of biological observations, such studies include The Cancer Genome Atlas (TCGA), Cancer Cell Line Encyclopedia (CCLE), and ENCyclopedia of DNA Elements (ENCODE). These projects systematically profile large numbers of biological samples resulting in multiple levels of qualitative or quantitative omics data. Whilst the systematically measuring large number of biological molecules (genes, proteins) can reveal novel knowledge that cannot be discovered by traditional methods, the accumulation of multiple omics data presents new challenges for data integration and interpretation. Several exploratory data analysis (EDA) methods including correspondence analysis (CA), principal component analysis (PCA) have been widely applied to study single omics data [1, 2]. These EDA methods are frequently performed in the early stage of analysis for quality control, detecting batch effect or exploring basic cluster structure in a dataset. In the analysis of multiple omics data, EDA also needs to identify correlations and associations between each of the high dimensional datasets. In this chapter, we describe the following EDA methods that enable researchers to identify relationships between two or more high dimensional datasets: l
l
Co-inertia analysis (CIA) can be used to explore relationships between two datasets [3]; Multiple co-inertia analysis (MCIA) can be applied to analyze multiple datasets [4].
Both methods project observations (samples) and variables onto a lower dimensional space, but constrain the dimension reduction such that the new variables represent covariant structure among datasets. Variables from each dataset are transformed onto the same scale. The association between variables and samples can be visualized in this new space, which greatly facilitates the detection of global variance structure and identification of the most informative variables across datasets.
2
Methods
2.1 Analysis of Two Datasets Using Co-inertia Analysis 2.1.1 Co-inertia Analysis
Co-inertia analysis is a multivariate exploratory approach used to identify the covariance between two datasets that have the same set of observations [5]. In the field of omics data analysis, CIA was first introduced to cross-platform comparison of microarray data [3]. With increasing availability of other omics data, it has also been applied to integration of different types of omics data [6]. Two omics datasets may be represented by two matrices, X and Y . In this chapter, we assume the rows of a matrix are samples (observations) and columns are variables, such as genes, proteins, other small molecules, etc. Similar to PCA, CIA is a dimension
Integrative Exploratory Analysis of Two or More Genomic Datasets
21
reduction technique but it considers two datasets simultaneously. For the ith dimension, CIA finds a pair of new variables, designated as co-inertia components or dimensions, using a linear combination of the original variables in X and Y , so as to maximize the squared covariance between them. argmaxui , vi cov 2 ðX ui , Y vi Þ
ð1Þ
X ui and Y vi are the co-inertia components for matrix X and Y , respectively; vi and ui are the linear combination coefficients, which is comparable to the loading vectors in the PCA. Due to the optimized criteria, the co-inertia components capture the most important covariance structure between the two datasets. The co-structure between the two datasets may be visualized by the co-inertia components in a lower dimensional space. 2.2 Case Study I: Integration of NCI-60 Cell Line Transcriptomic and Proteomic Data
The NCI-60 panel is a collection of 60 cancer cell lines from nine different tissues of origin. It includes leukemia, melanoma, ovarian, renal, breast, prostate, colon, lung, and central nervous system (CNS). These cell lines are widely used for in vitro screening of anticancer compounds. In attempts to discover gene–drug interactions, several genome wide data profiling approaches have been applied to these cell lines, including DNA copy number variation, DNA mutation, gene expression, protein expression, drug sensitivity, etc. In this case study, we will examine the mRNA expression measure by Agilent GE 4x44K microarray platform (downloaded from [7]) and the proteome data (mass spectrometry based proteomics) [8]. We will use CIA to explore the similarity between datasets and cell lines. We load the required package and data using: library(omicade4) library(made4) load("../data/NCI60_rnaprotein.RDA")
NCI60_rnaprotein is an object of class list, which consists of two numerical matrices, mRNA and protein. These two matrices have the following dimensions:
summary(NCI60_rnaprotein) ##
Length Class
Mode
## mRNA
640958 -none- numeric
## protein 431288 -none- numeric sapply(NCI60_rnaprotein, dim) ##
mRNA protein
## [1,] 11051 ## [2,]
58
7436 58
22
Chen Meng and Aedin Culhane
Each of the matrices has 58 cell lines in columns. Due to the problem of data quality, we removed two cell lines, resulting in 58 cell lines included in this analysis. CIA requires that the columns in the matrices are correctly matched, to verify this: identical(colnames(NCI60_rnaprotein$mRNA),colnames(NCI60_rnaprotein$protein)) ## [1] TRUE
However, the number of rows in different matrices may be different. To facilitate the visualization later, we first create some auxiliary variables to indicate the names of cell lines, tissues of origin of cell lines, and the color for each. names