Bioinformatics for Omics Data: Methods and Protocols (Methods in Molecular Biology, 719) 1617790265, 9781617790263


98 72 12MB

English Pages [578]

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Bioinformatics for Omics Data
Preface
Contents
Contributors
Part I
Omics Bioinformatics Fundamentals
Part II
Omics Data and Analysis Tracks
Part III
Applied Omics Bioinformatics
Index
Recommend Papers

Bioinformatics for Omics Data: Methods and Protocols (Methods in Molecular Biology, 719)
 1617790265, 9781617790263

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Methods

in

Molecular Biology™

Series Editor John M. Walker School of Life Sciences University of Hertfordshire Hatfield, Hertfordshire, AL10 9AB, UK



For other titles published in this series, go to www.springer.com/series/7651

Bioinformatics for Omics Data Methods and Protocols Edited by

Bernd Mayer emergentec biodevelopment GmbH, Vienna, Austria

Editor Bernd Mayer, Ph.D. emergentec biodevelopment GmbH Gersthofer Strasse 29-31 1180 Vienna, Austria [email protected] and Institute for Theoretical Chemistry University of Vienna Währinger Strasse 17 1090 Vienna, Austria [email protected]

ISSN 1064-3745 e-ISSN 1940-6029 ISBN 978-1-61779-026-3 e-ISBN 978-1-61779-027-0 DOI 10.1007/978-1-61779-027-0 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011922257 © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or ­dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of going to press, ­neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Humana Press is part of Springer Science+Business Media (www.springer.com)

Preface This book discusses the multiple facets of “Bioinformatics for Omics Data,” an area of research that intersects with and integrates diverse disciplines, including molecular biology, applied informatics, and statistics, among others. Bioinformatics has become a default technology for data-driven research in the Omics realm and a necessary skill set for the Omics practitioner. Progress in miniaturization, coupled with advancements in readout technologies, has enabled a multitude of cellular components and states to be assessed simultaneously, providing an unparalleled ability to characterize a given biological phenotype. However, without appropriate processing and analysis, Omics data add nothing to our understanding of the phenotype under study. Even managing the enormous amounts of raw data that these methods generate has become something of an art. Viewed from one perspective, bioinformatics might be perceived as a purely technical discipline. However, as a research discipline, bioinformatics might more accurately be viewed as “[molecular] biology involving computation.” Omics has triggered a paradigm shift in experimental study design, expanding beyond hypothesis-driven approaches to research that is basically explorative. At present, Omics is in the process of consolidating various intermediate forms between these two extremes. In this context, bioinformatics for Omics data serves both hypothesis generation and validation and is thus much more than mere data management and processing. Bioinformatics workflows with data interpretation strategies that reflect the complexity of biological organization have been designed. These approaches interrogate abundance profiles with regulatory elements, all expressed as interaction networks, thus allowing a one-step (descriptive) embodiment of wide-ranging cellular processes. Here, the seamless transition to computational Systems Biology becomes apparent, the ultimate goal of which is representing the dynamics of a phenotype in quantitative models capable of predicting the emergence of higher order molecular procedures and functions that arise from the interplay of basic molecular entities that constitute a living cell. Bioinformatics for Omics data is certainly embedded in a highly complex technological and scientific environment, but it is also a component and driver of one of the most exciting developments in modern molecular biology. Thus, while this book seeks to provide practical guidelines, it hopefully also conveys a sense of fascination associated with this research field. This volume is structured in three parts. Part I provides central analysis strategies, standardization, and data management guidelines, as well as fundamental statistics for analyzing Omics profiles. Part II addresses bioinformatics approaches for specific Omics tracks, spanning genome, transcriptome, proteome, and metabolome levels. For each track, the conceptual and experimental background is provided, together with specific guidelines for handling raw data, including preprocessing and analysis. Part III presents examples of integrated Omics bioinformatics applications, complemented by case studies on biomarker and target identification in the context of human disease. I wish to express my gratitude to all authors for their dedication in providing excellent chapters, and to John Walker, who initiated this project. As for any omissions or errors, the responsibility is mine. In any case, enjoy reading. Vienna, Austria 

Bernd Mayer

v

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v ix

Part I  Omics Bioinformatics Fundamentals   1  Omics Technologies, Data and Bioinformatics Principles . . . . . . . . . . . . . . . . . . . Maria V. Schneider and Sandra Orchard   2  Data Standards for Omics Data: The Basis of Data Sharing and Reuse . . . . . . . . . Stephen A. Chervitz, Eric W. Deutsch, Dawn Field, Helen Parkinson, John Quackenbush, Phillipe Rocca-Serra, Susanna-Assunta Sansone, Christian J. Stoeckert, Jr., Chris F. Taylor, Ronald Taylor, and Catherine A. Ball   3  Omics Data Management and Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arye Harel, Irina Dalah, Shmuel Pietrokovski, Marilyn Safran, and Doron Lancet   4  Data and Knowledge Management in Cross-Omics Research Projects . . . . . . . . . Martin Wiesinger, Martin Haiduk, Marco Behr, Henrique Lopes de Abreu Madeira, Gernot Glöckler, Paul Perco, and Arno Lukas   5  Statistical Analysis Principles for Omics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniela Dunkler, Fátima Sánchez-Cabo, and Georg Heinze   6  Statistical Methods and Models for Bridging Omics Data Levels . . . . . . . . . . . . . Simon Rogers   7  Analysis of Time Course Omics Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin G. Grigorov   8  The Use and Abuse of -Omes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sonja J. Prohaska and Peter F. Stadler

3 31

71

97

113 133 153 173

Part II  Omics Data and Analysis Tracks   9  Computational Analysis of High Throughput Sequencing Data . . . . . . . . . . . . . . Steve Hoffmann 10  Analysis of Single Nucleotide Polymorphisms in Case–Control Studies . . . . . . . . Yonghong Li, Dov Shiffman, and Rainer Oberbauer 11  Bioinformatics for Copy Number Variation Data . . . . . . . . . . . . . . . . . . . . . . . . . Melissa Warden, Roger Pique-Regi, Antonio Ortega, and Shahab Asgharzadeh 12  Processing ChIP-Chip Data: From the Scanner to the Browser . . . . . . . . . . . . . . Pierre Cauchy, Touati Benoukraf, and Pierre Ferrier 13  Insights Into Global Mechanisms and Disease by Gene Expression Profiling . . . . Fátima Sánchez-Cabo, Johannes Rainer, Ana Dopazo, Zlatko Trajanoski, and Hubert Hackl

vii

199 219 235

251 269

viii

Contents

14  Bioinformatics for RNomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Kristin Reiche, Katharina Schutt, Kerstin Boll, Friedemann Horn, and Jörg Hackermüller 15  Bioinformatics for Qualitative and Quantitative Proteomics . . . . . . . . . . . . . . . . . 331 Chris Bielow, Clemens Gröpl, Oliver Kohlbacher, and Knut Reinert 16  Bioinformatics for Mass Spectrometry-Based Metabolomics . . . . . . . . . . . . . . . . . 351 David P. Enot, Bernd Haas, and Klaus M. Weinberger

Part III Applied Omics Bioinformatics 17  Computational Analysis Workflows for Omics Data Interpretation . . . . . . . . . . . . Irmgard Mühlberger, Julia Wilflingseder, Andreas Bernthaler, Raul Fechete, Arno Lukas, and Paul Perco 18  Integration, Warehousing, and Analysis Strategies of Omics Data . . . . . . . . . . . . . Srinubabu Gedela 19  Integrating Omics Data for Signaling Pathways, Interactome Reconstruction, and Functional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paolo Tieri, Alberto de la Fuente, Alberto Termanini, and Claudio Franceschi 20  Network Inference from Time-Dependent Omics Data . . . . . . . . . . . . . . . . . . . . Paola Lecca, Thanh-Phuong Nguyen, Corrado Priami, and Paola Quaglia 21  Omics and Literature Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vinod Kumar 22  Omics–Bioinformatics in the Context of Clinical Data . . . . . . . . . . . . . . . . . . . . . Gert Mayer, Georg Heinze, Harald Mischak, Merel E. Hellemons, Hiddo J. Lambers Heerspink, Stephan J.L. Bakker, Dick de Zeeuw, Martin Haiduk, Peter Rossing, and Rainer Oberbauer 23  Omics-Based Identification of Pathophysiological Processes . . . . . . . . . . . . . . . . . Hiroshi Tanaka and Soichi Ogishima 24  Data Mining Methods in Omics-Based Biomarker Discovery . . . . . . . . . . . . . . . . Fan Zhang and Jake Y. Chen 25  Integrated Bioinformatics Analysis for Cancer Target Identification . . . . . . . . . . . Yongliang Yang, S. James Adelstein, and Amin I. Kassis 26  Omics-Based Molecular Target and Biomarker Identification . . . . . . . . . . . . . . . . Zgang–Zhi Hu, Hongzhan Huang, Cathy H. Wu, Mira Jung, Anatoly Dritschilo, Anna T. Riegel, and Anton Wellstein Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

379

399

415

435 457 479

499 511 527 547

573

Contributors S. James Adelstein  •  Harvard Medical School, Harvard University, Boston, MA, USA Shahab Asgharzadeh  •  Department of Pediatrics and Pathology, Keck School of Medicine, Childrens Hospital Los Angeles, University of Southern California, Los Angeles, CA, USA Stephan J.L. Bakker  •  Department of Nephrology, University Medical Center Groningen, Groningen, The Netherlands Catherine A. Ball  •  Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA Marco Behr  •  emergentec biodevelopment GmbH, Vienna, Austria Touati Benoukraf  •  Université de la Méditerranée, Marseille, France; Centre d’Immunologie de Marseille-Luminy, Marseille, France; CNRS, UMR6102, Marseille, France; Inserm, U631, Marseille, France Andreas Bernthaler  •  emergentec biodevelopment GmbH, Vienna, Austria Chris Bielow  •  AG Algorithmische Bioinformatik, Institut für Informatik, Freie Universität Berlin, Berlin, Germany Pierre Cauchy  •  Inserm, U928, TAGC, Marseille, France; Université de la Méditerranée, Marseille, France Jake Y. Chen  •  Indiana University School of Informatics, Indianapolis, IN, USA Stephen A. Chervitz  •  Affymetrix, Inc., Santa Clara, CA, USA Irina Dalah  •  Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel Eric W. Deutsch  •  Institute for Systems Biology, Seattle,WA, USA Ana Dopazo  •  Genomics Unit, Centro Nacional de Investigaciones Cardiovasculares, Madrid, Spain Anatoly Dritschilo  •  Lombardi Cancer Center, Georgetown University, Washington, DC, USA Daniela Dunkler  •  Section of Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, Austria David P. Enot  •  BIOCRATES life sciences AG, Innsbruck, Austria Raul Fechete  •  emergentec biodevelopment GmbH, Vienna, Austria Pierre Ferrier  •  Centre d’Immunologie de Marseille-Luminy (CIML), Marseille, France Dawn Field  •  NERC Centre for Ecology and Hydrology, Oxford, UK Claudio Franceschi  •  ‘L Galvani’ Interdept Center, University of Bologna, Bologna, Italy Alberto de la Fuente  •  CRS4 Bioinformatica, Parco Tecnologico SOLARIS, Pula, Italy Srinubabu Gedela  •  Stanford University School of Medicine, Stanford, CA, USA Gernot Glöckler  •  emergentec biodevelopment GmbH, Vienna, Austria Martin G. Grigorov  •  Nestlé Research Center, Lausanne, Switzerland Clemens Gröpl  •  Ernst-Moritz-Arndt-Universität Greifswald, Greifswald, Germany Bernd Haas  •  BIOCRATES life sciences AG, Innsbruck, Austria ix

x

Contributors

Jörg Hackermüller  •  Bioinformatics Group, Department of Computer Science, University of Leipzig, Leipzig, Germany; Fraunhofer Institute for Cell Therapy and Immunology, Leipzig, Germany Hubert Hackl  •  Division for Bioinformatics, Innsbruck Medical University, Innsbruck, Austria Martin Haiduk  •  emergentec biodevelopment GmbH, Vienna, Austria Arye Harel  •  Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel Georg Heinze  •  Section of Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, Austria Merel E. Hellemons  •  Department of Nephrology, University Medical Center Groningen, Groningen, The Netherlands Steve Hoffmann  •  Interdisciplinary Center for Bioinformatics and The Junior Research Group for Transcriptome Bioinformatics in the LIFE Research Cluster, University Leipzig, Leipzig, Germany Friedemann Horn  •  Fraunhofer Institute for Cell Therapy and Immunology, Leipzig, Germany; Institute of Clinical Immunology, University of Leipzig, Leipzig, Germany Zhang-Zhi Hu  •  Lombardi Cancer Center, Georgetown University, Washington DC, USA Hongzhan Huang  •  Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE, USA Mira Jung  •  Lombardi Cancer Center, Georgetown University, Washington, DC, USA Amin I. Kassis  •  Harvard Medical School, Harvard University, Boston, MA, USA Oliver Kohlbacher  •  Eberhard-Karls-Universität Tübingen, Tübingen, Germany Vinod Kumar  •  Computational Biology, Quantitative Sciences, GlaxoSmithKline, King of Prussia, PA, USA Hiddo J. Lambers Heerspink  •  Department of Nephrology, University Medical Center Groningen, Groningen, The Netherlands Doron Lancet  •  Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel Paola Lecca  •  The Microsoft Research – University of Trento Centre for Computational and Systems Biology, Povo, Trento, Italy Yonghong Li  •  Celera Corporation, Alameda, CA, USA Henrique Lopes de Abreu Madeira  •  emergentec biodevelopment GmbH, Vienna, Austria Arno Lukas  •  emergentec biodevelopment GmbH, Vienna, Austria Gert Mayer  •  Department of Internal Medicine IV (Nephrology and Hypertension), Medical University of Innsbruck, Innsbruck, Austria Harald Mischak  •  mosaiques diagnostics GmbH, Hannover, Germany Irmgard Mühlberger  •  emergentec biodevelopment GmbH, Vienna, Austria Thanh-Phuong Nguyen  •  The Microsoft Research – University of Trento Centre for Computational and Systems Biology, Povo, Trento, Italy Rainer Oberbauer  •  Medical University of Vienna and KH Elisabethinen Linz, Vienna, Austria Soichi Ogishima  •  Department of Bioinformatics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan

Contributors

xi

Sandra Orchard  •  EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton,Cambridge, UK Antonio Ortega  •  Department of Electrical Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, CA, USA Helen Parkinson  •  EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Paul Perco  •  emergentec biodevelopment GmbH, Vienna, Austria Shmuel Pietrokovski  •  Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel Roger Pique-Regi  •  Department of Human Genetics, University of Chicago, Chicago, IL, USA Corrado Priami  •  The Microsoft Research – University of Trento Centre for Computational and Systems Biology, Povo, Trento, Italy Sonja J. Prohaska  •  Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany John Quackenbush  •  Department of Biostatistics, Dana-Farber Cancer Institute, Boston, MA, USA Paola Quaglia  •  The Microsoft Research – University of Trento Centre for Computational and Systems Biology, Povo, Trento, Italy Johannes Rainer  •  Bioinformatics Group, Division Molecular Pathophysiology, Medical University Innsbruck, Innsbruck, Austria Kristin Reiche  •  Fraunhofer Institute for Cell Therapy and Immunology, Leipzig, Germany Knut Reinert  •  AG Algorithmische Bioinformatik, Institut für Informatik, Freie Universität Berlin, Berlin, Germany Anna T. Riegel  •  Lombardi Cancer Center, Georgetown University, Washington, DC, USA Phillipe Rocca-Serra  •  EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Simon Rogers  •  Inference Research Group, Department of Computing Science, University of Glasgow, Glasgow, UK Peter Rossing  •  Steno Diabetes Center Denmark, Gentofte, Denmark Marilyn Safran  •  Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel Fátima Sánchez-Cabo  •  Genomics Unit, Centro Nacional de Investigaciones Cardiovasculares, Madrid, Spain Susanna-Assunta Sansone  •  EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Maria V. Schneider  •  EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Katharina Schutt  •  Fraunhofer Institute for Cell Therapy and Immunology, Leipzig, Germany; Institute of Clinical Immunology, University of Leipzig, Leipzig, Germany Dov Shiffman  •  Celera Corporation, Alameda, CA, USA Peter F. Stadler  •  Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany

xii

Contributors

Christian J. Stoeckert Jr  •  Department of Genetics and Center for Bioinformatics, University of Pennsylvania School of Medicine, Philadelphia, PA, USA Hiroshi Tanaka  •  Department of Computational Biology, Graduate School of Biomedical Science, Tokyo Medical and Dental University, Tokyo, Japan Chris F. Taylor  •  EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Ronald Taylor  •  Computational Biology & Bioinformatics Group, Pacific Northwest National Laboratory, Richland, WA, USA Alberto Termanini  •  ‘L Galvani’ Interdept Center, University of Bologna, Bologna, Italy Paolo Tieri  •  ‘L Galvani’ Interdept Center, University of Bologna, Bologna, Italy Zlatko Trajanoski  •  Division for Bioinformatics, Innsbruck Medical University, Innsbruck, Austria Kerstin Boll  •  Fraunhofer Institute for Cell Therapy and Immunology, Leipzig, Germany; Institute of Clinical Immunology, University of Leipzig, Leipzig, Germany Melissa Warden  •  Department of Pediatrics and Pathology, Keck School of Medicine, Childrens Hospital Los Angeles, University of Southern California, Los Angeles, CA, USA Klaus M. Weinberger  •  BIOCRATES life sciences AG, Innsbruck, Austria Anton Wellstein  •  Lombardi Cancer Center, Georgetown University, Washington, DC, USA Martin Wiesinger  •  emergentec biodevelopment GmbH, Vienna, Austria Julia Wilflingseder  •  Medical University of Vienna and KH Elisabethinen Linz, Vienna, Austria Cathy H. Wu  •  Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE, USA Yongliang Yang  •  Department of Radiology, Harvard Medical School, Harvard University, Boston, MA, USA; Center of Molecular Medicine, Department of Biological Engineering, Dalian University of Technology, Dalian, China Dick de Zeeuw  •  Department of Nephrology, University Medical Center Groningen, Groningen, The Netherlands Fan Zhang  •  Indiana University School of Informatics, Indianapolis, IN, USA

Part I Omics Bioinformatics Fundamentals

Chapter 1 Omics Technologies, Data and Bioinformatics Principles Maria V. Schneider and Sandra Orchard Abstract We provide an overview on the state of the art for the Omics technologies, the types of omics data and the bioinformatics resources relevant and related to Omics. We also illustrate the bioinformatics challenges of dealing with high-throughput data. This overview touches several fundamental aspects of Omics and bioinformatics: data standardisation, data sharing, storing Omics data appropriately and exploring Omics data in bioinformatics. Though the principles and concepts presented are true for the various different technological fields, we concentrate in three main Omics fields namely: genomics, transcriptomics and proteomics. Finally we address the integration of Omics data, and provide several useful links for bioinformatics and Omics. Key words: Omics, Bioinformatics, High-throughput, Genomics, Transcriptomics, Proteomics, Interactomics, Data integration, Omics databases, Omics tools

1. Introduction The last decade has seen an explosion in the amount of biological data generated by an ever-increasing number of techniques enabling the simultaneous detection of a large number of alterations in molecular components (1). The Omics technologies utilise these high-throughput (HT) screening techniques to generate the large amounts of data required to enable a system level understanding of correlations and dependencies between molecular components. Omics techniques are required to be high throughput because they need to analyse very large numbers of genes, gene expression, or proteins either in a single procedure or a combination of procedures. Computational analysis, i.e., the discipline now known as bioinformatics, is a key requirement for the study of the vast amounts of data generated. Omics requires the use of Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_1, © Springer Science+Business Media, LLC 2011

3

4

Schneider and Orchard

t­ echniques that can handle extremely complex biological samples in large quantities (e.g. high throughput) with high sensitivity and specificity. Next generation analytical tools require improved robustness, flexibility and cost efficiency. All of these aspects are being continuously improved, potentially enabling institutes such as the Wellcome Trust Sanger Sequencing Centre (see Note 1) to generate thousands of millions of base pairs per day, rather than the current output of 100  million per day (http://www. yourgenome.org/sc/nt). However, all this data production makes sense only if one is equipped with the necessary analytical resources and tools to understand it. The evolution of the laboratory techniques has therefore to occur in parallel with a corresponding improvement in analytical methodology and tools to handle the data. The phrase Omics – a suffix signifying the measurement of the entire complement of a given level of biological molecules and information – encompasses a variety of new technologies that can help explain both normal and abnormal cell pathways, networks, and processes via the simultaneous monitoring of thousands of molecular components. Bioinformaticians use computers and statistics to perform extensive Omics-related research by searching biological databases and comparing gene sequences and proteins on a vast scale to identify sequences or proteins that differ between diseased and healthy tissues, or more general between different phenotypes. “Omics” spans an increasingly wide range of fields, which now range from genomics (the quantitative study of protein coding genes, regulatory elements and noncoding sequences), transcriptomics (RNA and gene expression), proteomics (e.g. focusing on protein abundance), and metabolomics (metabolites and metabolic networks) to advances in the era of post-genomic biology and medicine: pharmacogenomics (the quantitative study of how genetics affects a host response to drugs), physiomics (physiological dynamics and functions of whole organisms) and in other fields: nutrigenomics (a rapidly growing discipline that focuses on identifying the genetic factors that influence the body’s response to diet and studies how the bioactive constituents of food affect gene expression), phylogenomics (analysis involving genome data and evolutionary reconstructions, especially phylogenetics) and interactomics (molecular interaction networks). Though in the remainder of this chapter we concentrate on an isolated few examples of Omics technologies, much of what is said, for ­example about data standardisation, data sharing, storage and analysis requirements are true for all of these different technological fields. There are already large amounts of data generated by these technologies and this trend is increasing, for example second and third generation sequencing technologies are leading to an ­exponential increase in the amount of sequencing data available. From a computational point of view, in order to address the

Omics Technologies, Data and Bioinformatics Principles

5

c­ omplexity of these data, understand molecular regulation and gain the most from such comprehensive set of information, knowledge discovery – the process of automatically searching large volumes of data for patterns – is a crucial step. This process of bioinformatics analysis includes: (1) data processing and ­molecule (e.g. protein) identification, (2) statistical data analysis, (3) pathway analysis, and (4) data modelling in a system wide context. In this chapter we will present some of these analytical methods and discuss ways in which data can be made accessible to both the specialised bioinformatician, but in particular to the research scientist.

2. Materials There are a variety of definitions of the term HT; however we can loosely apply this term to cases where automation is used to increase the throughput of an experimental procedure. HT technologies exploit robotics, optics, chemistry, biology and image analysis research. The explosion in data production in the public domain is a consequence of falling equipment prices, the opening of major national screening centres and new HT core facilities at universities and other academic institutes. The role of bioinformatics in HT technologies is of essential importance. 2.1. Genomics High-Throughput Technologies

High-Throughput Sequencing (HTS) technologies are used not only for traditional applications in genomics and metagenomics (see Note 2), but also for novel applications in the fields of transcriptomics, metatranscriptomics (see Note 3), epigenomics (see Note 4), and studies of genome variation (see Note 5). Next generation sequencing platforms allow the determination of the sequence data from amplified single DNA fragments and have been developed specifically to lend themselves to robotics and parallelisation. Current methods can directly sequence only relatively short (300–1,000 nucleotides long) DNA fragments in a single reaction. Short-read sequencing technologies dramatically reduce the sequencing cost. There were initial fears that the increase in quantity might result in a decrease in quality, and improvements in accuracy and read length are being looked for. However, despite this, these advances have significantly reduced the cost of several sequencing applications, such as resequencing individual genomes (2) readout assays (e.g. ChIP-seq (3) and RNAseq (4)).

2.2. Transcriptomics High-Throughput Technologies

The transcriptome is the set of all messenger RNA (mRNA) molecules, or “transcripts”, produced in one or a population of cells. Several methods have been developed in order to gain expression information at high throughput level.

6

Schneider and Orchard

Global gene expression analysis has been conducted either by hybridization with oligonucleotide microarrays, or by counting of sequence tags. Digital transcriptomics with pyrophosphatase based ultra-high throughput DNA sequencing of ditags represents a revolutionary approach to expression analysis, which generates genome-wide expression profiles. ChIP-Seq is a technique that combines chromatin immunoprecipitation with sequencing technology to identify and quantify in vivo protein–DNA interactions on a genome-wide scale. Many of these applications are directly comparable to microarray experiments, for example ChIP-chip and ChIP-Seq are for all intents and purposes the same (5). The most recent increase in data generation in this evolving field is due to novel cycle-array sequencing methods (see Note 6), also known as next-generation sequencing (NGS), more commonly described as second-generation sequencing which are already being used by technologies such as next-generation expressed-sequence-tag sequencing (see Note 7). 2.3. Proteomics High-Throughput Technologies

Proteomics is the large-scale study of proteins, particularly their expression patterns, structures and functions, and there are various HT techniques applied to this area. Here we explore two main proteomics fields: Mass Spectrometry HT and Protein– Protein Interactions (PPIs).

2.3.1. Mass Spectrometry High-Throughput Technologies

Mass spectrometry is an important emerging method for the characterization of proteins. It is also a rapidly developing field which is currently moving towards large-scale quantification of specific proteins in particular cell types under defined conditions. The rise of gel-free protein separation techniques, coupled with advances in MS instrumentation sensitivity and automation, has provided a foundation for high throughput approaches to the study of proteins. The identification of parent proteins from derived peptides now relies almost entirely on the software of search engines, which can perform in silico digests of protein sequence to generate peptides. Their molecular mass is then matched to the mass of the experimentally derived protein fragments.

2.3.2. Interactomics HT Technologies

Studying protein–protein interactions provides valuable insights into many fields by helping precisely understand a protein’s role inside a specific cell type, with many of the techniques commonly used to experimentally determine protein interactions lending themselves to high throughput methodologies. Complementation assays (e.g. 2-hybrid) measure the oligomerisation-assisted complementation of two fragments of a single protein which when united result in a simple biological readout – the two protein fragments are fused to the potential bait/prey interacting partners respectively. This methodology is easily scalable to HT since it can

Omics Technologies, Data and Bioinformatics Principles

7

yield very high numbers of coding sequences assayed in a ­relatively simple experiment and a wide variety of interactions can be detected and characterised following one single, commonly used protocol. However, the proteins are being expressed in an alien cell system with a loss of temporal and physiological control of expression patterns, resulting in a large number of false-positive interactions. Affinity-based assays, such as affinity chromatography, pull-down and coimmunoprecipitation, rely on the strength of the interaction between two entities. These techniques can be used on interactions which form under physiological conditions, but are only as good as the reagents and techniques used to identify the participating proteins. High throughput mass spectrometry is increasingly used for the rapid identification of the participants in an affinity complex. Physical methods depend on the properties of molecules to enable measurement of an interaction, as typified by techniques such as X-ray crystallography and enzymatic assays. High quality data can be produced but highly purified proteins are required, which has always proved a rate ­limiting step. Availability of automated chromatography systems and custom robotic systems that streamline the whole process, from cell harvesting and lysis through to sample clarification and chromatography has changed this, and increasing amounts of data are being generated by such experiments. 2.4. Challenges in HT Technologies

It is now largely the case that high throughput methods exist for all or most of the Omics domains. The challenge now is to ­prevent bottlenecks appearing in the storing, annotation, and analysis of the data. First the data which is required to describe both – how an experiment was performed and the results generated by it – must be defined. A place to store that information must be identified, a means by which it will be gathered has to be agreed upon, and ways in which the information will be queried, retrieved and analysed must also be decided. Data in isolation is of limited use, so ideally the data format chosen should be appropriate to enable the combination and comparison of multiple datasets, both ­in-house and with other groups working in the same area. HT data is increasingly used in a broader context beyond the individual project; consequently it is becoming more important to standardise and share this information appropriately and to preinterpret it for the scientists who are not involved with the experiment, whilst still making the raw data available for those who wish to perform their own analyses.

2.5. Bioinformatics Concepts

In high throughput research, knowledge discovery starts by collecting, selecting and cleaning the data in order to fill a database. A database is a collection of files (archive) of consistent data that are stored in a uniform and efficient manner. A relational database consists of a set of tables, each storing records (instances).

8

Schneider and Orchard

A record is represented as a set of attributes which define a property of a record. Attributes can be identified by their name and store a value. All records in a table have the same number and type of attributes. Database design is a crucial step in which the data requirements of the application have first to be defined (conceptual design), including the entities and their relationships. Logical design is the implementation of the database using database ­management systems, which ensure that the process is scalable. Finally the physical design phase estimates the workload and refines the database design accordingly. It is during this phase that table designs are optimized, indexing is implemented and clustering approaches are optimized. These are fundamental in order to obtain fast responses to frequent queries without jeopardising the database integrity (e.g. redundancy). Primary or archived databases contain information directly deposited by submitters and give an exact representation of their published data, for example DNA sequences, DNA and protein structures and DNA and protein expression profiles. Secondary or derived databases are socalled because they contain the results of analysis on the primary resources, including information on sequence patterns or motifs, variants and mutations and evolutionary relationships. The fundamental characteristic of a database record is a unique identifier. This is crucial in biology given the large number of situations where a single entity has many names, or one name refers to multiple entities. To some extent, this problem can be overcome by the use of an accession number, a primary key derived by a reference database to describe the appearance of that entity in that database. For example, using the UniProtKB protein sequence database accession number of human p53 gene products (P04637) gives information on the sequence of all the isoforms of these proteins, gene and protein nomenclature as well as a wealth of information about its function and role in a cell. More than one protein sequence database exists, and the vast majority of protein sequences exist in all of these. Fortunately resources to translate between these multiple accession numbers now exist, for example the Protein Identifier Cross-Reference (PICR) Service at the European Bioinformatics Institute (EBI) (see Note 8). The Omics fields share with all of biology the challenge of handling ever-increasing amounts of complex information effectively and flexibly. Therefore a crucial step in bioinformatics is to choose the appropriate representation of the data. One of the simplest but most efficient approaches has been the use of ­controlled vocabularies (CVs), which provide a standardised ­dictionary of terms for representing and managing information. Ontologies are structure CVs. An excellent example of this ­methodology is the Gene Ontology (GO) that describes gene products in terms of their associated biological processes, cellular

Omics Technologies, Data and Bioinformatics Principles

9

components and molecular functions in a species independent manner. Substantial effort has been, and continues to be, put into the development and maintenance of the ontologies themselves; the annotation of gene products, which entails making associations between the ontologies and the genes and gene products across databases; and the development of tools that facilitate the creation, maintenance and use of ontologies. The hierarchical nature of these CVs enable more meaningful queries to be made, for example searching either a microarray or proteomics experiment for expression patterns on the brain, enable experiments annotated to the cortex to be included because the BRENDA tissue CV recognises the cortex as “part-of ” the brain (http://www. ebi.ac.uk/ontology-lookup/browse.do?ontName=BTO). Use of these CVs have been encouraged, and even made mandatory, by many groups such as the Microarray Expression Data group (MGED) which recommends the use of the MGED ontology (6) for the description of key experimental concepts and, where possible, ontologies developed by other communities for describing terms such as anatomy, disease and chemical compounds. Clustering methods are used to identify patterns in the data, in other words to recognise what is similar, to identify what is different, and from there to know when differences are meaningful. These three steps are not trivial at all; proteins for example exhibit rich evolutionary relationships and complex molecular interactions and hence present many challenges for computational sequence analysis. Sequence similarity refers to the degree to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity (the extent to which two (nucleotide or amino acid) sequences are invariant) and/or conservation (changes at a specific position of an amino acid or nucleotide sequence that preserve the physicochemical properties of the original residue). The applications of sequence similarity searching are numerous, ranging from the characterization of newly sequenced genomes, through phylogenetics, to species identification in environmental samples. However, it is important to keep in mind that identifying similarity between sequences (e.g. nucleotide or amino acid sequences) is not necessarily equivalent to identifying other pro­ perties of such sequences, for example their function.

3. Methods It is obvious that without bioinformatics it is impossible to make sense of the huge data produced in Omics research. If we look at the increase of the EMBL Nucleotide Sequence Database (EMBL-Bank), the Release 105 on 27-AUG-2010 contained

10

Schneider and Orchard

195,241,608 sequence entries comprising 292,078,866,691 nucleotides. This translated to a total of 128 GB compressed and 831 GB uncompressed data. Bioinformatics does not only have to provide the structures in which to store the information, but also store it in such a way that is retrievable, and comparable not only to similar data but also to other types of information. The challenges and concepts bioinformatics as a discipline currently encompasses do not essentially differ from those listed by (7), they have merely expanded to meet the challenges imposed by the volume of data produced. These include: 1. A precise, predictive model of transcription initiation and termination: the ability to predict where and when transcription will occur in a genome (fundamental for HTS and proteomics); 2. A precise, predictive model of RNA splicing/alternative splicing: the ability to predict the splicing pattern of any primary transcript in any tissue (fundamental for transcriptomics and proteomics); 3. Precise, quantitative models of signal transduction pathways: ability to predict cellular responses to external stimuli (required in proteomics and pathways analysis); 4. Determination of effective protein:DNA, protein:RNA and protein:protein recognition codes (important for recognition of interactions among the various types of molecules); 5. Accurate ab initio protein structure prediction (required for proteomics and pathways analysis); 6. Rational design of small molecule inhibitors of proteins (chemogenomics); 7. Mechanistic understanding of protein evolution: understanding exactly how new protein functions evolve (comparative genomics); 8. Mechanistic understanding of speciation: molecular details of how speciation occurs (comparative genome sequences, sequence variation); 9. Continued development of effective gene ontologies – systematic ways to describe the functions of any gene or protein (genomics, transcriptomics, and proteomics). The above list summarises general concepts required for multiple Omics data sources. Next we describe issues which are specific to one particular field, but may have downstream consequences in other areas. 3.1. The Role of Bioinformatics in Genomics

Here we will explore two major challenges in genomics: de novo sequencing assembly and genome annotation.

Omics Technologies, Data and Bioinformatics Principles

11

3.1.1. De Novo Genome Sequencing

A critical stage in de  novo genome sequencing is the assembly of  shotgun reads, in other words putting together fragments ­randomly extracted from the sample to form a set of contiguous sequences and contigs that represent the DNA in the sample. Algorithms are available for whole genome shotgun fragment assembly, including Atlas (8), Arachne (9), Celera (10), PCAP (11), Phrap (http://www.phrap.org) and Phusion (12). All these programmes rely on the overlap-layout-consensus approach (13) where all the reads are compared to each other in a pair-wise fashion. However, this approach presents several disadvantages, especially in the case of next-generation microread sequencing. EDENA (14) is the only microread assembler developed using computation of pairwise overlaps. Included reads, i.e. reads which align over their whole length onto another read, have to be removed from the graph; this means that mixed-length sequencing cannot be performed directly with an overlap graph. Short reads are either simply mapped onto long read contigs or they are assembled separately (Daniel Zerbino personal communication). The use of a sequence graph to represent an assembly was introduced by (15). Idury and Waterman presented an assembly algorithm for an alternative sequencing technique, sequencing by hybridisation, where an oligoarray could detect all the k nucleotide words, also known as k-mers, present in a given genome. Pevzner et  al. (16) expanded on this idea, proposing a slightly different formalisation of the sequence graph, called the de Bruijn graph, whereby the k-mers are represented as arcs and overlapping k-mers join at their tips, and consecutively presented algorithms to build and correct errors in the de Bruijn graph (13), use paired-end reads (16) or short reads (17). Zerbino and Birney (18) developed a new set of algorithms, collectively called “Velvet,” to manipulate de Bruijn graphs for genomic sequence assembly for the de novo assembly of microreads. Several studies have used Velvet (19–22). Other analytical software adopting the use of the de Bruijn graph are ALLPATHS (23) and SHORTY (24) specialised in localising the use of paired-end reads, whereas the ABySS (25, 26) successfully parallelised the construction of the de Bruijn graph, thus removing practical memory limitations on assemblies. The field of de  novo assembly of NGS reads is constantly evolving and there is not yet a firm process or best practise set in place.

3.1.2. Genome Annotation

Genome annotation is the process of marking the genes and other biological features in a DNA sequence. It consists of two main steps: (1) Gene Finding: identifying elements on the genome and (2) adding biological information to these elements. There are automatic annotation tools to perform all this by computer analysis, as opposed to manual annotation which involves human expertise. Ideally, these approaches coexist and complement each

12

Schneider and Orchard

other in the same annotation pipeline. The basic level of ­annotation uses BLAST to find similarities, and annotates genomes based on that. However, nowadays more and more additional information is added to the annotation platform. Structural annotation consists of the identification of genomic elements: ORFs and their localisation, gene structure, coding regions and location of regulatory motifs. Functional annotation consists in attaching biological information to genomic elements: biochemical function, biological function, involved regulation and interactions and expression. These steps may involve both biological experiments and in silico analysis and are often initially performed in related databases, usually protein sequence databases such as UniProtKB, and transferred back onto the genomic sequence. A variety of software tools have been developed to permit scientists to view and share genome annotations. The additional information allows manual annotators to disentangle discrepancies between genes that have been given conflicting annotation. For example, the Ensembl genome browser relies on both curated data sources as well as a range of different software tools in their automated genome annotation pipeline (27). Genome annotation remains a major challenge for many genome projects. The identification of the location of genes and other genetic control elements is frequently described as defining the biological “parts list” for the assembly and normal operation of an organism. Researchers are still at an early stage in the process of delineating this parts list, as well as trying to understand how all the parts “fit together”. 3.2. The Role of Bioinformatics in Transcriptomics

Both microarray and proteomics experiments provide long lists of transcripts (mRNA and proteins respectively) co-expressed at any one time and the challenge is to give biological relevance to these lists. Several different computational algorithms have been developed and can be usefully applied at various steps of the analytical pipeline. Clustering methods are used to order and visualise the underlying patterns in large scale expression datasets showing similar patterns that can therefore be grouped according to their co-regulation/co-expression (e.g. specific developmental times or cellular/tissue locations). This indicates (1) co-regulated transcripts which might be functionally related and (2) the clusters represent a natural structure of the data. Transcripts can also be grouped by their known – or predicted function. A resource commonly used for this is the GO ontology (http://www.geneontology.org). There are several bioinformatics tools for calculating the number of significantly enriched GO terms, for example: (1) GO miner (http://discover.nci.nih.gov/ gominer) generates a summary of GO terms that are significantly enriched in a user input list of protein accession numbers when compared to a reference database like UniProtKB/SwissProt;

Omics Technologies, Data and Bioinformatics Principles

13

(2)  GO slims which are subsets of GO terms from the whole Gene Ontology and are particularly useful for giving a summary of the results of GO annotation of a genome, microarrays and proteomics (http://amigo.geneontology.org/cgi-bin/amigo/ go.cgi). 3.3. The Role of Bioinformatics in Proteomics 3.3.1. Protein Annotation

3.3.2. Protein–Protein Interaction Analysis and Comparative Interactomics

The use of different bioinformatics approaches to determine the presence of a gene or open reading frame (ORF) in those genomes can lead to divergent ORF annotations (even for data generated from the same genomic sequences). It is therefore crucial to use the correct dataset for protein sequence translations. One method for confirming a correct protein sequence is mass spectrometry based proteomics, in particular by de  novo sequencing which does not rely on pre-existing knowledge of a protein sequence. However, historically, there has initially been no method for publishing these protein sequences, except as long lists reported directly with the article or included on the publisher’s website as supplementary information. In either case, these lists are typically provided as PDF or spreadsheet documents with a custom-made layout, making it practically impossible for computer programmes to interpret them, or efficiently query them. A solution to this problem is provided by the PRIDE database (http://www.ebi.ac. uk/pride) which provides a standards compliant, public repository for mass spectrometry based proteomics, giving access to experimental evidence that a transcribed gene product does exist, as well as the pattern of tissues in which it is expressed (28). The annotation of protein functional information largely relies on manual curation, namely biologists reading the scientific literature and transferring the information to computational records – a process in which the UniProtKB curators have lead the way for many years. The many proteins for which functional information is not available, however, rely on selected information being transferred from closely related orthologues in other species. A number of protein signature databases now exist, which create algorithms to recognise these closely related protein families or domains within proteins. These resources have been combined in a single database, Interpro (http://www.ebi.ac.uk/ interpro) and the tool InterProScan (see Note 9) (http://www. ebi.ac.uk/Tools/InterProScan) is available for any biologist wishing to perform their own automated protein (or gene) annotation (29). Protein–protein interactions are generally represented in graphical networks with nodes corresponding to the proteins and edges to the interactions. Although edges can vary in length most networks represent undirected and only binary interactions. Bioinformatics tools and computational biology efforts into graph theory methods have and continue to be part of the knowledge

14

Schneider and Orchard

discovery process in this field. Analysis of PPI networks involves many challenges, due to the inherent complexity of these networks, high noise level characteristic of the data, and the presence of unusual topological phenomena. A variety of data-mining and statistical techniques have been applied to effectively analyze PPI data and the resulting PPI networks. The major challenges for computational analysis of PPI networks remain: 1. Unreliability of large scale experiments; 2. Biological redundancy and multiplicity: a protein can have several different functions; or a protein may be included in one or more functional groups. In such instances overlapping clusters should be identified in the PPI networks, however since conventional clustering methods generally produce pairwise disjoint clusters, they may not be effective when applied to PPI networks; 3. Two proteins with different functions frequently interact with each other. Such frequent connections between the proteins in different functional groups expand the topological complexity of the PPI networks, posing difficulties to the detection of unambiguous partitions. Intensive research trying to understand and characterise the structural behaviours of such systems from a topological perspective have shown that features such as small-world properties (any two nodes can be connected via a short path of a few links), scalefree degree distributions (power-law degree distribution indicating that a few hubs bind numerous small nodes), and hierarchical modularity (hierarchical organization of modules) suggests that a functional module in a PPI network represents a maximal set of functionally associated proteins. In other words, it is composed of those proteins that are mutually involved in a given biological process or function. In this model, the significance of a few hub nodes is emphasized, and these nodes are viewed as the determinants of survival during network perturbations and as the essential backbone of the hierarchical structure. The information retrieved from HT interactomics data could be very valuable as a means to obtain insights into a systems evolution (e.g. by comparing the organization of interaction networks and by analyzing their variation and conservation). Likewise, one could learn whether and how to extend the network information obtained experimentally in well-characterised model systems onto different organisms. Cesareni et al. (30) concluded that, despite the recent completion of several high throughput experiments aimed at the description of complete interactomes, the available interaction information is not yet of sufficient coverage and quality to draw any biologically meaningful conclusion from the comparison of different interactomes. The development of more

Omics Technologies, Data and Bioinformatics Principles

15

accurate experimental and informatics approaches is required to allow us to study network evolution. 3.4. Storing Omics Data Appropriately

The massive amounts of data produced in Omics experiments can help us gain insights into underlying biological processes only if they are carefully recorded and stored in databases, where they can be queried, compared and analyzed. Data has to be stored in a structured and standardized format that enables data sharing between multiple resources, as well as common tool development and the ability to merge data sets generated by different technologies. Omics is very much technology driven, and all instrument and software manufacturers initially produce data in their own proprietary formats, often then tying customers into a limited number of downstream analytical instruments. Efforts have been ongoing for many years to develop and encourage the development of common formats to enable data exchange and standardized methods for the annotation of such data to allow dataset comparison. These efforts were spear-headed by the transcriptomics ­community, who developed the MIAME standards (Minimum Information About a Microarray Experiment, http://www.mged. org/Workgroups/MIAME/miame.html) (31). The MIAME standards describe the set of information sufficient to interpret a microarray experiment and its results unambiguously, to enable verification of the data and potentially to reproduce the experiment itself. Their lead was soon followed by the proteomics ­community with the MIAPE standards (Minimum Information About a Proteomics Experiment, http://www.psidev.info/index. php?q=node/91), the interaction community (MIMIx, http:// imex.sourceforge.net/MIMIx) and many others. This has resulted in the development of tools which can combine datasets, for example it is possible to import protein interaction data into the visualisation tool Cytoscape (http://www.cytoscape.org) in a common XML format (PSI-MI) and overlay this with expression data from a microarray experiment.

3.5. Exploring Omics Data in Bioinformatics

Below we will follow the three Omics fields we described above. It would be impossible to list all the databases dealing with these data, however as the European Bioinformatics Institute hosts one of the most comprehensive sets of bioinformatics databases and also actively coordinates or is involved in setting standards and their implementation, it serves as exemplar for databases that are at the state of the art for standards, technologies and integration of the data. A list of major Institutes and their databases is provided at the end of this chapter (see Note 18).

3.5.1. Genomics

The genome is a central concept at the heart of biology. Since the first complete genome was sequenced in the mid-1990s, over 800

16

Schneider and Orchard

more have been sequenced, annotated, and submitted to the public databases. New ultra-high throughput sequencing technologies are now beginning to generate complete genome sequence at an accelerating rate, both to gap-fill portions of the taxonomy where no genome sequence has yet been deciphered (e.g. the GEBA project, http://www.jgi.doe.gov/programs/ GEBA, which aims to sequence 6,000 bacteria from taxonomically distinct clades), and to generate data for variation in populations of species of special interest (e.g. the 1000 Genomes Project in human, http://www.1000genomes.org, and the 1001 Genomes Project in Arabidopsis, http://www.1001genomes.org). In addition, modern sequencing technologies are increasingly being used to generate data for gene regulation and expression on a genomewide scale. The vast amount of information associated with the genomic sequence demands a way to organise and access it (see Note 19). A successful example of this is the genome browser Ensembl. Ensembl (http://www.ensembl.org) is a joint project between the EBI and the Wellcome Trust Sanger Institute that annotates chordate genomes (i.e. vertebrates and closely related invertebrates with a notochord such as sea squirt). Gene sets from model organisms such as yeast and fly are also imported for comparative analysis by the Ensembl “compara” team. Most annotation is updated every 2 months; however, the gene sets are determined about once a year. A new browser, http://www. ensemblgenomes.org, has now been set up to access nonchordates genomes from bacteria, plants, fungi, metazoa and protists. Ensembl provides genes and other annotation such as regulatory regions, conserved base pairs across species, and mRNA protein mappings to the genome. Ensembl displays many layers of genome annotation into a simplified view for the ease of the user. The Ensembl gene set reflects a comprehensive transcript set based on protein and mRNA evidence in UniProt and NCBI RefSeq databases (see Note 10). These proteins and mRNAs are aligned against a genomic sequence assembly imported from a relevant sequencing centre or consortium. Transcripts are clustered into the same gene if they have overlapping coding sequence. Each transcript is given a list of mRNAs and proteins it is based upon. Ensembl utilises BioMart, a query optimised database for efficient data mining described below, and the application of a comparative analysis pipeline: Compara. The Ensembl Compara multi-species database stores the results of genome-wide species comparisons calculated for each data release including: (1) Comparative genomics: Whole genome alignments and Synteny regions and (2) Comparative proteomics: Orthologue predictions and Paralogue predictions.

Omics Technologies, Data and Bioinformatics Principles

17

Ensembl Compara includes GeneTrees, a comprehensive gene orientated phylogenetic resource. It is based on a computational pipeline to handle clustering, multiple alignment, and tree generation, including the handling of large gene families. Ensembl also imports variations including Single Nucleotide Polymorphisms and insertion-deletion mutations (Indels) and their flanking sequence from various sources. These sequences are aligned to the reference sequence. Variation positions are calculated in this way along with any effects on transcripts in the area. The majority of variations are obtained from NCBI dbSNP. For human, other sources include Affymetrix GeneChip Arrays, The European Genome-phenome Archive, and whole genome alignments of individual sequences from Venter (32), Watson (33) and Celera individuals (34). Sources for other species include Sanger re-sequencing projects for mouse, and alignments of sequences from the STAR consortium for rat. Ancestral alleles from dbSNP were determined through a comparison study of human and chimpanzee DNA (35). 3.5.2. Transcriptomics

There is a wide range of HT transcriptomics data: single and dual channel microarray-based experiments measuring mRNA, miRNA and generally non-coding RNA. One can also include non-array techniques such as serial analysis of gene expression (SAGE). There are three main public repositories on microarray based studies: ArrayExpress (36), Gene Expression Omnibus (37), and CIBEX (38). Here we describe the EBI microarray repository, ArrayExpress, which consists of three components: ●●

●●

●●

the ArrayExpress Repository – a public archive of functional genomics experiments and supporting data, the ArrayExpress Warehouse – a database of gene expression profiles and other bio-measurements, the ArrayExpress Atlas – a new summary database and metaanalytical tool of ranked gene expression across multiple experiments and different biological conditions.

The Warehouse and Atlas allow users to query for differentially expressed genes by gene names and properties, experimental conditions and sample properties, or a combination of both (39). The latest developed ArrayExpress Atlas of Gene Expression (http://www.ebi.ac.uk/microarray-as/atlas) allows the user to query for condition-specific gene expression across multiple data sets. The user can query for a gene or a set of genes by name, synonym, Ensembl identifier, GO term or, alternatively, for a biological sample property or condition, (e.g. tissue type, disease name, developmental stage, compound name or identifier). Queries for both genes and conditions are also possible (e.g. the user can query for all “DNA repair” genes up-regulated in

18

Schneider and Orchard

­ cancer” which returns a list of “experiment, condition, gene” “ triplets each with a P-value and an up/down arrow characterising the significance and direction of a gene’s differential expression in a particular condition in an experiment). ArrayExpress accepts data generated on all array-based technologies, including gene expression, protein array, ChIP-chip and genotyping. More recently, data from transcriptomic and related applications of uHTS technologies such as Illumina (SOLEXA Ltd, Saffron Walden, UK), and 454 Life Sciences (Roche, Branford, Connecticut) are also accepted. For Solexa data FASTQ files, sample annotation and processed data files corresponding to transcription values per genomic location are submitted and curated to the emerging standard MINSEQE (http://www.mged. org/minseqe) and instrument-level data are stored in the European Short Read Archive (http://www.ebi.ac.uk/embl/ Documentation/ENA-Reads.html). The ArrayExpress Warehouse now includes gene expression profiles from in situ gene expression measurements, as well as other molecular measurement data from metabolomics and protein profiling technologies. Where in situ and array-based gene expression data are available for the same gene, these are displayed in the same view and links are provided to the multispecies 4DXpress database of in situ gene expression (39). The Gene Expression Atlas provides a statistically robust framework for integration of gene expression experiment results across different platforms at a meta-analytical level. It also represents a simple interface for identifying strong differential expression candidate genes in conditions of interest. The Atlas also integrates ontologies for high quality annotation of gene and sample attributes and builds new gene expression summarised views, with the aim to provide analysis of putative signalling pathway targets, discovery of correlated gene expression patterns and the identification of condition/tissue-specific patterns of gene expression. A list of URLs to bioinformatics relevant resources to transcriptomics can be found in Subheading 4 (see Note 20). 3.5.3. Proteomics

A list of proteomics relevant bioinformatics resources can be found in Note 21.

3.5.3.1. Protein Sequence and Functional Annotation

Translated proteins and their co-translational modification or PTM (post-translated modifications) are the backbone of proteomics (28). UniProt is the most comprehensive data repository on protein sequence and functional annotation. It is maintained by a collaboration of the Swiss Institute of Bioinformatics (SIB), the Protein Information Resource (PIR), and the EBI. It has four components, each of them optimized for different user profiles: 1. UniProt Knowledgebase (UniProtKB) comprises two sections: UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.

Omics Technologies, Data and Bioinformatics Principles

19

(a) UniProtKB/Swiss-Prot contains high quality annotation extracted from the literature and computational analyses curated by experts. Annotations include, amongst others: protein function(s), protein domains and sites, PTMs, subcellular location(s), tissue specificity, structure, interactions, and diseases associated with deficiencies or abnormalities. (b) UniProtKB/TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL/ GenBank/DDJB nucleotide sequence databases, excluding some types of data such as pseudogenes. UniProtKB/ TrEMBL records are annotated automatically based on computational analyses. 2. UniProt Reference Clusters (UniRef), which provides clustered sets of all sequences from the UniProtKB database and selected UniProt Archive records to obtain complete coverage of sequences at different resolutions (100, 90, and 50% sequence identity), while hiding redundant sequences. 3. UniProt archive (UniParc) is a repository that reflects the history of all protein sequences. 4. UniProt Metagenomic and Environmental Sequences database (UniMES) contains data from metagenomic projects such as the Global Ocean Sampling Expeditions. UniProtKB includes cross-references from over 120 external databases, including Gene Ontology (GO), InterPro (protein families and domains), PRIDE (Protein identification data), IntEnz (enzyme) (see Note 11), OMIM (the Online Mendelian Inheritance in Man database) (see Note 12), Interaction databases (e.g. IntAct, DIP, Mint, see Note 21), Ensembl, several genomic databases from potential pathogens (e.g. EchoBase, Ecogene, LegioList, see Note 13), the European Hepatitis C Virus database (http://euhcvdb.ibcp.fr/euHCVdb) and others. 3.5.3.2. Mass Spectrometry Repositories

Several repositories have been established to store protein and peptide identifications derived from MS, the main method for the identification and quantification of proteins (28). There are two main repositories for MS data in proteomics: ●●

●●

The Proteomics IDEntifications database (PRIDE, http:// www.ebi.ac.uk/pride) Peptidome (http://www.ncbi.nlm.nih.gov/peptidome)and a number of related resources such as PeptideAtlas (http:// www.peptideatlas.org) and the Global Proteomics Machine (http://www.thegpm.org/GPMDB), which take deposited raw data for reanalysis in their own pipeline.

These all serve as web-based portals for data mining, data visualisation, data sharing, and cross-validation resources in the field.

20

Schneider and Orchard

The proteomics identifications (PRIDE) database has been built to turn publicly available data, buried in numerous academic publications, into publicly accessible data. PRIDE is fully compliant to the standards released by the HUPO-PSI and also makes extensive use of CVs such as Taxonomy, the BRENDA Tissue Ontology and Gene Ontology, thus direct access to PRIDE data organised by species, tissue, sub-cellular location, disease state and project name can be obtained via the “Browse Experiments” menu item. PRIDE remains the most complete database in terms of metadata associated with peptide identifications, since it contains numerous experimental details of the protocols followed by the submitters (28). The detailed metadata in PRIDE has enabled analyses of large datasets which have proven to yield very interesting information for the field (28). PRIDE uses Tranche (see Note 14) to allow the sharing of massive data files, currently including search engine output files and binary raw data from mass spectrometers that can be accessed via a hyperlink from PRIDE. As a member of the Proteome Exchange consortium PRIDE will make both the annotated meta-raw spectral data available, via Tranche to related analytical pipelines such as PeptideAtlas (see Note 15) and The Global Proteome Machine (see Note 16). 3.5.3.3. Protein–Protein Interactions and Interactomics

Both the number of laboratories producing PPI data and the size of such experiments continues to increase and a number of repositories exist to collect this data (see Note 22). Here we explore IntAct, a freely available, open source database system and analysis tools for molecular interaction data derived from literature or direct user submissions. IntAct follows a deep curation model, capturing a high level of detail from the experimental reports on the full text of the publication. Queries may be performed on the website with the initial presentation of the data as a list of binary interaction evidences. Users can access the individual evidences that describe the interaction of two specific molecules, thus allowing users to filter result sets (e.g. by interaction detection method) to only retain user-defined evidences. For convenience, evidence pertaining to the same interactors is grouped together in the binary interaction evidence table. Downloads of any datasets are available in both PSI-MI XML and tab-delineated MITAB format, providing end users with the highest level of details without compromising the integrity and simplicity of access to the data (40). IntAct is also involved in a major data exchange collaboration driven by the major public interaction data providers (listed at the end of this Chapter): The International Molecular Exchange Consortium (IMEx, http://imex.sourceforge.net) partners share curation efforts and exchange completed records on molecular interaction data. Curation has been aligned to a common ­standard, as detailed in the curation manual of the individual databases and

Omics Technologies, Data and Bioinformatics Principles

21

summarised in the joint curation manual available at http://imex. sourceforge.net. IMEx partner databases request the set of minimum information about a molecular interaction experiment (MIMIx) to be provided with each data deposition (41). The use of common data standards encourages the development of tools utilising this format. For example Cytoscape (http:// www.cytoscape.org) resembles an open source bioinformatics software platform for visualising molecular interaction networks and integrating these interactions with gene expression profiles and other state data, in which data from resources such as IntAct can be visualised and combined with other datasets. The value of the information obtained from comparing networks depends heavily on both the quality of the data used to assemble the networks themselves and the coverage of these networks (30, 42). The most comprehensive studies are in Saccharomyces cerevisiae; however, it should be noted that two comparable, “comprehensive” experiments, performed in parallel by two different groups using the same approach (tandem affinity purification technology) ended up with fewer than 30% of the interactions discovered by each group in common (43), suggesting that coverage is far from complete. 3.6. Integration of Omics Data

In the Omics several efforts have been and continue to be made in order to create computational tools for integrating Omics data. These need to address three different aspects of the integration (44): 1. To identify the network scaffold by delineating the connections that exist between cellular components; 2. To decompose the network scaffold into its constituent parts in an attempt to understand the overall network structure; 3. To develop cellular or system models to simulate and predict the network behaviour that gives rise to particular cellular phenotypes. As we have seen in the previous section here are significant challenges to modern post-genomics data sets: 1. Many technological platforms, both hardware and software, are available for several Omics data types, but some of these are prone to introducing technical artefacts; 2. Standardized data representations are not always adopted, which complicates cross-experiment comparisons; 3. Data-quality, context and lab-to-lab variations represent another important hurdle that must be overcome in genomescale science. Obviously the spread of Omics data in wide variety of formats represents a challenge for encompassing the technical hitches in integrating and migrating across platforms. One of the important

22

Schneider and Orchard

techniques often used is XML. XML is used to provide a ­document markup language that is easier to learn, retrieve, store and transmit. It is semantically richer than HTML (45). Here we present three different infrastructures which have been used and represent different ways of integration of Omics data: BioMart, Taverna and the BII Infrastructure. 3.6.1. BioMart

BioMart is a query-oriented DBMS developed jointly by the Ontario Institute for Cancer Research and the EBI: BioMart (http://www.biomart.org) is particularly suited for providing “data mining” like searches of complex descriptive data. It can be used with any type of data as shown by some of the resources currently powered by BioMart: Ensembl, UniProt, InterPro, HGNC, Rat Genome Database, ArrayExpress DW, HapMap, GermOnLine, PRIDE, PepSeeker, VectorBase, HTGT and Reactome. BioMart comes with an “out of the box” website that can be installed, configured and customised according to user requirements. Further access is provided by graphical and text based applications or programmatically using web services or API written in Perl and Java. BioMart has built-in support for query optimisation and data federation and in can also be configured to work as a DAS 1.5 Annotation server. The process of converting a data source into BioMart format is fully automated by the tools included in the package. Currently supported RDBMS platforms are MySQL, Oracle and Postgres. BioMart is completely Open Source, licenced under the LGPL, and freely available to anyone without restrictions (46).

3.6.2. Taverna

The Taverna workbench (http://taverna.sourceforge.net) is a free software tool for designing and executing workflows, created by the myGrid project (http://www.mygrid.org.uk/tools/ taverna), and funded through OMII-UK (http://www.omii. ac.uk). Taverna allows users to integrate many different software tools, including web services from many different domains. Bioinformatics services include those provided by the National Centre for Biotechnology Information, The EBI, the DNA Databank of Japan, SoapLab, BioMOBY and EMBOSS (see Note 17). Effectively, Taverna allows a scientist with limited computational background and technical resource support to construct highly complex analyses over public and private data and computational resources, all from a standard PC, UNIX box or Apple computer. A successful example of using Taverna in Omics is demonstrated by the work of Li et  al. (47) where the authors describe an example of a workflow involving the statistical identification of differentially expressed genes from microarray data followed by the annotation of their relationships to cellular processes. They show that Taverna can be used by data analysis experts as a

Omics Technologies, Data and Bioinformatics Principles

23

generic tool for composing ad hoc analyses of quantitative data by combining the use of scripts written in the R programming language with tools exposed as services in workflows (47). 3.6.3. BII Infrastructure

As we have seen, it is now possible to run complex multi-assay studies through a variety of Omics technologies, for example determining the effect on a number of subjects, of a compound by characterising a metabolic profile (by mass spectroscopy), measuring tissue specific protein and gene expression (by mass spectrometry and DNA microarrays, respectively), and conducting conventional histological analysis. It is essential that such complex metadata (i.e. sample characteristics, study design, assay execution, sample-data relationships) are reported in a standard manner to correctly interpret the final results (data) that they contextualise. Relevant EBI systems, such as ArrayExpress, PRIDE and ENA-Reads (The European Nucleotide Archive (ENA) accepts data generated by NGS methodologies such as 454, Illumina and ABI SOLiD) are built to store microarraybased, proteomics and NGS-based experiments, respectively. However, these systems have different submission and download formats, and diverse representations of the metadata and terminologies used. Nothing yet exists to archive metabolomics-based assays and other conventional biomedical/environmental assays. The BioInv Index (BioInvestigation Index, http://www.ebi. ac.uk/net-project/projects.html) infrastructure (BII) aims to fill this gap. BII infrastructure aims to be a single entry point for those researchers willing to deposit their multi-assay studies and datasets, and/or easily download similar datasets. This infrastructure allows commonly representing and storing the experimental metadata of biological, biomedical and environmental studies. Although relying on other EBI production systems, the BII infrastructure shields the users from their diverse formats and ontologies, by progressively implementing in the editor tool integrative crossdomain “standards” such as MIBBI, OBO Foundry and ISA-TAB. A prototype instance is up and running at http://www.ebi.ac.uk/ bioinvindex/home.seam.

4. Notes 1. Wellcome Trust Sanger Sequencing Centre: The Sanger Institute is a genome research institute primarily funded by the Wellcome Trust. The Sanger uses large-scale sequencing, informatics and analysis of genetic variation to further improve our understanding of gene function in health and disease and to generate data and resources of lasting value to biomedical research, see http://www.sanger.ac.uk.

24

Schneider and Orchard

2. Metagenomics: The term indicates the study of metagenomes, genetic material recovered directly from environmental samples. It is also used generically for environmental genomics, ecogenomics or community genomics. Metagenomics data can be submitted and stored in appropriate databases (see http://www.ncbi.nlm.nih.gov/Genbank/metagenome.html and http://www.ebi.ac.uk/genomes/wgs.html). 3. Metatranscriptomics: This term refers to studies where microbial gene expression in the environment is accessed (e.g. pyrosequencing) directly from natural microbial assemblages. 4. Epigenomics: Understanding the large numbers of variations in DNA methylation and chromatin modification by exploiting omics techniques. There are various recent efforts in this direction (i.e. http://www.heroic-ip.eu). 5. Studies of genome variation: Clear examples on the advances on this front come from the large-scale human variation databases which archive and provide access to experimental data resulting from HT genotyping and sequencing technologies. The European Genotype Archive (http://www.ebi.ac.uk/ ega/page.php) provides dense genotype data associated with distinct individuals. Another relevant projects on this front is ENCODE (http://www.genome.gov/10005107), the Encyclopedia Of DNA Elements, which aims to identify all functional elements in the human genome sequence. 6. Cycle-array sequencing methods: also known as NGS: Cyclearray methods generally involve multiple cycles of some enzymatic manipulation of an array of spatially separated ­oligonucleotide features. Each cycle only queries one or a few bases, but an enormous number of features are processed in parallel. Array features can be ordered or randomly dispersed. 7. Next generation expressed-sequence-tag sequencing: ESTs are small pieces of DNA sequence (200–500 nucleotides long) that are generated by sequencing of an expressed gene. Bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms are sequenced and use as “tags” to fish a gene out of a portion of chromosomal DNA by matching base pairs. Characterising transcripts through sequences rather than hybridization to a chip has its advantages (i.e. the sequencing approach does not require the knowledge of the genome sequence as a prerequisite, as the transcript sequences can be compared to the closest annotated reference sequence in the public database using standard computational tools). 8. The PICR service reconciles protein identifiers across multiple source databases (http://www.ebi.ac.uk/tools/picr).

Omics Technologies, Data and Bioinformatics Principles

25

9. InterPro/InterProScan: InterPro is a database of protein families, domains, regions, repeats and sites in which identifiable features found in known proteins can be applied to new protein sequences (http://www.ebi.ac.uk/interpro/index. html). InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool: InterProScan. InterProScan is a sequence search package that combines the individual search methods of the member databases and provides the results in a consistent format: The user can choose among text, raw, HTML or XML. The results display potential GO terms and the InterPro entry relationships where applicable (http://www.ebi.ac.uk/ Tools/InterProScan). 10. NCBI RefSeq databases: The Reference Sequence (RefSeq) database is a non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxa, see http://www.ncbi.nlm.nih.gov/RefSeq. 11. IntEnz: Integrated relational Enzyme database is a freely available resource focused on enzyme nomenclature (http:// www.ebi.ac.uk/intenz). 12. OMIM: the Online Mendelian Inheritance in Man database (http://www.ncbi.nlm.nih.gov/omim). 13. Genomic databases from potential pathogens: EchoBase is a database that curates new experimental and bioinformatic information about the genes and gene products of the model bacterium Escherichia coli K-12 strain MG1655; http://www. york.ac.uk/res/thomas. Ecogene database contains updated information about the E.  coli K-12 genome and proteome sequences, including extensive gene bibliographies; http://ecogene.org. LegioList is a database dedicated to the analysis of the genomes of Legionella pneumophila strain Paris (endemic in France), strain Lens (epidemic isolate), strain Philadelphia 1, and strain Corby; http://genolist.pasteur.fr/LegioList. 14. Tranche: Tranche is a free and open source file sharing tool that facilitates the storage of large amounts of data, see https://trancheproject.org. 15. PeptideAtlas: PeptideAtlas (http://www.peptideatlas.org) is a multi-organism, publicly accessible compendium of peptides identified in a large set of tandem mass spectrometry proteomics experiments.

26

Schneider and Orchard

16. The Global Proteome Machine: Open-source, freely available informatics system for the identification of proteins using tandem mass spectra of peptides derived from an enzymatic digest of a mixture of mature proteins, for more see http:// www.thegpm.org. 17. EMBOSS: EMBOSS is “The European Molecular Biology Open Software Suite”. It is a free, Open Source software analysis package especially designed for the needs of the molecular biology user community. EMBOSS automatically copes with data in a variety of formats and allows transparent retrieval of sequence data from the web, see http://emboss. sourceforge.net/what. 18. Selected projects, organisations and institutes relevant in Omics http://www.ebi.ac.uk http://www.ncbi.nlm.nih.gov http://www.bii.a-star.edu.sg http://www.ibioinformatics.org http://www.bioinformatics.org.nz http://www.isb-sib.ch http://www.igb.uci.edu http://www.uhnres.utoronto.ca/centres/proteomics http://www.humanvariomeproject.org http://www.expasy.org/links.html http://bioinfo.cipf.es http://www.bcgsc.ca http://www.blueprint.org http://www.cmbi.kun.nl/edu/webtutorials http://newscenter.cancer.gov/sciencebehind http://www.genome.gov/Research http://cmgm.stanford.edu 19. Genomics related resources Genomes Pages at the EBI: http://www.ebi.ac.uk/genomes http://www.ensembl.org/index.html, http://www.ensemblgenomes.org Caenorhabditis elegans (and some other nematodes): http:// www.wormbase.org Database for Drosophila melanogaster: http://flybase.org Mouse Genome Informatics: http://www.informatics.jax.org Rat Genome Database: http://rgd.mcw.edu

Omics Technologies, Data and Bioinformatics Principles

27

Saccharomyces Genome Database: http://www.yeastgenome.org Pombe genome Project: http://www.sanger.ac.uk/Projects/ S_pombe AceDB genome database: http://www.acedb.org/introduction.shtml HIV Sequence Database: http://www.hiv.lanl.gov/content/ sequence/HIV/mainpage.html 3-D structural information about nucleic acids: http:// ndbserver.rutgers.edu Gene Ontology: http://www.geneontology.org Human mitochondrial genome database: http://www. mitomap.org 20. Transcriptomics related resources ArrayExpress: http://www.ebi.ac.uk/microarray-as/ae Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo MGED Society: http://www.mged.org miRBASE: http://www.mirbase.org Comparative RNA: http://www.rna.ccbb.utexas.edu Arabidopsis gene expression database: http://www.arexdb.org Noncoding RNA database: http://www.ncrna.org/frnadb Mammalian noncoding RNA database: http://jsm-research. imb.uq.edu.au/rnadb Noncoding RNA databases: http://biobases.ibch.poznan.pl/ ncRNA Comprehensive Ribosomal RNA database: http://www. arb-silva.de RNA modification pathways: http://modomics.genesilico.pl RNA modification database: http://library.med.utah.edu/ RNAmods RNAi database: http://nematoda.bio.nyu.edu:8001/cgi-bin/ index.cgi Genomic tRNA database: http://gtrnadb.ucsc.edu MicroCosm Targets: http://www.ebi.ac.uk/enright-srv/ microcosm/htdocs/targets/v5 miRNA sequences: http://www.ebi.ac.uk/enright-srv/MapMi microRNA binding and siRNA off-target effects: http:// www.ebi.ac.uk/enright/sylamer 21. Proteomics related resources Protein sequences: http://www.uniprot.org ExPaSy Proteomics Service: http://www.expasy.org

28

Schneider and Orchard

Protein information Resources: http://pir.georgetown.edu Gene Ontology (GO) annotations to proteins: http://www. ebi.ac.uk/GOA/index.html The Peptidase database: http://merops.sanger.ac.uk Molecular Class-Specific Information System (MCSIS) project: http://www.gpcr.org PROWL (Mass spectrometry and Gaseous Ion Chemistry): http://prowl.rockefeller.edu Protein fingerprinting: http://www.bioinf.manchester.ac.uk/ dbbrowser/PRINTS/index.php Protein families: http://pfam.sanger.ac.uk Domain Prediction: http://hydra.icgeb.trieste.it/~kristian/ SBASE Protein domain families: http://prodom.prabi.fr/prodom/ current/html/home.php Protein families, domains and regions: http://www.ebi.ac. uk/interpro/index.html Simple Modular Architecture Research Tool: http://smart. embl-heidelberg.de Integrated Protein Knowledgebase: http://pir.georgetown. edu/iproclass TIGRFAMS: http://www.jcvi.org/cms/research/projects/ tigrfams/overview Protein databank: http://www.rcsb.org/pdb/home/home.do PRIDE: http://www.ebi.ac.uk/pride Protein Data Bank in Europe: http://www.ebi.ac.uk/pdbe Peptidome: http://www.ncbi.nlm.nih.gov/peptidome PeptideAtlas: http://www.peptideatlas.org Global Proteomics Machine: GPMDB/index.html

http://www.thegpm.org/

22. Protein–protein interaction databases IntAct: http://www.ebi.ac.uk/intact/main.xhtml IMEx: http://imex.sourceforge.net DIP: http://dip.doe-mbi.ucla.edu MINT: http://mint.bio.uniroma2.it/mint MPact: http://mips.gsf.de/genre/proj/mpact MatrixDB: http://matrixdb.ibcp.fr MPIDB: http://www.jcvi.org/mpidb BioGRID: http://www.thebiogrid.org

Omics Technologies, Data and Bioinformatics Principles

29

Acknowledgements The authors would like to thank Dr. Gabriella Rustici and Dr. Daniel Zerbino for useful insights and information on transcriptomics and genome assembly respectively. The authors would also like to thank Dr. James Watson for useful comments to the manuscript. References 1. Knasmüller, S. et  al. (2008) Use of conventional and -omics based methods for health claims of dietary antioxidants: A critical overview. Br J Nutr 99, ES3–52. 2. Hillieret, L.W. et  al. (2008) Whole-genome sequencing and variant discovery in C. elegans. Nat Methods 5, 183–88. 3. Johnson, D.S. et  al. (2007) Genome-wide mapping of in vivo protein–DNA interactions. Science 316, 1441–42. 4. Mortazavi, A. et  al. (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621–8. 5. Rustici, G. et  al. (2008) Data storage and analysis in ArrayExpress and Expression Profiler. Curr Protoc Bioinformatics 7, 7–13. 6. Whetzel, P.L. et  al. (2006) The MGED Ontology: A resource for semantics-based description of microarray experiments. Bioinformatics 22, 866–73. 7. Burge, C., Birney, E., and Fickett, J. (2002) Top 10 future challenges for bioinformatics. Genome Technol 17, 1–3. 8. Havlak, P. et  al. (2004) The Atlas genome assembly system. Genome Res 14, 721–32. 9. Batzoglou, S. et  al. (2002) ARACHNE: A whole genome shotgun assembler. Genome Res 12, 177–89. 10. Myers, E.W. et  al. (2000) A whole-genome assembly of Drosophila. Science 287, 2196–204. 11. Huang, X. et  al. (2003) PCAP: A wholegenome assembly program. Genome Res 13, 2164–70. 12. Mullikin, J.C., and Ning, Z. (2003) The Phusion assembler. Genome Res 13, 81–90. 13. Pevzner, P.A., Tang, H., and Waterman, M.S. (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 14, 9748–53. 14. Hernandez, D. et al. (2008) De novo bacterial genome sequencing: Millions of very short

15. 16. 17. 18.

19. 20.

21. 22.

23. 24.

25.

reads assembled on a desktop computer. Genome Res 18, 802–9. Idury, R., and Waterman, M. (1995) A new algorithm for DNA sequence assembly. J Comput Biol 2, 291–306. Pevzner, P., and Tang, H. (2001) Fragment assembly with double-barrelled data. Bioinformatics 17, S225–33. Chaisson, M.J., and Pevzner, P.A. (2008) Short read fragment assembly of bacterial genomes. Genome Res 18, 324–30. Zerbino, D.R., and Birney, E. (2008) Velvet: Algorithms for de  novo short read assembly using de Bruijn graphs. Genome Res 18, 821–9. Ossowski, S. et al. (2008) Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res 18, 2024–33. Farrer, R.A. et al. (2009) De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads. FEMS Microbiol Lett 1, 103–11. Wakaguri, H. et al. (2008) DBTSS: Database of transcription start sites, progress report. Nucleic Acids Res 36, D97–101. Chen, X. et  al. (2009) High throughput genome-wide survey of small RNAs from the parasitic protists Giardia intestinalis and Trichomonas vaginalis. Genome Biol Evol 1, 165–75. Butler, J. et al. (2008) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res 18, 810–20. Chen, J., and Skiena, S. (2007) Assembly for double-ended short-read sequencing technologies. In ‘Advances in Genome Sequencing Technology and Algorithms’, edited by E. Mardis, S. Kim, and H. Tang. Artech House Publishers, Boston. Simpson, J.T. et al. (2009) ABySS: A parallel assembler for short read sequence data. Genome Res 9, 1117–23.

30

Schneider and Orchard

26. Jackson, B.G., Schnable, P.S., and Aluru, S. (2009) Parallel short sequence assembly of transcriptomes. BMC Bioinformatics 10, S1–14. 27. Spudich, G., Fernandez-Suarez, X.M., and Birney, E. (2007) Genome browsing with Ensembl: A practical overview. Brief Funct Genomic Proteomic 6, 202–19. 28. Vizcaíno, J.A. et  al. (2009) A guide to the Proteomics Identifications Database proteomics data repository. Proteomics 9, 4276–83. 29. Hunter, S. et  al. (2009) InterPro: The integrative protein signature database. Nucleic Acids Res 37, 211–15. 30. Cesareni, G. et al. (2005) Comparative interatcomics. FEBS Lett 579, 1828–33. 31. Brazma, A. et al. (2001) Minimum information about a microarray experiment (MIAME)toward standards for microarray data. Nat Genet 29, 365–71. 32. Levy, S. et  al. (2007) The diploid genome sequence of an individual human. PLoS Biol 5, 2113–44. 33. Wheeler, D.A. et  al. (2008) The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–76. 34. Venter, J.C. et al. (2001) The sequence of the human genome. Science 291, 1304–51. 35. Spencer, C.C. et al. (2006) The influence of recombination on human genetic diversity. PLoS Genet 2, e148. 36. Brazma, A. et al. (2003) ArrayExpress – A public repository for microarray gene expression data at the EBI. Nucleic Acids Res 31, 68–71. 37. Edgar, R., Domrachev, M., and Lash, A.E. (2002) Gene Expression Omnibus: NCBI

38. 39.

40. 41.

42. 43. 44. 45. 46. 47.

gene expression and hybridization array data repository. Nucleic Acids Res 30, 207–10. Ikeo, K. et  al. (2003) CIBEX: Center for information biology gene expression database. C R Biol 326, 1079–82. Parkinson, H. et  al. (2009) ArrayExpress update – From an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res 37, 868–72. Aranda, B. et al. (2009) The IntAct molecular interaction database. Nucleic Acid Res. 1–7 doi:10.1093/nar/gkp878. Orchard, O. et  al. (2007) The minimum information required for reporting a molecular interaction experiment (MIMIx). Nat Biotechnol 25, 894–8. Kiemer, L., and Cesareni, G. (2007) Comparative interactomics: Comparing apples and pears? Trends Biotechnol 25, 448–54. Kiemer, L. et al. (2007) WI-PHI: A weighted yeast interactome enriched for direct physical interactions. Proteomics 7, 932–43. Joyce, A.R., and Palsson, B.Ø. (2006) The model organism as a system: Integrating ‘omics’ data sets. Nat Rev Mol Cell Biol 7, 198–210. Akula, S.P. et  al. (2009) Techniques for integrating -omics Data. Bioinformation 3, 284–6. Haider, S. et  al. (2009) BioMart Central Portal – Unified access to biological data. Nucleic Acids Res 1, W23–27. Li, P. et  al. (2008) Performing statistical ­analyses on quantitative data in Taverna workflows: An example using R and maxdBrowse to identify differentially-expressed genes from microarray data. BMC Bioinformatics 9, 334.

Chapter 2 Data Standards for Omics Data: The Basis of Data Sharing and Reuse Stephen A. Chervitz, Eric W. Deutsch, Dawn Field, Helen Parkinson, John Quackenbush, Phillipe Rocca-Serra, Susanna-Assunta Sansone, Christian J. Stoeckert Jr., Chris F. Taylor, Ronald Taylor, and Catherine A. Ball Abstract To facilitate sharing of Omics data, many groups of scientists have been working to establish the relevant data standards. The main components of data sharing standards are experiment description standards, data exchange standards, terminology standards, and experiment execution standards. Here we provide a survey of existing and emerging standards that are intended to assist the free and open exchange of large-format data. Key words: Data sharing, Data exchange, Data standards, MGED, MIAME, Ontology, Data format, Microarray, Proteomics, Metabolomics

1. Introduction The advent of genome sequencing efforts in the 1990s led to a dramatic change in the scale of biomedical experiments. With the comprehensive lists of genes and predicted gene products that resulted from genome sequences, researchers could design experiments that assayed every gene, every protein, or every predicted metabolite. When exploiting transformative Omics technologies such as microarrays, proteomics or high-throughput cell assays, a single experiment can generate very large amounts of raw data as well as summaries in the form of lists of sequences, genes, proteins, metabolites, or SNPs. Managing, analyzing, and sharing the large data set from Omics experiments present challenges

Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_2, © Springer Science+Business Media, LLC 2011

31

32

Chervitz et al.

because the standards and conventions developed for single-gene or single-protein studies do not accommodate the needs of Omics studies (1) (see Note 1). The development and applications of Omics technologies is evolving rapidly, and so is awareness of the need for, and value of, data-sharing standards in the life sciences community. Standards that become widely adopted can help scientists and data analysts better utilize, share, and archive the ever-growing mountain of Omics data sets. Also, such standards are essential for the application of Omics approaches in healthcare environments. This chapter provides an introduction to the major Omics data sharing standards initiatives in the domains of genomics, transcriptomics, proteomics, and metabolomics, and includes summaries of goals, example applications, and references for further information. New standards and organizations for standards development may well arise in the future that will augment or supersede the ones described here. Interested readers are invited to further explore the standards described in this chapter (as well as others not mentioned) and keep up with the latest developments by visiting the website http://biostandards.info. 1.1. Goals and Motivations for Standards in the Life Sciences

Standards within a scientific domain have the potential to provide uniformity and consistency in the data generated by different researchers, organizations, and technologies. They thereby facilitate more effective reuse, integration, and mining of those data by other researchers and third-party software applications, as well as enable easier collaboration between different groups. Standardscompliant data sets have increased value for scientists who must interpret and build on earlier efforts. And, of course, software analysis tools which – of necessity – require some sort of regularized data input are very often designed to process data that conform to public data formatting standards, when such are available for the domain of interest. Standard laboratory procedures and reference materials enable the creation of guidelines, systems benchmarks, and laboratory protocols for quality assessment and cross-platform comparisons of experimental results that are needed in order to deploy a technology within research, industrial, or clinical environments. The value of standards in the life sciences for improving the utility of data from high-throughput post-genomic experiments has been widely noted for some years (2–6). To understand how the conclusions of a study were obtained, not only do the underlying data need to be available, but also the details of how the data were generated need to be adequately described (i.e., samples, procedural methods, and data analysis). Depositing data in public repositories is necessary but not sufficient for this purpose. Several standard types of associated data are also needed. Reporting, or “minimum information,” standards

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

33

are needed to ensure that submitted data are sufficient for clear interpretation and querying by other scientists. Standard data formats greatly reduce the amount of effort required to share and make use of data produced by different investigators. Standards for the terminology used to describe the study and how the data were generated enable not only improved understanding of a given set of experimental results but also improved ability to compare studies produced by different scientists and organizations. Standard physical reference materials as well as standard methods for data collection and analysis can also facilitate such comparisons as well as aid the development of reusable data quality metrics. Ideally, any standards effort would take into account the usability of the proposed standard. A standard that is not widely used is not really a standard and the successful adoption of a standard by end-user scientists requires a reasonable cost-benefit ratio. The effort of producing a new standard (development cost) and, more importantly, the effort needed to learn how to use the standard or to generate standards-conforming data (end-user cost) has to be outweighed by gains in the ability to publish experimental results, the ability to use other published results to advance one’s own work, and higher visibility bestowed on standards-compliant publications (7). Thus, a major focus of standards initiatives is minimizing end-user usability barriers, typically done by educational outreach via workshops and tutorials as well as fostering the development of software tools that help scientists utilize the standard in their investigations. There also must be a means for incorporating feedback from the target community both at the initiation of standard development and on a continuing basis so that the standard can adapt to user needs that can change over time. Dr. Brazma and colleagues (8) discuss some additional factors that contribute to the success of standards in systems biology and functional genomics. 1.2. History of Standards for Omics

The motivation for standards for Omics initially came from the parallel needs of the scientific journals, which wanted standards for data publication, and the needs of researchers, who recognized the value of comparing the large and complex data sets characteristic of Omics experiments. Such data sets, often with thousands of data points, required new data formats and publication guidelines. Scientists using DNA microarrays for genome-wide gene expression analysis were the first to respond to these needs. In 2001, the Microarray and Gene Expression Data (MGED) Society (http://www.mged.org) published the Minimum Information About a Microarray Experiment (MIAME) standard (9), a guideline for the minimum information required to describe a DNA microarray-based experiment. The MIAME guidelines specify the information required to describe such an

34

Chervitz et al.

experiment so that another researcher in the same discipline could either reproduce the experiment or analyze the data de novo. Adoption of the MIAME guidelines was expedited when a number of journals and funding agencies required compliance with the standard as a precondition for publication. In parallel with MIAME, data modeling and XML-based exchange standards called Microarray Gene Expression Object Model (MAGE-OM) and Markup Language (MAGE-ML) (10), and a controlled vocabulary called the MGED Ontology (11), were created. These standards facilitated the creation and growth of a number of interoperable databases and public data repositories. Use of these standards also led to the establishment of opensource software projects for DNA microarray data analysis. Resources such as the ArrayExpress database (12–14) at the European Bioinformatics Institute (EBI), the Gene Expression Omnibus (GEO) (15–18), at the National Center for Biotechnology Information (NCBI), and others were advertised as “MIAME-compliant” and capable of importing data submitted in the MAGE-ML format (10). Minimum information guidelines akin to MIAME then arose within other Omics communities. For example, the Minimum Information about a Proteomics Experiment (MIAPE) guidelines for proteomics studies (19) have been developed. More recent initiatives have been directed towards technology-independent standards for reporting, modeling, and exchange that support work spanning multiple Omics technologies or domains, and directed toward harmonization of related standards. These pro­ jects have, of necessity, required extensive collaboration across disciplines. The resulting standards have gained in sophistication, benefiting from insights gained in the use and implementation of earlier standards, in the use of formalisms imposed by the need to make the data computationally tractable and logically coherent, and in the experience in engagement of multiple academic communities in the development of these prior standards. Increasingly, the drive for standards in Omics is shifting from the academic communities to include the biomedical and healthcare communities as well. As application of Omics technologies and data expands into the clinical and diagnostic arena, organizations such as the US Food and Drug Administration (FDA) and technology manufacturers are becoming more involved in a range of standards efforts, for example the MicroArray Quality Control (MAQC) consortium brings together representatives of many such organizations (20). Quality control/assurance projects and reference standards that support comparability of data across different manufacturer platforms are of particular interest as Omics technologies mature and start to play an expanded role in healthcare settings.

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

35

2. Materials Omics standards are typically scoped to a specific aspect of an Omics investigation. Generally speaking, a given standard will cover either the description of a completed experiment, or will target some aspect of performing the experiment or analyzing results. Standards are further stratified to handle more specific needs, such as reporting data for publication, providing data exchange formats, or defining standard terminologies. Such scoping reflects a pragmatic decoupling that permits different standards groups to develop complementary specifications concurrently and allows different initiatives to attract individuals with relevant expertise or interest in the target area (8). As a result of this arrangement, a standard or standardization effort within Omics can be generally characterized by its domain and scope. The domain reflects the type of experimental data (transcriptomics, proteomics, metabolomics, etc.), while the scope defines the area of applicability of the standard or the methodology being standardized (experiment reporting, data exchange, etc.). Tables  1 and 2 list the different domains and scopes, respectively, which characterize existing Omics standardizations efforts (see Note 2).

Table 1 Domains of Omics standards. The domain indicates the type of experimental data that the standard is designed to handle Domain

Description

Genomics

Genome sequence assembly, genetic variations, genomes and metagenomes, and DNA modifications

Transcriptomics

Gene expression (transcription), alternative splicing, and promoter activity

Proteomics

Protein identification, protein–protein interactions, protein abundance, and posttranslational modifications

Metabolomics

Metabolite profiling, pathway flux, and pathway perturbation analysis

Healthcare and Toxicogenomicsa Clinical, diagnostic, or toxicological applications Harmonization and Multiomicsa Cross-domain compatibility and interoperability a Healthcare, toxicological, and harmonization standards may be applicable to one or more other domain areas. These domains impose additional requirements on top of the needs of the pure Omics domains

36

Chervitz et al.

Table 2 Scope of Omics standards. Scope defines the area of applicability or methodology to which the standard pertains. Scope-General: Standards can be generally partitioned based on whether they are to be used for describing or executing an experiment. Scope-Specific: The scope can be further narrowed to cover more specific aspects of the general scope Scope-General

Scope-Specific

Description

Experiment description

Reporting (Minimum information) Data exchange & modeling

Documentation for publication or data deposition Communication between organizations and tools Ontologies and CV’s to describe experiments or data

Terminology Experiment execution

Physical standards Data analysis & quality metrics

Reference materials, spike-in controls Analyze, compare, QA/QC experimental results

CV controlled vocabulary, QA/QC quality assurance/quality control

The remainder of this section describes the different scopes of Omics standards, listing the major standards initiatives and organizations relevant to each scope. The next section then surveys the standards by domain, providing more in-depth description of the relevant standards, example applications, and references for further information. 2.1. Experiment Description Standards

Experiment description standards, also referred to generally as “data standards”, concern the development of guidelines, conventions, and methodologies for representing and communicating the raw and processed data generated by experiments as well as the metadata for describing how an experiment was carried out, including a description of all reagents, specimens, samples, equipment, protocols, controls, data transformations, software algorithms, and any other factors needed to accurately communicate, interpret, reproduce, or analyze the experimental results. Omics studies and the data they generate are complex. The diversity of scientific problems, experimental designs, and technology platforms creates a challenging landscape of data for any descriptive standardization effort. Even within a given domain and technology type, it is not practical for a single specification to encompass all aspects of describing an experiment. Certain aspects are more effectively handled separately; for example, a description

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

37

of the essential elements to be reported for an experiment is independent of the specific data format in which that information should be encoded for import or export by software applications. In recognition of this, experiment description standardization efforts within the Omics community are further scoped into more specialized areas that address distinct data handling requirements encountered during different aspects of or types of data encountered in an Omics study. Thus we have: ●●

Reporting.

●●

Data exchange & modeling.

●●

Terminology.

These different areas serve complementary roles and together, provide a complete package for describing an Omics experiment within a given domain or technology platform. For example, a data exchange/modeling standard will typically have elements to satisfy the needs of a reporting standard with a set of allowable values for those elements to be provided by an associated standard controlled vocabulary/terminology. 2.1.1. Reporting Standards: Minimum Information Guidelines

The scope of a reporting standard pertains to how a researcher should record the information required to unambiguously communicate experimental designs, treatments and analyses, to contextualize the data generated, and underpin the conclusions drawn. Such standards are also known as data content or minimum information standards because they usually have an acronym beginning with “MI” standing for “minimum information” (e.g. MIAME). The motivation behind reporting standards is to enable an experiment to be interpreted by other scientists and (in principle) to be independently reproduced. Such standards provide guidance to investigators when preparing to report or publish their investigation or archive their data in a repository of experimental results. When an experiment is submitted to a journal for publication, compliance with a reporting standard can be valuable to reviewers, aiding them in their assessment of whether an experiment has been adequately described and is thus worthy of approval for publication. A reporting specification does not normally mandate a particular format in which to capture/transport information, but simply delineates the data and metadata that their originating community considers appropriate to sufficiently describe how a particular investigation was carried out. Although a reporting standard does not have a specific data formatting requirement, the often explicit expectation is that the data should be provided using a technology-appropriate standard format where feasible, and that controlled vocabulary or ontology terms should be used in descriptions where feasible. Data repositories may impose such a requirement as a condition for data submission.

38

Chervitz et al.

Omics experiments, in addition to their novelty, can be quite complex in their execution, analysis, and reporting. Minimum information guidelines help in this regard by providing a consistent framework to help scientists think about and report essential aspects of their experiments, with the ultimate aim of ensuring the usefulness of the results to scientists who want to understand or reproduce the study. Such guidelines also help by easing compliance with a related data exchange standard, which is often designed to support the requirements of a reporting standard (discussed below). Depending on the nature of a particular investigation, information in addition to what is specified by a reporting standard may be provided as desired by the authors of the study or as deemed necessary by reviewers of the study. Table  3 lists the major reporting standards for different Omics domains. The MIBBI project (discussed later in this chapter) catalogues these and many other reporting standards and provides a useful introduction (21).

Table 3 Existing reporting standards for Omics Acronym

Full name

Domain

Organization

CIMR

Core Information for Metabolomics Reporting

Metabolomics

MSI

MIAME

Minimum Information about a Microarray Experiment

Transcriptomics

MGED

MIAPE

Minimum Information about a Proteomics Experiment

Proteomics

HUPO-PSI

MIGS-MIMS

Minimum Information about a Genome/Metagenome Sequence

Genomics

GSC

MIMIx

Minimum Information about a Molecular Interaction eXperiment

Proteomics

HUPO-PSI

MINIMESS

Minimal Metagenome Sequence Analysis Standard

Metagenomics

GSC

MINSEQE

Minimum Information about a high-throughput Nucleotide Sequencing Experiment

Genomics, Transcriptomics MGED (UHTS)

MISFISHIE

Minimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments

Transcriptomics

MGED

Acronyms and definitions of the major reporting standards efforts are shown, indicating their target domain and the maintaining organization, which are as follows: MGED MGED Society, http://mged.org; GSC Genomic Standards Consortium, http://gensc.org; HUPO-PSI Human Proteome Organization Proteomics Standards Initiative, http:// www.psidev.info; MSI Metabolomics Standards Initiative, http://msi-workgroups.sourceforge.net

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

39

For some publishers, compliance with a reporting standard is increasingly becoming an important criterion for accepting or rejecting a submitted Omics manuscript (22). The journals Nature, Cell, and The Lancet have led the way in the enforcement of compliance for DNA microarray experiments by requiring submitted manuscripts to demonstrate compliance with the MIAME guidelines as a condition of publication. Currently, most journals that publish such experiments have adopted some elaboration of this policy. Furthermore, publishers such as the BioMed Central are moving to, or already endorse the MIBBI project, described below, as a portal to the diverse set of available guidelines for the biosciences. 2.1.2. Data Exchange and Modeling Standards

The scope of a data exchange standard is the definition of an encoding format for use in sharing data between researchers and organizations, and for exchanging data between software programs or information storage systems. A data exchange standard delineates what data types can be encoded and the particular way they should be encoded (e.g., tab-delimited columns, XML, binary, etc.), but does not specify what the document should contain in order to be considered complete. There is an expectation that the content will be constructed in accordance with a communityapproved reporting standard and the data exchange standard itself is typically designed so that users can construct documents that are compliant with a particular reporting standard (e.g., MAGE-ML and MAGE-TAB contain placeholders that are designed to hold the data needed for the production of MIAMEcompliant documents). A data exchange standard often is designed to work in conjunction with a data modeling standard, which defines the attributes and behaviors of key entities and concepts (objects) that occur within an Omics data set. The model is intended to capture the exchange format-encoded data for the purpose of storage or downstream data mining by software applications. The data model itself is designed to be independent of any particular software implementation (database schema, XML file, etc.) or programming language (Java, C++, Perl, etc.). The implementation decisions are thus left to the application programmer, to be made using the most appropriate technology(s) for the target user base. This separation of the model (or “platform-independent model”) and the implementation (or “platform-specific implementation”) was first defined by the Object Management Group’s Model Driven Architecture (http://www.omg.org/mda) and offers a design methodology that holds promise for building software systems that are more interoperable and adaptable to technological change. Such extensibility has been recognized as an essential feature of data models for Omics experiments (23). Data exchange and modeling standards are listed in Table 4.

40

Chervitz et al.

Table 4 A sampling of data exchange and modeling standards for Omics Acronym Data format

Object model

Full name

Domain

Organization

FuGE-ML

FuGE-OM

Functional Genomics Experiment Markup Language/Object Model

Multiomics

FuGE

Investigation Study Assay – Tabular

Multiomics

RSBI

MicroArray and Gene Expression Markup Language MicroArray and Gene Expression Tabular Format

Transcriptomics

MGED

Molecular Interactions Format Mass Spectrometry Markup Language Mass Spectrometry Identifications Markup Language

Proteomics

HUPO-PSI

Polymorphism Markup Language/ Phenotype and Genotype Object Model Study Data Tabulation Model

Genomics

GEN2PHEN

Healthcare

CDISC

ISA-TAB MAGE-ML

MAGE-OM

MAGE-TAB

MIF (PSI-MI XML) mzML mzIdentML

PML

PAGE-OM

SDTM

Acronyms and names of some of the major data exchange standards efforts are shown, indicating their target domain and the maintaining organization, which are as described in the legend to Table 3 with the following additions: RSBI Reporting Structure for Biological Investigations, http://www.mged.org/Workgroups/rsb; FuGE Functional Genomics Experiment, http://fuge.sourceforge.net; GEN2PHEN Genotype to phenotype databases, http://www. gen2phen.org; CDISC Clinical Data Interchange Standards Consortium, http://www.cdisc.org. Additional proteomics exchange standards are described on the HUPO-PSI website, http://www.psidev.info

2.1.3. Terminology Standards

The scope of a terminology standard is typically defined by the use cases it is intended to support and competency questions it is designed to answer. An example of a use case is annotating the data generated in an investigation with regard to materials, procedures, and results while associated competency questions would include those used in data mining (for example, “find all cancer studies done using Affymetrix microarrays”). Terminology standards generally

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

41

provide controlled vocabularies and some degree of organization. Ontologies have become popular as mechanisms to encode terminology standards because they provide definitions for terms in the controlled vocabulary as well as properties of and relationships between terms. The Gene Ontology (24) is one such ontology created to address the use case of providing consistent annotation of gene products across different species and enabling questions such as “return all kinases”. The primary goal of a terminology standard is to promote consistent use of terms within a community and thereby facilitate knowledge integration by enabling better querying and data mining within and across data repositories as well as across domain areas. Use of standard terminologies by scientists working in different Omics domains can enable interrelation of experimental results from diverse data sets (see Note 3). For example, annotating results with standard terminologies could help correlate the expression profile of a particular gene, assayed in a transcriptomics experiment, to its protein modification state, assayed in a separate proteomics experiment. Using a suitably annotated metabolomics experiment, the gene/protein results could then be linked to the activity of the pathway(s) in which they operate, or to a disease state documented in a patient’s sample record. Consistent use of a standard terminology such as GO has enabled research advances. Data integration is possible across diverse data sets as long as they are annotated using terms from GO. Data analysis for association of particular types of gene products with results from investigations is also made possible because of the effort that has been made by the GO Consortium to consistently annotate gene products with GO. Numerous tools that do this are listed at the Gene Ontology site http://www. geneontology.org/GO.tools.microarray.shtml. There is already quite a proliferation of terminologies in the life sciences. Key to their success is adoption by scientists, bio­ informaticians, and software developers for use in the annotation of Omics data. However, the proliferation of ontologies which are not interoperable can be a barrier to integration (25) (see Note 4). The OBO Foundry targets this area and is delineating best practices underlying the construction of terminologies, maximizing their internal integrity, extensibility, and reuse. Easy access to standard terminologies is important and being addressed through sites such as the NCBO BioPortal (http://bioportal.bioontology.org) and the EBI Ontology Lookup Service (http://www.ebi.ac.uk/ ontology-lookup). These web sites allow browsing and downloading of ontologies. They also provide programmatic access through web services, which is important for integration with software tools and web sites that want to make use of these. Terms in ontologies are organized into classes and typically placed in a hierarchy. Classes represent types of entities for which

42

Chervitz et al.

Table 5 Terminology standards Acronym Full name

Domain

Organization

EVS

Enterprise Vocabulary Services Healthcare

NCI

GO

Gene Ontology

GOC

MS

Proteomics Standards Initiative Proteomics Mass Spectrometry controlled vocabulary

MO

MGED Ontology

Transcriptomics MGED

OBI

Ontologies for Biomedical Investigators

Multiomics

OBI

OBO

Open Biomedical Ontologies

Multiomics

NCBO

Multiomics

HUPO-PSI

PSI-MI Proteomics Standards Initiative Proteomics Molecular Interactions ontology

HUPO-PSI

sepCV

Sample processing and separations controlled vocabulary

Proteomics

HUPO-PSI

SO

Sequence Ontology

Multiomics

GOC

Acronyms and names of some of the major terminology standards in use with Omics data are shown, indicating their target domain and the maintaining organization, which are as described in the legends to Tables 3 and 4 with the following additions: GOC Gene Ontology Consortium, http://geneontology.org/GO.consortiumlist. shtml; NCI National Cancer Institute, http://www.cancer.gov; NCBO National Center for Biomedical Ontology, http://bioontology.org; OBI Ontology Biomedical Investigations, http://purl.obofoundry.org/obo/obi

there can be different instances. Terms can be given accession numbers so that they can be tracked and can be assigned details, such as who is responsible for the term and what was the source of the definition. If the ontology is based on a knowledge representation language such as OWL (web ontology language, http:// www.w3.org/TR/owl-ref) then restrictions on the usage of the term can be encoded. For example, one can require associations between terms (e.g. the inputs and outputs of a process). Building an ontology is usually done with a tool such as Protégé (http:// protege.stanford.edu) or OBO-Edit (http://oboedit.org). These tools are also useful for navigating ontologies. Table 5 lists some of the ontologies or controlled vocabularies relevant to Omics. For a complete listing and description of these and related ontologies, see the OBO Foundry website (http://www.obofoundry.org). 2.2. Experiment Execution Standards 2.2.1. Physical Standards

The scope of a physical standard pertains to the development of standard reagents for use as spike-in controls in assays. A physical standard serves as a stable reference point that can facilitate the quantification of experimental results and the comparison of

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

43

Table 6 Organizations involved in the creation of physical standards relevant to Omics experiments Acronym Full name

Domain

ERCC

External RNA Control Consortium

Transcriptomics http://www.cstl.nist.gov/biotech Cell&TissueMeasurements/GeneExpression/ ERCC.htm

LGC

Laboratory of the Transcriptomics, http://www.lgc.co.uk Government Proteomics Chemist

NIST

National Institute Transcriptomics http://www.cstl.nist.gov/biotech/ for Standards Cell&TissueMeasurements/Main_Page.htm Technology

NMS

Multiomics National Measurement System (NMS) Chemical and Biological Metrology

ATCC

American Type Culture Collection Standards Development Organization

Healthcare

Website

http://www.nmschembio.org.uk

http://www.atcc.org/Standards/ ATCCStandardsDevelopmentOrganizationSDO/ tabid/233/Default.aspx

results between different runs, investigators, organizations, or technology platforms. Physical standards are essential for quality metrics purposes and are especially important within applications of Omics technologies in regulated environments such as clinical or diagnostic settings. In the early days of DNA microarray-based gene expression experiments, results from different investigators, laboratories, or array technology were notoriously hard to compare despite the use of reporting and data exchange standards (26). The advent of physical standards and the improved metrology promises to increase the accuracy of comparisons within cross-platform and cross-investigator experimental results. Such improvements are necessary for the adoption of Omics technologies in clinical and diagnostic applications within the regulated healthcare industry. Examples of physical standards are provided in Table 6. 2.2.2. Data Analysis and Quality Metrics

The scope of a data analysis or quality metrics standard is the delineation of best practices for algorithmic and statistical

44

Chervitz et al.

approaches to processing experimental results as well as methods to assess and assure data quality. Methodologies for data analysis cover the following areas: ●●

Data transformation (normalization) protocols.

●●

Background or noise correction.

●●

Clustering.

●●

Hypothesis testing.

●●

Statistical data modeling.

Analysis procedures have historically been developed in a tool-specific manner by commercial vendors, and users of these tools would rely on the manufacturer for guidance. Yet efforts to define more general guidelines and protocols for data analysis best practices are emerging. Driving some of these efforts is the need for consistent approaches to measure data quality, which is critical for determining one’s confidence in the results from any given experiment and for judging the comparability of results obtained under different conditions (days, laboratories, equipment operators, manufacturing batches, etc.). Data quality metrics rely on data analysis standards as well as the application of

Table 7 Data analysis and quality metrics projects Acronym

Full name

Domain

Organization

arrayQuality- Quality assessment Metrics software package

Transcriptomics

BioConductor

CAMDA

Critical Assessment of Microarray Data Analysis

Transcriptomics

n/a

CAMSI

Critical Assessment of Mass Spectrometry Informatics

Proteomics

n/a

iPRG

Proteome Informatics Research Group

Proteomics

ABRF

MAQC

Microarray Quality Control Project

Transcriptomics

FDA

NTO

Normalization and Transformation Ontology

Transcriptomics

EMERALD

BioConductor’s arrayQualityMetrics: http://bioconductor.org/packages/2.3/bioc/ html/arrayQualityMetrics.html. CAMDA is managed by a local organizing committee at different annual venues: http://camda.bioinfo.cipf.es. EMERALD’s NTO: http://www.microarray-quality.org/ontology_work.html. MAQC is described in Subheading 3.2.6

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

45

physical standards. Collecting or assessing data quality using quality metrics is facilitated by having data conforming to widely-adopted reporting standards as available in common data exchange formats. A number of data analysis and quality metrics efforts are listed in Table 7.

3. Methods Here we review some of the more prominent standards and initiatives within the main Omics domains: genomics, transcriptomics, proteomics, and metabolomics. Of these, transcriptomics is the most mature in terms of standards development and community adoption, though proteomics is a close second. 3.1. Genomic Standards

Genomic sequence data is used in a variety of applications such as genome assembly, comparative genomics, DNA variation assessment (SNP genotype and copy number), epigenomics (DNA methylation analysis), and metagenomics (DNA sequencing of environment samples for organism identification). Continued progress in the development of high-throughput sequencing technology has lead to an explosion of new genome sequence data and new applications of this technology. A number of efforts are underway to standardize the way scientists describe and exchange this genomic data in order to facilitate better exchange and integration of data contributed by different laboratories using different sequencing technologies.

3.1.1. MIGS-MIMS

This term stands for Minimum Information About a Genome Sequence/Minimum Information about a Metagenomic Sequence: MIGS/MIMS (http://gensc.org). MIGS (Minimum Information About a Genome Sequence) is a minimum information checklist that is aimed at standardizing the description of a genomic sequence, such as the complete assembly of a bacterial or eukaryotic genome. It is intended to extend the core information that has been traditionally captured by the major nucleotide sequence repositories (Genbank, EMBL, and DDBJ) in order to accommodate the additional requirements of scientists working with genome sequencing project data. MIGS is maintained by the Genomic Standards Consortium (http://gensc.org) which also has developed an extension of MIGS for supporting metagenomic data sets called MIMS (Minimum Information about a Metagenomic Sequence/ Sample). MIMS allows for additional metadata particular to a metagenomics experiment, such as the details about environmental sampling.

46

Chervitz et al.

A data format called GCDML (Genomic Contextual Data Markup Language) is under development by the GSC for the purpose of providing a MIGS/MIMS-compliant data format for exchanging data from genomic/metagenomic experiments. 3.1.2. SAM Tools

The SAM format is an emerging data exchange format for efficiently representing large sequence alignments, driven by the explosion of data from high-throughput sequencing projects, such as the 1,000 Genomes Project (27). It is designed to be simple, compact, and to accommodate data from different alignment programs. The SAM Tools open source project provides utilities for manipulating alignments in the SAM format, including sorting, merging, indexing, and generating alignments (http://samtools.sourceforge.net).

3.1.3. PML and PaGE-OM

The Polymorphism Markup Language PML (http://www. openpml.org) was approved as an XML-based data format for exchange of genetic polymorphism data (e.g., SNPs) in June 2005. It was designed to facilitate data exchange among different data repositories and researchers who produce or consume this type of data. Phenotype and Genotype Experiment Object Model (PaGE-OM) is an updated, broader version of the PML standard and provides a richer object model and incorporates phenotypic information. It was approved as a standard by the OMG in March 2008. PaGE-OM defines a generic, platform-independent representation for entities such as alleles, genotypes, phenotype values, and relationships between these entities with the goal of enabling the capture of the minimum amount of information required to properly report most genetic experiments involving genotype and/or phenotype information (28). Further refinements of the PaGE-OM object model, harmonization with object models from other domains, and generation of exchange formats are underway at the time of writing. PaGE-OM is maintained by JBIC (http:// www.pageom.org) in partnership with the Gen2Phen project (http://www.gen2phen.org).

3.2. Transcriptomics Standards

This section describes the organizations and standards related to technologies that measure transcription, gene expression, or its regulation on a genomic scale. Transcriptomics standards pertain to the following technologies or types of investigation: ●●

Gene expression via DNA microarrays or ultra high-throughput sequencing.

●●

Tiling.

●●

Promoter binding (ChIP-chip, ChIP-seq).

●●

In situ hybridization studies of gene expression.

Data Standards for Omics Data: The Basis of Data Sharing and Reuse 3.2.1. MIAME

47

The goal of MIAME (Minimum Information About a Microarray Experiment, http://www.mged.org/Workgroups/MIAME/ miame.html) is to permit the unambiguous interpretation, reproduction, and verification of the results of a microarray experiment. MIAME was the original reporting standard which inspired similar “minimum information” requirements specifications in other Omics domains (9). MIAME defines the following six elements as essential for achieving these goals: 1. The raw data from each hybridization. 2. The final processed data for the set of hybridizations in the experiment. 3. The essential sample annotation, including experimental factors and their values. 4. The experiment design including sample data relationships. 5. Sufficient annotation of the array design. 6. Essential experimental and data processing protocols. For example, the MIAME standard has proven useful for microarray data repositories that have used it both as a guideline to data submitters and as a basis for judging the completeness of data submissions. The ArrayExpress database provides a service to publishers of microarray studies wherein ArrayExpress curators will assess a dataset on the basis of how well it satisfies the MIAME requirements (29). A publisher can then choose whether to accept or reject a manuscript on the basis of the assessment. ArrayExpress judges the following aspects of a report to be the most critical toward MIAME compliance: 1. Sufficient information about the array design (e.g., reporter sequences for oligonucleotide arrays or database accession numbers for cDNA arrays). 2. Raw data as obtained from the image analysis software (e.g. CEL files for Affymetrix technology, or GPR files for GenPix). 3. Processed data for the set of hybridizations. 4. Essential sample annotation, including experimental factors (variables) and their values (e.g., the compound and dose in a dose response experiment). 5. Essential experimental and data processing protocols. 6. Several publishers now have policies in place that require MIAME-compliance as a precondition for publication.

3.2.2. MINSEQE

The Minimum Information about a high-throughput Nucleotide SEQuencing Experiment (MINSEQE, http://www.mged.org/ minseqe) provides a reporting guideline akin to MIAME that is

48

Chervitz et al.

applicable to high-throughput nucleotide sequencing experiments used to assay biological state. It does not pertain to traditional sequencing projects, where the aim is to assemble a chromosomal sequence or resequence a given genomic region, but rather to applications of sequencing in areas such as transcriptomics where high-throughput sequencing is being used to compare the populations of sequences between samples derived from different biological states, for example, sequencing cDNAs to assess differential gene expression. Here, sequencing provides a means to assay the sequence composition of different biological samples, analogous to the way that DNA microarrays have traditionally been used. MINSEQE is now supported by both the Gene Expression Omnibus (GEO) and ArrayExpress. ArrayExpress and GEO have entered into a metadata exchange agreement, meaning that UHTS sequence experiments will appear in both databases regardless of where they were submitted. This complements the exchange of under­ lying raw data between the NBCI and EBI short read archives, SRA and ERA. 3.2.3. MAGE

The MAGE project (MicroArray Gene Expression, http://www. mged.org/Workgroups/MAGE/mage.html) aims to provide a standard for the representation of microarray gene expression data to facilitate the creation of software tools for exchanging microarray information between different users and data repositories. The MAGE family of standards does not have direct support for capturing the results of higher-level analysis (e.g., clustering of expression data from a microarray experiment). MAGE includes the following sub-projects: ●●

MAGE-OM: MAGE Object Model

●●

MAGE-ML: MAGE Markup Language

●●

MAGEstk: MAGE Software Toolkit

●●

MAGE-TAB: MAGE Tabular Format

MAGE-OM is a platform independent model for representing gene expression microarray data. Using the MAGE-OM model, the MGED Society has implemented MAGE-ML (an XML-based format) as well as MAGE-TAB (tab-delimited values format). Both formats can be used for annotating and communicating data from microarray gene expression experiments in a MIAME-compliant fashion. MAGE-TAB evolved out of a need to create a simpler version of MAGE-ML. MAGE-TAB is easier to use and thus more accessible to a wider cross-section of the microarray-based gene expression community which has struggled with the often large, structured XML-based MAGE-ML documents. A limitation of MAGE-TAB is that only single values are permitted for certain data slots that may in practice be

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

49

multivalued. Data that cannot be adequately represented by MAGE-TAB can be described using MAGE-ML, which is quite flexible. MAGEstk is a collection of Open Source packages that implement the MAGE Object Model in various programming languages (10). The toolkit is meant for bioinformatics users that develop their own applications and need to integrate functionality for managing an instance of a MAGE-OM. The toolkit facilitates easy reading and writing of MAGE-ML to and from the MAGE-OM, and all MAGE-objects have methods to maintain and update the MAGE-OM at all levels. However, the MAGE-stk is the glue between a software application and the standard way of representing DNA microarray data in MAGE-OM as a MAGE-ML file. 3.2.4. MAGE-TAB

MAGE-TAB (30) is a simple tab delimited format that is used to represent gene expression and other high throughput data such as high throughput sequencing (see Note 5). It is the main submission format for the ArrayExpress database at the European Bioinformatics Institute and is supported by the BioConductor package ArrayExpress. There are also converters available to MAGE-TAB from GEO soft format, from MAGE-ML to MAGETAB, and an open source template generation system (31). The MGED community maintains a complete list of applications using MAGE-TAB at http://www.mged.org/mage-tab (see Note 6).

3.2.5. MO

The MGED Ontology (MO, http://mged.sourceforge.net/ ontologies/index.php) provides standard terms for describing the different components of a microarray experiment (11). MO is complementary to the other MGED standards, MIAME, and MAGE, which respectively specify what information should be provided and how that information should be structured. The specification of the terminology used for labeling that information has been left to MO. MO is an ontology with defined classes, instances, and relations. A primary motivation for the creation of MO was to provide terms wherever needed in the MAGE Object Model. This has led to MO being organized along the same lines as the MAGE-OM packages. A feature of MO is that it provides pointers to other resources as appropriate to describe a sample, or biomaterial characteristics, and treatment compounds used in the experiment (e.g. NCBI Taxonomy, ChEBI) rather than importing, mapping, or duplicating those terms. A major revision of MO (currently at version 1.3.1.1, released in Feb. 2007) was planned to address structural issues. However, such plans have been recently superseded by efforts aimed in incorporating MO into the Ontology for Biomedical Investigations (OBI). The primary usage of MO has been for the annotation of microarray experiments. MO terms can be found incorporated

50

Chervitz et al.

into a number of microarray databases (e.g., ArrayExpress, RNA Abundance Database (RAD) (32), caArray (http://caarray.nci. nih.gov/). Stanford Microarray Database (SMD) (33–38), maxD (http://www.bioinf.manchester.ac.uk/microarray/maxd), MiMiR (39)) enable retrieval of studies consistently across these different sites. MO terms have also been used as part of column headers for MAGE-TAB (30), a tab-delimited form of MAGE. Example terms from MO v.1.3.1.1: ●●

●●

●●

●●

3.2.6. MAQC

BioMaterialPackage (MO_182): Description of the source of the nucleic acid used to generate labeled material for the microarray experiment (an abstract class taken from MAGE to organize MO). BioMaterialCharacteristics (MO_5): Properties of the biomaterial before treatment in any manner for the purposes of the experiment (a subclass of BioMaterialPackage). CellTypeDatabase (MO_141): Database of cell type information (a subclass of the Database). eVOC (MO_684): Ontology of human terms that describe the sample source of human cDNA and SAGE libraries (an instance of CellTypeDatabase).

The MAQC project (MicroArray Quality Control project, http:// www.fda.gov/nctr/science/centers/toxicoinformatics/maqc) aims to develop best practices for executing microarray experiments and analyzing results in a manner that maximizes consistency between different vendor platforms. The effort is spearheaded by the U.S. Food and Drug Administration (FDA) and has participants spanning the microarray industry. The work of the MAQC project is providing guidance for the development of quality measures and procedures that will facilitate the reliable use of microarray technology within clinical practice and regulatory decision-making, thereby helping realize the promises of personalized medicine (40). The project consists of two phases: 1. MAQC-I demonstrated the technical performance of microarray platforms in the identification of differentially expressed genes (20). 2. MAQC-II is aimed at reaching consensus on best practices for developing and validating predictive models based on microarray data. This phase of the project includes genotyping data as well as gene expression data, which was the focus of MAQC-I. MAQC-II is currently in progress with results expected soon (http://www.fda.gov/nctr/science/centers/ toxicoinformatics/maqc).

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

51

3.2.7. ERCC

The External RNA Control Consortium (ERCC, http://www.cstl. nist.gov/biotech/Cell&TissueMeasurements/GeneExpression/ ERCC.htm) aims to create well-characterized and tested RNA spike-in controls for gene expression assays. They have worked with the U.S. National Institute of Standards and Technology (NIST) to create certified reference materials useful for evaluating sample and system performance. Such materials facilitate standardized data comparisons among commercial and custom microarray gene expression platforms as well as by an alternative expression profiling method such as qRT-PCR. The ERCC originated in 2003 and has grown to include more than 90 organizations spanning a cross-section of industry and academic groups from around the world. The controls developed by this group have been based on contributions from member organizations and have undergone rigorous evaluation to ensure efficacy across different expression platforms.

3.3. Proteomic Standards

This section describes the standards and organizations related to technologies that measure protein-related phenomena on a genomic scale.

3.3.1. HUPO PSI

The primary digital communications standards organization in this domain is the Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) (http://www.psidev.info/), which provides an official process for drafting, reviewing, and accepting proteomics-related standards (41). As with other standardization efforts, the PSI creates and promotes both minimum information standards, which define what metadata about a study should be provided, as well as data exchange standards, which define the standardized, computer-readable format for conveying the information. Within the PSI are six working groups, which define standards in subdomains representing different components in typical workflows or different types of investigations:

3.3.2. MIAPE

●●

Sample processing

●●

Gel electrophoresis

●●

Mass spectrometry

●●

Proteomics informatics

●●

Protein modifications

●●

Molecular interactions

MIAPE (Minimum Information About a Proteomics Experiment, http://www.psidev.info/index.php?q=node/91) is a reporting standard for proteomics experiments analogous to use of MIAME for gene expression experiments. The main MIAPE publication (19) describes the overall goals and organization of the MIAPE

52

Chervitz et al.

specifications. Each subdomain (e.g., sample processing, column chromatography, mass spectrometry, etc.) has been given a separate MIAPE module that describes the information needed for each component of the study being presented. The PSI has actively engaged the journal editors to refine the MIAPE modules to a level that the editors are willing to enforce. 3.3.3. Proteomics Mass Spectrometry Data Exchange Formats

Since 2003, several data formats for encoding data related to proteomics mass spectrometry experiments have emerged. Some early XML-based formats originating from the Institute for Systems Biology such as mzXML (42) and pepXML/protXML (43) were widely adopted and became de-facto standards. More recently, the PSI has built on these formats to develop official standards such as mzML (44) for mass spectrometer output, GelML for gel electrophoresis, and mzIdentML for the bioinformatics analysis results from such data and others. See Deutsch et al. (45) for a review of some of these formats. These newer PSI formats are accompanied by controlled vocabularies, semantic validator software, example instance documents, and in some cases fully open-source software libraries to enable swift adoption of these standards.

3.3.4. MIMIx

The PSI Molecular Interactions (MI) Working Group (http:// www.psidev.info/index.php?q=node/277) has developed and approved several standards to facilitate sharing of molecular interaction information. MIMIx (Minimum Information about a Molecular Interaction Experiment) (46) is the minimum information standard that defines what information must be present in a compliant list of molecular interactions. The PSI-MI XML (or MIF) standard is an XML-based data exchange format for encoding the results of molecular interaction experiments. A major component of the format is a controlled vocabulary (PSI-MI CV) that insures the terms to describe and annotate interactions are used consistently by all documents and software. In addition to the XML format, a simpler tab-delimited data exchange format MITAB2.5 has been developed. It supports a subset of the PSI-MI XML functionality and can be edited easily using widely available spreadsheet software (47).

3.4. Metabolomics Standards

This section describes the standards and organizations related to the study of metabolomics, which studies low molecular weight metabolite profiles on a comprehensive, genomic scale within a biological sample. Metabolomic standards initiatives are not as mature as those in the transcriptomic and proteomic domains, though there is a growing community interest in this area. (Note that no distinction is made in this text between metabolomics vs metabonomics. We use “metabolomics” to refer to both types of investigations, in so far as a distinction exists).

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

53

Metabolomic standards pertain to the following technologies or types of investigations: ●●

Metabolic profiling of all compounds in a specific pathway

●●

Biochemical network modeling

●●

●●

Biochemical network perturbation analysis (environmental, genetic) Network flux analysis

The metabolomics research community is engaged in development of a variety of standards, coordinated by the Metabolomics Standards Initiative (48, 49). Under development are reporting “minimum information” standards (48, 50), data exchange formats (51), data models (52–54), and standard ontologies (55). A number of specific experiment description-related projects for metabolomics are described below. 3.4.1. CIMR

CIMR (Core Information for Metabolomics Reporting, http:// msi-workgroups.sourceforge.net) is in development as a minimal information guideline for reporting metabolomics experiments. It is expected to cover all metabolomics application areas and analysis technologies. The MSI is also involved in collaborative efforts to develop ontologies and data exchange formats for metabolomics experiments.

3.4.2. MeMo and ArMet

MeMo (Metabolic Modelling, http://dbkgroup.org/memo) defines a data model and XML-based data exchange format for metabolomic studies in yeast (54). ArMet (Architecture for Metabolomics, http://www.armet. org) defines a data model for plant metabolomics experiments and also provides guidance for data collection (52, 56).

3.5. Healthcare Standards

The health care community has a long history of using standards to drive data exchange and submission to regulatory agencies. Within this setting, it is vital to ensure that data from assays pass quality assessments and can be transferred without loss of meaning and in a format that can be easily used by common tools. The drive to translate Omics approaches from a research to a clinical setting has provided strong motivation for the development of physical standards and guidelines for their use in this setting. Omics technologies hold much promise to improve our understanding of the molecular basis of diseases and develop improved diagnostics and therapeutics tailored to individual patients (6, 57). Looking forward, the health care community is now engaged in numerous efforts to define standards important for clinical, diagnostic, and toxicological applications of data from highthroughput genomics technologies. The types and amount of data from a clinical trial or toxicogenomics study are quite extensive,

54

Chervitz et al.

incorporating data from multiple Omics domains. Standards development for electronic submission of this data is still ongoing with best practices yet to emerge. While it is likely that highthroughput data will be summarized prior to transmission, it is anticipated that the raw files should be available for analysis if requested by regulators and other scientists. Standards-related activities pertaining to the use of Omics technologies within a health care setting can be roughly divided into three main focus areas: experiment description standards, reference materials, and laboratory procedures. 3.5.1. Healthcare Experiment Description Standards

Orthogonal to the experiment description standards efforts in the basic research and technical communities, clinicians and biologists have identified the need to describe the characteristics of an organism or specimen under study in a way that is understandable to clinicians as well as scientists. Under development within these biomedical communities are reporting standards to codify what data should be captured and in what data exchange format to permit reuse of the data by others. As with the other minimum information standards, the goal is to create a common way to describe characteristics of the objects of a study, and identify the essential characteristics to include when publishing the study. Parallel work is underway in the arena of toxicogenomics (21, 58). Additionally, standard terminologies in the form of thesauri or controlled vocabularies and systematic annotation methods are also under development. It is envisioned that clinically relevant standards (some of which are summarized in Table  8) will be used in conjunction with the experiment description standards being developed by the basic research communities that study the same biological objects and organisms. For example, ISA-TAB (described below) is intended to complement existing biomedical formats such as the Study Data Tabulation Model (SDTM), a FDA-endorsed data model created by CDISC to organize, structure, and format both clinical and nonclinical (toxicological) data submissions to regulatory authorities (http://www.cdisc.org/models/sds/v3.1/ index.html). It is inevitable that some information will be duplicated between the two frameworks, but this is not generally seen as a major problem. Links between related components of ISATAB and SDTM could be created using properties of the subject source, for example.

3.5.2. Reference Materials

Developing industry-respected standard reference materials, such as a reagent for use as a positive or negative control in an assay, is essential for any work in a clinical or diagnostic setting. Reference materials are physical standards (see above) that provide an objective way to evaluate the performance of laboratory equipment, protocols, and sample integrity. The lack of suitable reference

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

55

Table 8 Summary of healthcare experiment description standards initiatives Acronym

Full name

Description

Scope

Website

BIRN

Data analysis; Biomedical Collaborative Terminology Informatics informatics Research Network resources medical/ clinical data

http://www.nbirn.net

CDISC

Clinical Data Interchange Standards Consortium

Regulatory submissions of clinical data

Data exchange & modeling

http://www.cdisc.org

CONSORT Consolidated Standards of Reporting Trials

Minimum requirements for reporting randomized clinical trials

Reporting

http://www.consortstatement.org

EVS

Enterprise Vocabulary Services

Controlled vocabulary by the National Cancer Institute in support of cancer

Terminology

http://www.cancer. gov/cancertopics/ terminologyresources

HL7

Health Level 7

Programmatic data exchange for healthcare applications

Data exchange

http://www.hl7.org

SEND

Standards for Exchange of Preclinical Data

Regulatory submissions of preclinical data; based on CDISC

Data exchange & http://www.cdisc.org/ modeling standards

ToxML

Toxicology XML

Toxicology data exchange; based on controlled vocabulary

Data exchange; terminology

http://www.leadscope. com/toxml.php

materials and guidelines for their use has been a major factor in slowing the adoption of Omics technologies such as DNA microarrays within clinical and diagnostic settings (6). The ERCC (described above) and the LGC (http://www.lgc. co.uk) are the key organizations working on development of standard reference materials, currently targeting transcriptomics experiments. 3.5.3. Laboratory Procedures

Standard protocols providing guidance in the application of reference materials, experiment design, and data analysis best practices are essential for performing high-throughput Omics procedures in clinical or diagnostic applications.

56

Chervitz et al.

Table 9 CLSI documents most relevant to functional genomics technologies Document Description

Status

MM12-A

Diagnostic Nucleic Acid Microarrays

Approved guideline

MM14-A

Proficiency Testing (External Quality Approved guideline Assessment) for Molecular Methods

MM16-A

Use of External RNA Controls in Gene Expression Assays

Approved guideline

MM17-A

Verification and Validation of Multiplex Nucleic Acid Assays

Approved guideline

The Clinical Laboratory Standards Institute (CLSI, http:// www.clsi.org) is an organization that provides an infrastructure for ratifying and publishing guidelines for clinical laboratories. Working with organizations such as the ERCC (described above), they have produced a number of documents (see Table 9) applicable to the use of multiplex, whole-genome technologies such as gene expression and genotyping within a clinical or diagnostic setting. 3.6. Challenges for Omics Standards in Basic Research

A major challenge facing Omics standards is proving their value to a significant fraction of the user base and facilitating widespread adoption. Given the relative youth of the field of Omics and of the standardization efforts for such work, the main selling point for use of a standard has been that it will benefit future scientists and application/database developers, with limited added value for the users who are being asked to comply with the standard at publication time. Regardless of how well designed a standard is, if complying with it is perceived as being difficult or complicated, widespread adoption will be unlikely to occur. Some degree of enforcement of compliance by publishers and data repositories most likely will be required to inculcate the standard and build a critical mass within the targeted scientific community that then sustains its adoption. Significant progress has been achieved here: for DNA microarray gene expression studies, for example, most journals now require MIAME compliance and there is a broad recognition of the value of this standard within the target community. Here are the some of the “pressure points” any standard will experience from its community of intended users: ●●

●●

Domain experts who want to ensure comprehensiveness of the standard End-user scientists who want the standard to be easy with which to comply

Data Standards for Omics Data: The Basis of Data Sharing and Reuse ●●

●●

57

Software developers who want tools for encoding and decoding standards-compliant data Standards architects who want to ensure formal correctness of the standard

Satisfying all of these interests is not an easy task. One complication is that the various interested groups may not be equally involved in the development of the standard. Balancing these different priorities and groups is the task of the group responsible for maintaining a standard. This is an ongoing process that must remain responsive to user feedback. The MAGE-TAB data exchange format in the DNA microarray community provides a case in point: it was created largely in response to users that found MAGE-ML difficult to work with. 3.7. Challenges for Omics Standards in Healthcare Settings

The handling of clinical data adds additional challenges on top of the intrinsic complexities of Omics data. Investigators must respect certain regulations imposed by regulatory authorities. For example, the Health Insurance Portability Accountability Act (HIPAA) mandates the de-identification of patient data to protect an individual’s privacy. Standards and information systems used by the healthcare community therefore must be formulated to deal with such regulations (e.g., (59)). While the use of open standards poses risks to the release of protected health information, the removal of detailed patient metadata about samples can present barriers to research (60, 61). Enabling effective research while maintaining patient privacy remains an on-going issue (Joe White, Dana-Farber Cancer Institute, personal communication).

3.8. Upcoming Trends: Standards Harmonization

The field of Omics is not suffering from lack of interest in standards development, as the number of different standards discussed in this chapter attests. Such a complex landscape can have adverse effects on data sharing, integration, and systems inter­ operability – the very things that the standards are intended to help (62). To address this, there are a number of projects in the research and biomedical communities engaged in harmonization activities that focus on integrating standards with related or complementary scope and aim to enhance interoperability in the reporting and analysis of data generated by different technologies or within different Omics domains. Some standards facilitate harmonization by having a sufficiently general-purpose design such that they can accommodate data from experiments in different domains. Such “multiomics” standards typically have a mechanism that allows them to be extended as needed in order to incorporate aspects specific to a particular application area. The use of these domain- and technologyneutral frameworks is anticipated to improve the interoperability of data analysis tools that need to handle data from different types

58

Chervitz et al.

Table 10 Existing Omics standards harmonization projects and initiatives Acronym

Full name

Scope

Organization

FuGE-ML Functional Genomics FuGE-OM Experiment Markup Language/Object Model

Data exchange FuGE & modeling

ISA-TAB

Investigation Study Assay Tabular Format

Data exchange RSBI, GSC, MSI, HUPO-PSI

HITSP

Healthcare Information (various) Technology Standards Panel

ANSI

MIBBI

Minimum Information Reporting for Biological and Biomedical Investigations

MIBBI

OBI

Ontologies for Biomedical Investigations

Terminology

OBI

OBO

Open Biomedical Ontologies

Terminology

NCBO

P3G

Public Population Project in Genomics

(various)

International Consortium

P3G covers harmonization between genomic biobanks and longitudinal population genomic studies including technical, social, and ethical issues: http://www.p3gconsortium. org. The other projects noted in this table are described further in the chapter

of Omics experiments as well as to reduce wheel reinvention by different standards groups with similar needs. Harmonization and multiomics projects are collaborative efforts, involving participants from different domain-specific standards developing organizations with shared interests. Indeed, the success of these efforts depends on continued broad-based community involvement. In Table 10, we describe major current efforts in such multiomics and harmonization. 3.8.1. FuGE

The FuGE (Functional Genomics Experiment, (http://fuge. sourceforge.net)) project aims to build generic components that capture common facets of different Omics domains (63). Its contributors come from different standards efforts, primarily MGED and HUPO-PSI, reflecting the desire to build components that provide properties and functionalities common across different Omics technologies and application areas.

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

59

The vision of this effort is that using FuGE-based components, a software developer will be better able to create and modify tools for handling Omics data, without having to reinvent the wheel for common tasks in potentially incompatible ways. Further, tools based on such shared components are expected to be more interoperable. FuGE has several sub-projects that include the FuGE Object Model (FuGE-OM) and the FuGE Markup Language (FuGE-ML), a data exchange format. Technology-specific aspects can be added by extending the generic FuGE components, building on the common functionalities. For example, a microarrayspecific framework equivalent to MAGE could be derived by extending FuGE, deriving microarray-specific objects from the FuGE object model. 3.8.2. HITSP

The Healthcare Information Technology Standards Panel (HITSP) is a public-private sector partnership of standards developers, healthcare providers, government representatives, consumers, and vendors in the healthcare industry. It is administered by the American National Standards Institute (ANSI, http://www.ansi.org) to harmonize healthcare-related standards and improve interoperability of healthcare software systems. It produces recommendations and reports contributing to the development of a Nationwide Health Information Network for the United States (NHIN, http://www. hhs.gov/healthit/healthnetwork/background). The HITSP is driven by use cases issued by the American Health Information Community (AHIC, http://www.hhs.gov/ healthit/community/background). A number of use cases have been defined on a range of topics, such as personalized healthcare, newborn screening, and consumer adverse event reporting (http://www.hhs.gov/healthit/usecases).

3.8.3. ISA-TAB

The ISA-TAB format (Investigation Study Assay Tabular format, http://isatab.sourceforge.net) is a general purpose framework with which to communicate both data and metadata from experiments involving a combination of functional technologies (64). ISA-TAB therefore has a broader applicability and more extended structure compared to a domain-specific data exchange format such as MAGE-TAB. An example where ISA-TAB might be applied would be an experiment looking at changes both in (1) the metabolite profile of urine, and (2) gene expression in the liver in subjects treated with a compound inducing liver damage, using both mass spectrometry and DNA microarray technologies, respectively. The ISA-Tab format is the backbone for the ISA Infrastructure – a set of tools that support the capture of multiomics experiment descriptions. It also serves as a submission format to compatible databases such as the BioInvestigation Index project at the EBI

60

Chervitz et al.

(http://www.ebi.ac.uk/bioinvindex). It allows users to create a common structured representation of the metadata required to interpret an experiment for the purpose of combined submission to experimental data repositories such as ArrayExpress, PRIDE, and an upcoming metabolomics repository (64). Additional motivation comes from a group of collaborative systems, part of the MGED’s RSBI group (65), each of which is committed to pipelining Omics-based experimental data into EBI public repositories or willing to exchange data among themselves, or to enable their users to import data from public repositories into their local systems. ISA-TAB has a number of additional features that make it a more general framework that can comfortably accommodate multidomain experimental designs. ISA-TAB builds on the MAGE-TAB paradigm, and shares its motivation for the use of tab-delimited text files; i.e., that they can easily be created, viewed, and edited by researchers, using spreadsheet software such as Microsoft Excel. ISA-TAB also employs MAGE-TAB syntax as far as possible, to ensure backward compatibility with existing MAGE-TAB files. It was also important to align the concepts in ISA-TAB with some of the objects in the FuGE model. The ISATAB format could be seen as competing with XML-based formats such as the FuGE-ML. However, ISA-TAB addresses the immediate need for a framework to communicate for multiomics experiments, whereas all existing FuGE-based modules are still under development. When these become available, ISA-TAB could continue serving those with minimal bioinformatics support, as well as finding utility as a user-friendly presentation layer for XMLbased formats (via an XSL transformation); i.e. in the manner of the HTML rendering of MAGE-ML documents. Initial work has been carried out to evaluate the feasibility of rendering FuGE-ML files (and FuGE-based extensions, such as GelML and Flow-ML) in the ISA-TAB format. Examples are available at the ISA-TAB website under the document section, along with a report detailing the issues faced during these transformations. When finalized, the XSL templates will also be released, along with Xpath expressions and a table mapping FuGE objects and ISA-TAB labels. Additional ISA-TAB-formatted examples are available, including a MIGS-MIMS-compliant dataset (see http://isatab.sourceforge.net/examples.html). The decision on how to regulate the use of the ISA-TAB (marking certain fields as mandatory or enforcing the use of controlled terminology) is a matter for those who will implement the format in their system. Although certain fields would benefit from the use of controlled terminology, ISA-TAB files with all fields left empty are syntactically valid, as are those where all fields are filled with free text values rather than controlled vocabulary or ontology terms.

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

61

3.8.4. MIBBI

Experiments in different Omics domains typically share some reporting requirements (for example, specifying the source of a biological specimen). The MIBBI project (Minimal Information for Biological and Biomedical Investigations, http://mibbi.org; developers: http://mibbi.sourceforge.net) aims to work collaboratively with different groups to harmonize and modularize their minimum information checklists (e.g., MIAME, MIGS-MIMS, etc.) refactoring the common requirements, to make it possible to use these checklists in combination (21). Additionally, the MIBBI project provides a comprehensive web portal providing registration of and access to different minimum information checklists for different types of Omics (and other) experiments.

3.8.5. OBI

An excellent description of the OBI project comes from its home web page: The Ontology for Biomedical Investigations (OBI, http://purl.obofoundry.org/obo/obi) project is developing an integrated ontology for the description of biological and medical experiments and investigations. This includes a set of “universal” terms that are applicable across various biological and technological domains, and domain-specific terms relevant only to a given domain. This ontology will support the consistent annotation of biomedical investigations, regardless of the particular field of study. The ontology will model the design of an investigation, the protocols and instrumentation used, the material used, the data generated and the type of analysis performed on it. This project was formerly called the Functional Genomics Investigation Ontology (FuGO) project (66). OBI is a collaborative effort of many communities representing particular research domains and technological platforms (http://obi-ontology.org/page/Consortium). OBI is meant to serve very practical needs rather than be an academic exercise. Thus it is very much driven by use cases and validation questions. The OBI user community provides valuable feedback about the utility of OBI and acts as a source of terms and use cases. As a member of the OBO Foundry (described below), OBI has made a commitment to be interoperable with other biomedical ontologies. Each term in OBI has a set of annotation properties, some of which are mandatory (minimal metadata defined at http:// obi-ontology.org/page/OBI_Minimal_metadata). These include the term’s preferred name, definition source, editor, and curation status.

3.8.6. OBO Consortium and the NCBO

The OBO Consortium (Open Biomedical Ontologies Consortium, http://www.obofoundry.org), a voluntary, collaborative effort among different OBO developers, has developed the OBO Foundry as a way to avoid the proliferation of incompatible ontologies in the biomedical domain (25). The OBO Foundry provides validation and assessment of ontologies to ensure

62

Chervitz et al.

interoperability. It also defines principles and best practices for ontology construction such as the Basic Formal Ontology, which serves as a root-level ontology from which other domain-specific ontologies can be built, and the relations ontology, which defines a common set of relationship types (67). Incorporation of such elements within OBO is intended to facilitate interoperability between ontologies (i.e., for one OBO Foundry ontology to be able to import components of other ontologies without conflict) and the construction of “accurate representations of biological reality.” The NCBO (National Center for Biomedical Ontology, http://bioontology.org) supports the OBO Consortium by providing tools and resources to help manage the ontologies and to help the scientific community access, query, visualize, and use them to annotate experimental data (68). The NCBO’s BioPortal website provides searches across multiple ontologies and contains a large library of these ontologies spanning many species and many scales, from molecules to whole organism. The ontology content comes from the model organism communities, biology, chemistry, anatomy, radiology, and medicine. Together, the OBO Consortium and the NCBO are helping to construct a consistent arsenal of ontologies to promote their application in annotating Omics and other biological experiments. This is the sort of community-based ontology building that holds great potential to help the life science community convert the complex and daunting Omics data sets into new discoveries that expand our knowledge and improve human health. 3.9. Concluding on the Need for Standards

A key motivation behind Omics standards is to foster data sharing, reuse, and integration with the ultimate goal of producing new biological insights (within basic research environments) and better medical treatments (within healthcare environments). Widely adopted minimum information guidelines for publication and formats for data exchange are leading to increased and better reporting of results and submission of experimental data into public repositories, and more effective data mining of large Omics data sets. Standards harmonization efforts are in progress to improve data integration and interoperability of software within both basic research settings as well as within healthcare environments. Standard reference materials and protocols for their use are also under active development and hold much promise for improving data quality, systems benchmarking, and facilitating the use of Omics technologies within clinical and diagnostic settings. High-throughput Omics experiments, with their large and complex data sets, have posed many challenges to the creation and adoption of standards. However, in recent years, the standards initiatives in this field have risen to the challenge and continue to

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

63

engage their respective communities to improve the fit of the standards to user and market needs. Omics communities have recognized that standards-compliant software tools can go a long way towards enhancing the adoption and usefulness of a standard by enabling ease-of-use. For data exchange standards, such tools can “hide the technical complexities of the standard and facilitate manipulation of the standard format in an easy way” (8). Some tools can themselves become part of standard practice when they are widely used throughout a community. Efforts are underway within organizations such as MGED and HUPO PSI to enhance the usefulness of tools for end user scientists working with standard data formats in order to ease the process of data submission, annotation, and analysis. The widespread adoption of some of the more mature Omics standards by large numbers of life science researchers, data analysts, software developers, and journals has had a number of benefits. Adoption has promoted data sharing and reanalysis, facilitated publication, and spawned a number of data repositories to store data from Omics experiments. A higher citation rate and other benefits have been detected for researchers who share their data (7, 69). Estimates of total volume of high-throughput data available in the public domain are complex to calculate, but a list of databases maintained by the Nucleic Acids Research journal (http://www3.oup.co.uk/nar/database/a) contained more than 1,000 databases in areas ranging from nucleic acid sequence data to experimental archives and specialist data integration resources (70). More public databases appear every year and as technologies change, so that deep sequencing of genomes and transcriptomes becomes more cost effective, the volume will undoubtedly rise even further. Consistent annotation of this growing volume of Omics data using interoperable ontologies and controlled vocabularies will play an important role in enabling collaborations and reuse of the data by other third parties. More advanced forms of knowledge integration that rely on standard terminologies are beginning to be explored using semantic web approaches (71–73). Adherence to standards by public data repositories is expected to facilitate data querying and reuse. Even in the absence of strict standards (such as compliance requirements upon data submission), useful data mining can be performed from large bodies of raw data originating from the same technology platform (74), especially if standards efforts make annotation guidelines available and repositories encourage their use. Approaches such as this may help researchers better utilize the limited levels of consistently annotated data in the public domain. It was recently noted that only a fraction of data generated is deposited in public data repositories (75). Improvements in this

64

Chervitz et al.

area can be anticipated through the proliferation of better tools for bench scientists that make it easier for them to submit their data in a consistent, standards-compliant manner. The full value of Omics research will only be realized once scientists in the laboratory and the clinic are able to share and integrate over large amounts of Omics data as easily as they can now do so with primary biological sequence data.

4. Notes 1. Tools for programmers: Many labs need to implement their own tools for managing and analyzing data locally. There are a number of parsers and tool kits for common data formats that can be reused in this context. These are listed in Table 11. 2. Tips for using standards: Standards are commonly supported by tools and applications related to projects or to public repositories. One example is the ISA-TAB related infrastructure described in Subheading  3.8.3, others are provided in Table 12. These include simple conversion tools for formats used by standards compliant databases such as ArrayExpress and GEO, and tools that allow users to access these databases and load data into analysis applications.

Table 11 Programmatic tools for dealing with standards, ontologies and common data formats Tool name

Language

Purpose

Website

Limpopo

Java

MAGE-TAB parser

http://sourceforge.net/ projects/limpopo/

MAGEstk

Perl and Java MAGE-ML toolkit

http://www.mged.org/ Workgroups/MAGE/ magestk.html

MAGE-Tab module

Perl

MAGE-TAB API

http://magetabutils. sourceforge.net/

OntoCat

Java

Ontology access tool for OWL, http://ontocat.sourceforge. OBO format files and ontology net/ web services

OWL-API

Java

Reading and querying OWL and OBO format files

http://owlapi.sourceforge. net/

Data Standards for Omics Data: The Basis of Data Sharing and Reuse

65

Table 12 Freely available standards related format conversion tools Tool name

Language

Formats supported

Website

MAGETabulator

Perl

SOFT to MAGE-TAB http://tab2mage.sourceforge.net

MAGETabulator

Perl

MAGE-TAB to MAGE-ML

http://tab2mage.sourceforge.net

ArrayExpress Package R (Bioconductor) MAGE-TAB to R objects

http://www.bioconductor.org/ packages/bioc/html/ ArrayExpress.html

GEOquery

R (Bioconductor) GEO SOFT to R objects

http://www.bioconductor.org/ packages/1.8/bioc/html/ GEOquery.html

ISA-Creator

Java

ISA-TAB to MAGETAB

http://isatab.sourceforge.net/ tools.html

ISA-Creator

Java

ISA-TAB to Pride XML

http://isatab.sourceforge.net/ tools.html

ISA-Creator

Java

ISA-TAB to Short http://isatab.sourceforge.net/ Read Archive XML tools.html

Table 13 Standards compliant data annotation tools Tool name

Language

Purpose

Website

Annotare

Adobe Air/Java

Desktop MAGE-TAB annotation application

http://code.google.com/p/ annotare/

MAGETabulator

Perl

MAGE-TAB template generation and related database

http://tab2mage.sourceforge.net

caArray

Java

MAGE-TAB Data management solution

https://array.nci.nih.gov/caarray/ home.action

ISA-Creator

Java

ISA-TAB annotation application

http://isatab.sourceforge.net/ tools.html

3. Annotation tools for biologists and bioinformaticians: Annotation of data to be compliant with standards is supported by several open-source annotation tools. Some of these are related to repositories supporting standards, but most are available for local installation as well. These are described in Table 13.

66

Chervitz et al.

4. Tips for using Ontologies: Further introductory information on design and use of ontologies can be found at the Onto­ genesis site (http://ontogenesis.knowledgeblog.org). Publicly available ontologies can be queried from the NCBO’s website (http://www.bioportal.org) and tutorials for developing ontologies and using supporting tools such as the OWL-API are run by several organizations, including the NCBO, the OBO Foundry and the University of Manchester, UK. 5. Format Conversion Tools: The MAGE-ML format described in Subheading  3.2.4 has been superseded by MAGE-TAB and the different gene expression databases use different formats to express the same standards compliant data. There are therefore a number of open source conversion tools that reformat data, or preprocess data for analysis application access. These are provided as downloadable applications, and are summarized in Table 12. Support for understanding and applying data formats is often available from repositories that use these formats for data submission and exchange. Validation tools and supporting code may also be available. Email their respective helpdesks for support. 6. Tips for developing standards: Most standards bodies have affiliated academic or industry groups and fora who are developing applications and who welcome input from the community. For example MGED has mailing lists, workshops, and an open source project that provides tools for common data representation tasks. References 1. Boguski, M.S. (1999) Biosequence exegesis. Science 286(5439), 453–5. 2. Brazma, A. (2001) On the importance of standardisation in life sciences. Bioinformatics 17(2), 113–4. 3. Stoeckert, C.J., Jr., Causton, H.C., and Ball, C.A. (2002) Microarray databases: standards and ontologies. Nat Genet 32, 469–73. 4. Brooksbank, C., and Quackenbush, J. (2006) Data standards: a call to action. OMICS 10(2), 94–9. 5. Rogers, S., and Cambrosio, A. (2007) Making a new technology work: the standardization and regulation of microarrays. Yale J Biol Med 80(4), 165–78. 6. Warrington, J.A. (2008) Standard controls and protocols for microarray based assays in clinical applications, in Book of Genes and Medicine. Medical Do Co: Osaka. 7. Piwowar, H.A., et  al. (2008) Towards a data sharing culture: recommendations for leadership

8. 9.

10.

11.

12.

from academic health center. PLoS Med 5(9), e183. Brazma, A., Krestyaninova, M., and Sarkans, U. (2006) Standards for systems biology. Nat Rev Genet 7(8), 593–605. Brazma, A., et al. (2001) Minimum information about a microarray experiment (MIAME)toward standards for microarray data. Nat Genet 29(4), 365–71. Spellman, P.T., et  al. (2002) Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol 3(9), RESEARCH0046. Whetzel, P.L., et  al. (2006) The MGED ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22(7), 866–73. Parkinson, H., et  al. (2009) ArrayExpress update – from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res 37(Database issue), D868–72.

Data Standards for Omics Data: The Basis of Data Sharing and Reuse 13. Parkinson, H., et al. (2007) ArrayExpress – a public database of microarray experiments and gene expression profiles. Nucleic Acids Res 35(Database issue), D747–50. 14. Parkinson, H., et al. (2005) ArrayExpress – a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 33(Database issue), D553–5. 15. Barrett, T., and Edgar, R. (2006) Gene expression omnibus: microarray data storage, submission, retrieval, and analysis. Methods Enzymol 411, 352–69. 16. Barrett, T., et al. (2005) NCBI GEO: mining millions of expression profiles – database and tools. Nucleic Acids Res 33(Database issue), D562–6. 17. Barrett, T., et al. (2007) NCBI GEO: mining tens of millions of expression profiles – database and tools update. Nucleic Acids Res 35(Database issue), D760–5. 18. Barrett, T., et al. (2009) NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res 37(Database issue), D885–90. 19. Taylor, C.F., et  al. (2007) The minimum information about a proteomics experiment (MIAPE). Nat Biotechnol 25(8), 887–93. 20. Shi, L., et al. (2006) The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24(9), 1151–61. 21. Taylor, C.F., et al. (2008) Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotechnol 26(8), 889–96. 22. DeFrancesco, L. (2002) Journal trio embraces MIAME. Genome Biol 8(6), R112. 23. Jones, A.R., and Paton, N.W. (2005) An analysis of extensible modelling for functional genomics data. BMC Bioinformatics 6, 235. 24. Ashburner, M., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1), 25–9. 25. Smith, B., et al. (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25(11), 1251–5. 26. Salit, M. (2006) Standards in gene expression microarray experiments. Methods Enzymol 411, 63–78. 27. Li, H., et al. (2009) The sequence alignment/ map format and SAMtools. Bioinformatics 25(16), 2078–9. 28. Brookes, A.J., et  al. (2009) The phenotype and genotype experiment object model

29.

30.

31.

32.

33.

34.

35.

36. 37.

38. 39.

40. 41.

42.

67

(PaGE-OM): a robust data structure for information related to DNA variation. Hum Mutat 30(6), 968–77. Brazma, A., and Parkinson, H. (2006) ArrayExpress service for reviewers/editors of DNA microarray papers. Nat Biotechnol 24(11), 1321–2. Rayner, T.F., et al. (2006) A simple spreadsheetbased, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinformatics 7, 489. Rayner, T.F., et al. (2009) MAGETabulator, a suite of tools to support the microarray data format MAGE-TAB. Bioinformatics 25(2), 279–80. Manduchi, E., et  al. (2004) RAD and the RAD Study-Annotator: an approach to collection, organization and exchange of all relevant information for high-throughput gene expression studies. Bioinformatics 20(4), 452–9. Ball, C.A., et  al. (2005) The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Res 33(Database issue), D580–2. Demeter, J., et  al. (2007) The Stanford Microarray Database: implementation of new analysis tools and open source release of software. Nucleic Acids Res 35(Database issue), D766–70. Gollub, J., et  al. (2003) The Stanford Microarray Database: data access and quality assessment tools. Nucleic Acids Res 31(1), 94–6. Gollub, J., Ball, C.A., and Sherlock, G. (2006) The Stanford Microarray Database: a user’s guide. Methods Mol Biol 338, 191–208. Hubble, J., et  al. (2009) Implementation of GenePattern within the Stanford Microarray Database. Nucleic Acids Res 37(Database issue), D898–901. Sherlock, G., et  al. (2001) The Stanford Microarray Database. Nucleic Acids Res 29(1), 152–5. Navarange, M., et al. (2005) MiMiR: a comprehensive solution for storage, annotation and exchange of microarray data. BMC Bioinformatics 6, 268. Allison, M. (2008) Is personalized medicine finally arriving? Nat Biotechnol 26(5), 509–17. Orchard, S., and Hermjakob, H. (2008) The HUPO proteomics standards initiative – easing communication and minimizing data loss in a changing world. Brief Bioinform 9(2), 166–73. Pedrioli, P.G., et al. (2004) A common open representation of mass spectrometry data and

68

Chervitz et al.

its application to proteomics research. Nat Biotechnol 22(11), 1459–66. 43. Keller, A., et al. (2005) A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol Syst Biol 1, 0017. 44. Deutsch, E. (2008) mzML: a single, unifying data format for mass spectrometer output. Proteomics 8(14), 2776–7. 45. Deutsch, E.W., Lam, H., and Aebersold, R. (2008) Data analysis and bioinformatics tools for tandem mass spectrometry in proteomics. Physiol Genomics 33(1), 18–25. 46. Orchard, S., et  al. (2007) The minimum information required for reporting a molecular interaction experiment (MIMIx). Nat Biotechnol 25(8), 894–8. 47. Kerrien, S., et  al. (2007) Broadening the horizon – level 2.5 of the HUPO-PSI format for molecular interactions. BMC Biol 5, 44. 48. Fiehn, O., et  al. (2006) Establishing reporting standards for metabolomic and metabonomic studies: a call for participation. OMICS 10(2), 158–63. 49. Sansone, S.A., et al. (2007) The metabolomics standards initiative. Nat Biotechnol 25(8), 846–8. 50. Goodacre, R., et  al. (2007) Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics 3(3), 231–41. 51. Hardy, N., and Taylor, C. (2007) A roadmap for the establishment of standard data exchange structures for metabolomics. Metabolomics 3(3), 243–8. 52. Jenkins, H., Johnson, H., Kular, B., Wang, T., and Hardy, N. (2005) Toward supportive data collection tools for plant metabolomics. Plant Physiol 138(1), 67–77. 53. Jenkins, H., et al. (2004) A proposed framework for the description of plant metabolomics experiments and their results. Nat Biotechnol 22(12), 1601–6. 54. Spasic, I., et al. (2006) MeMo: a hybrid SQL/ XML approach to metabolomic data management for functional genomics. BMC Bioinformatics 7, 281. 55. Sansone, S.-A., Schober, D., Atherton, H., Fiehn, O., Jenkins, H., Rocca-Serra, P., et al. (2007) Metabolomics standards initiative: ontology working group work in progress. Metabolomics 3(3), 249–56. 56. Jenkins, H., Hardy, N., Beckmann, M., Draper, J., Smith, A.R., Taylor, J., et  al. (2004) A proposed framework for the description of plant metabolomics experiments and their results. Nat Biotechnol 22(12), 1601–6.

57. Kumar, D. (2007) From evidence-based medicine to genomic medicine. Genomic Med 1(3–4), 95–104. 58. Fostel, J.M. (2008) Towards standards for data exchange and integration and their impact on a public database such as CEBS (Chemical Effects in Biological Systems). Toxicol Appl Pharmacol 233(1), 54–62. 59. Bland, P.H., Laderach, G.E., and Meyer, C.R. (2007) A web-based interface for communication of data between the clinical and research environments without revealing identifying information. Acad Radiol 14(6), 757–64. 60. Meslin, E.M. (2006) Shifting paradigms in health services research ethics. Consent, privacy, and the challenges for IRBs. J Gen Intern Med 21(3), 279–80. 61. Ferris, T.A., Garrison, G.M., and Lowe, H.J. (2002) A proposed key escrow system for secure patient information disclosure in biomedical research databases. Proc AMIA Symp, 245–9. 62. Quackenbush, J., et  al. (2006) Top-down standards will not serve systems biology. Nature 440(7080), 24. 63. Jones, A.R., et  al. (2007) The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nat Biotechnol 25(10), 1127–33. 64. Sansone, S.A., et  al. (2008) The first RSBI (ISA-TAB) workshop: “can a simple format work for complex studies?” OMICS 12(2), 143–9. 65. Sansone, S.A., et al. (2006) A strategy capitalizing on synergies: the Reporting Structure for Biological Investigation (RSBI) working group. OMICS 10(2), 164–71. 66. Whetzel, P.L., et al. (2006) Development of FuGO: an ontology for functional genomics investigations. OMICS 10(2), 199–204. 67. Smith, B., et al. (2005) Relations in biomedical ontologies. Genome Biol 6(5), R46. 68. Rubin, D.L., et  al. (2006) National Center for Biomedical Ontology: advancing biomedicine through structured organization of scientific knowledge. OMICS 10(2), 185–98. 69. Piwowar, H.A., and Chapman, W.W. (2008) Identifying data sharing in biomedical literature. AMIA Annu Symp Proc, 596–600. 70. Galperin, M.Y., and Cochrane, G.R. (2009) Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009. Nucleic Acids Res 37(Database issue), D1–4.

Data Standards for Omics Data: The Basis of Data Sharing and Reuse 71. Ruttenberg, A., et  al. (2007) Advancing translational research with the Semantic Web. BMC Bioinformatics (8 Suppl 3), S2. 72. Sagotsky, J.A., et al. (2008) Life Sciences and the web: a new era for collaboration. Mol Syst Biol 4, 201. 73. Stein, L.D. (2008) Towards a cyberinfrastructure for the biological sciences: progress,

69

visions and challenges. Nat Rev Genet 9(9), 678–88. 74. Day, A., et  al. (2007) Celsius: a community resource for Affymetrix microarray data. Genome Biol 8(6), R112. 75. Ochsner, S.A., et  al. (2008) Much room for improvement in deposition rates of expression microarray datasets. Nat Methods 5(12), 991.

Chapter 3 Omics Data Management and Annotation Arye Harel, Irina Dalah, Shmuel Pietrokovski, Marilyn Safran and Doron Lancet Abstract Technological Omics breakthroughs, including next generation sequencing, bring avalanches of data which need to undergo effective data management to ensure integrity, security, and maximal knowledgegleaning. Data management system requirements include flexible input formats, diverse data entry mechanisms and views, user friendliness, attention to standards, hardware and software platform definition, as well as robustness. Relevant solutions elaborated by the scientific community include Laboratory Information Management Systems (LIMS) and standardization protocols facilitating data sharing and managing. In project planning, special consideration has to be made when choosing relevant Omics annotation sources, since many of them overlap and require sophisticated integration heuristics. The data modeling step defines and categorizes the data into objects (e.g., genes, articles, disorders) and creates an application flow. A data storage/warehouse mechanism must be selected, such as file-based systems and relational databases, the latter typically used for larger projects. Omics project life cycle considerations must include the definition and deployment of new versions, incorporating either full or partial updates. Finally, quality assurance (QA) procedures must validate data and feature integrity, as well as system performance expectations. We illustrate these data management principles with examples from the life cycle of the GeneCards Omics project (http://www.genecards.org), a comprehensive, widely used compendium of annotative information about human genes. For example, the GeneCards infrastructure has recently been changed from text files to a relational database, enabling better organization and views of the growing data. Omics data handling benefits from the wealth of Web-based information, the vast amount of public domain software, increasingly affordable hardware, and effective use of data management and annotation principles as outlined in this chapter. Key words: Data management, Omics data integration, GeneCards, Project life cycle, Relational database, Heuristics, Versioning, Quality assurance, Annotation, Data modeling

Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_3, © Springer Science+Business Media, LLC 2011

71

72

Harel et al.

1. Introduction 1.1. What Is Data Management?

Data management is the development, execution, and supervision of policies, plans, data architectures, procedures, programs, and practices that control, protect, deliver, and enhance the value of data and information assets. Topics in data management include architecture, analysis, security, quality assurance, integration, and metadata management. Good data management allows one to work more efficiently, to produce higher quality information, to achieve greater exposure, and to protect the data from loss or misuse (1–4).

1.2. Why Manage Omics Data?

Technological breakthroughs in genomics and proteomics, nextgeneration sequencing (5, 6) as well as in polychromatic flow cytometry and imaging account for vastly accelerating data acquisition trends in all Omics fields (4–7). Even one biological sample may be used to generate many different, enormous Omics data sets in parallel (8). At the same time, these technologies improve the focus of biology, from reductionist analyses of parts, to system-wide analyses and modeling – which consequently further increases this avalanche of data (9). For example, the complex 2.25 billion bases giant panda genome has been determined using 52-bases reads to include 94% of the genome with 56× coverage, probably only excluding its repeat regions (10). This state of affairs has caused a change in approaches to data handling and processing. Extensive computer manipulations are required for even basic analyses, and a focused data management strategy becomes pivotal (11, 12). Data management has been identified as a crucial ingredient in all large-scale experimental projects. Exploiting a well-structured data management system can leverage the value of the data supplied by Omics projects (13). The advantages of data management are as follows: 1. It ensures that once data is collected, information remains secure, interpretable, and exploitable. 2. It helps keep and maintain complete and accurate records obtained in parallel by many researchers who manipulate multiple objects in different geographical locations. 3. It can bring order to the complexity of experimental procedures, and to the diversity and high rate of development of protocols used in one or more centers. 4. It addresses needs that increase nonlinearly when analyses are carried out across several data sets. 5. It enables better research and the effective use of data mining tools. 6. It supports combining data which is mined from numerous, diverse databases.

Omics Data Management and Annotation

73

2. Materials 2.1. Data Management System Requirements

Present-day Omics research environments embody multidimensional complexity, characterized by diverse data types stemming from multicenter efforts, employing a variety of software and technological platforms, requiring detailed project planning, which takes into account the complete life cycle of the project (Fig. 1). The ideal data management system should fulfill a variety of requirements, so as to facilitate downstream bioinformatics and systems biology analyses (6, 13, 14): 1. Flexible inputs supporting source databases of different formats, a variety of data types, facile attainment of annotation about the experiments performed, and a pipeline for adding information at various stages of the project. 2. Flexible views of the database at each stage of the project (summary views, extended views, etc.), customized for different project personnel, including laboratory technicians and project managers. 3. User-friendliness, preferably with Web-based interfaces for data viewing and editing.

Omics Project Planning Data Sources

Programming Technologies

Data Warehouse /Modeling

Information Presentation

de-novo Insight

Implementation and Development Versioning

Algorithms and Heuristics

Data Integration

Quality Assurance

Public Releases User Interface

Search Pages

Integrated Database

Fig. 1. Data management starts with project planning, and completes its initial cycle with the first public release of the system.

74

Harel et al.

4. Interactive data entry, including the capabilities to associate entries with particular protocols, to trace the relationships between different types of information, to record time stamps and user/robot identifiers, to trace bottlenecks, and to react to the dynamic needs of laboratory managers (rather than software developers). 5. Computing capacity for routine calculations via the launching of large-scale computations, e.g., on a computer cluster or grids. 6. External depositions, i.e., a capacity to create valid depositions for relevant databases (such as GenBank (15), Ensembl (16), SwissProt (17)), including the tracking of deposition targets. 7. Robustness through residing on an industrial strength, well supported, high quality, database management system (preferably relational) ensuring security and organization. 2.2. Barriers for Implementing Data Management

Given the volume of information currently generated, and the relevant resources engaged, making the case for data management is relatively straightforward. However, the barriers that must be overcome before data management becomes a reality are substantial (4, 6, 8, 18): 1. Time and effort considerations.  Data management is a longterm project. The development of data management solutions frequently stretches well beyond the initial implementation phase (instrumentation and laboratory workflow evolve continuously, thus making data management an ever-moving target). 2. Personnel recruitment.  Data management requires a different set of skills and a different frame of mind as compared to data analysis, hence a recruitment challenge. 3. Search and display capacities.  For large sets of complex data, and where a wide variety of results can be generated from a single experiment, project planning should include extensive search capacities and versatile display modes. 4. Proactive management.  A capacity to accommodate newly discovered data, new insights, and remodeling needs. Planning upfront for migration to new versions is essential. 5. Optimized data sharing.  In large projects encompassing several research centers, data sharing (19) poses several impediments that need to be overcome, including issues of shared funding, publication, and intellectual property, as well as legal and ethical issues, necessitating providing means to avoid unauthorized use of the data. 6. Community resources.  Lack of informatics expertise poses a problem, and expert pools of scientists with the requisite skills must be developed, as well as a community of biocurators (12). Paucity of funding has been highlighted (20), ­necessitating

Omics Data Management and Annotation

75

new ways of balancing streams of support for the generation of novel data and the protection of existing data (8). 7. Data Storage.  The cost of storing the hundreds of terabytes of raw data produced by next-generation sequencing has been estimated to be greater than the cost of generating the data in the first place (6). In this and other Omics examples, maintaining the streams of data in a readily usable and queryable form is an important challenge. 8. Redundancy.  Whole genome resequencing, including the human 1,000 genomes project (http://www.1000genomes. org) (21), as well as multiple plant and other animal variation discovery programs lead to shifting from central database to interactive databases of genome variation, such as  dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP) (22) and databases supporting the  human and bovine HapMap pro­ jects (http://www.hapmap.org) (23). These constitute trials for data integration that needs to be addressed. 9. Ontologies and semantics.  Use of ontologies (24, 25) and limited vocabularies across many databases is an invaluable aid to semantic integration. However, it seems that no single hierarchy is expressive enough to reflect the abundant differences in scientist viewpoints. Furthermore, the complexity of ontologies inflicts difficulties, since grappling with a very deep and complex ontology could sometimes be as confusing as not having one at all. 10. Integration.  Since significant portions of the data incorporated in Omics databases represent information copied from other sources, complexities arise due to the lack of standardized formatting, posing an integration challenge. 11. Text mining vs. manual curation.  Many databases lack objective yardsticks which validate their automatic data mining annotation. The alternatives are expert community-based annotation, or local team manual curation of information from the scientific literature. These are time-consuming processes which require highly trained personnel. While algorithms for computer-assisted literature annotation are emerging (26, 27), the field is still in its infancy. 12. The metagenomics challenge.  The production of metagenomics data is yet another challenge for data management because sequences cannot always be associated with specific species. 13. The laboratory-database hiatus.  Information is often fragmented among hard drives, CDs, printouts, and laboratory notebooks. As a result, data entry is often incomplete or delayed, often accompanied by only minimal supporting information. New means to overcome this problem have to be urgently developed.

76

Harel et al.

2.3. Omics Data Management

Since data management was recognized as crucial for exploiting large sets of data, considerable effort has been invested by the scientific Omics community to produce relevant computer-based systems, and to standardize rules that apply to worldwide users of such systems (1, 6, 9, 20, 21). Such broadly accessed projects in genomics and proteomics are, among others, the international nucleotide sequence databases consisting of GenBank (http:// www.ncbi.nlm.nih.gov/Genbank) (15), the DNA Databank of Japan (DDBJ, http://www.ddbj.nig.ac.jp) (28), the European Molecular Biological Laboratory (EMBL, http://www.embl.org) (29), the Universal Protein Resource (UniProtKB, http://www. uniprot.org) (17), and the Protein Data Bank (PDB, http://www. pdb.org) (30). In addition to hosting text sequence data, they encompass basic annotation and, in many cases, the raw under­ lying experimental data. Although these projects are pivotal in the Omics field, they do not answer the complete variety of needs of the scientific community. To fill the gap, a number of other databases have been developed. Many are meta-databases, integrating the major databases and often also various others. Examples include GeneCards (http://www.genecards.org) (31–35) and Harvester (http://harvester.fzk.de) (36). In parallel, databases that focus on specific areas have emerged, including HapMap (23) for genetic variation; RNAdb (http://research.imb.uq.edu.au/ rnadb) (37) for RNA genes; and OMIM (http://www.ncbi.nlm. nih.gov/omim) (38) for genetic disorders in human.

2.4. Laboratory Information Management Systems

In order to manage large projects, it is possible to use Laboratory Information Management Systems (LIMS), a well-known methodology in scientific data management. A LIMS is a software system used in laboratories for the management of samples, laboratory users, instruments, standards, and other laboratory functions, such as microtiter plate management, workflow automation, and even invoicing (13, 39, 40). LIMS facilitate day-to-day work of complex, high dimensional projects (multiuser, multigeographic locations, and many types of data, instruments, and input/output formats); it organizes information and allows its restoration in a convivial manner, which improves searches. It also simplifies scientific management by centralizing the information and identifying bottlenecks from global data analyses. Finally, it allows data mining studies that could in turn help choose better targets, thus improving many project outcomes (13). Since Omics data management tasks are complex and cover a wide range of fields, a variety of implementation platforms have been developed. Starting from general Perl modules supporting the development of LIMS for projects in genomics and transcriptomics (40), followed by a more sophisticated LIMS project (41) covering data management in several Omics fields (e.g., 2D Gels, Microarray, SNP, MS, Sequence data), and specialized projects for managing ESTs data

Omics Data Management and Annotation

77

(42, 43), transcriptome data (44), functional genomics (45), ­toxicology, and biomarkers (46). In addition, LIMS designed specifically for the proteomics arena include: Xtrack (47). Designed to manage crystallization data and to hold the chemical compositions of the major crystallization screens, Xtrack stores samples expression and purification data throughout the process. After completion of the structure, all data needed for deposition in the PDB are present in the database. Sesame (48). Designed for the management of protein production and structure determination by X-ray crystallography and NMR spectroscopy. HalX (49). Developed for managing structural genomics projects; The Northeast Structural Genomics Consortium has developed a LIMS as part of their SPINE-2 software (50). ProteinScape™ (51). A bioinformatics platform which enables researchers to manage proteomics data from generation and warehousing to storage in a central repository. 2.5. Data Viewers

As reference genome sequences have become available, several genome viewers have been developed to allow users effective access to the data. Common browsers include EnsEMBL (http://www. ensembl.org) (16), GBrowse (http://gmod.org/wiki/Gbrowse) (52), and the University of California, Santa Cruz genome browser (http://genome.ucsc.edu) (53) (see Note 1). These viewers increasingly include sequence variation and comparative genome analysis tools, and their implementations are becoming the primary information source for many researchers who do not wish to trawl through the underlying sequence data used to identify the sequence annotation, comparison, and variation. Genome viewers and their underlying databases are becoming both the visualization and interrogation tools of choice for sequencing data.

2.6. Standardization

Massive-scale raw data must be highly structured to be useful to downstream users. In all types of Omics projects, many targets are manipulated, and the results must be able to be considered within the context of experimental conditions. In such large-scale efforts, data exchange standardization is a necessity for facilitating and accelerating the process of collecting relevant metadata, reduce replication of efforts and maximize the ability to share and ­integrate data. One effective solution is to develop a consensusbased approach (54). Standardized solutions are increasingly available for describing, formatting, submitting, sharing, annotating, and exchanging data. These reporting standards include minimum information checklists (55), ontologies that provide terms needed to describe the minimal information requirements (25), and file formats (56, 57).

78

Harel et al.

The “minimum information about a genome sequence” guideline published by the Genomic Standards Consortium (GSC) (9) calls for this critical field to be mandatory for all genome and metagenome submissions. The GSC is an openmembership, international working body formed in September 2005, aimed to promote mechanisms that standardize the description of genomes and the exchange and integration of genomic data (58). Some pertinent cases in point are: Minimum Information About Microarray Experiment (MIAME), which has been largely accepted as a standard for microarray data; also, nearly all LIMSs developed for relevant experiments are now “MIAME-compliant” (59–61). The US Protein Structure Initiative, PDB, and BioMagResBank jointly developed standards were partially used by the European SPINE project and used in designing its data model (62). The Proteomics Standards Initiative of the Human Proteome Organization (HUPO) aims to provide the data standards and interchange formats for the storage and sharing of proteomics data (63). The Metabolomics Standards Initiative recommended that metabolomics studies should report the details of study design, metadata, experimental, analytical, data processing, and statistical techniques used (64). The growing number of standards and policies in these fields has stimulated the generation of new databases that centralize and integrate these data: Digital Curation Centre (DCC, http://www.dcc.ac.uk), tracks data standards, documents best practice (65), and the BioSharing database (http://biosharing.org) (66), currently centralizing bioscience data policies and standards, by providing a “onestop shop” for those seeking data policy information (8). 2.7. Project Planning

As in all disciplines, the proper start to an Omics project is defining requirements. Good planning makes for clean, efficient subsequent work. The following areas of a project require planning: Type of data to be used.  Should the project be restricted to only one type of Omics data (genomics, metabolomics, etc.), in order to get a highly specialized database and have simple relations among a limited number of business objects (see below)? Or does one prefer to include and connect numerous data types, potentially leading to new cross-correlation insights? Does one wish to focus on experimental work or include computational inference? Data presentation.  Most projects opt for a Web-based interface, but  there are many advantages to stand-alone client-based

Omics Data Management and Annotation

79

a­ pplications, which users install on their own computers. Advantages of Web-based applications include: fast propagation to worldwide users, easier deployment and maintenance, and ubiquity. Advantages of stand-alone applications include: no developer data security issues (they all fall on the user), and no server load (e.g., too many simultaneous Web requests) issues. Design of business objects.  To be managed, data must be categorized and broken into smaller units – customarily called business objects. How should those be designed? And what can be expected from the data? Is simple presentation to the user sufficient or are searches required? Should there be an option for user contribution to the project’s information base? Should the contributions be named or anonymous? Data updates.  Omics projects, especially those based on remote sites, need to take into account that information is not static: it is never sufficient to just create a data compilation; update plans are crucial. Data integrity.  No data management plan is satisfactory if it does not allow for constant maintenance of data integrity, security, and accuracy. 2.8. Choosing Omics Sources

What type of Omics data?  With so many fields to choose from, and with such a high need for the analysis of large-volume data, the choice is never straightforward. A large variety exists (67, 68) including: Genomics

The genes’ sequences and the information therein

Transcriptomics The presence and abundance of transcription products Proteomics

The proteins’ sequence, presence and function within the cell

Metabolomics

The complete set of metabolites within the cell

Localizomics

The subcellular localization of all proteins

Phenomics

High-throughput determination of cell function and viability

Metallomics

The totality of metal/metalloid species within an organism (69)

Lipidomics

The totality of lipids

Interactomics

The totality of the molecular interactions in an organism

Spliceomics

The totality of the alternative splicing isoforms

Exomics

All the exons

Mechanomeics

The force and mechanical systems within an organism

Histomics

The totality of tissues in an organ

80

Harel et al.

Many projects evolve from a field of interest and from curiosity-driven research; others are an outgrowth of a wider recognition of scarcity of knowledge. Scope of database.  The choice is between a low volume, highly specialized database, freed from the constraints of supporting general and generic requirements, wherein one can optimize the design to be as efficient as possible for the relevant application. This presents the opportunity for the author to become the recognized expert in a specific Omics area. Larger databases involve more complex database management. Even for a focused database, experiments often yield too many results to be practically manageable for an online systems; one must then decide what, how much, and how often to archive, and design tools to effectively access the archives and integrate them into query results when needed. In addition, one should realize that ongoing analyses of the data often provide insights which in turn impact the implementation of updated versions of the data management system. Data integration and parsing.  Data integration involves combining data residing in different sources and providing users with a unified and intelligent view of the merged information. One of the most important scientific decisions in data integration is which fields should be covered. There are many databases for each Omics field; designing a standard interface which also eliminates duplicates and unifies measurements and nomenclature is very important, but not easy to implement. An example of where multiarea coverage has been beneficial is the integration of metabolomics and transcriptomics data in order to help define prognosis characteristics of neuroendocrine cancers (70). Some of the challenges include the need to parse and merge heterogenous source data into a unified format, and importing/merging it into the business objects of the project. This step includes taking text that is given in one format, breaking it into records (one or more fields or lines), choosing some or all of its parts, writing it in another format, often adding additional original annotation. The main implementation hurdle is dealing with different formats from different sources. Some need to be parsed from the Web, preferably in an automatic fashion. Many sources provide users with data dumps from a relational database, others opt for exporting extensible markup language (XML) files (see Note 2), or simple text files, e.g., in a comma separated values (CSV) format. The more sources a project has, the higher the likelihood that more than one input format needs parsing. The project has to provide specific software for each of its mined sources; when developing in an object-oriented fashion, type-specific parsing classes can be used as a foundation and ease this work (71). Irrespective of programing language, and especially if there is more than one programer working on a project, a source code

Omics Data Management and Annotation

81

version control system, like CVS (see Note 3), should be strongly considered. Also, for efficiency, most programers use an integrated development environment, such as Eclipse (see Note 4). 2.9. Defining the Data Model Within the Project: Data, Metadata, and Application Flow

Once the sources of information are chosen, the system’s data definition has to be developed. The volume of data, be it small or large, has to be broken down and categorized into business objects. How to achieve these goals and to implement the objects and their relationships is called data modeling (72). An item may be defined as a business object if it has specific annotation. For example: A genomics project may choose to define genes, sequences, and locations among its many objects, but not proteins – even though they may be mentioned as the product of the gene. If the base information does not contain protein annotation, such as amino acid sequence, or 3D images, there may be no need for the protein to be defined as an object. The names of the objects, their permitted actions within the project, their data types and more, are referred to as metadata. Application flow is the sum of actions that can be performed within the project, and the order in which they are connected. It is important for the efficient management and integration of the data. Examples of questions to consider include: Is it sufficient for the project information to be merely presented to the user, or should it be searchable as well? What must the user be able to do with the results? Should there be a capability for users to add comments or even new data – or will the database as provided by the system be static until its next official update? Large projects, with complex relationships among its fields, are useful only if the data is searchable, so the application flow should contain: (a) a home page; (b) a page from which the user can enter their search terms and define the search parameters; and (c) a page listing the search results. Additionally, useful features would be for the search results page to: (a) sort by different criteria; (b) refine the search; (c) perform statistical analyses on the results set; (d) export the results set to a file (say in Excel format) or to a different application, either internal or external to the system. Understanding the application flow is critical at the data modeling stage since its requirements (both functional and performance) drive the design choices.

2.10. Defining Data Warehouse Requirements

A data warehouse is defined as the central repository for a pro­ ject’s (or a company’s) electronically stored data (73). Different databases allow mining via different means: parsing of Web pages, ad hoc text files (flat files), CSV files, XML files, database dumps. For clean, efficient data use in the project, all of these sources should be integrated into one, uniform project database. With the advent of reliable public domain solutions, a significant ­proportion of current Omics projects use relational database

82

Harel et al.

management systems (RDBMSs) as their data warehouse. Using a text file database is best suited for small projects with relatively low volume annotations and relatively simple relations between the business objects. For example, an experimental data project specializing in cytoskeleton proteins only has a small volume of data, in comparison to say NCBI’s EntrezGene project, so using text files may suffice for it. A text file database has some advantages: 1. It needs less specialized maintenance; simple system commands can be used to check how many files there are, how they are organized, and how much space they take up. 2. Implementation can be very quick: One can choose a favorite programing language and write the necessary code to create, extract, and search the data. If the files are in XML, or another popular format, there are many relevant public-domain modules, in a variety of programing languages, already written, and freely accessible. 3. It is easier to deploy. One can use system commands again to organize and compress the set of files into one downloadable file. However, the file system solution also has considerable disadvantages, especially for high volume data, with complex relationships among objects (74): 1. It could use more disk space; often data is stored redundantly for easier access. 2. Writing and maintaining custom-made code to access/add/ analyze data may become cumbersome and/or slow. 3. Fewer analyses can be done on complex relationships, since they cannot be easily marked in text files. Indeed, for high-volume, complex-relationships projects, a relational database offers the following important advantages: 1. Application program independence; data is stored in a standard, uniform fashion, as tables. 2. Multiple views of data, as well as expandability, flexibility, and scalability. 3. Self-describing meta-data which elucidates the content of each field. 4. Reduced application development times once the system is in place. 5. Standards enforcement; all applications/projects using the same RDBMS have a ready-made solution for integration. A widely used open source RDBMS is MySQL (75). The Web includes many tutorials, examples, free applications, and user

Omics Data Management and Annotation

83

groups relating to it. Complementing such a database system are programing languages to automate the insertion of data, and curating and analyzing the included data, for example, employing Perl (76) for generating user MySQL extensions (77). To a lesser degree, the type of mined data also plays a role in choosing a data warehouse. If all other things are equal, then keeping the same type as the project’s input data saves implementation time. RDBMSs provide basic querying facilities which are often sufficient to power the searches needed by Omics applications. Systems based on flat files, as well as relational databases for which speed and full-text searching is essential, typically need an external search engine (e.g., as can be provided by Google or by specialized software like Glimpse or Lucene (78, 79) for efficient indexing and querying). 2.11. Defining Versioning Requirements

Planning a project cannot end before accounting for its life cycle. Both in-silico and experimental data projects need to anticipate the future evolution of the data and its analyses. Integrated databases would rapidly become obsolete if they fail to deploy updates with new information from their sources. Programing development time forms a lower bound for version intervals, and the more sources a project mines, the more complicated this process becomes. Not all sources update at the same time, and frequencies vary. Planning version cycles must also take into account the interdependence of different sources. Data updates can be either complete (with all data mined and integrated from scratch) or incremental (updating only what has changed, e.g., new research articles that have just been published about a specific disease). Incremental updates of this sort, along with triggered reports to users, are very attractive. In practice, this is often extremely difficult to implement in Omics applications, due to widespread data complexities, exacerbated by unpredictable source data format changes, as well as the interdependencies of many of the major data sources. Finding full or partial solutions in this arena is an interesting research focus. Ensuring speedy deployment of data updates (complete or partial) is not the only reason for versioning. As time passes, and more users send feedback about an Omics project, and new scientific areas and technologies emerge, new features become desirable, and/or new application behaviors become necessary. These warrant a code update, and since the code services the data, it is often most convenient to provide joint data and feature updates.

2.12. Data Quality Control and Assurance

Quality Assurance (QA) is crucial to all projects, and is customarily performed to assure compliance with the designed models and specifications. Quality Assurance refers to the process used to ensure that deliverables comply with the design models and specifications. Examples of quality assurance include process checklists,

84

Harel et al.

project audits and methodology, and standards development. It is usually done toward the end of project development, though extremely useful in intermediate stages as well, and it can be done by anyone who has access to the project manual or user interface. For an in-silico application, QA would mean checking that all of the data is represented as planned, that all of the application’s functionality exists and behaves as designed, and that all of the application’s required steps follow one another in the specified order. For a project based on experimental data, quality assurance would also include verification that the data in the project’s database is identical to the data gathered by the scientists. Plans for QA should be designed in parallel with the implementation; running QA tests should be allotted their own time frames within milestones plans. Once shortcomings are uncovered, they should be returned to the implementation stages for correction and retesting. Also, some of the defects lead to redesigning the test plan – often to add more checks. For Omics projects in particular, QA has to ensure that both the business logic (the specific data entities and the relationships among them) and the science logic (e.g., no DNA sequence should contain the letter P, or full protein sequence should be at least 30AA) are correct.

3. Methods 3.1. GeneCards Data and Its Sources

In this section, we use the GeneCards project as a case study for illustrating many of the data management concepts described above. GeneCards® (http://www.genecards.org) (31–35) is a comprehensive, authoritative, and widely used compendium of annotative information about human genes. Its gene-centric content is automatically mined and integrated from over 80 digital sources, including HGNC (80), NCBI (81), ENSEMBL (82), UniProtKB (83), and many more (84), resulting in a Web-based deep-linked card for each of >53,000 human gene entries, categorized as protein-coding, RNA genes, pseudogenes, and more. Figure 2 depicts the GeneCards project life cycle. The result is a comprehensive and searchable information resource of human genes, providing concise genome, proteome, transcriptome, ­disease, and function data on all known and predicted human genes, and successfully overcoming barriers of data format heterogeneity using standard nomenclature, especially approved gene symbols (85). GeneCards “cards” include distinct functional areas, encompassing a variety of topics, including the GIFtS ­annotation score (35), aliases and descriptions, summaries, ­location, ­proteins, domains and families, gene function, proteins

Omics Data Management and Annotation

85

GeneCards Project Planning NCBI,Ensembl UniProt,HGNC...

Perl,XML,PHP,Propel Smarty, MySQL, Solr

flat files to relational DB

home site and mirrors

annotation wealth

Implementation and Development GeneCards versions 2,3

Algorithms: GeneNote, GIFtS Partner hunter, Set distiller, GeneLoc

Omics Data Sifting, Sorting, Merging, Ranking

Quality Assurance via GeneQArds

The GeneCards Suite One web card per gene, Deep links

Fast searches GeneALaCart, GeneDecks

Integrated database in text, XML, and MySQL formats

Fig. 2. The GeneCards project’s instantiation of data management planning, implementation, releases, and versioning, with examples of its sources, technologies, data models, presentation needs, de novo insights, algorithms, quality assurance, user interfaces, and data dumps.

and interactions, drugs and compounds, transcripts, expression, orthologs, paralogs, SNPs, disorders, publications, other genomewide and specialized databases, licensable technologies, and products and services featuring a variety of gene-specific research reagents. A powerful search facility provides full-text and fieldspecific searches; set-specific operations are available via the GeneALaCart and GeneDecks (34) subsystems. 3.2. GeneCards Data Modeling and Integration

The GeneCards data model is complex. In legacy GeneCards Versions 2.xx, information is stored in flat files, one file per gene. Version 3.0 (V3), deployed in 2010, uses a persistent object/ relational approach, attempting to model all of the data entities and relationships in an efficient manner so that the diverse functions of displaying single genes, extracting various slices of attributes of sets of genes, and performing well on both full text and field-specific searches are taken into account. Since the data is collected by interrogating dozens of sources, it is ­initially organized according to those sources. However, it is important for the data to

86

Harel et al.

be also presented to users organized by topics of ­interest, e.g., with all diseases grouped together, whether mined from specialized disorder databases, literature text-mining tools, or the source of protein data. Data integration in GeneCards operates at a variety of additional levels, serving as good examples for such a process. In some cases, such integration manifests only juxtaposition, such as sequentially presenting lists of pathways from seven different data sources, thereby allowing the user to perform comparisons. In other cases, further unification-oriented processing takes place, striving to eliminate duplicates and to perform prioritization. This is exemplified by the alias and descriptor list, by the genomic location information, which employs an original exon-based algorithm (86), and by the gene-related publications list. The latter integration is based on the prevalence of the associations and their quality (manual versus automatic curation) of the method of associations. The functional view is provided on the Web and in the V3 data model. Each of the views (source or topic) is available in a corresponding set of XML files. The V2-to-V3 migration path uses these files as the input for loading the relational database. The administration of the database is facilitated by the use of the phpMyAdmin (87) tool (see Note 5). The data generation pipeline is completed by having the database serve as input to an indexing facility which empowers sophisticated and speedy searches. 3.3. GeneCards Business Objects and Application Flow

The GeneCards V3 database (in version 3.02) is very elaborate, with >55,000 gene entries, a count that results from a consensus among three different gene entry sources: HGNC (80), NCBI (81), and ENSEMBL (82). It further encompasses ~456,000 aliases, descriptions, and external identifiers, >800,000 articles, >900,000 authors, >3 million gene-to-articles associations, >7 million SNPs – just a sample of the myriads of annotations and relationships culled from the >80 data sources, and populating 110 tables (data and system) and two views, interlinked by 81 foreign keys. The primary business object is the genes entity, with attributes that include symbol, GeneCards identifier, HGNC approved symbol (when available), and origin (HGNC, EntrezGene, or ENSEMBL). The data model parallels that of the Webcard, with some of the complex sections (e.g., gene function) represented by many tables. An off-line generation pipeline mines and integrates the data from all sources. The application flow of the online GeneCards Web site enables users to: (a) view a particular GeneCards gene and all of its attributes; (b) search the database from the home page or from any of the specific webcards; (c) analyze and/or export the search results; (d) use the GeneALaCart batch query facility to download a subset of the

Omics Data Management and Annotation

87

descriptors for a specified set of genes; (e) use GeneDecks’s Partner Hunter (either from a card or from its home page) to find more genes that are similar to a given gene based on a chosen set of GeneCards attributes; (f) use GeneDecks’s Set Distiller to find GeneCards annotations that are the most strongly enriched within a given set of genes. 3.4. Different Views of the Data

GeneCards users are eclectic, and include biologists, bioinformaticians and medical researchers from academia and industry, students, patent/IP personnel, physicians, and lay people. To address the varied individual needs of a multifaceted user base, GeneCards affords a variety of output formats, including the Web interface as described above, excel files exported by batch queries and GeneDecks results, plain text files embodying the legacy V2 format, XML files, organized by sources or by function, and MySQL (75) database dumps containing all the tabular information. Additionally, it provides a Solr (88)/Lucene (79) index, available for independent querying by the Solr analyzer; an object-oriented interface to the data, facilitated by Propel (89) (see Note 6), and a complete copy of the data and software, used by academic and commercial mirror sites. An Application Programing Interface (API) either developed in-house or adopted from projects of similar scope is in the planning. Examples of useful algorithms implemented within GeneCards include integrated exon-based gene locations (86), and SNPs filtering and sorting. In the latter, SNPs are hierarchically sorted (by default) by validation status, by location type (e.g., coding non-synonymous/synonymous, splice site), and by the number of validations.

3.5. Managing GeneCards Versions

Since 1997, the GeneCards project has released over 70 revisions, addressing numerous data format changes imposed by the mined sources. Timing of new releases has often been constrained by uneven source synchronization relative to the latest builds of root sources, such as NCBI. Mechanisms for incremental updates have been designed, but often found to be suboptimal solutions. The GeneCards generation process is embarrassingly parallelizable, so the time to generate all of the data from scratch into text files has been reduced to about 1 or 2 days, followed by about 1 week for XML data generation and the loading of the MySQL database, and a few hours for indexing.

3.6. Quality Assurance

Consistent data integrity and correctness is a major priority, and we have developed a semi-automated system, GeneQArds, for instantiating a key data management quality assurance component. GeneQArds is an in-house mechanism that was established to: (a) assess the integrity of the migration from the V2 text file system into the MySQL database and (b) validate and quantify the results of the new V3 search engine. To ensure correctness,

88

Harel et al.

we have developed a mechanism (using SQL queries and PHP) which builds a binary matrix for all gene entries, indicating the presence or absence of data from each one of the GeneCards sources in the database. Comparison of such matrices for two builds or software versions provides an assessment of database integrity and points to possible sources of error. A search engine comparison tool enables comparisons of single Web queries as well as batch (command line) queries. A report provides lists of genes showing search engine inconsistencies, and enables tracking of the particular source contributing to such discrepancies. Finally, an automated software development and bug tracking system, Bugzilla (90) (see Note 7) is used to record, organize, and update the status of feature and bug lists. 3.7. Lessons Learned

Work on the GeneCards project provides a variety of data management lessons and insights. Regarding versioning, it shows the advantages of careful version planning, so as to ensure data release consistency. In parallel, GeneCards provides an example of the difficulties involved in finding the balance between full and incremental updates, and the partial offset of the problem by database generation speed optimization. Database architecture migration is often considered difficult, and GeneCards’ successful move from a flat file architecture to a relational database model demonstrates the feasibility, advantages, and hurdles. It also shows how an evolutionary transition helps prevent disruptions in service and addresses time and funding constraints. Finally, the GeneCards example shows the advantage of developing a comparison-based quality assurance system. Furthermore, we show that developing quality assurance software systems result in useful side effects. This is exemplified by how the presence/absence vectors for the data sources, which are embedded in the GeneQArds QA effort, helped develop a novel gene annotation scoring mechanism (35).

3.8. Practical Issues Regarding Single-Researcher Data Management

Assembling, annotating, and analyzing a database of various Omics data are currently well feasible for modest sized labs and even single researchers. The key developments allowing this are: (a) the very large amounts of public Omics data; (b) free and simple accessibility of this data through the internet; (c) affordable personal computers with strong processors and very large storage capacity; (d) free and cheap software to construct databases and write programs to assemble, curate, and analyze data. We illustrate these points, and several potential difficulties and their possible solutions, using specific types of Omics data, resources, and computational approaches. Studying a particular gene or gene family is common for many biology and related research groups. Computational bio­ logy and bioinformatics analyses often accompanies experimental

Omics Data Management and Annotation

89

work on genes, in order to keep abreast with current research work and to complement and direct the “wet” work at a fraction of its time and cost. A typical starting point is the studied gene sequence and data about its function. Usual questions on the gene function include active site(s) sequence location, tissue and subcellular expression site, and the occurrence and reconstructed evolution of its paralogs and orthologs (see Note 1). Sequence data can help address these questions and is readily accessible through public databases, such as the ones at the NCBI, EBI, and DDBJ. These databases are more than simple data repositories, offering diverse data search methods, easy data downloading, several data analyses procedures, and links between related data within and between these sites and others. Most researchers are well familiar with finding sequences related to their gene of interest by sequence to sequence searches (e.g., BLAST (91)) and simple keyword searches. The advantage of the sequence search is that it searches primary data (i.e., the sequenced nucleotides data) or its immediate derivative (i.e., the virtually translated protein sequences). However, these searches rely on detectable sequence similarity which can be difficult to identify between quickly diverging and/or ancient separated sequences. The advantage of keyword searches is in accessing the sometimes rich annotations of many sequences, thus using experimental and computational analyses already available for the database sequences. The main disadvantages of keyword searches are missing and miss-annotated data. The former is usually the result of the deposit of large amounts of raw data that is yet to be processed in the database. Large genomic data sets are an example of such data. Raw sequence reads (e.g., the Trace Archive (92)) are an example of data which is not even meant to be annotated beyond its source and sequencing procedures. These sequence reads are meant to complement the assembled data or to be reassembled by interested users. Miss-annotations of data are a more severe problem, since they can mislead researchers and direct them to erroneous conclusions or even futile experiments. The cause of miss-annotations is often automatic annotations necessitated by the shear amounts of deposited data (15), data contamination that can be present before the final assembly at last stages of large sequencing projects, hindrances in the process of metadata integration, and wrong implementation of the search engine. One way to avoid and reduce the pitfalls of these two types of sequence finding approaches (by sequence similarity and by keywords), is to use both and then cross-reference their results, along with careful manual inspection of the retrieved data, employing critical assessment and considering all possible pitfalls. Once the sequence data is found and downloaded, it needs to be curated and stored. Following the discussion in the previous

90

Harel et al.

parts of this chapter, it is clear that it should be organized in a database. Current standard personal computers feature enough computation power and storage space to easily accommodate databases of thousands of sequences and their accompanying data, and sequence assemblies of prokaryotes and even complex organisms with genomes sizes of a few gigabases. Freely available powerful databases systems, such as MySQL and applications to view and manage them (see Note 5) are not too difficult to install on personal computers by some end users themselves, or with the skilled assistance of computer support personnel. Using this approach, a research group can set up a database of biological data with the resources of an up-to-date personal computer and internet connection, the main effort being staff time. Depending on users’ background, appreciable time might be needed to install and utilize the relevant database systems and programing languages. Relevant courses, free tutorials, examples and other resources are available on the Web (74, 93). Researchers can then devote their thoughts and time to plan, construct, curate, analyze, and manage their Omics databases.

4. Notes 1. Genome Viewing Tools.  The UCSC (53) portal includes the popular Web-based UCSC Genome Browser, which zooms and scrolls over complete chromosomes, showing a variety of external annotations; GeneSorter which elaborates expression, homology, and detailed information on groups of genes, GenomeGraphs, which enables visualization of genome-wide data sets and others. Artemis (94) is a free stand-alone genome viewer and annotation tool from the Sanger Institute, which allows visualization of sequence features and the results of analyses within sequence contexts. 2. XML.  Extensible Markup Language (XML) (95) is a simple, flexible, standardized tag-based format for data exchange between projects, Web sites, institutions, and more. Originally designed to meet the challenges of large-scale electronic publishing, it has evolved to become a data formatting tool which is as widespread as relational databases. 3. CVS.  Concurrent Versions System (CVS) (96) is a UNIXbased open source-code control system that allows many programers to work simultaneously on the same program files, reports on conflicts, and keeps track of code changes over time, and by programer. This makes it easy to find which changes introduced bugs, and to revert back to previous revisions if necessary.

Omics Data Management and Annotation

91

4. Eclipse.  An integrated development environment supporting a variety of programing languages, including C++, Java, Perl, and PHP, Eclipse (97) allows programers to easily develop, maintain, test, and manage all of the files associated with software projects. Eclipse is a free shareware application, written by different individuals, affiliated with different institutions, and united under the Eclipse platform. 5. MySQL Graphical Interfaces.  phpMyAdmin (87), is a public domain graphical user interface, written in PHP, that enables handling of MySQL database administration (75) via a Web browser. phpMyAdmin supports a wide range of operations with MySQL, including managing and querying databases, tables, fields, relations, and indexes, with results presented in a friendly and intuitive manner. Sequel Pro (98), another graphical interface for managing MySQL databases, is a free open-source application for the Macintosh OSX 10.5 system. It is a stand-alone program that can manage and browse both local and remote MySQL databases. 6. PHP Propel.  An open-source Object-Relational Mapping (ORM) framework for the PHP programing language. Propel (89) is free software that helps automatically create objectoriented code as well as relational database structures based on the same input schema. With the help of Propel, a system can be based on a standard relational database model of normalized tables of scalars, and also allow easy access to persistent, complex data objects, enabling application writers to work with database elements in the same way that they work with other PHP objects. 7. Bugzilla.  A popular open-source Web-based general-purpose bugtracker and testing tool (90). Each bug or enhancement request is assigned a unique number, and at any point in time, is attached to a particular project, component, version, owner, and status (migrating from new to assigned to resolved to verified or reopened, and eventually to closed).

Acknowledgments We thank the members of the GeneCards team: Iris Bahir, Tirza Doniger, Tsippi Iny Stein, Hagit Krugh, Noam Nativ, Naomi Rosen, and Gil Stelzer. The GeneCards project is funded by Xennex Inc., the Weizmann Institute of Science Crown Human Genome Center, and the EU SYNLET (FP6 project number 043312) and SysKID (FP7 project number 241544) grants.

92

Harel et al.

References 1. Liolios, K., Mavromatis, K., Tavernarakis, N., and Kyrpides, N. C. (2008) The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 36, 475–9. 2. Data Management International, http://www. dama.org/i4a/pages/index.cfm?pageid=1. 3. Tech FAQ. What is Data Management?, http://www.tech-faq.com/data-management. shtml. 4. Chaussabel, D., Ueno, H., Banchereau, J., and Quinn, C. (2009) Data management: it starts at the bench. Nat Immunol 10, 1225–7. 5. Aebersold, R., and Mann, M. (2003) Mass spectrometry-based proteomics. Nature 422, 198–207. 6. Batley, J., and Edwards, D. (2009) Genome sequence data: management, storage, and visualization. Biotechniques 46, 333–6. 7. Wilkins, M. R., Pasquali, C., Appel, R. D., Ou, K., Golaz, O., Sanchez, J. C., Yan, J. X., Gooley, A. A., Hughes, G., Humphery-Smith, I., Williams, K. L., and Hochstrasser, D. F. (1996) From proteins to proteomes: large scale protein identification by two-dimensional electrophoresis and amino acid analysis. Biotechnology (NY) 14, 61–5. 8. Field, D., Sansone, S. A., Collis, A., Booth, T., Dukes, P., Gregurick, S. K., Kennedy, K., Kolar, P., Kolker, E., Maxon, M., Millard, S., Mugabushaka, A. M., Perrin, N., Remacle, J. E., Remington, K., Rocca-Serra, P., Taylor, C. F., Thorley, M., Tiwari, B., and Wilbanks, J. (2009) Megascience. ‘Omics data sharing’. Science 326, 234–6. 9. Field, D., Garrity, G., Gray, T., Morrison, N., Selengut, J., Sterk, P., Tatusova, T., Thomson, N., Allen, M. J., Angiuoli, S. V., Ashburner, M., Axelrod, N., Baldauf, S., Ballard, S., Boore, J., Cochrane, G., Cole, J., Dawyndt, P., De Vos, P., DePamphilis, C., Edwards, R., Faruque, N., Feldman, R., Gilbert, J., Gilna, P., Glockner, F. O., Goldstein, P., Guralnick, R., Haft, D., Hancock, D., Hermjakob, H., Hertz-Fowler, C., Hugenholtz, P., Joint, I., Kagan, L., Kane, M., Kennedy, J., Kowalchuk, G., Kottmann, R., Kolker, E., Kravitz, S., Kyrpides, N., Leebens-Mack, J., Lewis, S. E., Li, K., Lister, A. L., Lord, P., Maltsev, N., Markowitz, V., Martiny, J., Methe, B., Mizrachi, I., Moxon, R., Nelson, K., Parkhill, J., Proctor, L., White, O., Sansone, S. A., Spiers, A., Stevens, R., Swift, P., Taylor, C., Tateno, Y., Tett, A., Turner, S., Ussery, D.,

10.

11. 12.

13.

14. 15. 16.

Vaughan, B., Ward, N., Whetzel, T., San Gil, I., Wilson, G., and Wipat, A. (2008) The minimum information about a genome sequence (MIGS) specification. Nat Biotechnol 26, 541–7. Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B., Bai, Y., Zhang, Z., Zhang, Y., Wang, W., Li, J., Wei, F., Li, H., Jian, M., Li, J., Zhang, Z., Nielsen, R., Li, D., Gu, W., Yang, Z., Xuan, Z., Ryder, O. A., Leung, F. C., Zhou, Y., Cao, J., Sun, X., Fu, Y., Fang, X., Guo, X., Wang, B., Hou, R., Shen, F., Mu, B., Ni, P., Lin, R., Qian, W., Wang, G., Yu, C., Nie, W., Wang, J., Wu, Z., Liang, H., Min, J., Wu, Q., Cheng, S., Ruan, J., Wang, M., Shi, Z., Wen, M., Liu, B., Ren, X., Zheng, H., Dong, D., Cook, K., Shan, G., Zhang, H., Kosiol, C., Xie, X., Lu, Z., Zheng, H., Li, Y., Steiner, C. C., Lam, T. T., Lin, S., Zhang, Q., Li, G., Tian, J., Gong, T., Liu, H., Zhang, D., Fang, L., Ye, C., Zhang, J., Hu, W., Xu, A., Ren, Y., Zhang, G., Bruford, M. W., Li, Q., Ma, L., Guo, Y., An, N., Hu, Y., Zheng, Y., Shi, Y., Li, Z., Liu, Q., Chen, Y., Zhao, J., Qu, N., Zhao, S., Tian, F., Wang, X., Wang, H., Xu, L., Liu, X., Vinar, T., Wang, Y., Lam, T. -W., Yiu, S. -M., Liu, S., Zhang, H., Li, D., Huang, Y., Wang, X., Yang, G., Jiang, Z., Wang, J., Qin, N., Li, L., Li, J., Bolund, L., Kristiansen, K., Wong, G. K., Olson, M., Zhang, X., Li, S., Yang, H., Wang, J., and Wang, J. (2009) The sequence and de novo assembly of the giant panda genome. Nature 463, 311–7. (2008) Big Data special issue. Nature 455. Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W., Hill, D. P., Kania, R., Schaeffer, M., St Pierre, S., Twigger, S., White, O., and Rhee, S. Y. (2008) Big data: the future of biocuration. Nature 455, 47–50. Haquin, S., Oeuillet, E., Pajon, A., Harris, M., Jones, A. T., van Tilbeurgh, H., Markley, J. L., Zolnai, Z., and Poupon, A. (2008) Data management in structural genomics: an overview. Methods Mol Biol 426, 49–79. Gribskov, M. (2003) Challenges in data management for functional genomics. OMICS 7, 3–5. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Wheeler, D. L. (2006) GenBank. Nucleic Acids Res 34, D16–20. Birney, E., Andrews, T. D., Bevan, P., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cuff, J., Curwen, V., Cutts, T., Down, T., Eyras, E., Fernandez-Suarez, X. M., Gane, P.,

Omics Data Management and Annotation

17.

18.

19. 20. 21. 22.

23.

Gibbins, B., Gilbert, J., Hammond, M., Hotz, H. R., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Keenan, S., Lehvaslaiho, H., McVicker, G., Melsopp, C., Meidl, P., Mongin, E., Pettett, R., Potter, S., Proctor, G., Rae, M., Searle, S., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Ureta-Vidal, A., Woodwark, K. C., Cameron, G., Durbin, R., Cox, A., Hubbard, T., and Clamp, M. (2004) An overview of Ensembl. Genome Res 14, 925–8. Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., and Bairoch, A. (2007) UniProtKB/Swiss-Prot. Methods Mol Biol 406, 89–112. Schofield, P. N., Bubela, T., Weaver, T., Portilla, L., Brown, S. D., Hancock, J. M., Einhorn, D., Tocchini-Valentini, G., Hrabe de Angelis, M., and Rosenthal, N. (2009) Post-publication sharing of data and tools. Nature 461, 171–3. Pennisi, E. (2009) Data sharing. Group calls for rapid release of more genomics data. Science 324, 1000–1. Merali, Z., and Giles, J. (2005) Databases in peril. Nature 435, 1010–1. 1000 Human Genomes Project, http://www. 1000genomes.org. Smigielski, E. M., Sirotkin, K., Ward, M., and Sherry, S. T. (2000) dbSNP: a database of single nucleotide polymorphisms. Nucleic Acids Res 28, 352–5. Frazer, K. A., Ballinger, D. G., Cox, D. R., Hinds, D. A., Stuve, L. L., Gibbs, R. A., Belmont, J. W., Boudreau, A., Hardenbol, P., Leal, S. M., Pasternak, S., Wheeler, D. A., Willis, T. D., Yu, F., Yang, H., Zeng, C., Gao, Y., Hu, H., Hu, W., Li, C., Lin, W., Liu, S., Pan, H., Tang, X., Wang, J., Wang, W., Yu, J., Zhang, B., Zhang, Q., Zhao, H., Zhou, J., Gabriel, S. B., Barry, R., Blumenstiel, B., Camargo, A., Defelice, M., Faggart, M., Goyette, M., Gupta, S., Moore, J., Nguyen, H., Onofrio, R. C., Parkin, M., Roy, J., Stahl, E., Winchester, E., Ziaugra, L., Altshuler, D., Shen, Y., Yao, Z., Huang, W., Chu, X., He, Y., Jin, L., Liu, Y., Sun, W., Wang, H., Wang, Y., Xiong, X., Xu, L., Waye, M. M., Tsui, S. K., Xue, H., Wong, J. T., Galver, L. M., Fan, J. B., Gunderson, K., Murray, S. S., Oliphant, A. R., Chee, M. S., Montpetit, A., Chagnon, F., Ferretti, V., Leboeuf, M., Olivier, J. F., Phillips, M. S., Roumy, S., Sallee, C., Verner, A., Hudson, T. J., Kwok, P. Y., Cai, D., Koboldt, D. C., Miller, R. D., Pawlikowska, L., Taillon-Miller, P., Xiao, M., Tsui, L. C., Mak, W., Song, Y. Q., Tam, P. K., Nakamura,

93

Y., Kawaguchi, T., Kitamoto, T., Morizono, T., Nagashima, A., Ohnishi, Y., Sekine, A., Tanaka, T., Tsunoda, T., Deloukas, P., Bird, C. P., Delgado, M., Dermitzakis, E. T., Gwilliam, R., Hunt, S., Morrison, J., Powell, D., Stranger, B. E., Whittaker, P., Bentley, D. R., Daly, M. J., de Bakker, P. I., Barrett, J., Chretien, Y. R., Maller, J., McCarroll, S., Patterson, N., Pe’er, I., Price, A., Purcell, S., Richter, D. J., Sabeti, P., Saxena, R., Schaffner, S. F., Sham, P. C., Varilly, P., Stein, L. D., Krishnan, L., Smith, A. V., Tello-Ruiz, M. K., Thorisson, G. A., Chakravarti, A., Chen, P. E., Cutler, D. J., Kashuk, C. S., Lin, S., Abecasis, G. R., Guan, W., Li, Y., Munro, H. M., Qin, Z. S., Thomas, D. J., McVean, G., Auton, A., Bottolo, L., Cardin, N., Eyheramendy, S., Freeman, C., Marchini, J., Myers, S., Spencer, C., Stephens, M., Donnelly, P., Cardon, L. R., Clarke, G., Evans, D. M., Morris, A. P., Weir, B. S., Mullikin, J. C., Sherry, S. T., Feolo, M., Skol, A., Zhang, H., Matsuda, I., Fukushima, Y., Macer, D. R., Suda, E., Rotimi, C. N., Adebamowo, C. A., Ajayi, I., Aniagwu, T., Marshall, P. A., Nkwodimmah, C., Royal, C. D., Leppert, M. F., Dixon, M., Peiffer, A., Qiu, R., Kent, A., Kato, K., Niikawa, N., Adewole, I. F., Knoppers, B. M., Foster, M. W., Clayton, E. W., Watkin, J., Muzny, D., Nazareth, L., Sodergren, E., Weinstock, G. M., Yakub, I., Birren, B. W., Wilson, R. K., Fulton, L. L., Rogers, J., Burton, J., Carter, N. P., Clee, C. M., Griffiths, M., Jones, M. C., McLay, K., Plumb, R. W., Ross, M. T., Sims, S. K., Willey, D. L., Chen, Z., Han, H., Kang, L., Godbout, M., Wallenburg, J. C., L’Archeveque, P., Bellemare, G., Saeki, K., An, D., Fu, H., Li, Q., Wang, Z., Wang, R., Holden, A. L., Brooks, L. D., McEwen, J. E., Guyer, M. S., Wang, V. O., Peterson, J. L., Shi, M., Spiegel, J., Sung, L. M., Zacharia, L. F., Collins, F. S., Kennedy, K., Jamieson, R., and Stewart, J. (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–61. 24. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25–9. 25. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L. J., Eilbeck, K., Ireland, A., Mungall, C. J., Leontis, N., Rocca-Serra, P., Ruttenberg, A.,

94

26. 27. 28. 29.

30.

31.

32.

33.

34.

35.

Harel et al. Sansone, S. A., Scheuermann, R. H., Shah, N., Whetzel, P. L., and Lewis, S. (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25, 1251–5. ClearForest, Text Analytics Solutions, http:// www.clearforest.com/index.asp. novo|seek, http://www.novoseek.com/ Welcome.action. DDBJ: DNA Data Bank of Japan, http:// www.ddbj.nig.ac.jp. Cochrane, G., Aldebert, P., Althorpe, N., Andersson, M., Baker, W., Baldwin, A., Bates, K., Bhattacharyya, S., Browne, P., van den Broek, A., Castro, M., Duggan, K., Eberhardt, R., Faruque, N., Gamble, J., Kanz, C., Kulikova, T., Lee, C., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., McHale, M., McWilliam, H., Mukherjee, G., Nardone, F., Pastor, M. P., Sobhany, S., Stoehr, P., Tzouvara, K., Vaughan, R., Wu, D., Zhu, W., and Apweiler, R. (2006) EMBL Nucleotide Sequence Database: developments in 2005. Nucleic Acids Res 34, D10–5. Sussman, J. L., Lin, D., Jiang, J., Manning, N. O., Prilusky, J., Ritter, O., and Abola, E. E. (1998) Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr D Biol Crystallogr 54, 1078–84. Rebhan, M., Chalifa-Caspi, V., Prilusky, J., and Lancet, D. (1998) GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14, 656–64. Safran, M., Chalifa-Caspi, V., Shmueli, O., Olender, T., Lapidot, M., Rosen, N., Shmoish, M., Peter, Y., Glusman, G., Feldmesser, E., Adato, A., Peter, I., Khen, M., Atarot, T., Groner, Y., and Lancet, D. (2003) Human Gene-Centric Databases at the Weizmann Institute of Science: GeneCards, UDB, CroW 21 and HORDE. Nucleic Acids Res 31, 142–6. Safran, M., Solomon, I., Shmueli, O., Lapidot, M., Shen-Orr, S., Adato, A., Ben-Dor, U., Esterman, N., Rosen, N., Peter, I., Olender, T., Chalifa-Caspi, V., and Lancet, D. (2002) GeneCards 2002: towards a complete, objectoriented, human gene compendium. Bioinformatics 18, 1542–3. Stelzer, G., Inger, A., Olender, T., Iny-Stein, T., Dalah, I., Harel, A., Safran, M., and Lancet, D. (2009) GeneDecks: paralog hunting and gene-set distillation with GeneCards annotation. OMICS 13, 477–87. Harel, A., Inger, A., Stelzer, G., StrichmanAlmashanu, L., Dalah, I., Safran, M., and

Lancet, D. (2009) GIFtS: annotation ­landscape analysis with GeneCards. BMC Bioinformatics 10, 348. 36. Liebel, U., Kindler, B., and Pepperkok, R. (2004) ‘Harvester’: a fast meta search engine of human protein resources. Bioinformatics 20, 1962–3. 37. Pang, K. C., Stephen, S., Engstrom, P. G., Tajul-Arifin, K., Chen, W., Wahlestedt, C., Lenhard, B., Hayashizaki, Y., and Mattick, J. S. (2005) RNAdb – a comprehensive mammalian noncoding RNA database. Nucleic Acids Res 33, D125–30. 38. Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., and McKusick, V. A. (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33, D514–7. 39. Laboratory information management system, http://en.wikipedia.org/wiki/ Laborator y_information_management_ system. 40. Morris, J. A., Gayther, S. A., Jacobs, I. J., and Jones, C. (2008) A Perl toolkit for LIMS development. Source Code Biol Med 3, 4. 41. Genome Canada LIMS, http://wishart. biology.ualberta.ca/labm/index.htm 42. Parkinson, J., Anthony, A., Wasmuth, J., Schmid, R., Hedley, A., and Blaxter, M. (2004) PartiGene – constructing partial genomes. Bioinformatics 20, 1398–404. 43. Schmid, R., and Blaxter, M. (2009) EST processing: from trace to sequence. Methods Mol Biol 533, 189–220. 44. The maxd software: supporting genomic expression analysis, http://www.bioinf. manchester.ac.uk/microarray/maxd. 45. Gribskov, M., Fana, F., Harper, J., Hope, D. A., Harmon, A. C., Smith, D. W., Tax, F. E., and Zhang, G. (2001) PlantsP: a functional genomics database for plant phosphorylation. Nucleic Acids Res 29, 111–3. 46. Predict-IV, www.predict-iv.toxi.uni-wuerzburg. de/participants/participant_7. 47. Harris, M., and Jones, T. A. (2002) Xtrack – a web-based crystallographic notebook. Acta Crystallogr D Biol Crystallogr 58, 1889–91. 48. Zolnai, Z., Lee, P. T., Li, J., Chapman, M. R., Newman, C. S., Phillips, G. N., Jr., Rayment, I., Ulrich, E. L., Volkman, B. F., and Markley, J. L. (2003) Project management system for structural and functional proteomics: Sesame. J Struct Funct Genomics 4, 11–23. 49. Prilusky, J., Oueillet, E., Ulryck, N., Pajon, A., Bernauer, J., Krimm, I., QuevillonCheruel, S., Leulliot, N., Graille, M., Liger,

Omics Data Management and Annotation

50.

51. 52.

53.

54. 55.

56.

D., Tresaugues, L., Sussman, J. L., Janin, J., van Tilbeurgh, H., and Poupon, A. (2005) HalX: an open-source LIMS (Laboratory Information Management System) for smallto large-scale laboratories. Acta Crystallogr D Biol Crystallogr 61, 671–8. Goh, C. S., Lan, N., Echols, N., Douglas, S. M., Milburn, D., Bertone, P., Xiao, R., Ma, L. C., Zheng, D., Wunderlich, Z., Acton, T., Montelione, G. T., and Gerstein, M. (2003) SPINE 2: a system for collaborative structural proteomics within a federated database framework. Nucleic Acids Res 31, 2833–8. ProteinScapeTM, http://www.protagen.de/ index.php?option=com_content&task=view &id=95&Itemid=288. Stein, L. D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J. E., Harris, T. W., Arva, A., and Lewis, S. (2002) The generic genome browser: a building block for a model organism system database. Genome Res 12, 1599–610. Karolchik, D., Baertsch, R., Diekhans, M., Furey, T. S., Hinrichs, A., Lu, Y. T., Roskin, K. M., Schwartz, M., Sugnet, C. W., Thomas, D. J., Weber, R. J., Haussler, D., and Kent, W. J. (2003) The UCSC Genome Browser Database. Nucleic Acids Res 31, 51–4. Brazma, A. (2001) On the importance of standardisation in life sciences. Bioinformatics 17, 113–4. Taylor, C. F., Field, D., Sansone, S. A., Aerts, J., Apweiler, R., Ashburner, M., Ball, C. A., Binz, P. A., Bogue, M., Booth, T., Brazma, A., Brinkman, R. R., Michael Clark, A., Deutsch, E. W., Fiehn, O., Fostel, J., Ghazal, P., Gibson, F., Gray, T., Grimes, G., Hancock, J. M., Hardy, N. W., Hermjakob, H., Julian, R. K., Jr., Kane, M., Kettner, C., Kinsinger, C., Kolker, E., Kuiper, M., Le Novere, N., Leebens-Mack, J., Lewis, S. E., Lord, P., Mallon, A. M., Marthandan, N., Masuya, H., McNally, R., Mehrle, A., Morrison, N., Orchard, S., Quackenbush, J., Reecy, J. M., Robertson, D. G., Rocca-Serra, P., Rodriguez, H., Rosenfelder, H., Santoyo-Lopez, J., Scheuermann, R. H., Schober, D., Smith, B., Snape, J., Stoeckert, C. J., Jr., Tipton, K., Sterk, P., Untergasser, A., Vandesompele, J., and Wiemann, S. (2008) Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotechnol 26, 889–96. Jones, A. R., Miller, M., Aebersold, R., Apweiler, R., Ball, C. A., Brazma, A., Degreef, J., Hardy, N., Hermjakob, H., Hubbard, S. J., Hussey, P., Igra, M., Jenkins, H., Julian, R. K., Jr., Laursen, K., Oliver, S. G., Paton, N.

57.

58.

59.

60.

61.

62.

95

W., Sansone, S. A., Sarkans, U., Stoeckert, C. J., Jr., Taylor, C. F., Whetzel, P. L., White, J. A., Spellman, P., and Pizarro, A. (2007) The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nat Biotechnol 25, 1127–33. Sansone, S. A., Rocca-Serra, P., Brandizi, M., Brazma, A., Field, D., Fostel, J., Garrow, A. G., Gilbert, J., Goodsaid, F., Hardy, N., Jones, P., Lister, A., Miller, M., Morrison, N., Rayner, T., Sklyar, N., Taylor, C., Tong, W., Warner, G., and Wiemann, S. (2008) The first RSBI (ISA-TAB) workshop: “can a simple format work for complex studies?”. OMICS 12, 143–9. Field, D., Garrity, G., Morrison, N., Selengut, J., Sterk, P., Tatusova, T., and Thomson, N. (2005) eGenomics: cataloguing our complete genome collection. Comp Funct Genomics 6, 363–8. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C. A., Causton, H. C., Gaasterland, T., Glenisson, P., Holstege, F. C., Kim, I. F., Markowitz, V., Matese, J. C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., and Vingron, M. (2001) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29, 365–71. Webb, S. C., Attwood, A., Brooks, T., Freeman, T., Gardner, P., Pritchard, C., Williams, D., Underhill, P., Strivens, M. A., Greenfield, A., and Pilicheva, E. (2004) LIMaS: the JAVA-based application and database for microarray experiment tracking. Mamm Genome 15, 740–7. Ball, C. A., Awad, I. A., Demeter, J., Gollub, J., Hebert, J. M., Hernandez-Boussard, T., Jin, H., Matese, J. C., Nitzberg, M., Wymore, F., Zachariah, Z. K., Brown, P. O., and Sherlock, G. (2005) The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Res 33, D580–2. Pajon, A., Ionides, J., Diprose, J., Fillon, J., Fogh, R., Ashton, A. W., Berman, H., Boucher, W., Cygler, M., Deleury, E., Esnouf, R., Janin, J., Kim, R., Krimm, I., Lawson, C. L., Oeuillet, E., Poupon, A., Raymond, S., Stevens, T., van Tilbeurgh, H., Westbrook, J., Wood, P., Ulrich, E., Vranken, W., Xueli, L., Laue, E., Stuart, D. I., and Henrick, K. (2005) Design of a data model for developing laboratory information management and analysis systems for protein production. Proteins 58, 278–84.

96

Harel et al.

63. Orchard, S., Hermjakob, H., Binz, P. A., Hoogland, C., Taylor, C. F., Zhu, W., Julian, R. K., Jr., and Apweiler, R. (2005) Further steps towards data standardisation: the Proteomic Standards Initiative HUPO 3(rd) annual congress, Beijing 25-27(th) October, 2004. Proteomics 5, 337–9. 64. Lindon, J. C., Nicholson, J. K., Holmes, E., Keun, H. C., Craig, A., Pearce, J. T., Bruce, S. J., Hardy, N., Sansone, S. A., Antti, H., Jonsson, P., Daykin, C., Navarange, M., Beger, R. D., Verheij, E. R., Amberg, A., Baunsgaard, D., Cantor, G. H., Lehman-McKeeman, L., Earll, M., Wold, S., Johansson, E., Haselden, J. N., Kramer, K., Thomas, C., Lindberg, J., Schuppe-Koistinen, I., Wilson, I. D., Reily, M. D., Robertson, D. G., Senn, H., Krotzky, A., Kochhar, S., Powell, J., van der Ouderaa, F., Plumb, R., Schaefer, H., and Spraul, M. (2005) Summary recommendations for standardization and reporting of metabolic analyses. Nat Biotechnol 23, 833–8. 65. Digital Curation Centre, http://www.dcc. ac.uk. 66. Biosharing, http://biosharing.org. 67. Joyce, A. R., and Palsson, B. Ø. (2006) The model organism as a system: integrating ‘omics’ data sets. Nat Rev Mol Cell Biol 7, 198–210. 68. Omes and Omics, http://omics.org/index. php/Omes_and_Omics. 69. Mounicou, S., Szpunar, J., and Lobinski, R. (2009) Metallomics: the concept and methodology. Chem Soc Rev 38, 1119–38. 70. Ippolito, J. E., Xu, J., Jain, S., Moulder, K., Mennerick, S., Crowley, J. R., Townsend, R. R., and Gordon, J. I. (2005) An integrated functional genomics and metabolomics approach for defining poor prognosis in human neuroendocrine cancers. Proc Natl Acad Sci USA 102, 9901–6. 71. Pefkaros, K. 2008 Using object-oriented analysis and design over traditional structured analysis and design. International Journal of Business Research. International Academy of Business and Economics. HighBeam Research. http://www.highbeam.com. 2 Jan. 2011. 72. Whitten, J. L., Bentley, L. D., and Dittman, K. C. (2004) Systems Analysis and Design Methods, 6th ed. McGraw-Hill Irwin, New York. 73. Todman, C. (2001) Designing a Data Warehouse: Supporting Customer Relationship Management, 1st ed., pp 25–58. PrenticeHall PTR, New Jersey.

74. CIS 3400 Database Management Systems Course – Baruch College CUNY, http:// cisnet.baruch.cuny.edu/holowczak/classes/ 3400. 75. MySQL, http://dev.mysql.com. 76. Perl, http://www.perl.org. 77. BioPerl, http://www.bioperl.org. 78. Glimpse, http://www.webglimpse.org. 79. Lucene, http://lucene.apache.org. 80. HGNC, http://www.genenames.org. 81. Entrez gene, http://www.ncbi.nlm.nih.gov/ sites/entrez?db=gene. 82. Ensembl, http://www.ensembl.org/index. html. 83. Universal Protein Resource (UniProtKB), http://www.uniprot.org. 84. GeneCards sources, http://www.genecards. org/sources.shtml. 85. Eyre, T. A., Ducluzeau, F., Sneddon, T. P., Povey, S., Bruford, E. A., and Lush, M. J. (2006) The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Res 34, D319–21. 86. Rosen, N., Chalifa-Caspi, V., Shmueli, O., Adato, A., Lapidot, M., Stampnitzky, J., Safran, M., and Lancet, D. (2003) GeneLoc: exon-based integration of human genome maps. Bioinformatics 19, i222–4. 87. phpMyAdmin, http://www.phpmyadmin. net/home_page/index.php. 88. Solr, http://lucene.apache.org/solr. 89. Propel, http://propel.phpdb.org/trac. 90. Bugzilla – server software for managing software development, http://www.bugzilla.org. 91. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment search tool. J Mol Biol 215, 403–10. 92. Trace at NCBI, http://www.ncbi.nlm.nih. gov/Traces. 93. Perl for bioinformatics and internet, http:// bip.weizmann.ac.il/course/prog. 94. Artemis, http://www.sanger.ac.uk/Software/ Artemis. 95. Extensible Markup Language (XML), http:// www.w3.org/XML. 96. Concurrent Versions System (CVS) Overview, http://www.thathost.com/wincvs-howto/ cvsdoc/cvs_1.html#SEC1. 97. Eclipse project, http://www.eclipse.org/ eclipse. 98. Sequel Pro, http://www.sequelpro.com.

Chapter 4 Data and Knowledge Management in Cross-Omics Research Projects Martin Wiesinger, Martin Haiduk, Marco Behr, Henrique Lopes de Abreu Madeira, Gernot Glöckler, Paul Perco, and Arno Lukas Abstract Cross-Omics studies aimed at characterizing a specific phenotype on multiple levels are entering the ­scientific literature, and merging e.g. transcriptomics and proteomics data clearly promises to improve Omics data interpretation. Also for Systems Biology the integration of multi-level Omics profiles (also across species) is considered as central element. Due to the complexity of each specific Omics technique, specialization of experimental and bioinformatics research groups have become necessary, in turn demanding collaborative efforts for effectively implementing cross-Omics. This setting imposes specific emphasis on data sharing platforms for Omics data integration and cross-Omics data analysis and interpretation. Here we describe a software concept and methodology fostering Omics data sharing in a distributed team setting which next to the data management component also provides hypothesis generation via inference, semantic search, and community functions. Investigators are supported in data workflow management and interpretation, supporting the transition from a collection of heterogeneous Omics profiles into an integrated body of knowledge. Key words: Scientific data management, Cross-Omics, Biomedical knowledge management, Systems biology, Inference, Context

1. Introduction Technological advancements in biomedical research and here in particular to note the Omics revolution, have provided powerful measures for investigating complex molecular processes and phenotypes. Each single high-throughput Omics technique is accompanied by the generation of a significant amount of data, and their combination (cross-Omics) has started to become common practice in multidisciplinary team project settings. Hence, data Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_4, © Springer Science+Business Media, LLC 2011

97

98

Wiesinger et al.

and information sharing has become essential. Virtual online environments address those issues and support critical processes in biomedical research (1–4). Complementary use of Omics methods, where each single Omics technique is driven at a specific group, demands for data unification and harmonization. Additionally, the integration of internal and third party (public domain) information is desired for supporting data interpretation. Unfortunately, effective and focused data/knowledge sharing is far from the norm in practice (5). Next to technical difficulties, further factors potentially hampering communication and information sharing are given by legal and cultural matters as well as partially diverging interests of collaborating parties (6). Many technical and procedural barriers are caused by the peculiarity of scientific data, characterized by heterogeneity, complexity, and extensive volume (7). In any case, user acceptance of technologies supporting data and knowledge exchange is central. Accessibility, usability, and a comprehensive understanding of purpose and benefits for each participating team are prerequisites for an “added value” deployment of any scientific data management tool in the Omics field and beyond. We introduce a concept dedicated at supporting a collaborative project setup, and specifically demonstrate the workflow for cross-Omics projects. Although publicly available repositories as found e.g. for transcriptomics and proteomics data do not meet theses general requirements, they are valuable representatives of established scientific data sharing platforms. Examples for transcriptomics data repositories providing access to a multitude of profiles on various cellular conditions include caArray (https:// cabig.nci.nih.gov/tools/caArray) (8), Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) (9), ArrayExpress (http:// www.ebi.ac.uk/microarray-as/ae) (10) and the Stanford Microarray Database (http://smd.stanford.edu) (11). These repositories usually provide raw data files further characterized by metadata (12, 13) holding study aim, biological material used, experiment type and design, contributors, technology platform, and sample description, among others. The scope of these platforms is relatively narrow and functions regarding data retrieval (and partially analysis) are similar in nature. Formats for metadata are usually rigid and the purpose of these platforms is clearly focused on creating a public memory for specific Omics data rather than facilitating collaboration. For our envisaged data management this rigid concept has to be expanded: (a) Management of diverse Omics data (next to transcriptomics also covering other Omics tracks) has to be taken care of. (b) Seeing cross-Omics team efforts, a knowledge management system has to provide the necessary flexibility allowing adaption to the research workflow (which for scientific workflows frequently experiences significant changes over time).

Data and Knowledge Management in Cross-Omics Research Projects

99

(c) Such a concept has to respect nonfunctional requirements for covering the needs of group constraints as access policy. (d) Next to mere data management such a system shall support interlinking of data (context representation) e.g. by easily interrogating transcriptomics and proteomics feature lists and querying the results profile. A defined level of flexibility enables adapting to changing project conditions, but in turn the framework still has to guarantee data consistency. These issues have to be respected under the constraints of operational, regulatory, and productivity aspects including security, versioning, further providing an audit trail and referencing. Based on the complex methodological environment and translational research aspects in molecular biology, collaborative projects are clearly seen as a way to design and implement research projects (5, 14, 15), and this trend is in particular obvious when combining Omics methods (7, 12).

2. Materials 2.1. Basic Considerations

We will discuss the knowledge management concept for a typical project setting involving collaborative application of different Omics techniques (transcriptomics and proteomics) driven by a team focusing on a specific phenotype. The data accumulated by the different groups participating in the effort shall be shared and integrated for providing the basis for cross-Omics analysis. Information may originate from human (e.g. a study plan document specifying a particular Omics study with respect to samples and associated sample descriptors) as well as machine sources (e.g. a transcriptomics raw data image file produced by a microarray scanner). In general standardization of the data ­formats is important particularly for information subjected to ­further automated processing (e.g. for bioinformatics workflows and statistics procedures). The example scenario is respecting the following major considerations and requirements: (a) Project centered view In contrast to an institution (or company) view on data maintenance and sharing a project view implicitly assumes an end of activities at project completion. This strongly impacts the modus operandi regarding what information shall be shared and why. The integration with legacy systems is considered as less critical. (b) Confidentiality and data security Unauthorized access to information must be prevented, thus a capable permission system and encrypted data transfer are

100

Wiesinger et al.

needed. The “need-to-know” principle may apply even in a collaborative research setting. (c) Share and find data Comprehensive search capabilities are required to facilitate quick data retrieval. Implementation of complex queries as well as full text search options are recommended for supporting user needs. Infrastructure capable of supporting large scale data exchange has to be in place. (d) Flexible structure regarding data objects Various types of data have to be considered, e.g. study plans, raw data (as generated by transcriptomics and proteomics instruments) as well as result profiles and associated analysis protocols. Those data objects might change over time, thus flexible data modeling is required. (e) Persistence and maintenance of data Once data is accessible to other participants it shall be persistent over time with a unique identifier assigned for referencing. Versioning of information is needed to account for eventual changes and to support reproducing results at any given point in time. (f) Inference as a mechanism for traversing data into knowledge A data repository shall not necessarily be limited to the data management aspect. Establishing and modeling associations among related information has to be envisaged, either upon manual definition (the user explicitly specifies which data objects “belong” together) or by automated mining procedures retrieving implicit relations. Consequently, relations among experimental studies, samples, raw data (transcriptomics and proteomics), result protocols and the like should be represented. In this way a data repository is transformed into an analysis and information retrieval environment. 2.2. General Data Concept

A common challenge in scientific data management is the need for a flexible data structure. Even if the data structure is properly defined with respect to the requirements identified at project start, adaptations are frequently required due to changes in the experimental protocols or even larger parts of the workflow as arising during project life time. Unfortunately, increasing flexibility regarding data structure usually comes with higher risk regarding data consistency and standardized data querying and analysis. LIMS (Laboratory Information Management Systems) partially address customizability for adapting to changing procedures. However, these powerful software platforms are mainly designed for mirroring relatively rigid processes, which for e.g., come into place at later stages of biomedical product development. A major issue to be considered here is the granularity of data representation: Representation may be fine grained, e.g. each

Data and Knowledge Management in Cross-Omics Research Projects

101

single expression value of an expression array is explicitly stored and addressable, or coarse grained e.g. by storing the whole expression file as one data object. Certainly more fine grained data representation easies specific querying, but on the other hand complicates adaptability as all data instances have to be explicitly taken care of in database design. To cope with the requirement of flexible data structures for an integrated Omics-bioinformatics repository we propose a mixed model for the data structure. The central components are (a) taxonomy, (b) records and (c) relations. These components allow building up a structured collection of data objects, as schematically given in Fig. 1. This general data model can be easily adapted to accurately react upon requirements introduced by changing research procedures. In this concept a user request allows to specifically access a data object in its native format (e.g. a CEL file coming from Affymetrix GeneChips) together with associated metadata. On top the user may specifically query also the native file as such if the data provided in this file follow a strict standard (as given for machine-generated raw data files). (a) Taxonomy component Similar to a conventional file system data objects are most easily organized in a user definable taxonomy. This taxonomy may represent a hierarchical structure modeled according to the project requirements. Like conventional file systems, the taxonomy channels the assignment of single data objects to one unique location in the hierarchy. Examples

b

a

c

Taxonomy

Record

Relation

Omics

RD-001

RD-001 Studies

Author

...

S-001

Subject

...

S-002

Study

S-002

File

expr 02.cel

S-003

S-002

Raw Data RD-001

Fig. 1. Components of the general information concept. (a) Hierarchical taxonomy shown as file system-like folder structure resembling an Omics experiment covering protocols and raw data. (b) Example raw data record characterized by metadata and the native data file. (c) Explicit relations between records, where the raw data record is uniquely assigned to a specific study plan record.

102

Wiesinger et al.

Table 1 Selected taxonomy types Taxonomy

Description

Work Breakdown Structure

In project management the Work Breakdown Structure is a deliverable-oriented grouping of tasks in a way that supports defining and organizing and the scope of a project.

Product Breakdown Structure

The Product Breakdown Structure is a grouping of physical or functional components (items) that make up a specific deliverable (e.g. a product, may be the experimentally validated features from the Omics-bioinformatics workflow).

Organizational Structure

An Organizational Structure represents a hierarchy of teams (contributors) on the level of organizations, departments, or individuals.

Procedural Structure

A Procedural Structure is a hierarchy of process related categories as the consecutive, stepwise maturation of deliverables described by status changes.

for ­taxonomies are listed in Table 1. Using, for e.g., a work breakdown structure (16) of a project as taxonomy means that the tasks are represented as folder hierarchy (e.g. a main folder study plans with subfolders for specific studies) and data objects generated in the context of a specific task (study) are assembled in the respective task folder. Figure 2 exemplarily shows a work breakdown structure that could be used as taxonomy for covering an integrated Omics-bioinformatics project. Next to such a taxonomy built on the basis of a project workflow alternative taxonomies as biomedical terminologies or ontology terms from molecular biology may be used (17, 18). (b) Record component For representing and handling data we introduce a “record” as the fundamental information instance. A record can be understood as the smallest, self-contained entity of data or information stored in the repository. Records are comprised of a set of named properties, each containing either data files, references to other records, or just plain text. A typical record combines a data file (e.g. a study plan document) along with a set of properties (metadata) providing further information about the data file (author, subject, study, etc.). Shortly, metadata can be understood as data about data (9). Diverse types of records (a study plan, an analysis report, etc.) can be modeled side by side, in the following referred to as “record type”. Each record type is defined via a fixed set of ­descriptors

Data and Knowledge Management in Cross-Omics Research Projects

103

Project

Project Management

Sample Collection

Omics

Data Analysis

Validation

Project Start

Protocols

Protocols

Protocols

Protocols

Coordination

Ethics

Studies

Quality

Reports

Controlling

Acquisition

Raw Data

Annotation

Project End

Storage

Reports

Statistics

Distribution

Reports

Fig. 2. Example of a Work Breakdown Structure for an integrated Omics-bioinformatics workflow. Main elements (next to project management) include sample collection, Omics, data analysis, and validation workflows.

for defining properties of this particular type (as a study plan record has a fundamentally different scope than an Omics raw data record). These properties are populated by the user upon creation of a new record assigned to a specific record type. This concept of records serves a generic data object representation, however, all records of a particular type have the same particular purpose and specific metadata definition. This concept of metadata and records allows both, rigidity and flexibility in data modeling, being central for standardization efforts in a (to a large extent non-standardized) environment as found in a science project. Each Record Type at least has to hold a description of scope and purpose to allow a researcher to query the database by either explicitly searching in selected metadata fields or by doing full text search in the record as such. (c) Relation component The relation component allows explicit or implicit linking of records, where directed and undirected relations are relevant. These relations allow the definition of context between records. A typical directed and explicit (user defined) relation would be a record being “associated” to another record, as given for “sample descriptor” records associated with the record “sample cohorts”. An example of an undirected and implicit (systems identified) relation would be given by two

104

Wiesinger et al.

records which share keywords in their metadata (e.g. two study plans both addressing a specific phenotype). Here the implicit relation is identified by a script building relations based on screening the metadata information for all records and record types. This concept can be further expanded for delineating complex sets of relations and represented as semantics (19) e.g. on the basis of OWL (Web Ontology Language, http://www.w3.org/TR/owl-features) or RDF (Resource Description Framework, http://www.w3.org/ RDF). The introduction of relations between records is part of the metadata functions.

2.4. Information Management Policy

Equally important as the general concept and its proper ­implementation is an appropriate information management policy

Knowledge Management Application

Modular implementation based on existing software building blocks is recommended for realizing a knowledge management system as introduced above. Significant infrastructure and utility functionality is offered in the public domain ready for use, clearly providing reduced development effort. Generally a Web application is favored for supporting easy handling on the client side in a multicenter setting e.g. using the Java Enterprise Platform (http:// java.sun.com/javaee). For supporting dynamic data models a post-relational approach as data foundation as given by the Content Repository for Java (http://www.ibm.com/developerworks/ java/library/j-jcr) technology is recommended. The coarse components of the application are illustrated in Fig.  3. Additional details on software architecture and environment (see Note 1), data persistence (see Note 2) and presentation technology (see Note 3) have to be respected.

WEB APPLICATION SERVER

2.3. Technical Realization

Presentation JSF/ICEfaces

Web Service

Business Logic EJBs

Data Access Jackrabbit

Relational Database

Filesystem

Fig. 3. Prototypical software architecture for realizing a record management system.

Data and Knowledge Management in Cross-Omics Research Projects

105

specifically designed for the particular project setup. Such a policy shall at least clarify the following main issues: (a) Who should provide information and When? It has to be clarified which team member is, at what point in time, expected to provide information to others. Automatic e-mail notification e.g. when a record is added to a specific folder supports efficient information flow. (b) What information should be provided and Why? It should be clear which information is required for reaching the project goals. Therefore, also the purpose of information with respect to driving project specific processes has to be defined. (c) In What format should information be provided? This definition is important for modeling of record types. For files considered for automated processing, the format has to be specified in a strict manner. The policy also has to define the type of taxonomy to be applied. (d) What context definition should be applied to information? Rules regarding explicit (user driven) relations among records have to be clarified. (e) Who is allowed to use information for What? Restrictions for data retrieval and use have to be clear and agreed on between parties. Freedom of operation, exploitation, intellectual property, and confidentiality issues have to be considered.

3. Methods 3.1. Application Example

In this section we outline a practical application example utilizing the concept given above focusing on an integrated Omics – bioinformatics workflow. User A:  Investigator at laboratory A; defines studies and collects biological samples. User B:  Investigator at laboratory B; performs transcriptomics experiments. User C:  Investigator at laboratory C; performs proteomics experiments. User D:  Investigator at an organization D; analyzes raw data and performs bioinformatics. In our record concept, data is collected in the form of record stacks in the sense that data of the same type is stored by using the same record type and metadata structure, assigned to storage locations in the given taxonomy. Records can therefore be ­distinguished according to source, generation methodology, or

106

Wiesinger et al.

Timeline

Analysis Report (Transcriptomics)

User D

Transciptomics Data

User B

User A

User C

Analysis Report (Cross-omics)

Transciptomics Data

Study Plan

Addendum

Proteomics Data

Analysis Report (Proteomics)

Fig. 4. Example for an Omics-bioinformatics workflow realized in a team setting, all being centrally linked to a study plan, as represented by explicit and directed linking of records.

particular scope (raw data, analysis report, etc.). Each study can be clearly described according to the hypothesis, the experimental methods and conditions, the materials and samples used, and finally the results generated. Therefore record types have to be defined to model these studies in advance. The procedural creation of records and their relational organization is outlined in Fig. 4. The actual first step of the process for user A is the creation of a study plan document and the creation of the respective study plan record. The study plan is a document defining the aims and procedures of the study, the major goals to be achieved, participating members, time lines, etc. User A thus creates a record of the record type “Study plan” by uploading the study plan document to the repository. User A may add further information to the specific study plan using the predefined metadata fields provided for this record type. The study plan record may also hold description and label assignment of samples used in the experimental protocol (see Note 4). Process constraints may be introduced at this step, e.g. by requesting a study plan as first document before any downstream tasks can be started like adding an Addendum record associated with the study plan for specifying in detail the experimental conditions for the proteomics experiment, followed by creating a new record for proteomics raw data. In parallel, user B is performing transcriptomics experiments according to the specifications provided in the study plan, and creates records holding transcriptomics raw data. As soon as user B and C have provided data, user D is able to download and feed the

Data and Knowledge Management in Cross-Omics Research Projects

107

Omics raw data into an analysis workflow, subsequently ­generating results records appropriately stored as the record type “Analysis Report”. Finally, all information generated in the course of the study refers to the respective study plan. Organizing cross-Omics data in a way as depicted in Fig. 4 has numerous advantages. Raw data files or analysis reports can be easily recalled, data provided from different contributors can be distinguished, and via setting relations raw data, analysis files and samples used are all unambiguously linked to a study plan. In this way information can be documented and used (or reused) in a comprehensive manner. 3.2. Neighborhood and Context

The concept shown in the example leads to a repository of project specific content utilizing records conforming to record types. Browsing taxonomy folders and applying keyword searches are simple ways of navigating and identifying content in the repository. However, such search procedures do only make limited use of relations. Relations among records allow deriving context networks (ontology), thereby offering entirely new opportunities regarding interpretation of data generated in their context. In a context network, relations with a specific meaning are defined among records. The “meaning” is given by semantics which provides edges (relations) between nodes (records). This concept is also found with relational databases, but in our system relations are defined between records and not explicitly during database design on the level of fine grained data tables. A visual representation of relations for a selected record of the set of records given in Fig. 4 is shown in Fig. 5. For the record “Study Plan A” (central node in Fig. 5a), relations to other records are given. Explicit relations indicating “associated” are depicted as solid lines, whereas implicit relations are indicated by dashed lines. Implicit relations are inferred automatically from records sharing similar attributes. In the example, a second relation instance is used to represent additional context, namely that study plan A and B are related e.g., by using the same sample cohort. A simple computational procedure can be implemented to screen for such implicit relations which mines metadata. Yet another type of implicit relation may be present, as illustrated in Fig. 5b: One relation between two analysis reports (transcriptomics and proteomics) may e.g. be derived on the basis of sharing a common list of features (provided as gene or protein identifiers) jointly identified as relevant. The second relation between “Analysis Report” (cross-omics) and “Study Plan B” may be based on a common disease term found in both records either in the metadata fields or in the text of attached files (see Note 5). Such a flexible delineation of implicit neighborhood as discussed here facilitates an explorative approach, allowing for the

108

Wiesinger et al.

a

b

Analysis Report (Proteomics)

Analysis Report (Transcriptomics)

Proteomics Data Study Plan A

Analysis Report (Cross-omics)

Addendum

Selected Record

Transcriptomics Data

Study Plan B Transcriptomics Data

Neighbor Record Explicit Relation Implicit Relation

Fig. 5. Visual representation of neighborhood. (a) Explicit (solid line) and implicit (dashed line) relations are shown. The explicit relations resemble the record structure given in Fig. 4. The implicit relation in (a) is derived by analyzing record metadata, the implicit relations given in (b) demonstrate the neighborhood representation of records.

discovery of relations (context) among data objects (records). Browsing such neighborhoods provides entirely new ways of extracting information stored in the record repository going far beyond the mere data management aspect.

4. Notes 1. Modern server-side web technologies (often referred as web 2.0) enable access to complex applications regardless of location or equipment used. The requirements for using such web applications are mostly as simple as having access to the Internet via a standard web browser. Therefore, designing a data/knowledge management system as web application becomes a method of choice for supporting distributed groups. Web application servers are used as platform for deploying and executing web applications. Such platforms come with numerous powerful technologies and concepts supporting the development, maintenance, and operation of web applications. Recent platforms provide a broad range of infrastructure functionality, among them database connection management, transactions and security. Hence, the efforts for implementing such functionalities can be significantly reduced. Moreover, most application servers are built

Data and Knowledge Management in Cross-Omics Research Projects

109

in a modular style to encourage clean levels of abstraction and support exchangeability of single components to a certain extent. One example includes Glassfish v3 (https://glassfish. dev.java.net) as application server, supporting Java Enterprise 6 (Java EE) technologies and paradigms. Following the Enterprise Java Bean (EJB, http://java.sun.com/products/ ejb) server-side component architecture (as included in Java EE) facilitates the design of a separated architectural design for e.g. separating the application logic and the presentation logic. This implies that changing or adding presentation ­components to the system does not affect the application logic (e.g. when a dedicated client application needs to be introduced). 2. Data structures as well as persistence mechanisms form the basis of data centric applications. Knowledge management systems in particular demand for database solutions with specific characteristics: The data management system (and data structures) is required to support handling of semi-structured data as well as support changes of the data model during operation. Especially in the life sciences domain it is essential to have data solutions capable of handling large amounts of data. In contrast, database performance of considerably complex queries is secondary. We found it very convenient to use the Content Repository for Java (JCR, as defined in Java Specification Request JSR 170) as central technology for data access and persistence. Following a hierarchy centered postrelational database model, JCR allows accessing data at a very high level of abstraction that perfectly meets the requirements mentioned above. Specifically we use Apache Jackrabbit (http://jackrabbit.apache.org), which represents the reference implementation of JCR. Since Jackrabbit provides ­additional features as versioning, full text search, and a dynamic data model, the development effort can be further reduced. A  relational database (as MySQL, http://www. mysql.com) and the filesystem serve as immediate persistence mechanisms, as data access through JCR retains complete transparency as the relational database beneath is never accessed directly. Jackrabbit integrates seamlessly with the application server environment in the shape of a connector module component. 3. Choosing a capable and easy to use presentation framework significantly influences development efforts regarding the user interface. Accessibility to the application largely depends on the technology used for presentation with respect to usability and technical compatibility. Building a concise user interface directly influences acceptance and user satisfaction. Naturally, it is of high importance not to exclude users because

110

Wiesinger et al.

of their client environment (web browser, operating system). The Java Server Faces (JSF, http://java.sun.com/javaee/ javaserverfaces) based integrated application framework may be used as server-side technology to build a user interface. Particularly, ICEfaces (http://www.icefaces.org) is a rich component framework based on JSF that utilizes the Asynchronous JavaScript and XML (Ajax) technology for providing a responsive user interface, on top saving bandwidth. Opposed to other frameworks, using JSF as server-side technology does not demand for any specific software preinstalled at the clients. JSF-based solutions can easily be accessed via standard web browsers (having Javascript enabled). 4. A properly designed system should provide the user (or system administrator) the option of flexible record type definition. In our example case, the sample documentation may, for e.g., also be done via generating a separate record of the record type “Sample documentation”, followed by explicitly introducing relations between samples used and the study plan. Such extension would allow adding specific metadata to individual samples (retrieval procedure/date of sample drawn) in contrast to having all samples organized in a tabular manner in a single record (where the metadata then only provide information being valid for all samples as e.g. cohort name). 5. Implementing the record management system in Jackrabbit (JCR reference system, see Note 2) provides full text search, consequently it is a straight forward procedure to search for any specific term in all records. Interrogation of given full text indexing with a term list of genes, proteins, diseases, pathway names, etc. is an easy computational task.

Acknowledgments The research leading to these results has received funding from the European Community’s Seventh Framework Programme under grant agreement n° HEALTH-F5-2008-202222. References 1. Sagotsky, J. A., Zhang, L., Wang, Z., Martin, S., and Deisboeck, T. S. (2008) Life Sciences and the web: a new era for collaboration. Mol Syst Biol 4, 201. 2. Waldrop, M. (2008) Big data: Wikiomics. Nature 455, 22–25. 3. Ruttenberg, A., Clark, T., Bug, W., Samwald, M., Bodenreider, O., Chen, H., Doherty,

D., Forsberg, K., Gao, Y., Kashyap, V., Kinoshita, J., Luciano, J., Marshall, M. S., Ogbuji, C., Rees, J., Stephens, S., Wong, G. T., Wu, E., Zaccagnini, D., Hongsermeier, T., Neumann, E., Herman, I., and Cheung, K. (2007) Advancing translational research with the Semantic Web. BMC Bioinformatics 8 Suppl 3, S2.

Data and Knowledge Management in Cross-Omics Research Projects 4. Stein, L. D. (2008) Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nat Rev Genet 9, 678–88. 5. Disis, M. L., and Slattery, J. T. (2010) The road we must take: multidisciplinary team science. Sci Transl Med 2, 22–9. 6. Nelson, B. (2009) Data sharing: empty archives. Nature 461, 160–3. 7. Moore, R. (2000) Data management systems for scientific applications. IFIP Conf Proc 188, 273–84. 8. Bian, X., Gurses, L., Miller, S., Boal, T., Mason, W., Misquitta, L., Kokotov, D., Swan, D., Duncan, M., Wysong, R., Klink, A., Johnson, A., Klemm, J., Fontenay, G., Basu, A., Colbert, M., Liu, J., Hadfield, J., Komatsoulis, G., Duvall, P., Srinivasa, R., and Parnell, T. (2009) Data submission and curation for caArray, a standard based microarray data repository system. Nat Proc doi:10.1038/npre.2009.3138.1. 9. Edgar, R., Domrachev, M., and Lash, A. E. (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30, 207–10. 10. Parkinson, H., Kapushesky, M., Kolesnikov, N., Rustici, G., Shojatalab, M., Abeygunawardena, N., Berube, H., Dylag, M., Emam, I., Farne, A., Holloway, E., Lukk, M., Malone, J., Mani, R., Pilicheva, E., Rayner, T. F., Rezwan, F., Sharma, A., Williams, E., Bradley, X. Z., Adamusiak, T., Brandizi, M., Burdett, T., Coulson, R., Krestyaninova, M., Kurnosov, P., Maguire, E., Neogi, S. G., Rocca-Serra, P., Sansone, S., Sklyar, N., Zhao, M., Sarkans, U., and Brazma, A. (2009) ArrayExpress update – from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res 37, D868–72. 11. Hubble, J., Demeter, J., Jin, H., Mao, M., Nitzberg, M., Reddy, T. B. K., Wymore, F.,

12.

13. 14.

15. 16.

17.

18.

19.

111

Zachariah, Z. K., Sherlock, G., and Ball, C. A. (2009) Implementation of GenePattern within the Stanford Microarray Database. Nucleic Acids Res 37, D898–901. Gray, J., Liu, D. T., Nieto-Santisteban, M., Szalay, A., DeWitt, D. J., and Heber, G. (2005) Scientific data management in the coming decade. ACM SIGMOD Record 34, 34–41. National Information Standards Organization (U.S.). (2004) Understanding metadata. NISO Press, Bethesda, MD. Wuchty, S., Jones, B. F., and Uzzi, B. (2007) The increasing dominance of teams in production of knowledge. Science 316, 1036–39. Gray, N. S. (2006) Drug discovery through industry-academic partnerships. Nat Chem Biol 2, 649–53. Project Management Institute (2008). A Guide to the Project Management Body of Knowledge: PMBoK Guide, Fourth Edition, PMI. Bodenreider, O. (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32, D267–70. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L. J., Eilbeck, K., Ireland, A., Mungall, C. J., Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S., Scheuermann, R. H., Shah, N., Whetzel, P. L., and Lewis, S. (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25, 1251–55. Das, S., Girard, L., Green, T., Weitzman, L., Lewis-Bowen, A., and Clark, T. (2009) Building biomedical web communities using a semantically aware content management system. Brief Bioinformatics 10, 129–38.

Chapter 5 Statistical Analysis Principles for Omics Data Daniela Dunkler, Fátima Sánchez-Cabo, and Georg Heinze Abstract In Omics experiments, typically thousands of hypotheses are tested simultaneously, each based on very few independent replicates. Traditional tests like the t-test were shown to perform poorly with this new type of data. Furthermore, simultaneous consideration of many hypotheses, each prone to a decision error, requires powerful adjustments for this multiple testing situation. After a general introduction to statistical testing, we present the moderated t-statistic, the SAM statistic, and the RankProduct statistic which have been developed to evaluate hypotheses in typical Omics experiments. We also provide an introduction to the multiple testing problem and discuss some state-of-the-art procedures to address this issue. The presented test statistics are subjected to a comparative analysis of a microarray experiment comparing tissue samples of two groups of tumors. All calculations can be done using the freely available statistical software R. Accompanying, commented code is available at: http://www.meduniwien.ac.at/msi/biometrie/MIMB. Key words: Differential expression analysis, False discovery rate, Familywise error rate, Moderated t-statistics, RankProduct, Significance analysis of microarrays

1. Introduction The recent developments of experimental molecular biology have led to a new standard in the design of Omics experiments: the so-called p  n paradigm. Under this new paradigm, the number of independent subjects n (e.g., tissue samples) is much smaller than the number of variables p (e.g., number of genes in an expression profile) that is analyzed. While in classical settings few prespecified null hypotheses are evaluated, now we are confronted with simultaneous testing of thousands of hypotheses. Classical statistical methods typically require the number of independent subjects to be large and a multiple of the number of variables in order to avoid collinearity and overfit (1). However, the number

Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_5, © Springer Science+Business Media, LLC 2011

113

114

Dunkler, Sánchez-Cabo, and Heinze

of subjects that can be considered for a high-throughput ­experiment is often limited due to the technical and economical limitations of the experiment. Hence, new statistical methods had to be developed to handle Omics data. These developments mainly focus on three issues: First, new test statistics (2, 3) and corrections for multiple testing of many hypotheses in one experiment (4, 5), second, avoiding collinearity in model estimation with more variables than subjects, and third, feature selection by optimizing predictive accuracy in independent samples. Some methods simultaneously address the latter two issues (6–8). In this contribution, we focus on statistical methods developed for the simultaneous comparison of continuous variables (e.g., gene expression profiles) between two conditions (e.g., two types of tumors). We do not discuss advanced issues of model building or feature selection that go beyond the scope of an introductory text. Thus, the remainder of the chapter is organized as follows: in Subheading  2, we explain basic statistical concepts and their use in an Omics data context. Subheading 3 exemplifies the most commonly used methods for hypothesis testing in Omics experiments by means of a microarray experiment comparing the gene expression of two types of tumor tissues (9).

2. Materials 2.1. Population and Sample

Any biological experiment that involves data collection seeks to infer properties of an underlying entity. This entity is called the target population of our experiment, and any conclusions we draw from the experiment apply to this target population. Conclusions that are restricted to the few subjects of our experiment must be considered scientifically worthless. The variation of gene expression among the members of the population constitutes a statistical distribution. Its characteristics can be described by statistical measures, e.g., its central tendency by the mean and its spread by the standard deviation. However, since the target population is not completely observable, we have to estimate its characteristics using data independently collected from a sample of few members of the population (the biological replicates or subjects, see Note 1). Statistical estimates are values computed from the subjects in a sample which are our best guesses for the corresponding population characteristics. Repeated sampling from a population yields different estimates of population characteristics, but the underlying population characteristics remain constant (as sampling has no influence on the target population).

2.2. The Principle of Statistical Testing

In statistical testing, we translate our biological hypothesis, e.g., “The expression of gene XY is different between basal-like human breast cancers (BLC) and non BLC (nBLC) tumors,” into a

Statistical Analysis Principles for Omics Data

115

s­ tatistical null hypothesis: “The mean expression of gene XY is equal among BLC and nBLC tumors.” A statistical test results in a decision about this null hypothesis: either reject the null hypothesis and claim that there is a difference between the groups (a positive finding), or do not reject the null hypothesis because of insufficient evidence against it (a negative finding). Statistical tests do not proof or verify hypotheses, but instead reject or falsify postulated hypotheses, if data is implausible for a given hypothesis. The complementary hypothesis to the null is denoted as the alternative hypothesis, and corresponds to any condition not covered by the null hypothesis: “Mean gene expression in BLC tumors differs from mean gene expression in nBLC tumors.” Our population now comprises BLC as well as nBLC tumors, and the parameter of interest is the difference in expected gene expression between these two groups. This parameter could be zero (corresponding to the null hypothesis) as well as different from zero (corresponding to the alternative hypothesis), and by our experiment we wish to infer its true value. A test statistic measures the deviation of the observed data from the null hypothesis. In our example, the test statistic could be defined as the difference in mean gene expression between BLC and nBLC tumors. We could now reject the null hypothesis as soon as the sample difference in means is different from zero. However, since we get a new difference in means each time we draw a new sample, we must take sampling variation into account. Therefore, we first assume that the null hypothesis is true, and estimate the distribution of the test statistic under this assumption. Then, we estimate a p-value, which measures the plausibility of the observed result given the null hypothesis is true on a probability scale. p-values, which are less than the predefined significance level, correspond to low plausibility and lead to rejection of the temporarily assumed null hypothesis. The steps of statistical testing can be formalized as follows: 1. Define a null hypothesis in terms of the population parameter of interest 2. Define a significance level, i.e., the probability of a falsely rejected null hypothesis 3. Select a test statistic that scores the data with respect to the null hypothesis 4. Derive the distribution of the test statistic given the null hypothesis applies 5. Compute the test statistic from the data at hand 6. Compute the p-value as the probability, given the null hypothesis applies, of a test statistic as extreme as or more extreme than the value observed in the sample 7. Reject the null hypothesis if the p-value is less than the ­significance level; otherwise, do not reject the null hypothesis

116

Dunkler, Sánchez-Cabo, and Heinze

The decision of a statistical test could be wrong: we could falsely reject the null hypothesis although the observed difference in means was only due to sampling variation. This type of wrong decision is called the type I error. We can control the probability of a type I error by choosing a small significance level. In gene expression comparisons, a type I error implies that we declare a gene differentially expressed in two groups of subjects although it is not. By contrast, a type II error occurs if the null hypothesis is falsely not rejected, i.e., if an experiment misses to declare a gene as important, although in truth this gene is differentially expressed. The principle of statistical testing provides control over the type I error rate, by fixing it at the significance level. Although a significance level of 5% has been established as a quasi-standard in many fields of science, this number has no other justification than it was easy to use in the precomputer era. In screening studies, investigators may accept a higher type I error rate in order to improve chances of identifying genes truly related to the experimental condition. In a confirmatory analysis, however, typically a low type I error rate is desirable. The type II error rate can be controlled by including a sufficient number of subjects, but, to estimate it precisely, a number of assumptions are needed. 2.3. Hypothesis Testing in Omics Experiments

In a typical analysis of an Omics experiment, we are interested in which of several thousand features are differentially expressed. Thus, hypothesis testing in Omics experiments results in a socalled gene list, i.e., a list of genes assumed to be differentially expressed. The principle of statistical testing as outlined in the last section can still be applied; however, error rates now refer to the gene list and not to a single gene. In particular, the type I error rate applicable to a single hypothesis is replaced by the familywise error rate (FWER), which is the probability to find differentially expressed genes although the global null hypothesis of no differential expression in any gene is true. Often the false discovery rate (FDR) (4) is considered as alternative to the FWER. The FDR is defined as the proportion of truly nondifferentially expressed genes among those in the gene list. Furthermore, modified test statistics have been developed which make use of similarities in the distributions across genes to enhance precision rather than treating each gene as a separate outcome variable.

2.4. Test Statistics for Omics Experiments

In order to present test statistics useful for Omics experiments, some basic notation is needed. We assume that gene expression values have been background corrected, normalized, and transformed by taking the logarithm to base 2 (see Subheading 3.2). In the following, the log2 gene expression of gene g (g = 1, …, G) in subject i (i = 1, …, nk) belonging to group k (k ∈{1, 2}) is denoted as y gik . The sample mean and variance of gene g in group k are

Statistical Analysis Principles for Omics Data

117

given as y gk = nk−1 ∑ i =1 y gik and s 2gk = (nk − 1)−1 ∑ i =1 (y gik − y gk )2 , respectively. The square root of the variance, s gk , is also known as  the standard deviation. The following test statistics are ­frequently used: nk

nk

2.4.1. Fold Change Statistic

The mean difference is given by M g = y g 1 − y g 2. Transferring M g M back to the original scale of gene expression yields 2 g , a statistic denoted as the fold change of gene g between groups 1 and 2. Early analyses of microarray data used to define a threshold M on M g , and declared all genes significant for which | M g |>| M |. This procedure makes the strong assumption that the variances are equal across all genes, which, however, is not plausible for gene expression data.

2.4.2. t-Statistic

In order to take different variance between genes into account, one may use the t-statistic t g = M g / s g , with the so-called pooled within-group standard error defined as s g = s 2g 1 / n1 + s 2g 2 / n 2 . Under the null hypothesis, and given normal distribution of y g 1 and y g 2 , t g follows Student’s t-distribution with n1 + n 2 − 2 degrees of freedom (see Note 2). The t-statistic uses a distinct pooled within-group standard error estimated for each gene separately, thereby taking into account different variances across different genes. However, in small samples the variance estimates are fairly unstable, and large values of t g , associated with high significance, may arise even with biologically meaningless fold changes.

2.4.3. Moderated t-Statistic

A general-purpose model proposed for microarray experiments with arbitrary experimental design (2) can be applied to a twogroup comparison, resulting in the moderated t-statistic t g = M g / s g , where s g = d0 s02 + d g s 2g / d0 + d g is a weighted

(

) (

2 g

)

average of the gene-specific variance s and a common variance s02 , d g = n g − 1 and d0 are the degrees of freedom of s 2g and s02, respectively. s02 and d0 are so-called hyperparameters and can be estimated, using empirical Bayes methods, from the distributions of gene expression of all genes (2). While s02 serves as variancestabilizing constant, d0 can be interpreted as the relative weight assigned to this variance stabilizer. Under the null hypothesis, the moderated t-statistic t g follows a Student’s t-distribution with d0 + d g degrees of freedom. The approach is implemented in R’s limma package (see Note 3). 2.4.4. Significance Analysis of Microarrays

The SAM procedure (3), implemented in the R package samr, tries to solve the problem of unstable variance estimates by defining a test statistic d g = M g / (s 0 + s g ), where, compared to the t-statistic, the denominator is inflated by an offset s 0 , which is based on the distribution of s g across all genes. Common choices for s 0 are the 90th percentile of s g (10) or are based on more

118

Dunkler, Sánchez-Cabo, and Heinze

refined estimation like the one applied in samr (11). Although related, d g (which offsets the standard errors) and t g (which employs an offset on variances) are not connected in any formal way. Due to the modification of its denominator, the distribution of d g is intractable even under the null hypothesis, and has to be estimated by resampling methods (see Subheading  2.6 and see Note 4). 2.4.5. Wilcoxon Rank-Sum Statistic

The classical nonparametric alternative to the t-test is the Wilcoxon rank-sum test, also known as the Mann–Whitney U-test, which tests the null hypothesis of equality of distributions in the two groups. Its test statistic is directly related to the probabilistic index Pr(y gi 1 > y gj 2 ) , i.e., the probability that a gene expression y gi 1 randomly picked from group 1 is greater than a gene expression y gj 2 randomly picked from group 2. The Wilcoxon rank-sum test is based on the ranks of the original data values. Thus, it is invariant to monotone transformations of the data and does not impose any parametric assumptions. It is particularly attractive because of its robustness to outliers. However, it has not the same interpretation as t-type statistics (fold change, t, moderated t, SAM), as it is sensitive to any differences in the location and shape of the gene expression distributions in the two groups. Assuming that data have been rank-transformed across groups, then the Wilcoxon rank-sum test statistic W g is given by the sum of ranks in group 1. In its implementation in the R package samr, W g is standardized to Z g = [W g − E(W g )] / (s 0 + s g ) , where E(W g ) is the expected value of W g under the null hypothesis, and s g is the standard error of that rank sum. samr again uses an offset s 0 in the denominator, which is by default the 5th percentile of all s g , g = 1, …, G.

2.4.6. RankProduct Statistic

Breitling et al. (12) proposed a simple nonparametric method to assess differential expression based on the RankProduct statistic n n R g = ∏ i =1 1 ∏ j 2=1 rank up (y gi 1 − y gj 2 ) / G , where rank up (y gi 1 − y gj 2 ) sorts the gene-specific expression differences between two arrays i and j in descending fashion such that the largest positive difference is assigned a rank of 1. If a gene is highly expressed in group 1 only, R g will assume small values, as the differences (y gi 1 − y gj 2 ) across all pairs of subjects (i, j ) from groups 1 and 2 are more likely to be positive. Similarly, R g will be large if gene g shows higher expression in group 2. For a gene not differentially expressed, the product terms rank up (y gi 1 − y gj 2 ) / G fall around 1/2. The distribution of the RankProduct statistic can be assessed by permutation methods (see Subheading 2.6).

2.5. The Multiple Testing Problem

The main problem that we are confronted with in Omics analyses is the huge number of hypotheses to be tested simultaneously, leading to an inflation of the FWER if no adjustment for multiple tests is considered. Corrections have been developed which allow researchers to preserve the FWER at the significance level. p-value

Statistical Analysis Principles for Omics Data

119

adjustments inflate original p-values, thus resulting in an increased type II error compared to an unadjusted analysis. Therefore, in recent years more and more experiments aim at controlling the FDR (4), which allows for a less stringent p-value correction. The FDR is defined as the proportion of false positives among all genes declared differentially expressed, i.e., in contrast to the FWER, the reference set is no longer defined by the true status of the gene (differentially expressed or not), but by our decision about the true status. Here, it is no longer believed that the null hypothesis applies to all genes under consideration. Instead, we are quite sure that differential expression is present in at least a subset of the genes. Procedures to control the FDR assign a q-value to each gene, which is an analog to the p-value, but now refers to the expected proportion of false discoveries in a gene list obtained by calling all genes with equal or lower q-values differentially expressed. The q-values are typically lower than adjusted p-values. 2.6. Assessing Significance in Omics Experiments While Controlling the FDR

q-values can be obtained by the following two procedures:

2.6.1. Stepwise Adjustment

A recursive step-up procedure, proposed by Benjamini and Hochberg (4), starts at the highest p-value, which is directly translated into a q-value. The second-highest p-value is multiplied by G/(G − 1) to obtain the corresponding q-value, etc. Each q-value is restricted such that it cannot exceed the precedent q-value. Formally, the procedure is defined as follows: 1. Raw p-values are ordered such that p(1) <  p( g ) <  p(G ) 2. q(G ) = p(G ) 3. q( g ) = min[q( g +1) , p( g )G / g ] This procedure takes raw p-values as input and is therefore suitable for all test statistics for which p-values can be analytically derived: t g and t g . It relies on the assumption of independent or positively correlated test statistics, an assumption that is likely to apply to microarray experiments (13). The Benjamini–Hochberg approach is the default p-value adjustment for the moderated t-statistics t g in R’s limma package.

2.6.2. Simulation-Based Adjustment

Storey (5) proposed a direct permutation-based approach to q-values to control the FDR. This adjustment is based on two ideas: first, we assume that some genes are truly differentially expressed, and second, that the FDR, associated with a p-value cutoff of, e.g., 0.05, equals the expected proportion of false ­findings among all genes declared significant at that cutoff. The number of false findings can be estimated from, e.g., 100 data sets that emerge from random reassignment of the group labels to the arrays. In each of these permuted data sets, the number of

120

Dunkler, Sánchez-Cabo, and Heinze

genes declared significant is counted. The average count estimates the number of false findings. Assuming that genes are ordered by their raw p-values, i.e., p(1) ≤  ≤ p( g ) ≤  ≤ p(G ) , and given the permuted p-values Plb (l = 1, …, G ; b = 1, …, B) from B permuted data sets as input, the Storey adjustment can be formalized as B

G

q(Sg ) = pˆ 0 ∑∑ I [Plb < p( g ) ] / [rank( g )B].



b =1 l =1

Here, rank( g ) denotes the index of gene g in the ordered sequence of unadjusted p-values such that, e.g., rank( g ) = 1 for the most significant gene, etc. rank( g ) is essentially the number of genes declared significant at a p-value threshold of p( g ) , and



G l =1

I [Plb < p( g ) ] / rank( g ) is the estimated false

positive rate among nondifferentially expressed genes in ­permutation b, which is averaged over B permutations. Alternatively, one may also apply the adjustment to the ordered absolute test statistics T(1) ≥  ≥ T( g ) ≥  T(G ) , yielding B G q(Sg ) = pˆ 0 ∑ b =1 ∑ l =1 I [| Tlb |>| T( g ) |] / [rank( g )B]. In both variants, pˆ 0 is the estimated proportion of genes not differentially expressed. Various suggestions have been made about how to estimate this proportion. Based on the empirical distribution function of raw G p-values Fˆ (p g ) = ∑ I (pl ≤ p g ) / G , pˆ = [1 − Fˆ (0.5)] / 0.5 , i.e., l =1

0

the ratio of observed and expected proportions of p-values exceeding 0.5. The approach is applicable to both p-values and test statistics, thus it works with all proposed test statistics. Since it employs permutation, it relies on the assumption of subset pivotality (see Note 5).

2.7. Improving Statistical Power by Unspecific Filtering

Adjusted p-values are always inflated compared to raw p-values. The magnitude of this inflation directly depends on G, the number of hypothesis to be considered, irrespective of the type of adjustment. In particular, control of the FDR depends on the number of hypotheses rejected, not on the total number of hypotheses. Hence, it is desirable to identify a priori genes that are very unlikely to be differentially expressed and that could be excluded from statistical testing. Some unspecific filtering criteria, i.e., criteria that do not make use of the group information, are defined as follows (14, 15): 1. Exclude those genes whose variation is close to zero. Consider the interquartile range, defined as IQR = (75th percentile − 25th percentile). The IQR filter excludes those genes where the IQR is lower than a prespecified minimum relevant log fold change, e.g., 1, here corresponding to a fold change of 2. If groups are perfectly separated, then the median log2

Statistical Analysis Principles for Omics Data

121

expression in the groups will be equal to the 75th and 25th percentiles, computed over the combined groups. Thus, for genes that do not pass this filter, even under perfect separation of the groups, the minimum relevant log fold change (approximately equal to the difference in medians) cannot be achieved. 2. Exclude genes with low average expressions. If nearly all expression values are low, there is no chance to detect overexpression in one group. 3. Exclude genes with a high proportion of undetectable signals or high background noise. The thresholds used for these filtering rules may depend on the platform used and on individual taste such that the number of genes excluded by filtering is highly arbitrary. Lusa et  al. (16) proposed a method to combine the results from a variety of filtering methods, without the need to prespecify filter thresholds. It is important to note that only unspecific filtering does not compromise the validity of results. If a filtering rule is applied that uncovers the group information (e.g., | M g |> M , with some arbitrary threshold M), then this must be considered also in resampling-based assessment of statistical significance.

3. Methods This section presents a complete workflow of a differential gene expression analysis, exemplified by comparing tissues of BLC patients to tissues of nBLC cancer patients (9). We compute test statistics and assess significance comparing eight BLC to eight nBLC subjects. Each tissue sample was hybridized to an Affymetrix Human Genome U133 Plus 2.0 Array and hence methods for the preprocessing of Affymetrix data are briefly described. All analyses were performed using R (http://www.r-project.org) and Bioconductor (http://www.bioconductor.org) software (17, 18). The data set and the R script to perform all analytical steps can be found at http://www.meduniwien.ac.at/msi/biometrie/MIMB. 3.1. Experimental Design Considerations 3.1.1. Sample Size Assessment for Microarray Experiments

The necessary sample size can be estimated by assuming values for the following quantities: 1. The number of genes investigated (NG). 2. The number of genes assumed to be differentially expressed (this number is often assumed to be equal to the number to be detected by testing) (NDE). 3. The acceptable number of false positives (FP).

122

Dunkler, Sánchez-Cabo, and Heinze

4. The relevant mean difference (or log fold change), often assumed as 1.0. 5. The per gene variation of gene expression within groups (usually expressed as standard deviation), often assumed as 0.7. The FDR that is to be controlled is given by FP/NDE. High variation of gene expression and a low (relevant) mean difference increase the necessary sample size. Given that the NDE equals the number of genes detected by testing, FDR equals the type II error rate. One minus the type II error rate is defined as the statistical power. The per gene type I error is also denoted as false negative rate and is given by FP/(NG-NDE), given all other numbers remain constant. Sample size assessment for microarrays can be done using the MD Anderson sample size calculator ­available at http://bioinformatics.mdanderson.org/Microarray SampleSize. Table  1 gives some impression about how the required sample size varies with different assumptions. R’s samr package also provides an assessment of sample size, given some pilot data, and can take into account different standard deviations across genes as well as correlation between genes. Bias may arise as a result of using arrays from different production batches, using different scanners to produce intensity images, or analyzing samples at different times. To minimize these sources of bias, stratification or randomization of these factors between experimental conditions is performed. As an example, consider

3.1.2. Bias

Table 1 Sample size calculations performed by the MD Anderson microarray sample size calculator (http://bioinformatics.mdanderson.org/MicroarraySampleSize) NG

NDE

FP

FDR

FC

SD

FNR

N

54,000

1,000

100

0.10

2

0.7

0.0186

19

5,400

1,000

100

0.10

2

0.7

0.1887

13

54,000

1,000

100

0.10

1.5

0.7

0.0186

56

54,000

1,000

100

0.10

2

1.4

0.0186

76

54,000

1,000

50

0.05

2

0.7

0.0186

25

54,000

300

30

0.10

2

0.7

0.00056

22

Provided with the numbers of the first six columns, the calculator supplies estimates of the false negative rate and the necessary sample size per group NG number of genes investigated, NDE number of genes truly/declared differentially expressed, FP number of false positive findings, FDR false discovery rate (=1 – power), FDR = FP/NDE, FNR false negative rate (per gene type I error), FNR = FP/(NG − NDE), FC fold change to be detected, SD per gene standard deviation (on log2 expressions), N necessary number of arrays per group

Statistical Analysis Principles for Omics Data

123

two experimental conditions 1 and 2, and assume that arrays stem from two production batches B1 and B2 of sizes 30 and 10, respectively, and are hybridized and analyzed at days D1–D5. On each day, a maximum of eight samples can be processed. Let Design A be characterized by the sequence of conditions {1111111111111111-11112222-22222222-222222}, bold figures standing for arrays from the production batch B1. This design does not allow separating the biological effect of BLC vs. nBLC from the effects of day and production batch. By contrast, design B {12122121-12112112-21211212-21121221-121121} randomizes the two latter effects such that the biological effect is approximately uncorrelated to the nuisance factors. 3.2. Data Preprocessing

Prior to statistical analysis of the data, several steps have to be performed in order to make the data comparable from array to array: background correction, normalization, and summarization. The most popular method for preprocessing of Affymetrix data is the Robust Multichip Average (RMA) (19) method, which serves all three purposes: first, the background is estimated based on the optical noise and nonspecific signal, ensuring that the background corrected signal is always positive. Second, quantile normalization (20) is applied to all arrays on the background corrected intensities. This type of normalization forces the gene expressions to have a similar distribution in all arrays under study, which is plausible assuming that typically only few genes show differential expression. Third, the multiple probes per gene are summarized by assuming that the log2 intensity x gij of gene g, array i, and probe j is a sum of the log2 expression level y gi , some probespecific affinity effects a gj common to all arrays, and some residual error. Assuming that the probe affinity effects sum up to zero, they are estimated by a robust alternative to linear regression minimizing the sum of absolute residuals, the median polish (21), and y gi results as estimated log2 expression level.

3.3. Quality Checks

Before and after normalization, quality checks need to be performed in order to discover arrays for which data quality may be questionable. Assuming that gene expression should be similar for the majority of genes, a useful and simple method is to compute concordance correlation coefficients and Spearman correlation coefficients between all arrays. Principal components analysis (see Subheading  3.5) can also be helpful to detect spurious arrays. After outlier detection and removal, the normalization step must be repeated, as outlying arrays may have disproportional impact on the results of normalization. Furthermore, a plot of difference against mean of gene expression of two arrays (MA-plot), or a plot of standard deviation of gene expression of all genes against mean gene expression of all genes are useful (see Note 6).

124

Dunkler, Sánchez-Cabo, and Heinze

3.4. Unspecific Filtering

For our selected example gene expression profiles, we employ R’s genefilter package to filter out genes with an IQR FW8YT1Q01B9VMY length=380 xy=0815_2008 region=1 run=R_XXXX TGATCTTACTATCATTGCAAAGCCACTTAAAGAC CACACACTACGTCACTGGAAAAGAGT TCAATAGAGGCCTCCTACGAGTAACACCCTTACAC TTCTGCTACAGAAACTACACCTTTT

Quality values for 454 reads are given in a separate file. The assignment of reads and quality values is possible via header line. >FW8YT1Q01B9VMY length=380 xy=0815_2008 region=1 run=R

208

Hoffmann 37 39 39 39 39 28 28 31 33 37 37 35 35 35 35 35 35 37 27 31 31 37 37 37 37 37 37 38 37 37 37 37 39 39 39 39 39 39 39 39 39 37 37 33 32 32 30 32 32 32 19 19 15 15 15 15 23 16 30 29 …

The FASTAQ format is a slight modification of the FASTA format. @solexaY:7:1:6:1669/1 GCCAGGNTCCCCACGAACGTGCGGTGCGTGACGGGC +solexaY:7:1:6:1669/1 ``a`aYDZaa``aa_Z_`a[`````a`_P][[\ _\V

The FASTQ header begins with an “@” symbol. Again, all following lines hold the sequence itself. For Illumina reads, the header informs about the name of the instrument (solexaY), the flowcell lane (7), the tile number within the flow cell (1), its x- and y-coordinates (6:1,669) and has a flag indicating whether the read is single-end (/1) or belongs to a mate-pair or paired-end run (/2). The “+” sign followed by the same sequence identifier indicates the beginning of the quality value string. Note that the qualities are given in ASCII. The quality values give an estimate on the accuracy of the base calling. Nowadays, most sequencing platforms report a Phred quality score. The score, originally developed in the context of the Human Genome project, is given by

Q = −10·log10 p, where, p is the probability that the reported base is incorrect. Illumina initially decided to deviate from this scoring and instead used the formula



Q Solexa = −10·log10

p . 1− p



While the Illumina quality score Q Solexa is asymptotically identical to Q for low error probabilities, it is typically smaller for higher error probabilities. Since the Illumina quality scores can become negative, a conversion to real phred scores using

(

Q = 10·log10 1 + 10Q Solexa /10

)

may be necessary. While high Illumina quality scores have been reported to overestimate the base calling accuracy, low scores underestimated the base calling accuracy (10, 12). Since version 1.3, the proprietary Solexa pipeline uses Phred scores. It is important to note that also the encoding of the quality string has been subject to changes. The new pipeline encodes the phred qualities from 0 to 62 in a non-standard way using the ASCII characters 64 to 126. Due to the fact that Phred scores from 0 to 93 are normally encoded using ASCII characters 33–126, a conversion might be necessary.

Computational Analysis of High Throughput Sequencing Data

209

3.2. Mapping

In genome informatics the mapping describes the process of generating a (mostly heuristic) alignment of query sequences to reference genomes. It is the basis for qualitative as well as quantitative analysis. To map HTS sequences, the algorithms have to address three different problems at once. In addition to the tremendous amounts sequences, the methods have to deal with a lower data quality and shorter read lengths. Particularly for short (erroneous) reads it is often not possible to decide for its original position in the reference genome since the reads may align equally well to several genomic locations. Sequencing of repetitive regions complicates this problem even more. The methods presented here apply different mapping policies to tackle those problems. To address the problem of the huge amount of data, most of the short read alignment programs use index structures either for the reads or the reference.

3.2.1. Mapping with Hash Tables

Heng Li et al. developed one of the first read mappers, MAQ, for Illumina sequences based on hash tables. Although the tool is no longer supported, a look into the core of this approach reveals some basic principles and policies of short read mapping. The focus of MAQ is to incorporate quality values to facilitate and improve read mapping (13). By default, MAQ indexes only the first 28 bp of the reads (the seed) in six different hash tables ensuring that all reads with at most two mismatches may be found in the genome. Equivalently for a seed of 8 bp the hash tables are built from three pairs of complementary templates, 11110000, 00001111, 11000011, 00111100, 11001100, and 00110011, where a 1 indicates a base that is included in the hash key gene­ration. After this indexing step, MAQ proceeds by scanning the refe­ rence sequence once for each pair of complementary template. Each time a seed hit is encountered MAQ attempts to extend the hit beyond the seed and scores it according to the quality values. It has been reported earlier that the use of quality values during the read alignment can improve the mapping results substantially (14). By default, MAQ reports all hits with up to two mismatches– but its algorithm is able to find only 57% of the reads with three mismatches. Hits with insertions and deletions (indels) are not reported. Furthermore, for reads with multiple equally scoring best hits only one hit is reported.

3.2.2. Mapping with Suffix Arrays and the Burrows– Wheeler Transform

A second approach to short read alignment uses the Burrows– Wheeler Transform (BWT). In brief, the BWT is a sorted cyclic permutation of some text T, e.g., a reference genome. Its main advantage is that the BWT of T contains stretches of repetitive symbols – making the compression of the T more effective. The backward search algorithm (15) on a compressed BWT simulates a fast traversal of a prefix tree for T – without explicitly representing

210

Hoffmann

the tree in the memory. It only requires two arrays to efficiently access the compressed BWT, which is the key to the speed and the low memory footprint of read aligners such as BWA (16), Bowtie (17) and SOAP2 (18). Because the backward search only finds exact matches, additional algorithms for inexact searches had to be devised. BWA, for example, solves this problem by enumerating alternative nucleotides to find mismatches, insertions and deletions, while SOAP2 employs a split alignment strategy. Here, the read is split into two parts, to allow a single mismatch, insertion or deletion. The mismatch can exist in at most one of the two fragments at the same time. Likewise, the read is split into three fragments to allow two mismatches and so forth. Other tools such as Bowtie do not allow short read alignments with gaps. BWT-based read mappers are the speed champions of short aligners – with an exceptionally low memory footprint. However, for all of the tools described above, the user has to carefully choose a threshold for a maximum number of acceptable errors. For error thresholds >2 mismatches, insertions, or deletions, the speed decreases significantly. While these thresholds seem to be sufficient for mapping of genomic DNA, mapping of transcriptome data or data that contains contaminations (e.g., linkers) may be more difficult. In contrast the tool segemehl (19), based on enhanced suffix arrays, aims to find a best local alignment with increased sensitivity. In a first step, exact matches of all substrings of a read and the reference genome are computed. The exact substring matches are then modified by a limited number of mismatches, insertions, and deletions. The set of exact and inexact substring matches is subsequently evaluated using a fast accurate alignment method. While the program shows good recall rates of 80% for high error rates of around 10%, it has a significantly larger memory footprint in comparison with the BWT and hashing methods. A practical example is given at the end of this chapter (see Note 1). The selection of an appropriate mapping method depends on various criteria (see Note 2). Due to the different indexing techniques some short read aligners are limited to certain read lengths. These tools may not be used if long reads or reads of different sizes need to be aligned. Furthermore, for speed reasons some aligners report only one hit per read – regardless of whether multiple equally good hits could be obtained. This may be a problem if repetitive regions are sequenced. The user has to assess carefully which degree of sensitivity is needed. A method that discards reads with multiple hits (sometimes a random hit is reported) or high error rates may be suitable for SNP detection, while mapping of transcriptome (RNAseq) data may require a higher sensitivity. A selection of mapping tools is given at the end of the chapter (see Note 3).

Computational Analysis of High Throughput Sequencing Data 3.2.3. SAM/BAM Mapping Output Format

211

Because most of the read mapping tools have their own output formats, a standard output format for short read aligners was developed in the context of the 1000 Genomes Project (http:// www.1000genomes.org) (20). The Sequence Alignment/Map (SAM) is a human readable tab-delimited format. A binary equivalent (BAM) is intended to facilitate the parsing with computer programs. The SAM format contains a header and an alignment section. A typical header section starts with a mandatory header line (@HD) that holds the file format version (VN:1.0). Sequence dictionaries (@SQ) hold the names (SN:chr20) and the lengths (LN:62435964) of the reference sequences to which the reads in the alignment section are mapped to. @HD @SQ @RG @RG

VN:1.0 SN:chr20 LN:62435964 ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891

To identify different biological samples, the SAM file may also hold one or more read groups (@RG). Each group has to have a unique identifier. (ID:L1, ID:L2) and the name of the sample (SM:NA12991) from which the reads were obtained. Additionally, the platform unit (PU:SC_1_10), e.g. the lane of the Illumina flowcell, or the library name (LB:SC_1) can be given. The alignment section holds all read alignments. A typical alignment line like read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG