Computational methods for understanding bacterial and archaeal genomes [1 ed.] 9781860949821, 1860949827

Over 500 prokaryotic genomes have been sequenced to date, and thousands more have been planned for the next few years. W

170 20 8MB

English Pages 494 Year 2008

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Computational methods for understanding bacterial and archaeal genomes [1 ed.]
 9781860949821, 1860949827

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

COMPUTATIONAL METHODS FOR UNDERSTANDING BACTERIAL AND ARCHAEAL CENOMES

SERIES ON ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Series Editors:

ISSN: 1751-6404

Ying XU (University of Georgia, USA) Limsoon WONG (National University of Singapore, Singapore) Associate Editors: Ruth Nussinov (NCI, USA) Rolf Apweiler (EBI, UK) Ed Wingender (BioBase, Germany)

See-Kiong Ng (Inst for Infocomm Res, Singapore) Kenta Nakai (Univ of Tokyo, Japan) Mark Ragan (Univ of Queensland, Australia)

Vol. 1: Proceedings of the 3rd Asia-Pacific Bioinformatics Conference Eds: Yi-Ping Phoebe Chen and Limsoon Wong Vol. 2: Information Processing and Living Systems Eds: Vladimir B. Bajic and Tan Tin Wee Vol. 3: Proceedings of the 4th Asia-Pacific Bioinformatics Conference Eds: Tao Jiang, Ueng-Cheng Yang, Yi-Ping Phoebe Chen and Limsoon Wong Vol. 4: Computational Systems Bioinformatics 2006 Eds: Peter Markstein and Ying Xu ISSN: 1762-7791 Vol. 5: Proceedings of the 5th Asia-Pacific Bioinformatics Conference Eds: David Sankoff, Lusheng Wang and Francis Chin Vol. 6: Proceedings of the 6th Asia-Pacific Bioinformatics Conference Eds: Alvis Brazma, Satoru Miyano and Tatsuya Akutsu Vol. 7: Computational Methods for Understanding Bacterial and Archaeal Genomes Eds: Ying Xu and J. Peter Gogarten

zyxwvu zyx

Series on Advances in Bioinformatics and Computational Biology - Volume 7 zyxwvutsrqp

COMPUTATIONAL METHODS FOR UNDERSTANDING BACTERIAL AND ARCHAEAL GENOMES

E d iToRs YiNG

xu

UNlVERSiTY Of

GEORGIA, USA zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHG

1. PETER GOGARTEN UNiVERSiTY Of CONNECTiCUT,

USA

Imperial College Press

Published by Imperial College Press 57 Shelton Street Covent Garden London WC2H 9HE Distributed by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

COMPUTATIONAL METHODS FOR UNDERSTANDING BACTERIAL AND ARCHAEAL GENOMES Series on Advances in Bioinformatics and Computational Biology — Vol. 7 Copyright © 2008 by Imperial College Press All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN-13 978-1-86094-982-1 ISBN-10 1-86094-982-7

Typeset by Stallion Press Email: [email protected]

Printed in Singapore.

PREFACE

Sequencing technology has advanced to such a level that large sequencing centers such as the Joint Genome Institute (JGI) of the US Department of Energy can sequence a prokaryotic genome within a day. As of this writing, ∼800 prokaryotic genomes have been sequenced and at least 1,000 are in the pipeline being sequenced. Knowing a few large sequencing efforts currently under planning, we could possibly see over 10,000 complete prokaryotic genomes within the next few years. In addition, the sequencing efforts of metagenomes have produced over one billion base-pairs of sequence fragments since 2000, and the efforts are expected to scale up rapidly, soon to produce substantially more genomic sequence data than what we have seen in the past twenty years. While these genomic sequence data have provided unprecedented opportunities for biologists to study and to understand these organisms, it has also raised some highly challenging problems regarding how to “mine” the genomes, extract the information encoded in the genomes, in a much more effective manner than what the existing tools can offer, simply to keep up with the pace of world-wide genome sequencing efforts. Compared to eukaryotic genomes which are more complex in general, prokaryotic genomes pose a set of unique challenging problems. First, prokaryotic genomes are much more dynamic in terms of their gene compositions than the typical eukaryotic genomes since horizontal gene transfers take place substantially more often in prokaryotes. Second, prokaryotes evolve at much faster rates than eukaryotes in general, hence making their genomes diverge faster. Third, prokaryotes have been found in broader environments than eukaryotes, suggesting their more flexible adaptability to the environments. Fourth, prokaryotes often co-exist with other prokaryotes as a mutually dependent community, parts of whose metabolisms are inter-species, making studies of their biochemistry and their genomes rather unique. Comparative genomics has been the most effective approach to mining the genomes and deciphering the information encoded in genomes. Numerous computational techniques have been developed based on comparative strategies, to predict genes and the functions of their protein products, to elucidate operonic structures, to detect previously unknown structures such as uber-operons, and to predict regulatory elements and interactions of biochemical pathways. Fundamental to these computational techniques is our understanding about the evolution of genes, molecular interactions, biological processes and genomes. It is the v

vi

Preface

theoretical framework for studying these evolutionary processes that have guided our computational studies of the prokaryotic genomes and their structures. For this reason, we have designed this book in such a way that it tightly integrates computational studies with theoretical discussions of prokaryotic genomes. We believe that it is essential to have a good understanding about both the evolutionary theories for prokaryotes and the possibilities and limitations of computational genome analysis techniques, in order to carry out in-depth computational studies that may generate new insights about these genomes and the information they encode. In this book, we included a collection of cohesively written chapters on prokaryotic genomes, their organization and evolution, the information they encode, computational approaches to deriving such information, and to understanding their organization and evolution. When appropriate, we attempt to provide a comparative view of the bacterial and archaeal genomes and of how information is differently encoded in these two domains of life. This book is intended to be used as an introductory text book for a graduate-level microbial genomics and bioinformatics course as well as a reference book for researchers working in the area of prokaryotic genome studies While the chapters are organized in a logical order, each chapter in the book is a self-contained review of a specific subject. Hence a reader does not necessarily have to read through the chapters in their sequential order. Since this is a rapidly evolving field that encompasses an exceptionally wide range of research topics, it is difficult for any individual to write a comprehensive textbook on the entire field. Most of the chapters are written by members of our two labs at the University of Georgia and the University of Connecticut, respectively. The remaining chapters, which, we feel, will help to fill gaps in covering the field, are written by experts who are actively doing research at the forefront of the selected topical areas. Chapter 1 (General characteristics of prokaryotic genomes) explains the notion of a genome as a collection of replicons in a cell, and provides an overview of different types of replicons and their distinguishing characteristics. This is followed by a review of the diversity of prokaryotic chromosomes with respect to their size, gene content, G+C content, codon usage, oligonucleotide composition, amino acid usage, repeat content, and intragenomic compositional heterogeneity emphasizing contrasts between eukaryotes and prokaryotes. Chapter 2 (Genes in prokaryotic genomes and their computational prediction) begins with a historical account of the development of gene prediction methods. These methods exploit either intrinsic information, that is, nucleotide ordering patterns intrinsic to a DNA sequence, or extrinsic information gained from sequence conservation in evolution. The basic idea and models underlying the prediction programs utilizing either intrinsic information or extrinsic information or both are described, followed by a critical comparative assessment of their performance on experimentally validated datasets. The strengths and weaknesses of the current programs and the future challenges are discussed, with suggestions for addressing the remaining core issues in this field.

Preface

vii

Chapter 3 (Evolution of the genetic code: computational methods and inferences) focuses on the evolution of genetic codes. Ever since its discovery, the genetic code has been one of the most difficult puzzles for evolutionary biology to unravel. As the evolution of early life on Earth is inextricably linked to the evolution of the genetic code, understanding its history has been the focus of numerous investigations over the last several years. We present several of the computational methods that have been applied to this problem, including a brief discussion of their results and significance. Chapter 4 (Dynamics of prokaryotic genome evolution) addresses the dynamics of genome evolution. The ever-increasing amount of genome sequences available in public databases allows a better comprehension of how evolutionary forces act on prokaryotic species. Several methods exist to categorize genes based on their frequency of occurrence among genomes, and are used to predict the genes’ roles and modes of evolution. This chapter describes practical approaches used in the comparison of multiple genomes, and discusses the current status of the field of prokaryotic genome evolution. In Chapter 5 (Mobile genetic elements and their prediction), we review mobile genetic elements, with a focus on those that can be computationally predicted within genome sequences. We discuss the features of these elements ranging from the small neutral Insertion Sequence (IS) elements, to the large genomic islands and prophage regions that can encode antibiotic resistance and virulence-related functions. The chapter focuses on the in silico prediction of these elements, including the discussion of several existing tools and databases. In addition, we point out the strengths and weaknesses of the existing methods and suggest future avenues of research in this area. Chapter 6 (Horizontal gene transfer: its detection and role in microbial evolution) provides a brief history of the discovery of gene transfer and how it impacted attempts to develop a natural taxonomy for prokaryotes. We review species concepts as applied to bacteria and archaea, look at examples of multilevel selection acting on genes, organisms and communities, and examine biological processes and artifacts that can create conflicts between gene and genome phylogenies. We describe phylogenetic and surrogate approaches that aim to identify transferred genes and discuss the advantages and problems associated with different methods, and provide an outlook on future developments that will allow to trace the history of genes and pathways through the network of organismal history. In Chapter 7 (Genome reduction during prokaryotic evolution), we review the phenomenon of genome reduction, which has taken place independently many times in several prokaryotic lineages. This chapter describes methods to reconstruct ancestral genome contents and to infer genome expansion and contraction. The mutational and selective hypotheses to explain these changes will be discuss. Finally, we describe the gene content of the smallest genomes and the fuzzy boundary between cells and organelles. Chapter 8 (Comparative mechanisms on transcription and regulatory signals in archaea and bacteria) describes our current knowledge about regulation of

viii

Preface

transcription in archaea and bacteria. This overview emphasizes the main cisregulatory signals, trans-regulatory factors (TFs) and alternative RNA polymerase types constituting the machinery needed for proper regulation of genes in response to changing environmental conditions at the level of transcription initiation. In archaea the basal machinery for transcription is related to RNAP II from eukaryotes while the use of TFs for modulating gene transcription is similar to those used by bacteria. We describe the promoter at different levels, including the activity of the basal machinery for transcription and the use of specific transcription factors sensing and responding to particular effectors signals. We also summarize regulation at other levels beyond transcription initiation. There is an inherent limitation in this comparative approach given the limited amount of knowledge of the regulation mechanisms in archaea. Furthermore, these comparisons also bring eukaryotic transcription to the scene when searching for an evolutionary understanding of the diverse puzzle of similarities and differences in gene organization, conservation of proteins and their mechanisms involved in bacteria, archaea and eukaryotes. Chapter 9 (Computational techniques for orthologous gene prediction in prokaryotes) discusses the definitions of homologs, orthologs, paralogs and their subtypes, and reviews the existing gene family databases and the approaches to homology assignment as well as family-superfamily classification systems. The chapter focuses on the automated methods of orthologous gene prediction. It describes the reciprocal best blast hit method, reciprocal smallest distance method, tree reconciliation algorithm, and the phylogenetic clustering algorithm BranchClust. In Chapter 10 (Computational elucidation of operons and uber-operons), we describe the basics about operons as the basic units of transcription regulation as well as their ties to biological pathways and networks. In addition, we discuss a relatively new and less well-studied layer of genomic structure, called uber-operons. The chapter presents a number of basic ideas as well as computational methods for operon and uber-operon prediction, plus relevant prediction servers publicly available on the Internet. We also showcase one study on the evolution of operons, suggesting possible rules for operon evolution. Chapter 11 (Prediction of regulons through comparative genome analyses) introduces first the typical structure of a regulon in prokaryotes, and the special regulon prediction problem as well as the genome-wide de novo regulon prediction problem. It then links these problems to the prediction problem of cis-regulatory binding sites. The chapter details a widely used phylogenetic foot-printing motif finding algorithm as well as a genome-wide scanning procedure for solving the special regulon prediction problem. For the more challenging genome-wide de novo prediction of regulons, the chapter introduces a recently developed phylogenetic foot-printing based algorithm. Examples of practical applications are also given for each described computational procedure. Chapter 12 (Prediction of biological pathways through data mining and information fusion) presents a few approaches to biological pathway construction

Preface

ix

using multiple sources of information, including transcriptomic data, genomic data, proteomic data as well as protein-protein and protein-DNA interaction data. Several computational algorithms for these predictions are reviewed in this chapter. A comprehensive description on how to estimate parameters in metabolic pathways is also included in this chapter. Chapter 13 (Microbial pathway models) emphasizes the importance of mathematical modeling approaches to understanding microbial systems. It reviews traditional approaches of enzyme kinetics and leads the reader to the state of the art in metabolic modeling. Particular focus is placed on Biochemical Systems Theory, which in several hundred publications has proven an invaluable tool for designing, analyzing, manipulating, and optimizing biological systems, even in situations where information on the target network is uncertain or partially missing. Chapter 14 (Metagenomics) reviews a new scientific endeavor, metagenomics, which is emerging with the convergence of sequencing and computational technologies focused on communities of microbes. The techniques and strategy of metagenomics enable the exploration of the genomes and the natural world of microbes without their prior isolation and cultivation. While building on pioneering research over roughly the past two decades and their enhancement by the convergent technologies, this exceptionally interdisciplinary endeavor, which also brings in biogeochemistry and ecological, evolutionary and environmental biology, as well as the wide range of research in microbiology and genomics, is just being defined. To invite the reader to participate in this endeavor, the chapter provides a brief overview of the current state of knowledge and the opportunities that create the excitement and bring together scientists from such diverse disciplines. Research observations in metagenomics, for example, have already demonstrated the vast diversity of microbial proteins and suggest extensive implications for applied life sciences (from environmental science to medicine), as well as for our fundamental understanding of biology. Ying Xu and J. Peter Gogarten

This page intentionally left blank

CONTENTS

Preface

v

List of Contributors

xiii

Acknowledgments

xix

1. General Characteristics of Prokaryotic Genomes

1

Jan Mr´ azek and Anne O. Summers 2. Genes in Prokaryotic Genomes and Their Computational Prediction

39

Rajeev K. Azad 3. Evolution of the Genetic Code: Computational Methods and Inferences

75

Greg Fournier 4. Dynamics of Prokaryotic Genome Evolution

99

Pascal Lapierre 5. Mobile Genetic Elements and Their Prediction

113

Morgan G.I. Langille, Fengfeng Zhou, Amber Fedynak, William W.L. Hsiao, Ying Xu and Fiona S.L. Brinkman 6. Horizontal Gene Transfer: Its Detection and Role in Microbial Evolution

137

J. Peter Gogarten and Olga Zhaxybayeva 7. Genome Reduction During Prokaryotic Evolution

153

Francisco J. Silva and Amparo Latorre 8. Comparative Mechanisms for Transcription and Regulatory Signals in Archaea and Bacteria Agustino Mart´ınez-Antonio and Julio Collado-Vides

xi

185

Contents

xii

9. Computational Techniques for Orthologous Gene Prediction in Prokaryotes

209

Maria Poptsova 10. Computational Elucidation of Operons and Uber-operons

233

Phuongan Dam, Fenglou Mao, Dongsheng Che, Ping Wan, Thao Tran, Guojun Li and Ying Xu 11. Prediction of Regulons Through Comparative Genome Analyses

259

Zhengchang Su, Guojun Li and Ying Xu 12. Prediction of Biological Pathways Through Data Mining and Information Fusion

281

Fenglou Mao, Phuongan Dam, Hongwei Wu, I-Chun Chou, Eberhard Voit and Ying Xu 13. Microbial Pathway Models

315

Siren R. Veflingstad, Phuongan Dam, Ying Xu and Eberhard O. Voit 14. Metagenomics

345

Kayo Arima and John Wooley References

397

Index

467

LIST OF CONTRIBUTORS

Agustino Mart´ınez-Antonio Departamento de Ingenier´ıa Gen´etica Instituto Polit´ecnico Nacional CINVESTAV, IPN. Irapuato, Gto. M´exico [email protected] Kayo Arima University of California San Diego La Jolla, CA 92093-0043, USA [email protected] Rajeev Azad Department of Biological Sciences University of Pittsburgh Pittsburgh, PA 15260, USA [email protected] Fiona S.L. Brinkman Department of Molecular Biology and Biochemistry Simon Fraser University Burnaby, BC, Canada [email protected] Dongsheng Che Department of Biochemistry & Molecular Biology University of Georgia Athens, GA 30602-7229, USA [email protected] I-Chun Chou Department of Biomedical Engineering Georgia Institute of Technology Atlanta, GA 30332, USA [email protected] xiii

xiv

List of Contributors

Julio Collado-Vides Computational Genomics Program Universidad Nacional Aut´ onoma de M´exico Av. Universidad s/n, Col. Chamilpa, C.P. 62210 Cuernavaca, Morelos, M´exico [email protected] Phuongan Dam Department of Biochemistry & Molecular Biology University of Georgia Athens, GA 30602-7229, USA [email protected] Amber Fedynak Department of Molecular Biology and Biochemistry Simon Fraser University Burnaby, BC, Canada [email protected] Greg Fournier Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269-3125, USA [email protected] J. Peter Gogarten Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269-3125, USA [email protected] William W.L. Hsiao Department of Molecular Biology and Biochemistry Simon Fraser University Burnaby, BC, Canada [email protected] Morgan G.I. Langille Department of Molecular Biology and Biochemistry Simon Fraser University Burnaby, BC, Canada [email protected]

List of Contributors

Pascal Lapierre Bioinformatics Facility, Bioservices Center University of Connecticut Storrs, CT 06269-3149, USA [email protected] Amparo Latorre Institut Cavanilles de Biodiversitat i Biologia Evolutiva and Departament de Gen`etica Universitat de Val`encia, 46071 Valencia, Spain [email protected] Guojun Li Department of Biochemistry & Molecular Biology University of Georgia Athens, GA 30602-7229, USA [email protected] Fenglou Mao Department of Biochemistry & Molecular Biology University of Georgia Athens, GA 30602-7229, USA [email protected] Jan Mr´ azek Department of Microbiology and Institute of Bioinformatics University of Georgia Athens, GA 30602, USA [email protected] Maria Poptsova Department of Molecular and Cell Biology University of Connecticut Storrs, CT 06269-3125, USA [email protected] Francisco J. Silva Institut Cavanilles de Biodiversitat i Biologia Evolutiva and Departament de Gen`etica Universitat de Val`encia, 46071 Valencia, Spain [email protected] Zhengchang Su Department of Computer Science University of North Carolina Charlotte, NC 28233, USA [email protected]

xv

xvi

List of Contributors

Anne Summers Department of Microbiology University of Georgia Athens, GA 30602, USA [email protected] Thao Tran School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332, USA [email protected] Siren R. Veflingstad Department of Biomedical Engineering Georgia Institute of Technology Atlanta, GA 30332, USA and Max Planck Institute for Biochemistry Martinsried, Germany [email protected] Eberhard O. Voit Department of Biomedical Engineering Georgia Institute of Technology Atlanta, GA 30332, USA [email protected] Ping Wan College of Life Science Capital Normal University Beijing, P. R. China [email protected] John Wooley University of California San Diego La Jolla, CA 92093-0043, USA [email protected] Hongwei Wu School of Electrical and Computer Engineering Georgia Institute of Technology Savannah, GA 31407, USA [email protected]

List of Contributors

Ying Xu Department of Biochemical and Molecular Biology and Institute of Bioinformatics University of Georgia Athens, GA 30602, USA [email protected] Olga Zhaxybayeva Department of Biochemistry and Molecular Biology Dalhousie University Halifax, Nova Scotia B3H 1X5, Canada [email protected] Fengfeng Zhou Department of Biochemical and Molecular Biology and Institute of Bioinformatics University of Georgia Athens, GA 30602, USA ff[email protected]

xvii

This page intentionally left blank

ACKNOWLEDGMENTS

During the editing of this book, we, the editors, have received tremendous help from many friends, colleagues and families, to whom we would like to take this opportunity to express our deep gratitude and appreciation. First we would like to thank Dr. Limsoon Wong of the National University of Singapore, who encouraged us to start this book project on behalf of the World Scientific Publishing Company. We are very grateful to the following colleagues who have critically reviewed the drafts of the chapters of the book at various stages: Siv G.E. Andersson, S.J.W. Busby, Luis Delay, Dan Drell, Jonathan Filee, Greg Fournier, Derrick Fouts, Maria Pilar Francino, Robert Friedman, Joel Graber, Roy Gross, Loren Hauser, Jinling Huang, Anders Krogh, Pascal Lapierre, John Leigh, Shoudan Liang, Bill Majoros, Trina Montgomery, Andres Moya, Jan Mrazek, Thane R. Papke, Haluk Resat, Heidi Sofia, Victor Solovyev, Kristen Swithers, Ariane Toussaint, Kalyan Vinnakota, Xiufeng Wan, Yuri Wolf, Alex Worden, Dong Xu, and Olga Zhaxybayeva. Their invaluable input on the scientific content, on the pedagogical style, and on the writing style helped to improve these book chapters significantly. We also want to thank Ms Kristen Swithers of the University of Connecticut for providing artwork for the book cover; and Ms Joan Yantko of the University of Georgia for her tireless help in numerous fronts in this book project, including taking care of a large number of email communications between the editors and the authors, chasing busy authors to get their revisions and other material, helping putting together the index list, and integrating the reference list. Last but not the least, we want to thank our families for their constant support and encouragement during the process of us working on this book project.

xix

This page intentionally left blank

CHAPTER 1 GENERAL CHARACTERISTICS OF PROKARYOTIC GENOMES

´ JAN MRAZEK and ANNE O. SUMMERS

1. Introduction 1.1. The Replicon Concept and Classification of Replicons The term chromosome for “colored body” from the Greek roots “chroma” (color) and “soma” (body) was first applied by 19th century biologists to a structure in the center of animal and plant cells that stained deeply with the basic dye fuscin. Later microscopic studies revealed that this apparently singular structure was actually composed of numerous smaller bodies each consisting of two strands of equal length and the term was pluralized to refer to all of these structures as “the chromosomes”. In work extending into the 20th century eukaryotic chromosomes were found to contain DNA, RNA, and proteins and because they replicated and were passed to each daughter cell it was believed they contained the information for inheritance. By the mid-20th century it was determined with the use of bacteria and their viruses (called bacteriophages) that it was the DNA that actually carried the genetic information; the RNA and proteins had a structural role only. In this time-frame it was also found that, unlike the linear structures of eukaryotic chromosomes, the chromosome of the model bacterium E. coli (and many others examined since) is a large, covalently closed circle of double stranded DNA. In due course, the genetic material of phages and other viruses came to be called chromosomes although many of these obligate intracellular parasites have RNA rather than DNA as their genetic material (Alberts et al., 2002). It was first discovered by genetic crosses that bacteria also have multiple distinct genetic linkage groups, a term that had been applied to the independently assorting chromosomes in animals, plants and fungi. However, unlike the chromosomes of eukaryotes these bacterial plasmids as they came to be called (Lederberg et al., 1952) were not present in all strains of the same species and those that carried plasmids typically had different ones. Plasmids conferred on the cells that had them interesting properties such as antibiotic resistances (Foster, 1983; Tenover, 2006). Moreover, plasmids can move themselves as well as genes of the bacterial chromosome from one cell to another by a mechanism called conjugation (Frost et al., 1994). With the advent of molecular biology in the early 1970’s, the physical 1

2

J. Mrazek ´ & A. O. Summers

nature of most bacterial plasmids was demonstrated to be covalently closed circles of double stranded DNA, of different sizes ranging from a few thousand base pairs to hundreds of thousands of base pairs, that is generally from 1000- to 10-fold smaller than the chromosome of the cell. The reality of distinct replicating entities in bacterial cells led to the adoption of the general term “replicon”, which covers them all (Jacob and Brenner, 1963). Thus, a replicon is any genetic entity that controls its own replication by virtue of encoding genes (both proteins, functional RNAs, and sites on its DNA with which they interact) that determine when and how often it will initiate replication. The term has also been applied to eukaryotic chromosomes (Vanicek and Klimek, 1971). Some replicons (notably viruses and bacteriophages in addition to chromosomes) also encode the polymerases needed for copying their genetic material (RNA or DNA); in contrast, plasmids typically rely on the endogenous chromosome-encoded polymerases to complete their replication (Actis et al., 1999). When it became possible to sequence the entire genetic complements of cells, the term genome was used to cover the suite of all stable replicons in a given organism. For eukaryotes this includes all of their linear chromosomes, organellar DNA in mitochondria and chloroplasts and also, for fungi, linear RNA (Schaffrath and Breunig, 2000) and linear and circular DNA (Fukuhara, 1995; Futcher, 1988) plasmids. Similarly, the bacterial genome consists of one or more chromosomes and linear and circular plasmids and temperate bacteriophages (whether inserted into the chromosome or replicating separately; Fig. 1). When the first bacterial genomes

Fig. 1. Hierarchy of replicons based on chemistry and topology. *Ends of linear replicons can either be free (unblocked) 5′ phosphates and 3′ hydroxyls or blocked in short hairpins (in SS) or covalently crosslinked (DS). **SS RNA replicons can either be directly translatable mRNA’s (+) or complementary to mRNA (−), needing to be copied before used as mRNA. ?? Indicates there are as yet no examples of these categories of replicons.

General Characteristics of Prokaryotic Genomes

3

Fig. 2. Plasmids in strains of the Salmonella Reference Collection B. The dark bands in each lane are plasmids of various sizes stained with a fluorescent dye. The right 3 lanes are sequenced plasmids used for molecular weight standards. Over 60% of the SARB collection have more than one plasmid, typically larger than 50 kb. The same is true of the E. coli Reference Collection, ECOR (data not shown).

were sequenced the strains used had been kept in culture so long that they had lost all or most of the natural plasmids typically found in wild strains. Presently, only 30 % of published prokaryotic genomes include one or more plasmids. This fraction is likely an underestimate as the majority of strains in carefully kept standard reference collections of the enterobacteriaceae have several plasmids each (Fig. 2) and freshly isolated staphylococci typically have 5 or 6 plasmids in a wide range of sizes (data not shown). Indeed, some bacteria such as Borrelia have a mixture of over 20 linear and circular plasmids (Beaurepaire and Chaconas, 2007; Schwan et al., 1988). Presently, the term genome refers to the entire genetic complement of an organism, which typically consists of multiple replicons including chromosomes, plasmids, and other genetic elements. Chromosomes carry genes for basic cell functions (housekeeping genes). In contrast, plasmids are not typically required for laboratory survival of the cells, although they can provide essential resistance to stresses and ability to survive in particular environments. However, the definition of the genome as the entire collection of replicons in the cell carries practical difficulties. Plasmids and phages can be lost from a strain during laboratory cultivation on artificial medium and, as noted above, different strains of a bacterial species might

4

J. Mrazek ´ & A. O. Summers

be very similar at the chromosomal level yet have a completely different complement of plasmids. Thus, the term “genome” or even “complete genome” must be taken with the caveat that it can only refer to the DNA sequence of a particular strain at the time it was sequenced. The chromosomes and very large (> 400 kb) plasmids can be quite stable in many genera. However, the variable (or mobile) fraction which can constitute 10 to 12% of the genetic information in freshly cultivated isolates of some genera will likely not be the same in every sequenced example of the species.

1.2. Physical Organization of Replicons in the Cell Eukaryotic cells are compartmentalized and the chromosomes are located in the nucleus, which has been extensively studied for a century and a half. On the other hand, prokaryotic cells are not compartmentalized and investigations of the detailed physical structure and organization of chromosomes in prokaryotic cells are more recent. Most well studied bacteria and archaea have circular chromosomes that replicate bidirectionally from a single origin, although spirochetes, Streptomyces, and a few other bacteria (Hinnebusch and Tilly, 1993) have linear chromosomes, and some archaea have more than one origin of replication (Mr´ azek and Karlin, 1998; Myllykallio et al., 2000; Robinson et al., 2004). Prokaryotic chromosomes are not enclosed in a separate compartment, but their physical structure is nonetheless highly ordered. The organized DNA structure in prokaryotic cells is referred to as the nucleoid and includes DNA as well as proteins that contribute that help stabilize the nucleoid structure. Compared to eukaryotic chromatin the bacterial nucleoid has a low protein content. The emerging model involves a DNA molecule assembled in dynamic supercoiled loops emanating radially from a protein-organized core. In rod-shaped bacterial cells, the origin of replication is located near one of the poles at the start of the replication, and the segregation of newly synthesized chromosomes in dividing cells is achieved by the origins of the two chromosomes being pushed to the opposite poles of the cell. Several recent reviews summarize current state of knowledge regarding molecular mechanisms of chromosome organization and segregation in model organisms (Draper and Gober, 2002; Gitai et al., 2005; Thanbichler and Shapiro, 2006). Physical structures of plasmids and phages are diverse and involve both circular and linear DNA in single stranded or double stranded forms. Both plasmids and bacteriophages are of importance because they affect two of the three known forms of horizontal gene transfer in prokaryotes, conjugation and transduction, respectively. In addition to their ability to replicate themselves, plasmids and phages each carry specialized proteins that enable the physical transfer of their DNA between host bacterial cells. For the transfer of plasmids between cells the plasmid DNA goes through a conjugative pore established by a complex structure that also assembles the pilus, a tubular hair-like grappling device that holds the donor cell and recipient cell surfaces together (Lawley et al., 2003). As this strand enters the recipient cell it

General Characteristics of Prokaryotic Genomes

5

is copied by the recipient’s DNA polymerase and eventually circularizes and begins to replicate. Such conjugative plasmids can also transfer a single strand of the entire donor cell chromosome or parts thereof, which does not circularize upon entering the recipient but can recombine with the recipient chromosome via homologous or illegitimate recombination. For bacteriophages the proteins involved in gene transfer are their viral coat proteins and other proteins involved in packing DNA into the viral capsid. As a virus replicates it can accidentally package host DNA into a phage capsid that is released upon cell lysis. Such a “transducing particle” can infect another cell because the infection machinery is built in to the phage capsid. The phage-borne donor DNA can recombine with the homologous regions of the recipient’s chromosome thus contributing to horizontal gene transfer. Phages and plasmids have been most extensively studied in the enterobacteriaceae and the staphylococci but similar replicons have been observed and studied in a more limited manner in nearly every defined bacterial or archaeal genus. Bioinformatic analysis of these mobile elements per se has been very limited (Frost et al., 2005) but is recently increasing (Suzuki et al., 2008), facilitated by databases such as the U.S. Dept. of Energy Joint Genome Institute’s Integrated Microbial Genomes site (http://img.jgi.doe.gov/cgi-bin/pub/main.cgi) (Markowitz et al., 2006a; Markowitz et al., 2006b) and ACLAME (http://aclame.ulb.ac.be) (Leplae et al., 2006).

2. Overall Properties of Prokaryotic Chromosomes 2.1. Size and Gene Content Eukaryotic genomes vary widely in size, largely due to variable amount of noncoding DNA. For example, the human genome of 3 Gb size is about 250 times larger than the yeast Saccharomyces cerevisiae genome, yet contains less than five times as many genes as the yeast genome. The main difference is in the amount of noncoding DNA — about 70% of the yeast genome is protein-coding but only 2–3% of the human genome is estimated to encode proteins. In contrast, all prokaryotic genomes are rather compact and the chromosome sizes correlate well with the number of genes (Fig. 3). Protein coding regions typically occupy about 85% of a prokaryotic genome and average gene density is a little less than one gene per 1 kb DNA sequence. Exceptions involve host-dependent pathogens or symbionts that recently lost their ability to survive outside a host. Adaptation to host-dependent lifestyle is accompanied by loss of genes that are no longer needed (Moran, 2002). Mycobacterium leprae, which unlike its close relatives Mycobacterium tuberculosis and Mycobacterium bovis has never been successfully cultivated outside a host, is an example (Cole et al., 2001). Only about 70% of the M. leprae chromosome encodes proteins, the lowest protein-coding fraction among completely sequenced prokaryotic genomes to date. The M. leprae chromosome contains a large number

6

J. Mrazek ´ & A. O. Summers

Fig. 3. Correlation of chromosome size and gene content in prokaryotic chromosomes. Each point represents a prokaryotic chromosome where the abscissa signifies the chromosome length and ordinate the number of annotated genes, both in logarithmic scale. Outliers are identified by species names.

of pseudogenes — former functional genes that have been inactivated by mutations. The differences in the fraction of protein-coding sequences among prokaryotes are nowhere close to the nearly hundred-fold differences among eukaryotic genomes. Archaea are similar to Bacteria in terms of genome and chromosome sizes and high gene density (Fig. 3). The relative invariance of the protein-coding fraction of prokaryotic genomes does not mean that all prokaryotic genomes are similar in size. However, unlike eukaryotes, the genome size correlates very well with gene content (Fig. 3). Table 1 shows a list of the smallest prokaryotic genomes sequenced to date. They all involve parasites or symbionts adapted to the host environment. Some of these organisms can grow in laboratory conditions but they require specifically formulated media and are not known to grow in a natural environment outside a host. In contrast, some of the largest prokaryotic chromosomes are found in Myxococcus xanthus and Streptomyces species, free-living bacteria that undergo a complex developmental cycle, requiring a large gene repertoire. In fact, the largest bacterial genomes carry more protein coding genes than some eukaryotes. In general, specialist species — those living in specialized niches — have small genomes and fewer genes than generalist species — those able to grow in diverse environments — which have large genomes and large number of genes (Shimkets, 1998).

General Characteristics of Prokaryotic Genomes Table 1.

7

Fifteen prokaryotes with smallest genomes among those completely sequenced.∗

Organism

Genome size (kb)

G+C content (%)

G + C at codon site 3 (%)

Number of annotated genes

Carsonella rudii Nanoarchaeum equitans Mycoplasma genitalium Buchnera aphidicola (3 strains sequenced) Baumannia cicadellinicola Wigglesworthia glossinidia Candidatus Blochmannia floridanus Aster yellows witches-broom phytoplasma Ureaplasma parvum Mycoplasma mobile Candidatus Blochmannia pennsylvanicus Mesoplasma florum Mycoplasma synoviae Mycoplasma pneumoniae Neorickettsia sennetsu Onion yellows phytoplasma

160 491 580 616–641 686 698 706 707 752 777 792 793 799 816 859 861

16.6 31.6 31.7 25.3–26.3 33.2 22.5 27.4 26.9 25.5 24.9 29.6 27.0 28.5 40.0 41.1 27.7

14.9 24.7 22.8 12.4–14.8 19.5 11.6 15.7 19.5 12.2 10.8 19.9 12.5 16.2 41.7 37.4 19.9

182 536 477 504–564 595 611 583 671 614 633 610 682 672 689 932 754

∗ The list includes complete genomes available in August 2006, except for the outlier C. rudii, whose genome was subsequently added to the list and included in Figs. 3–8.

2.2. Why Are Prokaryotic Chromosomes Small? High gene density and small sizes of prokaryotic chromosomes suggest that prokaryotes tend to lose DNA segments that they do not need. This is most obvious in comparisons of obligate parasites or symbionts with closely related freeliving bacteria. A proposed evolutionary model dominated by genome reduction has emerged from comparisons among genomes of symbiotic and free-living γproteobacteria (Dale et al., 2003; Moran, 2003; Moran and Plague, 2004). The initial period of adaptation to obligate symbiotic lifestyle is characterized by loss of some DNA repair capacities, in particular the RecBCD pathway involved in recombinational repair. This is accompanied by proliferation of repeats, which facilitate frequent genome rearrangements and further loss of genes via homologous recombination, and by decrease in the genomic G+C content (Moran, 2002) (see also Fig. 4). At least two mechanisms can facilitate the latter. Spontaneous deaminations convert cytosine to uracil, and if not corrected these give rise to G−C → A−T mutations during the next round of replication (Glass et al., 2000). In addition, early in vitro experiments showed that adenine is most often inserted opposite abasic lesions by DNA polymerase (Randall et al., 1987), possibly because adenine is more hydrophobic than other bases and thus might have higher affinity for the active site of the DNA polymerase (Kypr, 1988). The initial period of rapid changes and gene loss is succeeded by a period of relative stability and a slow decay of additional genes. This evolutionary scenario of genome reduction probably applies to both obligate symbionts and obligate pathogens (Moran, 2002). It is also consistent with the data in Fig. 3, where the correlation between the chromosome size and gene content is

8

J. Mrazek ´ & A. O. Summers

Fig. 4. Relationship between chromosome size and overall G+C content in prokaryotic chromosomes.

strong and linear but a clear shift occurs at chromosome sizes between 2 and 3 Mb. Chromosomes smaller than 2 Mb have fewer genes per unit length than expected on the basis of the gene density of the chromosomes larger than 3 Mb. Exceptions include Nanoarchaeum equitans, an archaeon which only grows in co-culture with another archaeon Ignicoccus (Waters et al., 2003), and Neorickettsia sennetsu, for which the ratio of chromosome size and gene content matches the projection from larger chromosomes. Carsonella rudii, whose genome contains a single chromosome of only 160 kb length (Nakabachi et al., 2006) represents an extreme case of genome reduction. However, it shares some characteristics with organellar genomes and might represent an evolutionary stage between a bacterium and an organelle (Galperin, 2006). M. leprae and Sodalis glossinidius with chromosomes of 3.3 Mb and 4.2 Mb length, respectively, contain fewer genes than expected compared to other prokaryotes (Fig. 3). Both these bacteria depend on their respective hosts for growth and might be in the early stage of genome reduction. Consistent with this hypothesis, the M. leprae chromosome is smaller than those of closely related mycobacteria but not yet of the typical size of obligate intracellular pathogens (usually around 1 Mb). The proposed evolutionary scenario of genome reduction relates to the order of events accompanying adaptation to the host-dependent lifestyle but on its own does not explain why all prokaryotic genomes apparently carry little DNA that is not beneficial. The compact character of prokaryotic genomes can be explained by selection (if smaller genomes increase the fitness of the organism) or by a

General Characteristics of Prokaryotic Genomes

9

mutational bias (if deletions are more frequent than insertions). One potential source of selection could relate to bacterial replication from a single origin. With only two replication forks progressing through the chromosome, replication of the complete chromosome takes a significant amount of time — about 37 minutes for E. coli K12 (Churchward and Bremer, 1977). Smaller chromosomes could reduce the time required for replication and thus provide a selective advantage to bacteria with smaller chromosomes. However, if this hypothesis were correct then one might expect that small genomes would be most beneficial to fast growing bacteria and that the chromosome size would correlate with doubling times in exponential growth phase, but such a correlation has not been observed (Mira et al., 2001). There might be other, less direct selective constraints on chromosome size. For example, pathogenic bacteria might benefit from eliminating unnecessary genes encoding antigens that could aid the host immune system in recognizing their cells. Keeping the genomes small could also be a means to reduce the overall amount of repetitive DNA and concomitantly reduce the potential for illegitimate recombination. There is, however, evidence supporting the deletion bias hypothesis. Comparisons between functional genes and homologous pseudogenes allow assessments of direction of the mutations, i.e., in which lineage the mutation occurred. Ochman and coworkers (Mira et al., 2001) used such comparisons to determine that deletions are more frequent than insertions in bacterial genomes. In addition, homologous recombination can lead to deletions of thousands of base pairs comprising one or more complete genes (Aras et al., 2003). The deletion bias generates a tendency towards genome reduction, which is balanced by selective constraints against the loss of beneficial genes, and by acquisition of new genes by lateral gene transfer and gene duplications. The size of a prokaryotic genome is therefore determined by equilibrium between the acquisition and loss of genetic material within the limits imposed by selection. In eukaryotes, the genome size and fraction of intergenic DNA sequences have been correlated with the rate of spontaneous DNA loss (Petrov, 2001; Petrov et al., 2000). More details on mechanisms and consequences of genome reduction and related evidence are provided in Chapter 7 of this book.

2.3. G+C Content Prokaryotic genomes are remarkably variable in their G+C content. Interestingly, the G+C content correlates with the genome and chromosome size (Moran, 2002) (Fig. 4). Specifically, the small genomes of obligate pathogens and symbionts are A+T rich, consistent with the genome reduction scenario described in the previous section. In all genera, protein coding genes tend to have slightly higher G+C content (about 5–10% on average) than intergenic sequences (Fig. 5). However, the G+C content in genes varies significantly among the three codon positions. The second codon position is the least variable (Fig. 6). The reasons for the low variance of the G+C content at codon site 2 likely relate to selective constraints and the

J. Mrazek ´ & A. O. Summers

10

Fig. 5.

Fig. 6.

Relationship between G+C content in genes and intergenic regions.

Relationship between G+C content in intergenic regions and codon site 2.

General Characteristics of Prokaryotic Genomes

11

requirement that genes encode functional proteins. Not only are all replacements between A−T and G−C base pairs at codon site 2 nonsynonymous (i.e., lead to amino acid substitutions) but the second codon position has also the largest effect on the chemical properties of the amino acid side chains. In particular, hydrophobic amino acids often have codons with a T at site 2 whereas strongly hydrophilic amino acids mostly use A at the same site (Kypr and Mr´ azek, 1987). Hydrophobic interactions are essential for stable protein structures and the need to maintain a balance between hydrophobic and hydrophilic amino acids imposes limits on variance of G+C content at the second codon position. G+C content at codon site 1 varies almost to the same extent as G+C in intergenic regions (Fig. 7). Interestingly, the relationship is not completely linear with a tendency to flatten off at high G+C values. In contrast, the G+C at codon site 3 varies even more than G+C in the intergenic regions (Fig. 8). Codon site three is under weaker selective constraints as most site 3 substitutions are synonymous. Hence, the codon site 3 G+C content is more likely to reflect the general mutational biases of the cell. However, similar assumptions can be made about intergenic regions. The lower variance of G+C content in intergenic regions suggests that they are subject to selective constraints, possibly related to periodic patterns of short A and T runs that might affect chromosome organization in the cell (Herzel et al., 1999; Mr´ azek, 2006; Tolstorukov et al., 2005). Alternatively, the high intergenomic variance of G+C content at codon site 3 could be in part due to compensatory effects for the low variance at site 2. This hypothetical mechanism requires that selective

Fig. 7.

Relationship between G+C content in intergenic regions and codon site 1.

J. Mrazek ´ & A. O. Summers

12

Fig. 8.

Relationship between G+C content in intergenic regions and codon site 3.

constraints — perhaps in addition to biased mutation rates — maintain the overall G+C balance and the high G+C variance at the synonymous codon sites serves to bring the overall G+C content close to some optimal overall G+C level.

2.4. Oligonucleotide Composition and Genome Signature If two nucleotides X and Y in a DNA sequence occur independently of each other then the dinucleotide XY should be found approximately at the frequency fXY ≈ fX fY . Significant deviations of fXY from fX fY suggest that the particular dinucleotide tends to be avoided (under-represented) or used more often than expected (over-represented) for some biological reasons. Karlin and coworkers (Blaisdell et al., 1996; Campbell et al., 1999; Karlin and Burge, 1995; Karlin et al., 1998a; Karlin and Cardon, 1994; Karlin and Ladunga, 1994; Karlin et al., 1997) defined dinucleotide relative abundances as ρXY = ffXXY fY . For doublef∗

stranded DNA, a symmetrized version is often used in the form ρ∗XY = f ∗XY ∗ , X fY ∗ ∗ ∗ where the frequencies fXY , fX , and fY are calculated from the DNA sequence concatenated with its inverted complement. Based on statistical tests with real and random DNA sequences these authors proposed thresholds 0.78 and 1.23 as suitable benchmarks for assessing whether a dinucleotide is significantly overor under-represented in a DNA sequence (Karlin and Cardon, 1994; Karlin and Ladunga, 1994). Comparisons among DNA segments from the same genome and from different genomes revealed that the dinucleotide relative abundances can vary significantly among different genomes but are remarkably stable within a

General Characteristics of Prokaryotic Genomes

13

genome, and therefore constitute a “genome signature” (Campbell et al., 1999; Karlin and Burge, 1995; Karlin et al., 1997). In this context, the genome signature refers to the vector of the sixteen dinucleotide relative abundances {ρ∗XY }. The high intragenomic stability of dinucleotide relative abundances was actually discovered long before the era of high-throughput DNA sequencing. In the 1960s and 1970s, Kornberg and coworkers (Josse et al., 1961) followed by Subak-Sharpe and coworkers (Russell et al., 1973; Russell and Subak-Sharpe, 1977; Russell et al., 1976) performed biochemical experiments to measure the same dinucleotide relative abundances that were later assessed computationally from the DNA sequences, and concluded that they constitute “general designs” for DNA extracted from the same or similar organisms, analogous to the genome signature. Table 2 displays the ρ∗XY values for several genomes emphasizing the dinucleotide compositional diversity among prokaryotic genomes. Larger datasets are available in the literature (Blaisdell et al., 1996; Campbell et al., 1999; Karlin et al., 1997) and dinucleotide relative abundances can be obtained for any DNA sequence at http://www.cmbl.uga.edu/software/signature.html. Note that the dinucleotide relative abundances factor out the mononucleotide frequencies and are therefore independent of G+C content. The exact molecular mechanisms that generate and maintain the genome signature are not known but probably involve a combination of selective constraints and context-dependent mutational biases (Blaisdell et al., 1996; Campbell et al., 1999; Karlin and Burge, 1995; Karlin et al., 1997). The concept of dinucleotide relative abundances can be extended to longer oligonucleotides. This generally involves application of Markov chain models to compare the observed frequency of an oligonucleotide f Obs and its expected frequency f Exp estimated from the known frequencies of shorter oligonucleotides. A Markov chain is a stochastic process such that the probability of a future state of the studied system only depends on the immediately preceding state but not any other past states. In DNA sequence analysis, there are four possible states represented by the four nucleotides A, C, G, and T, and the Markov chain is defined by the sixteen transition probabilities p(X → Y ) = Pr(xi = Y |xi−1 = X), where {xi } is a sequence of letters A, C, G, or T. The transition probabilities can be estimated from a DNA sequence as p(X → Y ) ≈ fXY /fX . These considerations can be immediately used to estimate the expected frequency of a trinucleotide XYZ based on the Markov Exp = fXYfYfYZ . In a generalization of the Markov chain model, the chain model as fXYZ nth order Markov chain describes a series of states where the next state depends on n preceding states. A 2nd order Markov chain can be used to estimate expected frequencies of tetranucleotides while factoring in the observed frequencies of the Exp fYZW = fXYZfYZ , and higher order Markov chains apply embedded trinucleotides fXYZW for longer oligonucleotides. This technique was used by Trifonov and coworkers to identify frequent or rare (and thus presumably meaningful) “words” in DNA sequences (Brendel et al., 1986; Trifonov and Brendel, 1986). Comparisons among DNA sequences based on Markov chain or similar models were also used to assess sequence similarity and for phylogenetic reconstructions where sequence alignments were problematic or unavailable, or simply to

14

Table 2.

Symmetrized dinucleotide relative abundances (ρ∗ values) for selected prokaryotic genomes.

Organism

CG

GC

TA

AT

CC GG

TT AA

GA TC

AG CT

TG CA

AC GT

50.8 40.4 66.6 68.1 51.5 64.1 61.9 68.9 46.8 38.9 43.5 28.6 28.5 47.7 41.4 67.0 46.3 43.5 33.1 31.4 40.8 35.8

1.16 0.87 1.10 1.55 1.31 1.28 1.13 1.06 0.70 0.93 1.04 0.25 1.14 0.75 0.78 1.07 0.92 0.87 0.89 0.32 0.50 0.67

1.28 1.31 1.17 1.32 1.28 1.20 1.20 1.08 1.09 1.56 1.27 1.21 0.97 1.02 1.16 1.16 0.69 0.75 1.11 1.12 0.95 0.95

0.75 0.73 0.54 0.47 0.64 0.44 0.49 0.39 0.77 0.73 0.65 0.94 0.50 0.75 0.85 0.49 0.50 0.82 0.72 0.83 0.80 1.00

1.10 1.03 1.17 1.50 1.04 1.42 1.38 1.01 1.02 0.86 1.02 0.92 0.94 1.00 0.93 0.89 0.83 0.66 0.91 0.94 0.87 0.95

0.91 0.96 0.84 0.60 0.97 0.77 0.89 0.86 1.17 1.17 0.97 1.31 0.88 1.36 1.11 0.87 0.99 1.24 1.22 1.38 1.24 1.24

1.21 1.15 1.07 1.18 1.45 1.10 1.21 0.97 1.15 1.37 1.24 1.07 0.82 1.32 1.15 1.25 1.19 1.29 1.24 1.14 1.14 1.04

0.92 0.93 1.10 1.28 0.90 1.25 1.11 1.14 1.01 0.87 1.06 1.02 1.26 0.86 0.92 1.01 1.40 1.12 1.01 1.05 1.12 1.05

0.82 0.86 1.02 0.79 0.70 0.90 0.90 1.03 1.03 0.97 0.91 1.24 0.95 0.85 0.99 1.00 1.11 1.18 0.88 1.11 1.23 1.17

1.12 1.25 1.10 0.89 1.01 1.01 1.07 1.15 1.08 0.97 1.08 0.93 1.01 1.05 1.09 1.12 0.97 0.74 1.07 1.03 0.95 0.88

0.88 0.88 0.86 0.89 0.84 0.81 0.74 0.98 0.77 0.67 0.75 0.77 1.14 0.79 0.89 0.93 0.87 0.89 0.82 0.72 0.75 0.85

Note: Significantly high (≥ 1.23) or low (≤ 0.78) values are shown in bold face and underlined.

J. Mrazek ´ & A. O. Summers

Escherichia coli K12 Acinetobacter ADP1 Pseudomonas aeruginosa PA01 Burkholderia mallei ATCC23344 Neisseria meningitidis MC58 Bradyrhizobium japonicum USDA110 Hyphomonas neptunium ATCC15444 Myxococcus xanthus Desulfotalea psychrophila LSv54 Helicobacter pylori 26695 Bacillus subtilis Clostridium perfringens Streptomyces coelicolor A3 Synechocystis PCC6803 Anabaena variabilis ATCC29413 Deinococcus radiodurans R1 Thermotoga maritima MSB8 Aquifex aeolicus VF5 Methanococcus maripaludis S2 Methanocaldococcus jannaschii DSM2661 Pyrococcus furiosus DSM3638 Sulfolobus solfataricus P2

G+C (%)

General Characteristics of Prokaryotic Genomes

15

complement standard phylogenetic methods (Blaisdell, 1986; Blaisdell, 1989a; Blaisdell, 1989b; Campbell et al., 1999; Karlin et al., 1999; Kirzhner et al., 2003; Pietrokovski et al., 1990). Interest in these and similar techniques has recently been renewed as a consequence of shotgun sequencing of environmental samples or “metagenomes” (Venter et al., 2004). In this approach, DNA fragments are sequenced directly from samples extracted from the environment, which involve a mixture of many different organisms. Subsequent analysis of such data is significantly facilitated by clustering the fragments into bins that likely came from a group of phylogenetically related organisms. New methods are being developed for this task, often utilizing some form of the genome signature approach (McHardy et al., 2007). 2.5. Amino Acid Composition and Adaptation to Growth at High Temperatures In addition to nucleotide and oligonucleotide composition, prokaryotic genomes vary significantly in the overall amino acid composition of the encoded proteins. The G+C content of the chromosome has a major effect on amino acid composition (Karlin et al., 1992; Karlin and Bucher, 1992). Amino acid usages in different genomes also reflect metabolic cost of synthesis of each particular amino acid, given the availability of nutrients and different metabolic pathways (Akashi, 2003). Amino acid composition also relates to optimal growth temperature. Thermophiles, organisms that grow optimally at temperatures higher than 45◦ C, consistently use more charged and hydrophobic amino acids and fewer polar uncharged amino acids than mesophiles, which grow optimally at moderate temperatures between 20 and 45◦ C (Kreil and Ouzounis, 2001; Suhre and Claverie, 2003). This difference is sufficiently strong to distinguish thermophiles unambiguously from mesophiles. Other genome properties have been proposed as characteristics of thermophiles but these are poor indicators of growth at high temperature (Suhre and Claverie, 2003). For example, there is only weak correlation between the genomic G+C content and optimal growth temperature, and many thermophiles or even hyperthermophiles have low G+C content whereas many mesophiles have high G+C content. The thermophily index (Nakashima et al., 2003) in the form (fRR + fYY ) − (fRY + fYR ), which measures dominance of purine-purine and pyrimidine-pyrimidine dinucleotides over purine-pyrimidine and pyrimidine-purine dinucleotides is similarly inaccurate in distinguishing thermophiles from mesophiles. 3. Heterogeneity of Prokaryotic Chromosomes 3.1. Intrachromosomal Variance of Nucleotide and Oligonucleotide Composition We know that G+C content and oligonucleotide composition vary in some cases quite dramatically between genomes. But how homogeneous are these values within

16

J. Mrazek ´ & A. O. Summers

a chromosome? Even before the first completely sequenced genomes, analysis of DNA contigs from E. coli and human genomes showed that nucleotide composition varies significantly more than could be reproduced by homogeneous stochastic models (Fickett et al., 1992). In fact, the high variance of G+C content in human and other mammalian genomes was first revealed not by sequence analysis but by centrifugation of DNA fragments in CsCl density gradients (Bernardi, 1989). The intragenomic variance of G+C content is demonstrated in Fig. 9, which shows sliding window G+C plots for E. coli and random DNA sequences generated by a Bernoulli model (reproducing only the overall base composition of the original sequence), 1st order Markov model (reproducing the dinucleotide composition), and 5th order Markov model (with the same hexanucleotide composition as the E. coli genome). The G+C variance along the chromosome is much higher in the real sequence than in the random sequences. In addition to high fluctuations of G+C content at all scales, most prokaryotic genomes also feature subtle but systematic changes in nucleotide composition when traversing the chromosome from the origin of replication towards the terminus. The exact character of this G+C variation differs among different genomes but generally involves decreased G+C content near the terminus of replication (Daubin and Perri´ere, 2003; Karlin et al., 1998a). The intragenomic variance of genome signature consisting of the 16 dinucleotide relative abundances is lower than the variance between genomes but higher than that of a homogeneous random sequence (Karlin and Burge, 1995; Karlin et al., 1998a; Karlin and Cardon, 1994; Karlin et al., 1997). Figure 10 shows average δ ∗ -distances within and between several proteobacterial chromosomes. Note that the two chromosomes of B. mallei are indistinguishable by genome signature. This is generally true for most replicons from a single cell and genome signatures of plasmids tend to be similar to those of the host chromosomes (Campbell et al., 1999). Analogously, temperate phages, which reside in the host for an extended period of time, have similar signatures to their hosts, whereas lytic phages’ signatures are often dissimilar from their hosts’ signatures (Blaisdell et al., 1996).

3.2. Synonymous Codon Usage It was noted soon after the first DNA sequences became available that synonymous codons in genes are not used with equal frequency. Like genome signature, synonymous codon usage often differs considerably between genomes (Grantham et al., 1981; Grantham et al., 1980). However, significant differences exist in synonymous codon usage even among genes from the same genome. In most unicellular and some multicellular organisms, the codon frequencies in genes relate to gene expression levels (Grantham et al., 1981; Sharp and Li, 1986). Table 3 compares synonymous codon frequencies in the collection of all annotated E. coli K12 genes and the genes encoding ribosomal proteins. Ribosomal protein genes are highly expressed in many organisms and have often been used as a standard for highly expressed genes (Karlin and Mr´ azek, 2000; Karlin et al., 1998b; Sharp

General Characteristics of Prokaryotic Genomes

17

Fig. 9. Sliding window plots of G+C content in the E. coli K12 chromosome (top) and random sequences preserving the nucleotide composition (Bernoulli model), dinucleotide composition (1st order Markov model), and hexanucleotide composition (5th order Markov model) of the chromosome. The G+C content is calculated within overlapping windows of 50 kb (black) and 10 kb (gray) size and the values are assigned to the position of the center of the window.

18

J. Mrazek ´ & A. O. Summers

Fig. 10. Average δ∗ -distances within and between several proteobacterial chromosomes and random sequences. The δ∗ -distance measures similarity of two DNA sequences A and B in terms of the symmetrized dinucleotide relative abundances ρ∗XY as a Manhattan distance δ∗ (A, B) = 1 P ∗ ∗ ∗ ∗ X,Y ∈{A,C,G,T } |ρXY (A) − ρXY (B)|, where ρXY (A) and ρXY (B) are the symmetrized 16 dinucleotide relative abundances in the two sequences A and B, respectively (Karlin et al., 1999; Karlin et al., 1998a; Karlin et al., 1997). Each of the chromosomes compared was divided into non-overlapping 50 kb samples and the figure shows the average δ∗ -distances for all pairwise comparisons between samples form a single chromosome (diagonal entries) or from two different chromosomes (non-diagonal entries). Gray backgrounds correspond to the δ∗ -distances: lighter more similar, darker more distant. The sequences compared in the table are E. coli K12 (Eco), Salmonella typhimurium LT2 (Sty), Haemophilus influenzae Rd (Hin), Acinetobacter ADP1 (Aba), Pseudomonas aeruginosa PAO1 (Pae), Burkholderia mallei ATCC23344 chromosome 1 (Bm1) and chromosome 2 (Bm2), Neisseria meningitidis MC58 (Nme), Bradyrhizobium japonicum USDA110 (Bja), Caulobacter crescentus CB15 (Ccr), Hyphomonas neptunium (Hne), Myxococcus xanthus DK1622 (Mxa), Bdellovibrio bacteriovorus HD100 (Bba), Desulfotalea psychrophila LSv54 (Dps), Helicobacter pylori 26695 (Hpy), a random sequence generated from the E. coli chromosome by the Bernoulli model and preserving nucleotide composition (rB), a random sequence generated by 1st order Markov model and preserving dinucleotide composition (rM1), and a random sequence generated by 5th order Markov model and preserving hexanucleotide composition (rM5). The three random sequences are the same as in Fig. 9.

and Li, 1986; Sharp and Li, 1987). The bias towards the use of a small set of preferred codons is common in highly expressed genes of many microbial genomes, particularly those of fast-growing microbes. What generates the strong synonymous codon bias in highly expressed genes? The most cited theory postulates that the codon bias in highly expressed genes arises from selective constraints related to translation efficiency (Ikemura, 1981a; Ikemura, 1981b; Ikemura, 1985). Ikemura noticed that the most frequently used codons relate to the cognate tRNA molecules that occur in the cells at highest concentrations. He proposed that the preferential use of such codons with abundant cognate tRNAs avoids translational stalling due to unavailability of the charged tRNAs. However, other mechanisms might

General Characteristics of Prokaryotic Genomes

19

Table 3. Synonymous codon frequencies (normalized to 100 within each synonymous group of codons) in the collection of all annotated genes in the E. coli K12 chromosome and restricted to ribosomal protein genes. Amino acid

Codon

% usage in all genes

% usage in ribosomal protein genes

Ala

GCA GCC GCG GCT

21.31 26.97 35.58 16.14

27.31 8.84 16.64 47.20

Arg

CGA CGC CGG CGT AGA AGG

6.46 39.80 9.84 37.80 3.85 2.25

0.32 31.21 0.48 67.20 0.80 0.00

Asn

AAC AAT

54.97 45.03

86.38 13.62

Asp

GAC GAT

37.24 62.76

61.88 38.13

Cys

TGC TGT

55.61 44.39

75.00 25.00

Gln

CAA CAG

34.74 65.26

23.40 76.60

Gly

GAA GAG

68.87 31.13

74.95 25.05

Gly

GGA GGC GGG GGT

10.88 40.34 15.10 33.68

0.87 38.13 1.21 59.79

His

CAC CAT

42.95 57.05

72.19 27.81

Ile

ATA ATC ATT

7.31 41.99 50.70

0.51 73.74 25.76

Leu

CTA CTC CTG CTT TTA TTG

3.67 10.44 49.55 10.38 13.09 12.87

0.20 3.99 83.63 4.99 2.99 4.19

Lys

AAA AAG

76.50 23.50

70.55 29.45

Met

ATG

100.00

100.00

Phe

TTC TTT

42.62 57.38

75.47 24.53

Pro

CCA CCC CCG CCT

19.10 12.45 52.54 15.90

13.91 1.30 69.57 15.22

J. Mrazek ´ & A. O. Summers

20

Table 3.

(Continued )

Codon

% usage in all genes

% usage in ribosomal protein genes

Ser

TCA TCC TCG TCT AGC AGT

12.37 14.86 15.40 14.55 27.67 15.14

2.48 26.63 1.55 38.70 26.01 4.64

Thr

ACA ACC ACG ACT

13.18 43.41 26.78 16.64

3.83 43.44 4.92 47.81

Amino acid

Trp

TGG

100.00

100.00

Tyr

TAC TAT

43.06 56.94

75.59 24.41

Val

GTA GTC GTG GTT

15.48 21.69 36.87 25.97

26.71 9.35 12.46 51.48

contribute to biased codon usage. In addition to selection at the level of translation, codon choices are affected by genome-wide biases in nucleotide and oligonucleotide composition such as G+C content and genome signature (Karlin and Mr´ azek, 1996; Sharp et al., 1993). Codon usage might be further influenced by mutational biases related to spontaneous deaminations in the non-transcribed strand and transcription-coupled repair (Francino et al., 1996; Francino and Ochman, 2001), as well as selective constraints related to transcription and mRNA structure, and possibly the structure and folding of the encoded proteins (Kahali et al., 2007; Thanaraj and Argos, 1996). There is an ongoing debate regarding the evolution of codon usage in different organisms and many review articles are available to readers interested in this topic, e.g. (Akashi, 2001; Duret, 2002; Ermolaeva, 2001; Sharp et al., 1993). Several methods have been proposed to measure synonymous codon usage bias (Karlin et al., 1998a; Karlin et al., 1998b; Sharp and Li, 1987; Wright, 1990), which can be used to predict whether a gene is highly expressed. While this approach has limitations and is not equally applicable to all genomes (Futcher et al., 1999; Jansen et al., 2003; Sharp et al., 2005), it can in some cases provide insights into metabolism and physiology of the microbes (Karlin and Mr´ azek, 2000; Karlin and Mr´ azek, 2001; Karlin et al., 2001; Mr´ azek et al., 2006). Danchin and coworkers (M´edigue et al., 1991) applied clustering techniques and principal component analysis to E. coli genes and found that the genes can be classified by codon frequencies into three classes. The bulk of genes were included in the first class, the second class comprised highly expressed genes, and the third class

General Characteristics of Prokaryotic Genomes

21

was dominated by genes of apparent foreign origin. Genes of the latter class were probably acquired by lateral gene transfer, i.e., transfer of genetic material between unrelated cells by other means than inheritance from the parent cell. Lawrence and Ochman proposed a model of evolution of laterally transferred genes (Lawrence and Ochman, 1997). In their scenario, laterally transferred genes initially carry the G+C content and codon preferences of the donor genome. However, being subject to the same genome-wide mutational biases and selective constraints as the rest of the host genome, the codon usage of laterally transferred genes is gradually ameliorated to resemble the genes of the new host. Hence the genes acquired recently from genomes with different codon usage and/or G+C content might still resemble the donor genome and differ from the rest of the genes in the host genome whereas older insertions generally resemble the composition of the host genome (Vernikos et al., 2007). 3.3. Identification of Genomic Islands and Lateral Gene Transfer Events Lateral gene transfer is considered a major factor in the evolution of prokaryotic chromosomes. Laterally transferred chromosomal genes (apart from those carried by mobile genetic elements such as plasmids and phages) are usually identified by atypical phylogenetic trees compared to indigenous chromosomal genes reflecting their foreign origin. However, recent lateral gene transfer events can often be detected by comparisons of genes within the same genome. The idea is rather simple: if codon usage and or nucleotide composition of a gene significantly differs from the bulk of genes of a chromosome, that gene might have been acquired laterally (Lawrence and Ochman, 1998; Mr´ azek and Karlin, 1999). Lateral transfer events often involve DNA segments comprising several genes, which give rise to genomic islands characterized by DNA composition different from the rest of the chromosome (see also below and Table 5). Pathogenicity islands are special cases of genomic islands that carry genes required for virulence. Genomic islands are often identified in G+C sliding window plots similar to that in Fig. 9. Other techniques have been proposed including genome signature, codon usage, and clusters of putative alien genes (Dufraigne et al., 2005; Karlin, 2001; Merkl, 2004). Figure 11 shows sliding window plots for G+C content, δ*-differences between the sequence in the sliding window and the complete chromosome, and codon bias relative to an average gene (Karlin, 2001) for Vibrio cholerae chromosome 1. The highest peaks in all plots refer to the location of a recognized pathogenicity island carrying genes for the biosynthesis of a toxin-coregulated pilus (Karaolis et al., 1998). See Chapter 5 in this volume for further discussion. 3.4. G–C Skew In 1950’s and 1960’s, Chargaff and coworkers formulated two rules governing base composition of DNA molecules. The first Chargaff rule postulates that DNA

22

J. Mrazek ´ & A. O. Summers

Fig. 11. Sliding window plots for Vibrio cholerae chromosome 1. The plots show G+C content (top), δ*-differences between the sequence in the sliding window and the complete chromosome (middle), and codon bias relative to the average gene (bottom) as described in ref. (Karlin, 2001) in a sliding window of 50 kb (black) and 10 kb (gray).

molecules contain the same amount of A and T bases, and the same amount of G and C bases (Chargaff, 1950). Note that this work preceded the discovery of DNA structure (Watson and Crick, 1953). In fact, the first Chargaff rule is a simple consequence of base pairing rules in double-stranded DNA. The second Chargaff rule postulates that analogous balance between A and T amounts and G and C amounts applies to each individual strand of DNA (Karkas et al., 1968). In other words, the two DNA strands are symmetrical in terms of base composition. The second Chargaff rule does not apply locally in individual genes, where the base composition and mutation biases differ between the transcribed and non-transcribed strands (Francino and Ochman, 1997; Mr´ azek and Kypr, 1994a) but it was believed that the two strands are compositionally symmetrical over large DNA segments containing many genes. Only after the first complete prokaryotic genomes were sequenced did it become apparent that the two DNA strands are asymmetric in many bacteria (Lobry, 1996). This compositional asymmetry is particularly strong with respect to G and C and it is commonly referred to as G−C skew (Fig. 12). G−C skew relates to the asymmetry of the replication fork. Most bacteria replicate bidirectionally from a single origin of replication. The leading strand is synthesized in the same direction as the progress of the replication fork, whereas the lagging strand is synthesized in the opposite direction via Okazaki fragments. The asymmetry of the replication fork can result in different mutational biases between the two strands. Biased orientation of genes, which in some genomes strongly prefer

General Characteristics of Prokaryotic Genomes

23

Fig. 12. Sliding window plots of (G − C)/(G + C) counts in the Bacillus subtilis chromosome in a sliding window of 50 kb (black) and 10 kb (gray). The circular chromosome is represented in a linear plot and the origin of replication is located at both ends of the plot, and the terminus is near the center. The bases were counted in the “top” DNA strand, that is, with the 5′ end at the left and 3′ end at the right. Hence, the left half of the plot corresponds to the leading strand and features an excess of G over C, whereas the right half corresponds to the lagging strand and features an excess of C over G.

the leading strand, can also contribute to G−C skew (Mr´azek and Karlin, 1998). Notably, both the G−C skew and the gene orientation bias are particularly strong in bacteria that use different DNA polymerases to replicate the leading and lagging strands (Rocha, 2002). In contrast, many archaeal and some bacterial genomes lack the G−C skew, possibly due to absence of a single origin of replication (Mr´azek and Karlin, 1998). Indeed, multiple origins of replication were detected in the archaeon Sulfolobus solfataricus (Lundgren et al., 2004; Robinson et al., 2004), whereas species of another archaeon, Pyrococcus, apparently possess a single origin of replication (Myllykallio et al., 2000). 4. Repeats in Prokaryotic Genomes 4.1. Large Repeats and Duplications Karlin and Ost developed formulas that allow estimating the largest expected exact repeat in a random sequence of letters (Karlin and Ost, 1988). For a random DNA sequence of the size of a typical prokaryotic genome, the longest expected sequence occurring at least twice is of about 22–26 bp length (Rocha et al., 1999a). How does this estimate compare with real genomes? Table 4 lists the largest exact repeats found in selected prokaryotic chromosomes. All chromosomes investigated contain at least one exact repeat of size > 1000 bp. In most cases the largest repeats relate to duplicated rRNA genes and transposons. The very long (> 40 kb) repeat in E. coli probably represents a very recent duplication of a chromosomal segment. However, even when rRNA genes are excluded exact repeats far exceeding the expected length are rampant (Rocha et al., 1999a; Rocha et al., 1999b). Do such large repeats have a role in the organisms’ physiology and/or evolution? Based on a survey of complete genomes available at the time, Rocha and coworkers

J. Mrazek ´ & A. O. Summers

24 Table 4.

Largest exact DNA repeats in several prokaryotic chromosomes.

Organism

Starting positions and direction

Annotated features

41786

1058578→ 1454185 →

Multiple genes, mostly hypothetical, urease operon

Acinetobacter ADP1

5555

18233 → ← 3564965

rRNA operon

Burkholderia mallei ATCC23344, chromosome 1

1935

1881643 → 2674278 →

partial 23S rRNA

Hyphomonas neptunium ATCC15444

1316

474873 → 2780557 → ← 2095831 and 2094515 →

transposons

Myxococcus xanthus

2816

2505730 → ← 4215240 7763522 →

transposons

Helicobacter pylori 26695

4852

441575→

three hypothetical proteins and partial 23S rRNA

Escherichia coli O157:H7 EDL933

Repeat length (bp)

← 1480567 Bacillus subtilis

2957

32145 → 92221 →

23S rRNA

Synechocystis PCC6803

5361

2448637 → ← 3330091

rRNA operon

Deinococcus radiodurans R1, chromosome 1

1768

251204 → ← 2587072

intergenic region upstream of 23S rRNA

Methanococcus maripaludis S2

5011

16 → 5057 →

rRNA operon duplicated in tandem

Pyrococcus furiosus DSM3638

1350

374527 → 1154119 →

oligopeptide ABC transporter gene

Sulfolobus solfataricus P2

6512

1551160 → ← 1628511

transposons and hypothetical genes

Note: Repeats were identified by the repeat-match program distributed with MUMmer (Kurtz et al., 2004).

argued that large repeats play important roles with respect to genome plasticity, gene transfer, and antigenic variation (Rocha et al., 1999b). Interestingly, the highest density of large repeats was detected in some of the smallest genomes (Mycoplasma genitalium and Mycoplasma pneumoniae), where large repeats in conjunction with simple sequence repeats (see below) contribute to increased antigenic variation of the pathogen population that aids evasion of the host immune system. DNA repeats in B. subtilis might facilitate integration of DNA fragments taken up by competent cells into the chromosome (Rocha et al., 1999a). In a more

General Characteristics of Prokaryotic Genomes

25

recent analysis of repeats in prokaryotic genomes, Aras and coworkers confirmed and further specified the role of repeats in genome plasticity (Aras et al., 2003). They investigated in detail the distribution of repeats in two complete Helicobacter pylori genomes and noted an abundance of repeats separated by < 5 kb or > 100 kb and a scarcity of repeats more than 5 kb but less than 100 kb apart. In addition, repeats 5kb in length are more likely to contain essential genes and consequently subject to negative selection. In addition to deletions, repeat-facilitated recombination often leads to gene duplications and amplifications, which can be beneficial in adaptations to environmental changes, and in pathogens can contribute to resistance to antibiotics, virulence and evasion of the host immune system (Craven and Neidle, 2007).

4.2. Transposons and Insertion Sequences The most common large repeats in prokaryotes are mobile elements called transposons which consist of segments of DNA encoding a suite of adjacent genes and DNA sites which control the movement of the DNA segment by recombination among the replicons in the same cell. Transposons range in size from a little less than 1 kb to over 50 kb and may occur only once in a cell or in several different replicons in the same cell (Craig et al., 2002). All transposons replicate when the chromosome, plasmid, or bacteriophage into which they are inserted replicates and a subset of transposons also controls its own replication. In addition, all transposons control their own recombination in a site-dependent manner. The recombination process is called transposition and can proceed via two distinct mechanisms: conservative and replicative. The minimum requirements for both prokaryotic transposition mechanisms are an enzyme called a transposase encoded within the transposon and specific sequences (sites) at the ends of the transposon DNA at which the transposase acts (Fig. 13A). Conservative transposition employs a cut-and-paste mechanism; the transposase makes breaks at each end of the transposon in the donor site and makes a third break in the recipient site (in the same or different replicon) and inserts the transposon into this break, often generating a short direct duplication (typically 5- or 9- bp) of a short stretch of recipient DNA at each end of the newly inserted transposon (Fig. 13B). The smallest transposons, also called insertion sequences (Fil´ee et al., 2007; Mahillon and Chandler, 1998) are conservative transposons (http://www-is.biotoul.fr/); they encode only the transposase and many have inverted repeats of 10–40 bp at each end where the transposase acts. However, two copies of the same IS can flank genes encoding selectable traits such as antibiotic resistance thereby generating a composite transposon which can be very stable. A prominent example of a composite transposon is Tn10, encoding tetracycline resistance genes flanked by copies of IS10. IS’s and composite transposons have

26

J. Mrazek ´ & A. O. Summers

Fig. 13. The two general modes of bacterial transposition. (A) Conservative transposition takes place via a cutting and pasting mechanism. The transposon is removed from one strand of the donor replicon and inserted into a nick on one strand on the recipient replicon. Subsequently, the other two strands in each replicon are nicked and the new junctions in the recipient replicon are closed. (B) Replicative transposition also begins with transfer of one copy of the transposon from a nick in the donor to a nicked strand in the recipient. However it differs in that the transposon remains covalently attached to both the donor and recipient replicons and is completely duplicated resulting in a cointegration of both plasmids with one copy of the transposon at each junction. The cointegrate is processed into two independent plasmids via resolvase-mediate recombination at the res site (striped bar in the center of the transposon) yielding physically separate replicons each with a copy of the transposon. Nicks are indicated by half-thickness strands adjacent to he transposon. These simple models are not meant to be all inclusive or to represent the myriad variations on these general processes as they apply to specific transposons.

little or no preferences for their insertion sites and move with varying degrees of randomness to new positions. In replicative transposition, the transposase generates a fusion (cointegrate) of the donor and recipient sites by making a full copy of the transposon at each fusion joint (Fig. 13A). Resolution (separation) of the cointegrate requires an additional

General Characteristics of Prokaryotic Genomes

27

transposon-encoded enzyme called a resolvase which acts at a resolution (res) site within the transposon at which the resolvase carries out a single crossover between res sites in each copy of the transposon restoring the independent replicons, each with a copy of the transposon. Many replicative transposons are identified by additional genes they encode which confer some novel phenotype on the replicon carrying them, for example antibiotic resistance. A prominent example of a replicative transposon is the Tn3 which encodes ampicillin resistance. However, the 38 kb bacteriophage Mu is also a replicative transposon. Replicative transposition is also very non-specific in its preference for target sites. Both types of transposons cause deletions and inversions of the DNA adjacent to where they are inserted and promoters within them can affect expression of genes adjacent to the insertion site. Of course, the random insertion of an IS or transposon into a gene inactivates it. 4.3. Integrons Integrons are not themselves repeats, but they contain a site called attI into which multiple 600–1500 bp segments of DNA can be inserted by an integrase enzyme whose gene (intI ) is immediately adjacent to attI (Hall, 1997; MacDonald et al., 2006; Mazel, 2006; Ploy et al., 2000). The short segments that are inserted in tandem are called gene cassettes; they typically contain a single open reading frame and beyond the 3′ terminus of this open reading frame they have a ∼ 60 to 120 bp loosely conserved region called attC (Fig. 14). The integron-specific integrase gene

Fig. 14. The basic structure of an integron. IntI1 is the integrase gene; attI is the insertion site and attC is the recombination signal carried by each gene cassette. The gene cassettes are represented by open squares; many that have been characterized encode antibiotic resistance. Pc is the promoter for expression of the cassette genes; Pi is the promoter for expression of the integrase gene. The double-slashes at each end mean that the integron locus is incorporated into a larger, self-replicating genetic element such as a plasmid or chromosome; it cannot replicate itself. Integrons that lie within large transposons on plasmids or the chromosome have been described.

J. Mrazek ´ & A. O. Summers

28

can excise gene cassettes as free circular molecules and insert them at another attI in a different integron elsewhere in the cell. It can also reposition a cassette within the same integron. This latter activity is important because expression of the genes encoded by the cassettes depends on a promoter (Pc) in the integron as most gene cassettes do not include a promoter for the cassette gene. The closer a cassette is to attI the better its gene will be expressed. The first integrons studied (now called Classes 1–3) were borne by large conjugative plasmids and carried many antibiotic resistance gene cassettes. More recently large arrays of as many as 90 cassettes and integrases distinct from those carried by plasmids have been observed in so-called superintegrons (Fluit and Schmitz, 2004; Rowe-Magnus et al., 1999) in chromosomes. In the few instances described so far the integrases of chromosomal superintegrons seem to be lineage specific and able to operate best with the co-resident attC sites, unlike the more liberal recombination preferences of the peripatetic, plasmid-borne integrons (RoweMagnus et al., 2003; Vaisvila et al., 2001). See Chapter 5 in this volume for a computational analysis of integrons and transposons.

4.4. Chimeric Mobile Elements: Conjugative Transposons, ICEs, Plasmid-Prophages, Transposon-Prophages, Genomic Islands, and Genetic Litter The above mobile elements engage in promiscuous recombination with all other DNA in a cell leading to chimeric arrangements, some of which are more stable and successfully evolve their own lineages within a given species or genus (Table 5) (Burrus and Waldor, 2004; Osborn and Boltner, 2002). Among the more prominent of the robust chimerae are plasmid-prophages such as the P1 bacteriophage of Shigella, the bacteriophage Mu which replicates by transposition (Canchaya et al., 2004); and the conjugative transposons such as Tn916 of the Firmicutes and Bacteroidetes (Salyers et al., 1995; Scott and Churchward, 1995). The largest of these aggregates of disparate genes are genomic islands characterized by collections Table 5.

Basic and hybrid mobile genetic elements.

Agents of HGT, MGE: (prototypes)

Plasmids (RP4, RK2, R100, ColE1)

Phages: (lambda)

Transposons: (Tn5, Tn10, Tn501)

Hybrid or Mosaic MGE’s

P1: A temperate phage that is a plasmid

Mu: A phage that replicates by transposition

Conjugative Transposon: A transposon that conjugates to other cells, e.g. Tn916

Genomic Islands (20KB — >100s kb) variously have conjugative, phage-derived, and/or transposon-derived components all clustered together on the main chromosome. Those shown to move by conjugation are called integrative conjugative elements (ICE). Pathogenicity islands (PAIs) have virulence genes.

General Characteristics of Prokaryotic Genomes

29

of both phage-like and conjugative-plasmid-like genes interspersed with virulence genes and genes of unknown functions (Brussow et al., 2004; Hacker et al., 2003). These seemingly adventitious arrangements can comprise a substantial fraction of the chromosome (Karch et al., 1999; Lim et al., 2001). Only very recently has it been possible to demonstrate the movement of an entire chromosomal island from one bacterial strain to another (Qiu et al., 2006). Many assemblages of various replication, transfer, or recombination genes result in dead-ends, visible as vestigial genes in all prokaryotic genomes. Prokaryotes also have a plethora of presumably mobile short repeats generally of unknown function, which have been identified by random DNA amplification or by sequence inspection (Bachellier et al., 1999; Fil´ee et al., 2007). They are frequently used in epidemiological typing of pathogenic bacteria (De Gregorio et al., 2005). 4.5. Retrons An unusual satellite nucleic acid molecule has been observed in several bacteria and archaea (Lampson et al., 2005). This molecule, called msDNA (for multicopy, single stranded DNA) consists of a single strand of DNA of differing lengths, typically less than 200 nucleotides in various examples, that is covalently linked via a 2′ OH group near the 5′ -end of the DNA strand to the 5′ -end single strand of RNA, also distinct in each instance. Free msDNA is made in hundreds of copies by a reverse transcriptase, the distal gene of a small operon whose 5′ -region includes the short templates for the msDNA and the RNA molecule. Retrons have been found in multiple copies in chromosomes, sometimes adjacent to large prophages. How they are assembled and what they do for the cell is unknown. 4.6. Short Dispersed Repeats In contrast to large repeats which stand out by the length of the repeated sequence, significant short dispersed repeats stand out by the number of copies of a particular sequence motif or pattern found in the genome. Short dispersed repeats can have various roles in the organisms. For example, some naturally competent bacteria take up only DNA fragments of their own species (e.g., Haemophilus influenzae and other Pasteurellaceae). The DNA uptake apparatus recognizes the DNA fragments of its own species by uptake signal sequences (USS), a short sequence motif occurring at high frequency in the genome. The core USS motif of H. influenzae consists of the 9-bp word AAGTGCGGT (ACCGCACTT in the complementary strand) that occurs 1461 times in the H. influenzae Rd chromosome (Karlin et al., 1996; Smith et al., 1995). In a random sequence, one would expect to find on average about 4–5 copies of this particular word, and the probability of finding 1461 copies by chance is very close to zero. No other 9-bp word (excluding those overlapping with the USS motif) occurs more than 230 times. However, similar sequence motifs featuring > 1000 copies in a chromosome are rare and most dispersed repeats do not exceed hundreds of copies. Some other examples of highly overrepresented

30

J. Mrazek ´ & A. O. Summers

short dispersed repeats include the highly iterated palindrome HIP1 in the form GCGATCGC or GGCGATCGCC found in many cyanobacterial genomes (Karlin et al., 1996; Robinson et al., 1995), Chi-sites (GCTGGTGG in E. coli but different sequences in other bacteria) which promote recombination via the RecBCD system (Krawiec and Riley, 1990), or E. coli REP (Repeated Extragenic Palindrome) elements (Blaisdell et al., 1993; Higgins et al., 1988). Most methods for finding significant short dispersed repeats in a DNA sequence center on overrepresented words, that is, oligonucleotides that occur significantly more often than expected by chance. Typically a large number of words (e.g., all possible oligonucleotides of a given length) are investigated for unusually frequent occurrence. A common approach involves estimating an expected number of copies of the word at hand based on some stochastic model that serves as a null hypothesis and assessing statistical significance of the difference between the observed and expected counts. Many different methods based on this general approach have been proposed, e.g. (Karlin and Cardon, 1994; Karlin et al., 1996; Leung et al., 1996; Pesole et al., 1992; Reinert et al., 2000; Schbath, 1997; Trifonov and Brendel, 1986). The methods differ by the stochastic model and approximations employed. However, even the most sophisticated among these methods is limited by the intrinsic inaccuracy of the underlying stochastic model (generally some form of Markov model), which on its own represents only an approximation of the real DNA sequence. For example, none of the commonly used stochastic models reflects the heterogeneity of DNA sequences where different segments are under different selective constraints and might have different evolutionary history. When using these methods one should keep in mind that statistical significance does not always imply biological significance and vice versa. Identification of overrepresented words in a genome on its own rarely provides sufficient leads to generate hypotheses about biological roles of the repeats. In some cases, analysis of distribution of the dispersed repeats in the genome can provide additional hints with respect to their function. One may ask the following questions: Are the repeats more often in genes or intergenic regions? Are they associated with genes related to a specific cellular process or metabolic pathway? Are they significantly more often near the 5′ or 3′ end of genes then elsewhere? Are they periodically spaced or often found at a specific distance from each other? Are they clustered in a particular segment of the chromosome? If yes, what other features of interest are in that segment? Or the opposite, are the repeats missing in specific segments of the chromosome? Several software tools designed to help answer such questions are available on our web site at http://www.cmbl.uga.edu/software.html. We show next some examples of how the distribution of sequence motifs can provide useful information. The highly repetitive motif (HRM) of Lactococcus lactis in the form WWNTTACTGACRR and its inverted complement YYGTCAGTAANWW (W stands for A or T, N for any base, R for A or G, and Y for C or T) was identified by the frequent word analysis and features 916 copies in the chromosome (Mr´ azek

General Characteristics of Prokaryotic Genomes

31

et al., 2002). The analysis of distribution of these motifs revealed two interesting anomalies. First, distances between pairs of HRM sequences are not random. Pairs of HRM sequences often occur at distances that are multiples of approximately 10 bp (i.e., ∼ 20, ∼ 30, ∼ 40, etc.) but almost never 25, 35, 45, 55 bp etc. 10 bp is close to the helical period of the DNA in the canonical B conformation (Wang, 1979). Sequences distributed in phase with the helical period face the same side of the double helix. Such regular spacing can play a role in DNA interactions with proteins or other molecules and suggests that the HRM might be involved in such interactions. The second distributional anomaly is in the HRM positions with respect to genes. The HRM sequences often occur in dyads just downstream of stop codons, reminiscent of Rho-independent transcription terminators (Henkin, 1996). We proposed that the L. lactis HRM has a dual role in the cell as a binding site for an unknown protein and as a transcription terminator (Mr´azek et al., 2002). In fact, it is probably not unusual that a particular frequent motif primarily used for some other purpose also functions as a transcription terminator. Such roles were also proposed for the uptake signal sequences of H. influenzae (Karlin et al., 1996; Kroll et al., 1992). The r-scan statistic is designed to determine distributional anomalies in clustering, overdispersion, or even distribution. Consider an array of N consecutive sequence markers (e.g., dispersed repeats) in a sequence of length L located at (r) positions xi , 1 ≤ i ≤ N . Let Ri denote the distance between the markers i and i+r (Fig. 15). Formulas are available to estimate the probabilities that the minimum (r) (r) distance mini Ri and maximum distance maxi Ri exceed a given threshold (Dembo and Karlin, 1988; Karlin and Brendel, 1992; Karlin et al., 1996). A lower than expected minimum distance indicates significant clustering, a higher than expected maximum distance indicates a significant overdispersion, whereas higher than expected minimum distance or lower than expected maximum distance reflects a significantly even distribution of the markers (Karlin et al., 1996). An online tool for r-scan analysis is available as a component of Pattern Locator (Mr´ azek and Xie, 2006) at http://www.cmbl.uga.edu/software/patloc.html. For example, use of rscan statistics has revealed that: (i) the DnaA (replication initiation protein) binding site TTATACACA forms a statistically significant cluster at the origin of replication in E. coli (data not shown). (ii) The DNA uptake signal sequences in H. influenzae have two areas of overdispersion (significantly low density). One coincides with a cryptic bacteriophage Mu inserted in the chromosome and the other with a cluster

Fig. 15. r-scan statistics. x1 , x2 , . . . , xn are positions of markers in a nucleotide sequence of length L. For r = 1, the distances between adjacent markers are considered. For r = 2, the distances skipping one parker are analyzed, etc. Formulas are available to assess whether the minimum and maximum distance are unusually large or small, indicating nonrandom distribution of the markers.

32

J. Mrazek ´ & A. O. Summers

of ribosomal protein genes and other genes involved in translation and transcription (Karlin et al., 1996). (iii) The HIP1 sequence GGCGATCGCC (Karlin et al., 1996; Robinson et al., 1995) in Synechocystis and most other cyanobacteria is significantly evenly distributed with minimum r-scan length larger than expected.

4.7. Simple Sequence Repeats Simple sequence repeats (SSRs) are composed of tandem iterations of a single nucleotide or a short oligonucleotide. SSRs have some interesting properties that differentiate them from “regular” DNA sequences. First, they are polymorphic in length. DNA polymerase slippage and/or recombination within the repeat frequently lead to changes in the number of repetitive units within the SSR. Such mutations can be beneficial in some cases. In some pathogenic bacteria, SSRs are associated with genes encoding antigens located on the surface of the cell where they can be recognized by the host immune system. SSRs located in protein coding regions or in the upstream regulatory regions of such genes can reversibly activate and deactivate these genes, thus contributing to antigenic variation of the pathogen population and facilitating avoidance of the host immune response (Groisman and Casadesus, 2005; Moxon et al., 1994; Rocha, 2003; Rocha and Blanchard, 2002). One of the well characterized cases of such regulation involves the pMGA (also referred to as VlhA) family of lipoproteins in Mycoplasma gallisepticum. Different strains possess from 32 to 70 pMGA proteins and most of them feature an SSR consisting of GAA iterations upstream of the gene (Baseggio et al., 1996; Papazisi et al., 2003). Expression of the pMGA genes is controlled by the length of the GAA repeats (Glew et al., 1998) and mutations in the GAA repeats determine which of the pMGA genes is expressed, and consequently the antigenic configuration of the cell. Depending on the actual sequence of the repeated unit, SSRs can also promote formation of unusual DNA structures. For example, the CGG repeat whose expansion causes fragile X syndrome in humans can adopt a G-DNA (quadruplex) conformation (Shafer and Smirnov, 2000). The GC repeats and to a lesser extent general purine-pyrimidine alternating patterns easily adopt the left-handed Z-DNA conformation (Nordheim and Rich, 1983; Sinden, 1994). AG or TC repeats form triple-helical H-DNA structures under favorable conditions (Htun and Dahlberg, 1989). In addition, trinucleotide and hexanucleotide repeats in protein coding sequences translate into amino acid runs or alternating patterns, which can affect the structure and function of the encoded proteins (Dunker et al., 2005; Karlin et al., 2002a; Perutz et al., 2002). Considering all the unusual properties of SSRs, one may ask how common they are in genomes. Long SSRs (those unlikely to occur by chance in a random sequence) are abundant in eukaryotes but rare in most prokaryotes (Field and Wills, 1998; Kashi and King, 2006; Mr´ azek and Kypr, 1994b; Tautz and Schl¨ otterer, 1994; T´ oth et al., 2000). Counts of mono-, di-, tri-, and tetranucleotides SSRs of varying lengths

General Characteristics of Prokaryotic Genomes

33

Fig. 16. Mono- (top left), di- (top right), tri- (bottom left), and tetranucleotide (bottom right) simple sequence repeats in human chromosome 22. The ordinate shows counts of SSRs of the exact length shown by the abscissa. Full circles signify the counts in the actual sequence and the gray lines refer to random sequences generated by different stochastic models. The SSR length is measured in nucleotides rather than the number of repeated units, which allows accounting for partial copies. All SSRs of length > 50 bp are reported at the length 50 bp. See references (Mr´ azek, 2006; Mr´ azek et al., 2007) for details.

in human chromosome 22 and in the E. coli chromosome are shown in Figs. 16 and 17 and compared to expected counts based on stochastic models of varying complexity (Mr´ azek, 2006; Mr´ azek et al., 2007). The abundance of long SSRs is obvious in the human DNA. In contrast, the SSR counts in E. coli are congruent with the random models, and mononucleotide SSRs of length > 8 bp are underrepresented (less abundant than expected). Most prokaryotes exhibit similar representations of SSRs as E. coli, including the underrepresentation of mononucleotide SSRs > 8 bp length. However, there are some notable exceptions. Several prokaryotic genomes have an excess of long mono-, di-, tri- and tetranucleotide SSRs (Mr´ azek et al., 2007). Some examples are shown in Fig. 18. These plots sometimes have a bimodal character where the SSR counts initially follow the expected counts or even drop below the expected counts but include a secondary peak at greater lengths. The bimodality suggests that different mechanisms affect SSRs of different lengths and the long SSRs that constitute the separate peak are maintained by positive selection. Moreover, most genomes with long mono-, di-, tri-, and tetranucleotide SSRs are those of host-adapted pathogens, which is consistent with the role of SSRs in immune avoidance. However, even the SSRs in pathogens are not always located in the protein coding regions or upstream regulatory regions where they can directly

34

Fig. 17. Fig. 16.

J. Mrazek ´ & A. O. Summers

Simple sequence repeats in the E. coli O157:H7 EDL933 chromosome. See legend to

Fig. 18. Examples of anomalous SSR length distributions in some genomes: mononucleotide SSRs in M. hyopneumoniae, dinucleotide SSRs in L. intracellularis, trinucleotide SSRs in M. gallisepticum, and dinucleotide SSRs in M. leprae. See legend to Fig. 16.

General Characteristics of Prokaryotic Genomes

35

influence the activity of a gene and the role of SSRs in avoidance of host immune response might be indirect (Mr´ azek, 2006; Rocha and Blanchard, 2002). Unlike long SSRs composed of mono- through tetranucleotides, long SSRs composed of iterations of pentanucleotides and longer oligonucleotides are found in nonpathogens and opportunistic pathogens and could arise from spontaneous expansion of shorter SSRs rather than due to selective constraints (Mr´azek et al., 2007).

4.8. CRISPR Repeats This remarkable and widespread type of repeat was initially discovered in the archaea Haloferax volcanii and Haloferax mediterranei before the first complete genomes became available (Mojica et al., 1995). When the first archaeal genome, Methanococcus jannaschii, was sequenced the authors reported an interesting family of repeats consisting of a long repeat element of ∼ 400 bp followed by a series of regularly spaced short repeats of ∼ 30 bp length (Bult et al., 1996). It took several more years and many more sequenced genomes before it was fully appreciated how widespread such repeats are among both archaea and bacteria (Karlin et al., 2002b; Karlin et al., 1998a; Mojica et al., 2000) and before they were given their commonly used name — Clustered Regularly Interspersed Short Palindromic Repeats (CRISPR) (Jansen et al., 2002). Figure 19 shows a part of the CRISPR locus in the Acinetobacter ADP1 chromosome. The main characteristic of CRISPR sequences is that the repeat itself is virtually invariant whereas the spacer sequences are variable. Likewise, the spacing of the repeats is highly conserved. For example, of the 91 spacers in the Acinetobacter ADP1 CRISPR locus, one is 31 bp long, one is 33 bp long, and all others are exactly 32 bp. The repeats have been dubbed “palindromic” but the palindromic character can be rather week in some cases. Some genomes have multiple distinct CRISPR loci and others have a single cluster of CRISPR repeats. Although the CRISPR repeats are virtually invariant within a genome or at least within each CRISPR locus, they vary significantly between different species (Godde and Bickerton, 2006; Karlin et al., 2002b; Karlin et al., 1998a; Mojica et al., 2000). Two key observations provided hints about possible roles of CRISPR sequences in prokaryotic genomes. First it was noted that several protein coding genes termed cas (CRISPR-associated genes) frequently occur near CRISPR loci (Jansen et al., 2002). The second discovery was that the variable spacers between the conserved repeats often match fragments of phage and plasmid DNA sequences (Mojica et al., 2005). This led to a theory that the CRISPR repeats protect the cells from phages and other foreign nucleic acids by a mechanism analogous to RNA interference in eukaryotes (Makarova et al., 2006; Mojica et al., 2005). In this model, the variable spacers provide resistance to phages in association with the Cas proteins (Makarova et al., 2006). Recent experiments with Streptococcus thermophilus confirmed that the spacers indeed provide resistance to bacteriophages

36

J. Mrazek ´ & A. O. Summers

Fig. 19. The CRISPR locus in Acinetobacter ADP1. The Acinetobacter ADP1 CRISPR locus contains 92 copies of the conserved 28 bp repeat (gray box) separated by 32 bp variable spacers. Only the first 30 CRISPR copies are shown. Nucleotides matching the palindromic character of the repeat are underlined in the top line.

and that cells that survived phage infection have acquired new spacers (Barrangou et al., 2007). These authors compared the CRISPR–Cas system to the immune system of eukaryotes where the CRISPR spacers provide specificity and the Cas proteins confer the phage resistance and probably also mechanisms for inserting new spacers into a CRISPR locus. Earlier hypotheses proffered possible roles of CRISPR repeats in chromosome partitioning (Mojica et al., 1995) and in DNA

General Characteristics of Prokaryotic Genomes

37

repair (Makarova et al., 2002). Phylogenetic analyses of the cas genes suggest that they propagate among prokaryotes via lateral gene transfer (Godde and Bickerton, 2006; Makarova et al., 2002). However, lateral gene transfer does not explain the intergenomic variance of CRISPR repeats combined with their invariance within the same CRISPR locus. Perhaps only the cas genes are laterally transferred but not the repeats themselves. Alternatively, there might be a yet unknown mechanism that alters the CRISPR sequences in a coordinated manner after the lateral transfer event.

5. Further Reading Bushman, F. (2002) Lateral DNA Transfer. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Craig, N.L. Craigie, R., Gellert, M., Lambowitz, A.M. (eds.) (2002) Mobile DNA II. ASM Press, Washington, DC. 1204 pp. Funnell, B.E. and Phillips, G.J. (eds.) (2004) Plasmid Biology, ASM Press, Washington, DC. 614 pp. Groisman EA, Casadesus J (2005) The origin and evolution of human pathogens. Mol Microbiol 56:1–7. Karlin S, Brendel V (1992) Chance and statistical significance in protein and DNA sequence analysis. Science 257:39–49. Lawrence JG, Ochman H (1998) Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci U S A 95:9413–9417. Moran NA, Plague GR (2004) Genomic changes following host restriction in bacteria. Curr Opin Genet Dev 14:627–633. Moxon ER, Rainey PB, Nowak MA, Lenski RE (1994) Adaptive evolution of highly mutable loci in pathogenic bacteria. Curr Biol 4:24–33. Thanbichler M, Shapiro L (2006) Chromosome organization and segregation in bacteria. J Struct Biol 156:292–303. Toussaint, A. and Merlin C. (2002) Mobile elements as a combination of functional modules. Plasmid 47:26–35.

Acknowledgments We appreciate thoughtful critiques of the manuscript by Laura Frost, Eva Top, Ariane Toussaint, two anonymous reviewers, and the editors. Work in JM’s lab is supported by funding from the University of Georgia research Foundation and Oak Ridge Associated Universities, work in AOS’ lab is supported by the US Dept. of Energy Genomes to Life Program grant DEFG0204ER63770, and work in both labs is supported by the US National Science Foundation Microbial Genome Sequencing Program grant 0626940.

This page intentionally left blank

CHAPTER 2 GENES IN PROKARYOTIC GENOMES AND THEIR COMPUTATIONAL PREDICTION

RAJEEV K. AZAD

1. Introduction As of now complete genomes of over 500 prokaryotes are available in GenBank and sequencing of as many genomes is in progress in different sequencing centers worldwide. The ever increasing pace of genome sequencing has brought significant advancement in our understanding of the microbial world. One of the major challenges in the post genome sequencing era is to identify the functional parts of a genome, mainly, the regions that code for proteins — the fundamental components of living cells. Experimental means of identifying these sequence segments, the so called protein-coding genes, in a sequenced genome still remain tedious, time consuming and prohibitively expensive. As a consequence, computational methods continue to serve as the main tools for annotating genes in genome sequencing projects. Prokaryotes have highly compact genomes, nearly 90% of a genome is packed with genes. Dense packing causes many genes to overlap their boundaries with neighboring genes, and the non-overlapping genes have very short intergenic regions in between. Gene overlap increases the organizational complexity, making precise gene identification difficult; conversely the absence of introns (non-coding segments often found inside eukaryotic genes) makes actual gene detection a relatively easier task. Genes are not just random assortments of nucleotides — their organizational structure has certain definitive patterns that make their detection possible through computational means. The first step in identifying genes is to scan all open reading frames (ORFs) in a genome. An ORF is a continuous stretch of triplets of nucleotides that starts with one of the four triplets ATG, CTG, GTG, TTG and ends when encountered with any of the three triplets TAA, TAG and TGA. An ORF signifies a potential coding region delimited by one of the possible start triplets downstream of ORF start and the stop triplet. The triplets define the codons that encode amino acids for proteins. The stop codons are also called non-sense codons as they do not encode an amino acid and only signal the end of translation. But not all ORFs 39

40

R. K. Azad

encode real proteins, so the prokaryotic gene prediction problem is formulated as discriminating the protein-coding ORFs from the ‘random’, non-coding ORFs. The history of using statistical patterns inherent in nucleotide ordering to distinguish protein-coding regions from non-coding regions is more than two decades old. The earlier methods were measuring the match between codon usage pattern observed in known protein-coding sequences and the test sequence. The first gene prediction algorithm, TESTCODE, developed by Fickett in 1982 exploited the non-uniform nucleotide compositional pattern observed at the three positions of a codon. The coding-potential was measured through eight parameters, the first four parameters quantifying the abundance of nucleotides at specific codon positions and the other four quantifying the nucleotide content. The probability of being a protein-coding sequence was assessed using the values determined for each of these eight parameters as observed in known protein-coding sequences. Later, the bias in nucleotide composition at third codon position was explicitly made use of in an information theoretic method proposed by Almagor (1985). Staden (1984) used each of the three discriminant criteria — the amino acid usage bias, the codon usage bias and the codon position specific nucleotide composition independently to measure the coding potential in the test sequences. The probability of being coding in each of the three reading frames was computed by using parameters estimated from experimentally validated protein-coding sequences. The difference in codon usage pattern between protein-coding sequences and random sequences was quantified through a likelihood ratio test by Gribskov et al. (1983). The parameters used to compute the likelihood of a sequence to be coded were estimated from the codon frequencies in the gene pool of highly expressed genes, whereas the parameters for random sequences were estimated from the general nucleotide composition of these genes. ‘Codon preference plots’ were then obtained in three frames from the scaled value of likelihood ratio for a window moved over a sequence; a consistently high value of this measure at a sequence location indicated the presence of a protein-coding gene. These plots were also used to detect frameshifts caused by sequencing errors. Most methods developed around this time attempted to assess either the bias in codon usage or the nucleotide compositional bias at the codon positions, while a few made use of other discriminant criteria. For example, Claverie and Bougueleret (1986) exploited the bias in ‘k-tuple’ (words of size k in DNA or protein sequences) in a heuristic information theoretic method to measure coding potential in a sequence. Research was also done to develop methods that could dispense with the use of reference datasets of known genes or proteins, one such effort by Fichant and Gautier (1987) was embedded in the correspondence analysis of codon usage. All these methods used the information embodied in the statistical patterns of nucleotide ordering in a sequence and are termed intrinsic methods. The development of extrinsic methods happened much later after substantial amounts of sequence data had accumulated in the genomic databases (Robison et al., 1994; Borodovsky et al., 1994). These methods were exploiting the conservation

Genes in Prokaryotic Genomes and Their Computational Prediction

41

of protein-coding sequences in evolution; methods using both intrinsic and extrinsic information were developed subsequently. The ab initio or intrinsic methods use mainly the following three steps. First, statistical determinants of sequence types, for example protein-coding and noncoding, are determined. Second, models are built for each sequence category. And finally, these models are integrated in a pattern recognition algorithm. Among the statistical determinants of coding potential assessed by Fickett and Tung (1992), the frame dependent (positional) hexamer frequencies were found to be most informative. Inhomogeneous Markov chain models provided a rigorous mathematical framework to account for in-frame oligomer statistics (Borodovsky et al., 1986). This led to the development of GeneMark gene prediction program (Borodovsky and McIninch, 1993). Subsequently developed prokaryotic as well as eukaryotic gene finders used extensively the inhomogeneous Markov models (Krogh, 1997; Burge and Karlin, 1997; Salzberg et al., 1998; Lukashin and Borodovsky, 1998; Krogh, 2000; Larsen and Krogh, 2003). Introduction of hidden Markov models — a class of probabilistic tools initially applied to speech recognition — in gene prediction was another landmark development pioneered by Krogh et al. in their ECOPARSE program (Krogh et al., 1994). Around this time, efforts were also being made to adapt other alternative methods for use in gene prediction including artificial neural networks (Uberbacher and Mural, 1991; Snyder and Stormo, 1993; Xu et al., 1994) and linear discriminant analysis (Solovyev et al., 1994), though these programs were mainly targeting the genes in eukaryotes.

2. Inhomogeneous Markov Models 2.1. The GeneMark Program Borodovsky and colleagues’ seminal work in the mid 1980s (Borodovsky et al., 1986) laid the foundation for the development of GeneMark gene prediction program (Borodovsky and McIninch, 1993). Around this time as the genome sequencing of several bacterial species was just coming to completion, this algorithm was readily adapted for use in many gene annotation projects (see, for example, Fleishmann et al., 1995; Bult et al., 1996; Blattner et al., 1997; Kunst et al., 1997; Tomb et al., 1997). Its initial success was a precursor to the development of several programs explicitly using the inhomogeneous Markov chain models. Gene prediction in GeneMark follows two steps — the training step and the test step. The training step in GeneMark entails learning the model parameters from given training sequences — specifically the parameters for a homogeneous Markov model of non-coding sequences and three-periodic inhomogeneous Markov model of protein-coding sequences are learnt from training sets of validated noncoding sequences and protein-coding sequences respectively. The test step entails integration of models in a Bayesian formalism to calculate the posterior probability of a sequence segment being a part of protein-coding region or non-coding region. The posterior probability profiles are then used to score an ORF.

42

R. K. Azad

Unlike previous methods which predicted genes on one strand first followed by another, the GeneMark algorithm was designed to score ORFs on both strands in parallel using DNA sequence of one strand only. The main motivation was to eliminate false predictions on a strand opposite to the strand where the actual gene lies; the source of these false predictions is the abundant ‘RNY’ codons in protein-coding regions (‘R’ and ‘Y’ stand for purine and pyrimidine respectively). These codons self-complement on the other strand causing ‘shadow’ regions complementary to protein-coding regions to be falsely predicted. To predict genes on both strands in parallel, GeneMark additionally uses a coding shadow model for the complement of protein-coding genes. A moving window statistics is obtained for the six reading frames (corresponding to three codon positions in ‘direct’ sequence and three in complementary sequence) as follows. For a DNA sequence segment in a moving window, S = {s1 , s2 , . . . , sn }, where n is a multiple of 3, the likelihood of S to be a part of the protein-coding regions (in one of three frame settings) or shadow of a protein-coding region (again in one of the three frames) is obtained using the inhomogeneous Markov models, Mi (i = 1 − 6, defining the first three reading frames for coding regions in direct strand and the other three for the shadow regions). The likelihood of S to be a part of protein-coding region in reading frame 1 is computed as n−1

)P 3 (sh+3 |sh+2 ), . . . , P b (sn |sn−h ), P (S|M 1 ) = P 1 (sh1 )P 1 (sh+1 |sh1 )P 2 (sh+2 |sh+1 2 3 (1) where P i (sh1 ) is the initial probability of oligomer sh1 situated in phase i (that is, the first nucleotide of the oligomer situated in codon position i), h defines the model order, slk denotes an oligomer starting in position k and ending in position l. P b (sk |sk−1 k−h ) is the transition probability of nucleotide sk to succeed the oligomer situated in frame b, b = 2, 1, 3 if (h mod 3) = 1, 2, 0 respectively. Similarly, sk−1 k−h the probabilities P (S|M i ), i = 2, . . . , 6 can be obtained for other frames. The probability of S to be a part of a non-coding region is obtained using the frame independent homogeneous Markov model, M7 . The value of P (S|M7 ) is computed as described above [Eq. (1)] with the superscripts (reading frames) omitted. The values of initial and transition probabilities are estimated from the training sequences using maximum likelihood approach. Following Bayes’ theorem, the a posteriori probabilities P (Mi |S), i = 1, . . . , 7 for S to be a part of protein-coding, coding shadow or non-coding region is obtained as P (S|Mi )P (Mi ) , P (Mi |S) = 7 j=1 P (S|Mj )P (Mj )

(2)

where P (Mi ) is the a priori probability of each of the seven events specified by model Mi . A sliding window is moved over a DNA sequence of interest (the default window size used in GeneMark is 96 nt and step size is 12 nt) and the a posteriori probability

Genes in Prokaryotic Genomes and Their Computational Prediction

43

is computed for each of the seven events for sequence segment S inside the window. Finally an ORF is scored by taking the average of the a posteriori probabilities for S being protein-coding for the windows that fall within the ORF and have the same reading frame as the ORF. The ORFs that have scores lying above an established threshold are predicted as protein-coding genes. A typical GeneMark output depicting the a posteriori probability profiles for a segment of E. coli K12 genome is shown in Fig. 1. The six panels show the profiles corresponding to the six reading frames, the ‘non-coding’ profile is not shown. In addition to detecting protein-coding regions, these profiles are also used to detect frameshifts caused by sequencing errors or other factors.

3. Interpolated Markov Models 3.1. The Glimmer Program GeneMark uses 5th order Markov models accounting for hexamer statistics at three codon positions in protein-coding sequences. An mth order model has 4m+1 − 4m free transition probability parameters to be estimated from the training data. For 5th order models, this means 3,072 transition probability parameters from each of three codon positions for coding sequences as well as coding shadow sequences and another 3,072 parameters for non-coding sequences. These parameters are estimated from the counts of oligomers in the training sequence, meaning that, for a 5th order model, each of the hexamers should be occurring sufficiently often enough to provide a reliable estimate of the probability parameters. In practice, however, some hexamers may be rare or occur less frequently while some may not exist at all in the training data. The corresponding probability parameters thus cannot be estimated reliably, affecting adversely the predictive ability of the model. Using lower order models may not be the optimal solution as they may compromise the predictive power imparted by more informative longer oligomers that occur frequently enough to provide reliable statistics. The trade-off between reliable statistics from shorter oligomers and prediction power gained from longer oligomers is hard to resolve in favor of a model order, though Borodovsky et al. (1999) suggested the ‘optimal’ model order appropriate for a range of training sequence sizes by studying the error rate in detecting short coding sequence segments as a function of the length of training sequence. It was shown later that both shorter and longer oligomer statistics can be used together in a model framework to enhance the overall prediction ability in detecting protein-coding genes. These models, called interpolated Markov models (IMM), were implemented in the Glimmer gene prediction program (Salzberg et al., 1998). This technique was adapted from the methodologies applied in language modeling, first used for speech recognition back in the early 1980’s (Jelinek and Mercer, 1980; Bahl et al., 1983). An interpolated Markov model of order h combines model of order 0 and higher, up to model of order h; an interpolated model of 8th order was used in the Glimmer algorithm. In this model framework, the probability of a nucleotide b to follow an

44

R. K. Azad

Fig. 1. GeneMark graphical output for a short region of the E. coli K12 genome. Each panel represents the a posteriori probability profile for being coding in one of the six reading frames (Adapted from Azad and Borodovsky, 2004b).

Genes in Prokaryotic Genomes and Their Computational Prediction

45

oligonucleotide of length h, termed context ch , is defined by a recursive equation, P IMM (b|ch ) = λ(ch )P (b|ch ) + (1−λ(ch ))P IMM (b|ch−1 ),

(3)

where P (b|ch ) is the transition probability defined for a regular Markov chain model and λ(ch ) is the interpolation parameter or weight assigned to ch , 0 ≤ λ(ch )≤1. This recursion has the initialization P (b|c−1 ) = 0.25, assuming equidistribution. Glimmer scores an ORF as the sum of the log likelihood of each base computed as above. The success of this class of models depends on the reliable estimation of interpolation parameters. Intuitively, λ(ch ) should take a value close to 1 if the count of the context ch is sufficiently high in the training sequence, implying that the use of shorter contexts for the prediction of nucleotide b is not preferred; on the other hand, λ(ch ) should be close to 0 for contexts ch with very low count, and thus the ‘interpolated’ transition probability of b is estimated using its shorter contexts. Optimization of this model to derive ‘informative’ interpolation parameters is a non-trivial issue. Glimmer uses a heuristic approach to estimate these parameters: if the count of ch , N (ch ), exceeds a certain threshold T (default = 400), λ(ch ) = 1, otherwise, an additional parameter, C, defining the χ2 test’s confidence that the frequency distribution of b following the context ch is different from the distribution estimated using the interpolated probabilities corresponding to the next shorter context ch−1 , is used: λ(ch ) = C

N (ch ) . T

(4)

In this set-up, the λ-parameter is thus considered to be a function of predictive power imparted by use of longer contexts, quantified through C, and the frequency of the context, N . Later versions of Glimmer used a more sophisticated model, the interpolated context model, a class of probabilistic decision trees using mutual information for branching. The earlier proposed interpolated Markov model was shown to be a special case of this. Additional modules to resolve overlapping genes were used in this version. The most recent version, Glimmer 3.0, reverts to the use of interpolated Markov models and is reported to reduce significantly false predictions, particularly in regions with overlapping genes, and improve the gene start detection (Delcher et al., 2007). A unique feature of this version is the ability to discriminate a bacterial endosymbiont genome from the host genome.

3.2. Using Deleted Interpolation in Gene Prediction Glimmer was reported to be highly sensitive in predicting protein-coding ORFs though it incurred many false positives. Later it was shown that by adjusting the threshold parameter, GeneMark produces similar accuracy (Borodovsky et al., 1999). This stimulated extensive studies of different model structures for gene

46

R. K. Azad

prediction. Another technique of interpolation, ‘Deleted Interpolation’, was adapted for use in gene prediction (Azad and Borodovsky, 2004a). This approach has a strong theoretical underpinning and uses optimization techniques to estimate interpolation parameters. Briefly, the training set is partitioned into a development set and a held-out set. The development set is used for estimating the probability parameters P (b|ch ) to be utilized later in deriving interpolation parameters. The contexts for each model order are binned according to their frequencies in the development set, and the contexts in each bin or ‘frequency bucket’ are assigned a single value of λ. For model order h, the value of λ-parameter for a bin is one which maximizes the log P IMM (b|ch ) per nucleotide for this bin in the held-out set. This procedure ensures better generalization of the model to the yet unseen test set, that is, the values of interpolation parameters carry sufficient predictive ability for deciphering desired features in the yet unseen data. Optimization is done successively by starting with P IMM (b|c0 ), and ending with P IMM (b|ch ) for a interpolated model of order h. The initialization is done by assuming the equal-distribution of nucleotides, i.e. P IMM (b|c−1 ) = P (b|c−1 ) = 0.25. At the final step, both development and held out sets are combined to re-estimate P (b|ch ) and then the desired P IMM (b|ch ) by using previously obtained set of interpolation parameters.

Fig. 2. Error rates of identification of coding sequence segments of size 96 nt of a model B. subtilis genome as a function of model order and training set size. The coding sequence was generated by 6th order inhomogeneous Markov model. Error rates are shown for models built by deleted interpolation (DI), by χ2 -confidence based interpolation (Chi-I) and fixed order models (FO) (Adapted from Azad and Borodovsky, 2004a).

Genes in Prokaryotic Genomes and Their Computational Prediction

47

The performance of different models when used in a Bayesian pattern recognition algorithm for detecting coding and non-coding sequence segments of length 96 nt is shown in Fig. 2. Notably, as the training data become sparser, the model built by deleted interpolation outperforms the fixed order models that couldn’t be outperformed by the other interpolation model using the heuristic estimation procedures discussed above. A comparison of the distributions of λ-parameter corresponding to the three codon positions reveals a remarkable variation in the case of the deleted interpolation model, which unexpectedly was absent in case of the heuristically derived interpolation model. Further analysis showed that whereas for the former model, λ-parameter is an increasing step function of the context frequency, it can take multiple values for a context frequency in the latter model, even both 0 and 1 for a context frequency close to threshold T . Even for a deleted interpolation model of order h, h cannot be chosen arbitrarily large as the reliable estimation of interpolation parameters depends on sufficient occurrence of contexts ch and the lower order contexts in the training sequences. Clearly, selection of an interpolation model as well as the model order that can impart significant prediction power should be done with caution. For genomes with high or low G+C content, interpolation models were in some cases outperformed by fixed order models. These observations led the authors to conclude that ‘the optimal choice of model structure and model order is species specific’ (Azad and Borodovsky, 2004a). 4. Hidden Markov Models The development of the theory of hidden Markov model (HMM) started in late 1960s, and soon it became one of the most effective tools for natural language processing (Baum and Petrie, 1966; Rabiner, 1989; Durbin et al., 1998). It was adapted for use in gene prediction much later when probabilistic methods were demonstrated to be quite promising in detecting the coding potential. The gene prediction problem was recast as segmentation of a DNA sequence into mainly protein-coding and non-coding regions. Given a DNA sequence S = {s1 , s2 , . . . , sN }, an HMM is used to label each nucleotide si for its functional role. These labels define the ‘hidden’ states, hi , of an HMM underlying the ‘observable’ states, si . The gene prediction problem is reformulated as finding the sequence of hidden functional states, H = {h1 , h2 , . . . , hN }, associated with the DNA sequence S. The hidden state, for example, can be a protein-coding state or non-coding state and the aim is to determine each base in S to be a part of either a protein-coding region or noncoding region. As the occurrence of events is governed by probabilistic rules in an HMM, this problem is solved through one of the following approaches: (a) Choose the most likely hidden state, h∗ , at each sequence position; protein-coding regions are then inferred by the sequence of contiguous coding states, (b) Maximize the conditional probability P (H|S, Θ) over H (Θ is the given model) to obtain the most likely hidden path H ∗ ; protein-coding regions are then inferred as described above.

R. K. Azad

48

HMM based gene finders decipher the protein-coding regions using either of the above two approaches. This is accomplished by using dynamic programming algorithms (Rabiner, 1989; Durbin et al., 1998), the former using the forwardbackward algorithm and the latter using the Viterbi algorithm. Note that the forward-backward algorithm is also frequently used in estimating HMM parameters. These algorithms are at the core of an HMM driven prediction program, a brief description of each follows.

4.1. The Forward-Backward Algorithm The forward-backward (FB) method uses dynamic programming like recursions that can be explained in terms of two variables- the forward variable αt (i) and the backward variable βt (i). For a given DNA sequence, S = {s1 , . . . , sN } and the model Θ, αt (i) is the probability of the nucleotide sequence s1 , . . . , st and state i at position t; βt (i) is the probability of the nucleotide sequence st+1 , . . . , sN , given state i at position t. For a DNA sequence S and a model Θ, the following recursive equations can be used to compute the variables αt (i) and βt (i) (Rabiner, 1989; Durbin et al., 1998): 

αt (i) =  βt (i) =

 j





αt−1 (j)Tj.i  Pi (st )

Ti,j Pj (st+1 )βt+1 (j),

(5)

(6)

j

where Ti,j is the probability of transition from state i to state j. Pi (st ) is the probability of the nucleotide st in state i. The initialization of forward recursion [Eq. (5)] is done as α1 (j) = Qj Pj (s1 ), where Qj is the initial probability of being in state j. The backward variable (Eq. (6)) is initialized as βN (j) = 1. The forward and backward variables can be used to compute the probability of a nucleotide st to be in state i, given model Θ and the sequence S, γt (i) =

αt (i)βt (i) , P (S)

(7)

where P (S) is the probability of the sequence S given the model Θ, the recursive relation defining the forward variable can be used to compute its value.

4.2. The Viterbi Algorithm Maximization of P (H|S, Θ) or P (S, H|Θ) over all possible sequences of hidden states is done using the following dynamic programming method. Let ηt (i) define the maximum likelihood of the first t nucleotides and the state i, at tth sequence

Genes in Prokaryotic Genomes and Their Computational Prediction

49

position, along a single hidden state path H; this is computed using a recursive relation, ηt (j) = max [ηt−1 (i)Ti,j Pj (st )] . i

(8)

The argument which maximizes the above expression is stored in array φt (j); by following this, at the final step, the argument that maximizes ηL (i) is obtained and then the array is used to backtrack the hidden states at each position thus yielding the most likely state sequence for the given DNA sequence.

4.3. HMM Training The probability of a nucleotide to be in state i at position t followed by a nucleotide in state j can be obtained using forward and backward variables: ξt (i, j) =

αt (i)Ti,j Pj (st+1 )βt+1 (j) . P (S)

(9)

Summing γt (i) and ξt (i, j) over position t gives the expectation value of the number of transitions from state i and the number of transitions from state i to state j, respectively. These values can be used to re-estimate the model parameters, i.e. the initial probabilities of being in state i, the transition probability from state i to state j, and the observation probability of a nucleotide P (st ) while in state i. It was ¯ shown that that the probability of the sequence given the re-estimated model Θ is either equal to or greater than the probability of the sequence given the model ¯ ≥ P (S|Θ) (Dempster et al., 1977; Rabiner, 1989). This process is Θ, i.e. P (S|Θ) ¯ saturates, that repeated to iteratively refine the model until at a step when P (S|Θ) is, no further increase in probability of the sequence is observed. This procedure is often used to obtain the maximum likelihood estimate of the model parameters, although it does not guarantee the optimal solution. The above described procedure, also known as Expectation-Maximization (EM) algorithm (Dempster et al., 1977; Rabiner, 1989), is implicit in the self learning gene prediction algorithms suited for raw (yet uncharacterized) genome sequences (Baldi, 2000). The EM algorithm is formulated in the form of two main steps: the expectation step and the maximization step. The first step is the expectation with respect to hidden variables given the current estimate of model parameters and the observation sequence. The second step is maximization over the model. For given data set S and model Θ, the EM algorithm thus proceeds by maximizing with respect to model Θ′ the expectation of log-likelihood function:  F (Θ, Θ′ ) = H P (H|S, Θ)log(P (S, H|Θ′ )). Here H defines the sequence of hidden states underlying the observation sequence S. The model Θ′ that maximizes this function increases the likelihood of the data under this model compared to the previous model Θ. The expectation and maximization steps are repeated alternately until convergence is reached.

R. K. Azad

50

nt

nt

nt

1

2

3

Coding states

nt Non-coding state

nt

nt

nt

3

2

1

Shadow states

Fig. 3. A standard HMM architecture. There are three hidden states each for coding model and coding shadow model, and one hidden state for non-coding model. Each state emits a nucleotide and makes transition to next allowed state. Oval and square represent respectively a hidden state and an emission state. The allowed hidden state transitions are shown by line arrows and emissions by block arrows (Adapted from Azad and Borodovsky, 2004b).

HMM based gene prediction algorithms either use a standard HMM or a generalized HMM (also called HMM with duration, see Rabiner, 1989; Durbin et al., 1998; Azad and Borodovsky, 2004b). The former has hidden states emitting single nucleotide and then either making a self transition or transition to a different state. Figure 3 shows a standard HMM with 3 hidden states corresponding to three codon positions for the coding model, 3 hidden states for the shadow model and 1 hidden state for the non-coding model. A generalized HMM provides a framework to explicitly model the length distributions of protein-coding regions and non-coding regions. Each hidden state emits a sequence of nucleotides and makes a transition to a different state. Instead of above mentioned seven states for a classic HMM, this HMM will require only 3 hidden states (Fig. 4; hidden states for start and stop codons as well as for their complements have also been added).

4.4. The ECOPARSE Program The application of hidden Markov models in gene finding was pioneered by Krogh et al. (1994). Their ECOPARSE program was designed to find protein-coding genes in E. coli, where it could identify 90% of known genes and several other predictions were supported by presence of homologs in the database. The components of this HMM were mainly a coding model derived from the codon statistics and an intergenic model accounting for the nucleotide distribution in non-coding regions as well as the probabilities of start and stop codons. In addition to the coding state, there were hidden states for insertion and deletion also in the coding model.

Genes in Prokaryotic Genomes and Their Computational Prediction

noncoding sequence

start codon

coding sequence

stop codon

start

coding

stop

r-stop

shadow

r-start

r-stop codon

shadow sequence

r-start codon

51

noncoding

Fig. 4. A generalized HMM architecture. There are seven hidden states representing coding regions, coding shadow regions, non-coding regions, start codon, stop codon, and reverse complement of start and stop codon. Each state emits a string of nucleotides and makes transition to another state. Oval and square represent respectively a hidden state and an emission state. The allowed hidden state transitions are shown by line arrows and emissions by block arrows (for details, see Azad and Lawrence, 2005).

This allowed the model to not just ‘emit’ a triplet or codon from the coding state, but also to insert or delete nucleotides in the triplets though with very small probabilities. This feature helped in identifying frameshifts in coding sequences. The intergenic model was more elaborate, there were separate models for short intergenic regions ( 4 then the window of sequence is predicted as protein-coding. Successive windows in same frame with SNR > 4 thus identify a protein-coding region which is searched further upstream and downstream to locate the start and stop codon, respectively. In the post-processing step, each of the predicted ORFs is verified again for the presence of distinctive peak at f = 1/3 to minimize the false predictions. Main advantages of this method are its independence of the training sequences and robustness to sequencing errors. However, due to the relatively higher window size (= 351 nt) needed to accumulate sufficient coding signal in the power spectrum of ‘coding windows’, this method does not perform satisfactorily in detecting the short genes.

5.2. The Lengthen-Shuffle Program To address the issue of low SNR observed in short coding sequences, Yan et al. (1998) suggested lengthening of sequence for amplifying the 3 base periodicity signal. In contrast to GeneScan, their Lengthen-Shuffle program is based on a Z curve representation of a DNA sequence and requires training sets. A widow of sequence was repeated many times till the sequence length becomes 1,024 nt or greater and then randomly shuffled many times with a triplet as a unit. The Z transform of a sequence is obtained as follows. Let NA , NT , NC , NG denote the count of the four nucleotides up to a position n in a given sequence of length L, the Z transform in a three dimensional space is obtained as: Xn = 2(NA + NG ) − n, Yn = 2(NA + NC ) − n, Zn = 2(NA + NT ) − n, where n = 0, 1, . . . , L. ∆Xn = Xn − Xn−1 and similarly values of ∆Yn and ∆Zn were obtained for each n. Each of these three represents a binary sequence of values 1 and −1. ∆Xn equals 1 if nth nucleotide is purine (A/G) and −1 if pyrimidine (C/T). Note also that ∆Yn represents the distribution of amino (A/C) and keto (G/T) along a sequence; and ∆Zn represents the distribution of types weak (A/T) and strong (G/C). The power spectrum for each of the three binary sequences were then obtained and their values at frequency f = 1/3 represents the components of a vector in a three dimensional space. The Fisher linear discriminant algorithm is used to first ‘learn’ a plane discriminating between coding and non-coding sequences and an appropriate threshold was established for prediction (see Sec. 8.1 for more detail).

Genes in Prokaryotic Genomes and Their Computational Prediction

55

It was reported that the lengthening and shuffling of a sequence improved the prediction accuracy by 6–7%. The caveats to Fourier transform based approaches include the absence of 3-base periodicity signal in several known protein-coding sequences; notwithstanding these limitations, the Fourier transform methodologies offer complementary tools for identifying yet undiscovered genes.

6. Self-Organizing Maps 6.1. The RescueNet Program This program was developed with the objective to identifying yet undetected genes, particularly those of atypical composition (Mahony et al., 2004). At the center of this program is a Self-Organizing Map (SOM), a class of artificial neural network, that can decipher the intra-genomic compositional variation and as a consequence multiple gene models can be built to identify genes of distinct compositional features. Note that other popular programs like GeneMark or EasyGene typically uses 2 or 3 gene models assuming the genes to be forming mainly two classes — typical and atypical (one of these classes is sometimes further divided into two). Without any preconceived notion of number of gene classes, RescueNet automatically identifies multiple gene classes characterizing the compositional variation in a genome. Relative synonymous codon usage (RSCU), quantifying the ratio of number of a codon to that expected assuming a uniform distribution of codons in a synonymous codon group (representing an amino acid), is used to measure the protein-coding potential of a given sequence. RescueNet uses long genes (>750 bp) with homology evidence reported in the literature, for the SOM training. Each of these genes is represented as a 59dimensional RSCU vector, u (excluding the three stop codons and two other codons with no synonymous alternative). The SOM is a 15 × 15 node lattice, each node represents a model (or a vector, v , in RSCU space) whose elements are initialized to take random values. Proximity of a gene to a node is assessed by obtaining the cosine of the angle between the RSCU vectors for the gene and the node (value 1 means exactly similar and 0 means exactly dissimilar). The gene is assigned to its closest node whose vector v is then updated as v + η( u − v ), where η defines the learning rate which is initialized to ∼ 1 and decrease linearly during training. Vectors for certain neighboring nodes surrounding this node are also allowed to be updated similarly. Assignment of all genes completes one cycle, which is done recursively many times and leads to the clustering of similar genes. This trained SOM is then used to measure the coding potential of a sequence in all six possible frames (in terms of its proximity to gene classes). In practice, if the probabilistic score for a moving window of 110 triplets exceeds a certain threshold, this is predicted as part of a protein-coding region. Genes are thus identified by processing the consecutive windows predicted as coding in same reading frame. RescueNet was reported to perform well in predicting genes in GC-rich genomes and also in detecting several novel genes otherwise missed by previous programs.

R. K. Azad

56

7. Directed Acyclic Graphs 7.1. The FrameD Program FrameD was designed to minimize the over-prediction of overlapping genes specifically in GC-rich genomes where long overlaps are more probable (Schiex et al., 2003). This program is based on a weighted directed acyclic graph and models the gene overlapping explicitly. The six coding frames and non-coding state are represented by seven parallel tracks. Here a track should be visualized as consisting of equal-spaced vertices; alternating ‘content edges’ and ‘signal edges’ represent connection between two consecutive vertices. A content edge represents a nucleotide at a given position in a sequence and the signal edge signifies a transition from that track to either itself (self transition) or to one of the other six tracks. Given a sequence with overlaps of coding regions not assumed, the occurrence of a start codon in a frame i implies a transition from non-coding track to the track representing frame i. When a stop codon is encountered in this path, a transition occurs from this coding track to the non-coding track. A ‘bi-coding’ track was incorporated to model gene overlaps in all possible frames. Frameshifts were accounted for by including additional edges or allowing jumps over edges. A weight is associated to each edge, defined as the logarithm of probability of an edge in a selected path. The probability of a content edge is simply the emission probability of the corresponding nucleotide estimated in the framework of an interpolated Markov model (Salzberg et al., 1998). The weights to signal edges are estimated by obtaining the RBS binding energy for start codons, appropriate penalties for stop codons and frameshifts and the probability of transition between tracks for other signal edges. The shortest path in this directed graph represents the parse of a sequence into protein-coding and non-coding regions. The other attributes of the FrameD program are its ability to detect frameshift and the flexibility to include the homology information in prediction.

8. Linear Discriminant Function 8.1. The ZCURVE Program Guo et al. (2003) proposed a gene finder which is based on Z curve representation of DNA sequences (Zhang and Zhang, 1991). This method derives a set of 33 parameters taking nucleotide and dinucleotide compositions at three codon positions into account. Briefly, for a codon position i, Z transform of a DNA sequence is obtained as X i = (PAi + PGi ) − (PCi + PTi ),

Y i = (PAi + PCi ) − (PGi + PTi )

and Z i = (PAi + PTi ) − (PGi + PCi ),

Genes in Prokaryotic Genomes and Their Computational Prediction

57

where P denotes the probability of a nucleotide; note that X i , Y i , Z i ∈ [−1, 1]. This represents a point in a three dimensional space. Considering all codon positions i = 1 − 3, the Z transform thus gives 9 parameters, the other 24 parameters are obtained by considering the dinucleotides situated at codon positions 1 and 2. A given ORF can thus be represented as a point in 33 dimensional space visualized as a super-cube with side length = 2. In this program, a set of seed ORFs was used as positive samples in the training, negative samples were not directly obtained from the non-coding sequences but by randomly shuffling the nucleotides in seed ORFs many times. These training samples were then used in a Fisher linear discriminant algorithm wherein a super plane in the 33 dimensional space was learnt to discriminate between the positive and negative samples. This super plane thus represents a vector v with 33 elements. A threshold T was established by equalizing the false positive rate and the false negative rate. The Z score of an ORF represented by a 33 element vector u was obtained as Z( u) = v . u − T . If Z( u) > 0, then the ORF was predicted as protein-coding.

9. Unsupervised Model Training: The Self-Learning Algorithms The earlier versions of prokaryotic gene finders using Markov models were training the models on experimentally validated sets of data, while later the fast pace of sequencing resulted in multiple genomes being sequenced before even a fraction of the protein-coding genes had been experimentally determined. This posed a challenge to such prediction methods, but Audic and Claverie (1998) found a way to solve this problem. They suggested an unsupervised procedure to learning the model parameters. A given genome sequence was randomly partitioned into three sets which were then used for learning the parameters of three Markov models characterizing the functional categories- protein-coding, protein-coding shadow and non-coding. Using these models, the a posteriori probability for a sequence segment representing either of the above categories was computed similar to GeneMark program. The subsequent steps involved using predictions to refine the models which were used again to classify the sequences. This iterative procedure was repeated till convergence; that is, further iterations result in reassignment of sequences to same cluster. This ‘self-training’ method thus derives the final set of model parameters and in the process generating the final prediction. The performance of this program was found comparable to GeneMark, reporting gene prediction accuracy of up to 90%.

9.1. The GeneMark-Genesis Program The necessity to predict genes in mostly anonymous genome sequences led to the development of several machine learning methods around this time. GeneMarkGenesis, an extension of GeneMark, used an unsupervised training procedure for

58

R. K. Azad

deriving parameters of multiple gene models (Hayes and Borodovsky, 1998). It was based on the premise that long ORFs are unlikely to be non-coding, thus ORFs with length >700 nt were used to build a Markov model of protein-coding sequences of order up to five, while a model of non-coding sequences was represented by the frequency distribution of nucleotides in the genome. Gene predictions by the algorithm using these models were then used to refine the models and, following next iteration, the predicted long ORFs were assigned to typical gene class, and the remaining low scoring long ORFs were assigned to atypical gene class. This provided initial seeds for the classification of genes into two classes by a k-means clustering algorithm. The clustering algorithm used the Kullback-Leibler divergence to quantify the difference in codon usage patterns between genes or gene classes. At each iteration of this algorithm, genes were reassigned to classes with the closest class centers and class centers were recomputed. This was repeated till convergence. The gene models trained on these classes were then employed along with the noncoding model in the Bayesian algorithm for scoring ORFs. Note that the Glimmer algorithm also selects non-overlapping long ORFs above a certain threshold length for learning the parameters of gene models. The threshold is determined by maximizing the number of non-overlapping ORFs in a genome.

9.2. The GeneMarkS Program A heuristic method for building models ‘on the fly’ was suggested by Besemer and Borodovsky (1999). It was shown that reliable models of protein-coding and noncoding sequences can be built even from very small amount of data (∼400 nt). This approach derives parameters for an anonymous genome sequence using linear functions relating nucleotide frequencies at three codon positions to the global nucleotide frequencies and amino acid frequencies to the G+C content. The heuristic models were used as seed models in an iterative gene-finding algorithm, GeneMarkS (Besemer et al., 2001). At each iteration step, the parse of a sequence was used to redefine the models and this process was continued until the parse obtained at a step did not significantly differ with the parse from the previous step. Essentially this was a new version of GeneMark.hmm, with unsupervised iterative model training replacing the supervised training, and the parameters of RBS model also refined iteratively along with other models.

9.3. The MED Program Another recently proposed unsupervised gene prediction program, MED 2.0, is based on an information theoretic measure of amino acid composition in proteincoding genes (Zhu et al., 2007). The discrimination between protein-coding ORFs and non-coding ORFs is done based on an entropic function, the ‘Entropy Density Profile’ (EDP). In contrast to frequently used codon usage pattern as statistical

Genes in Prokaryotic Genomes and Their Computational Prediction

59

discriminant, this program exploits amino acid usage instead by obtaining a 20pi , where pi is the dimensional vector whose elements are defined as P20pi log i=1 pi log pi probability of amino acid i. This quantifies the information content of each amino acid normalized by Shannon information entropy (the denominator with a negative sign). It was shown the protein-coding and non-coding ORFs form two distinct clusters in the EDP vector space, which was exploited to classify the ORFs in a given genomes. Starting with sets of root protein-coding and non-coding ORFs obtained by using ‘universal’ EDP cluster centers, an ORF is classified in either of the two classes based on the Euclidean distance between its EDP vector and the vector representing a cluster center. ORF classification is refined iteratively together with transcription start site refinement until a converging set of predictions is obtained. 10. Using Similarity Search in Gene Prediction Similarity of the protein product of a query sequence to protein sequences in a protein sequence database is the most compelling evidence of protein-coding potential of the query sequence. However the success of this approach depends on the breadth and depth of the sequence database. Also the ‘orphan’ genes unique to a species, i.e. having no homologues in the database, cannot be detected by such extrinsic approaches. Hence a combination of intrinsic and extrinsic approaches to gene prediction can be more powerful than either approach alone. Methods using both intrinsic and extrinsic measures were developed later when a sufficient amount of sequence data had accumulated in the protein sequence database. The main features of some popular algorithms using either extrinsic information alone or both extrinsic and intrinsic evidences are described below. 10.1. The ORPHEUS Program Earlier extrinsic evidence was often utilized as a measure of confidence in the predictions by ab initio gene prediction programs. In contrast, the ORPHEUS program prioritizes extrinsic evidences over intrinsic information. This program starts by searching for high scoring sequence segments in a genome sequence against a database of protein sequences. The highly conserved sequence segments identified thus were extended at both ends to the first start and stop codons in the correct frame. These ORFs comprised the first set of predictions. This set of ORFs was then used for training the gene model whose parameters characterized the summary statistics of codon usage observed in the ORFs. The coding potential of ORFs whose protein-products did not show significant similarity to database proteins was measured by the following quantity: Ω = Q(a1 a2 · · · aN ) − max{Q(b1 b2 · · · bN ), Q(c1 c2 · · · cN )}, where Q(a1 a2 · · · aN ) defines log likelihood of sequence of codons (a1 · · · aN ), n i=1 log f (ai ), normalized in standard deviation units to be independent of

60

R. K. Azad

sequence length; (b1 · · · bN ) and (c1 · · · cN ) denote the other two frame settings for sequence of codons; and f (ai ) is the frequency of the codon ai . If Ω exceeds a certain threshold, the ORF in question is predicted to be a gene. Following the above procedure, a second set of predicted ORFs was obtained. The final set was a union of the two sets of predicted ORFs, with start codon refined to farthest start codon at 5′ -end if the downstream sequence (size ∼99 nt) to this start codon has sufficient coding potential measured through Ω. In the last step, parameters of a RBS model were derived to further refine the gene starts.

10.2. The CRITICA Program Log likelihood scores computed using both extrinsic and intrinsic approaches were combined in the CRITICA algorithm (Badger and Olsen, 1999). The BLASTN program was used to search for homologs of the sequence segments in a query genome sequence. The high scoring segment pairs were then assimilated. The similarity in amino acid translation of sequence segments in each pair was assessed using a log likelihood function in all six reading frames. The log likelihood function for aligned triplets was defined as the logarithm of the ratio of the probability of the aligned triplets in a coding frame setting to that in a non-coding setting. This score is a function of both the nucleotide similarity in aligned triplets and the corresponding amino acid match. In this scheme, aligned triplets with a greater nucleotide difference but encoding same the amino acid is assigned a higher score, and the aligned identical triplets is given a score of zero as it does not carry any comparative value. This score is combinedwith a ‘dicodon’ frequency score that fcoding (ti |ti−1 ) depends on the log likelihood function, ln fnon-coding (ti |ti−1 ) , where f (ti |ti−1 ) is the frequency of triplet ti given the preceding triplet ti−1 . The score for a sequence segment is obtained by simply summing the combined scores for all aligned triplets. If this score is found to be statistically significant to be coding, the sequence segment is predicted as protein-coding. The 3′ -end of the sequence segment was extended terminating at the first stop codon encountered. The 5′ -end of the predicted coding sequence was determined through a log odds score for all possible start codons combined with a score for potential Shine-Dalgarno sequence. Note that CRITICA also uses an unsupervised training procedure; here the dicodon usage statistics is learnt in an iterative fashion, which makes this program suitable for anonymous genome sequences.

10.3. The BDGF Program Shibuya and Rigoutsos (2002) developed BDGF (Bio-Dictionary Gene Finder) program exploiting the presence of conserved sequence patterns in a query sequence. This program used Bio-Dictionary, a database of patterns termed ‘seqlets’ processed from known proteins using the Teiresias algorithm, an unsupervised discovery learning algorithm (Rigoutsos and Floratos, 1998). The basic idea behind this gene

Genes in Prokaryotic Genomes and Their Computational Prediction

61

finder is straightforward — search for seqlets in the protein-product of an ORF in question. If there is high occurrence of matching seqlets in the ORF, declare the ORF as a protein-coding gene. An appropriate threshold was established to optimize the performance of this algorithm.

10.4. The EasyGene Program An elaborate HMM of a gene is at the core of the EasyGene program (Larsen and Krogh, 2003). The parameters of this non-looped HMM are learnt from training data assimilated using database similarity search. Specifically, the longest possible ORFs between two in-frame stop codons whose protein-products have significant matches in protein database form the training set for gene model after removing the paralogs. The parameters of the HMM are estimated using the Baum-Welch algorithm. Unlike other HMMs used previously, this HMM has models for several distinctive sequence features around the start and stop codon of a gene (Fig. 5). Just before the model of start codons, there is a null model followed by a RBS model. The former describes the general composition of genome and also the coding regions on the complementary strand, while the latter represents the ribosomal binding site with seven hidden states and a spacer with variable number of states. To account for distinctive patterns in codon frequencies near the gene start and end, codon models specific to these regions are incorporated in the HMM. There are three models for the sequence of internal codons, each representing three distinct gene classes. The model order is variable, with the highest 4th order being used for model of coding regions. Easygene uses the FB algorithm to compute the posterior probabilities for each nucleotide to be emitted by a state, and each ORF is scored using a log-odds  function. The score of sequence S containing an ORF is defined as log PP (S|M) (S|N ) ,

Fig. 5. The non-looped HMM architecture of a gene used in the EasyGene program. Each box represents a sub-model with more than one hidden state. The number above a box represents the number of nucleotides emitted by hidden states of the sub-model (Larsen and Krogh, 2003).

62

R. K. Azad

where P (S|M ) and P (S|N ) are the posterior probabilities of S given the generating model to be hidden Markov model M for a gene with flanks and null hidden Markov model N respectively. To reduce false positives arising mainly from over-prediction of short genes, EasyGene computes the statistical significance of the score of a putative gene. This is done by obtaining the expected number of ORFs predicted with the same length-adjusted score or higher in a random genome sequence of 1 Mbp size. A third order homogeneous Markov model of a genome was used for obtaining random sequence. Most recent version of EasyGene has included modules for the prediction of alternative start codons (Nielsen and Krogh, 2005).

10.5. The GISMO Program The most recent gene finder in this class, GISMO, employs HMM and a supervised learning method, Support Vector Machine (SVM), to distinguish protein-coding ORFs from non-coding ORFs (Krause et al., 2007). In contrast to pairwise sequence similarity search for identifying the evolutionary conserved ORFs, GISMO uses an HMM-based protein domain search that yields a better signal to noise ratio and also identifies easily new ordering of domains in genes. A given genome sequence is translated in all possible reading frames and significant matches to translated regions are searched for in the Pfam database of protein domains. The ORFs with conserved domains form the initial set of predicted genes. This set is then used for learning the parameters of the SVM. This classifier uses non-linear functions, mainly the Gaussian kernel function, to learn a hyperplane to separate the proteincoding ORFs from the non-coding ORFs in the high dimensional parameter space. In this setup, the normalized codon frequencies were found to have the highest predictive power. The ORFs were scored by measuring their distance from the ‘trained’ hyperplane and an appropriate threshold was established to optimally classify the ORFs.

11. Gene Start Prediction The earlier gene prediction programs attempted to identify the ORFs where coding regions reside, that is, identifying the reading frame and stop codon of a gene was considered a success. While the first encountered in-frame stop codon in a reading frame of a gene unambiguously defines its 3′ -end, determination of its 5′ -end is often confounded by the presence of several potential start codons. The earlier versions of GeneMark and Glimmer chose the 5′ -most start codon, i.e. among all possible ORFs for a stop codon, the longest ORF with sufficient coding potential was considered most likely the protein-coding gene candidate. The experimentally validated Ntermini of proteins show that this is true for the majority of the genes. However, a significant fraction of genes have their actual start codon downstream of the 5′ most start codon. Determination of the exact gene start is important for analyzing the protein product as well as for predicting promoters, operons, non-coding RNAs

Genes in Prokaryotic Genomes and Their Computational Prediction

63

etc. Gene identification is deemed complete only when both 5′ - and 3′ -end are determined precisely. Stormo et al. (1982) pioneered the use of computational methods for detecting the translation initiation sites (TIS) in prokaryotes in early 1980s. However, substantial efforts to computational identification of 5′ -end of gene were made only in 1990s when many gene prediction programs had already achieved high accuracy in detecting the 3′ -end of genes. The HMM of the ECOPARSE program incorporated a gene start model defined by the probability distribution of start codons in known protein-genes (Krogh et al., 1994). Subsequently developed gene prediction methods used more sophisticated models for gene start detection; these methods were based on several distinctive features around the gene start, with the most prominent being the conserved patterns observed typically 8-10 bases upstream of gene start. Prior to translation of an mRNA transcript, the ribosome binds to a region upstream of its 5′ -end; this region is known as ribosomal binding site (RBS). The earlier methods were looking for a motif, a 6 bp consensus sequence complementary to 16S rRNA sequence at its 3′ -end, to locate the RBS. Schurr et al. (1993) suggested a method based on computing binding energy between 16S rRNA and the potential RBS regions — that with maximum free energy gain upon binding to 16S rRNA was identified as a putative RBS. Different approaches to locating RBS proposed later were soon adapted to refine the predicted gene starts at a post-processing step; alternatively models of the informative features around TIS were integrated with the main model in a modular framework like HMM to simultaneously predict both the 5′ -end and 3′ -end. Hannenhalli et al. (1999) defined a score for a potential gene start site as the linear combination of weighted log likelihood scores for five parameter: 1) distance between actual start codon and 5′ -most start codon, 2) the start codon, 3) ribosomal binding energy, 4) spacer length, and 5) a start codon score (likelihood of a non-coding window upstream of TIS and protein-coding window just downstream). Among all possible start codons for a gene, one with the maximum value of this score was selected as the gene start. An optimization technique — Mixed Integer Programming — was used to obtain an optimal set of weight parameters. Both GeneMark.hmm and the later versions of Glimmer use models of RBS at the post-processing stage. In GeneMark.hmm, a multiple sequence alignment of known RBS sequences (each of length 16 nt) from regions located upstream of start codons (from −19 to −4) was done using a simulated annealing algorithm (Lukashin and Borodovsky, 1998). In this procedure, a score computed for an alignment within a window of size 5 was maximized by moving a randomly selected sequence by a few nucleotides each time. Successive iterations result in a converging alignment represented by a matrix of positional nucleotide frequencies. A consensus RBS sequence is easily deciphered as the sequence of nucleotides with highest probability at each position in the matrix. The potential RBS sequences upstream of each possible start codon are then scored as the likelihood of being generated by the RBS

64

R. K. Azad

model. The most likely RBS sequence and thus the start codon is then determined using an appropriate threshold. In contrast to GeneMark.hmm’s post-processing, Suzek et al. (2001) suggested a probabilistic model for the post-processing of Glimmer predictions, which is derived by first finding the consensus or seed RBS sequence. The seed sequence represents the reverse complement of a sequence at the 3′ -end of 16S rRNA with the maximum frequency of occurrences upstream of annotated start codons. A window is then moved upstream of start codons to locate sequences matching significantly with the seed sequence, and these sequences are subjected to multiple sequence alignment to derive a position weight matrix which is then used to compute likelihood of a subsequence upstream of all possible start codons for a gene. The final score for a RBS is computed as the product of the likelihood score and a probabilistic score for spacers. The start codon of a gene is identified by the highest scoring RBS. Around this time efforts were also made to integrate an RBS model in an HMM framework for gene prediction. The EasyGene program used an RBS model with seven hidden states and a spacer model with variable number of hidden states to model lengths ranging from 3 to 12 nucleotides between RBS and start codon (Larsen and Krogh, 2003). To get a training set for models for RBS and other distinctive features upstream of a gene start, only those ORFs with database match extending to the region near their 5′ -most start codon were selected; this generated a training set of ORFs with validated gene starts. The parameters were estimated using a simulated annealing technique, which involved adding noise of decreasing intensity to the parameters in each iteration of the Baum-Welch optimization procedure. Note that the GeneMarkS program also derives parameters of RBS and a spacer model in parallel with parameter derivation for other models (Besemer et al., 2001). In this self training iterative procedure, Gibbs sampling method was used to align sequences upstream of the predicted gene starts, and the identified conserved regions were then used to obtain positional nucleotide frequency matrix and spacer length distribution. The start codons identified by this two-component model redefined the training set for the next iteration; and this process was iterated until the algorithm converged. Recent methods for gene start prediction have attempted to incorporate other parameters that may be correlating with translation initiation. For example, the four component program, MED-Start, by Zhu et al. (2004) models the correlation between translation termination site and TIS, and between TIS and upstream consensus signal in addition to the models for consensus sequence around RBS and start codon. The parameters of this program are also estimated using an iterative self-training procedure. In contrast, the most recent gene start finder, HonYaku, uses a supervised procedure for training models of RBS, spacer, start codon, nucleotide composition downstream of start codon, protein length distribution and operon orientation (Makita et al., 2007). The component models are integrated in a Bayesian algorithm to compute the posterior probability for a gene to start from one of the possible TIS.

Genes in Prokaryotic Genomes and Their Computational Prediction

65

12. Resolving Overlapping Genes Partial overlaps of genes are ubiquitous in prokaryotes (Johnson and Chisholm, 2004). Overlapping regions carry double genetic-code load for genes in different reading frames. Though short overlaps are frequent, gene finders have to often face the spurious detection of genes on strand complementary to where the real genes lie. This is particularly the case when the strands are scanned one after the other. The ‘shadow’ of a gene is a consequence of self complementary RNY codons found in abundance in highly expressed genes. As a rule, the shorter of two long overlapping ORFs is eliminated to minimize false predictions. To tackle this problem more effectively, GeneMark incorporated a model of gene shadow, which made it possible to scan genes on both strands in parallel by analyzing one strand only. If ORFs in different reading frames overlap by greater than a specified length, Glimmer scores the overlap regions separately. If the overlap region in the reading frame of longer ORF scores higher and moving the shorter ORF’s start site to downstream alternative start codons sequentially does not resolve the overlap, the shorter ORF is then eliminated. Often the overlapping regions do not provide sufficient data to train models for four different orientations of overlapping gene pairs (Fig. 6). Gene overlap was not considered in the HMM architecture of GeneMark.hmm, and the Viterbi parse of a genome is thus coding segments interrupted by non-coding segments of at least one base. This implies the predicted genes may be shorter than they actually are or some genes may even be missed. Use of an RBS model at the post-processing step, however, allows the extension of 5′ -ends of predicted genes and thus to correspond to the actual gene overlaps. The most recent version of GeneMark.hmm is equipped with overlap models and thus has the capability to detect overlaps in genes at the prediction step. To deal with the issue of overlapping genes, both GeneHacker Plus and EasyGene program use a non-looped HMM of the gene structure with flanking regions included. Each gene is thus scored independent of its overlap with neighboring genes.

Fig. 6. Four possible gene overlaps. Right (left) pointing arrows represent genes on direct (complementary) strand.

66

R. K. Azad

13. Non-coding RNA Gene Prediction Often the phrase ‘gene prediction’ is used to imply prediction of genes that encode proteins. In addition to identifying the protein-coding genes, an important task in genome sequencing projects is to identify those genes that yield functional RNA products instead of coding for proteins. RNA is a single stranded molecule composed of four nucleotide subunits — adenine, guanine, cytosine and uracil. Unlike messenger RNAs that are intermediaries in protein formation, non-coding RNAs form complex three dimensional structures through intramolecular base pairings (G pairing with C and A pairing with U). Base pairs are often stacked on top of one another and form an A-shaped double helix called stems. They are interrupted by structures formed by unpaired bases, called loops. A typical secondary structure of a RNA molecule is shown in Fig. 7. The non-coding RNAs are components of cellular machines including ribosome and even catalyze biochemical reactions. The most common examples are ribosomal RNA, transfer RNA, small nucleolar RNA amd micro-RNAs. Identification of non-coding RNA genes poses a significant challenge to computational biologists, primarily because of the difficulty in recognizing signals, such as, promoters, initiators or terminators, absence of specific nucleotide ordering

Fig. 7. tRNA secondary structure. Nucleotides that are always present are shown in circles, while oval indicates their presence depending on a structure (reprinted from Laslett and Canback, 2004 (original version at Sprinzl et al., 1996), with permission from Oxford University Press).

Genes in Prokaryotic Genomes and Their Computational Prediction

67

patterns as well as low sequence conservation in the evolution of these genes. Several methods have been developed in recent years to identify the non-coding RNA genes; again as in protein-coding gene prediction, these methods exploit either intrinsic or extrinsic evidences or both. Carter et al. (2001) suggested a machine learning method based on neural networks and support vector machines. This method involved training several networks — network with parameters learnt from nucleotide and dinucleotide composition of known RNA genes and non-coding sequences, and network trained on structural motifs found mainly in RNA tetraloops. Trained networks were used for detecting RNA genes in bacterial and hyperthermophilic archaea. This intrinsic approach was reported to achieve an accuracy of over 90% on bacterial genomes and a still better performance on hyperthermophilic archaeal genomes. Other methods were aimed at detecting the specific RNA transcription signals and supporting the signal based predictions by finding homologues in the database. Argaman’s et al. (2001) strategy was to search for transcription initiation and termination signals in intergenic regions, the predicted sequences were then matched against the RNA sequence database, and those with significant hits were deemed putative non-coding RNA genes. Application to E. coli genome yielded 24 putative small RNA genes, and experimental tests confirmed 14 as novel RNA genes. Wassarman et al. (2001) combined sequence conservation and genomic microarrays probing for detecting RNA transcripts to find putative RNA genes in E. coli genome; this approach discovered 17 novel non-coding RNA genes. The recently proposed sRNAPredict program also searches for sequence conservation and transcription signals in the intergenic regions; when applied to Vibrio cholerae genome, it detected 32 novel non-coding RNAs in addition to 9 out of 10 known non-coding RNAs (Livny et al., 2005). Unlike protein-coding genes, non-coding RNA genes show low conservation at the sequence level. Rivas et al. (2001) exploited the conservation of RNA secondary structure in their comparative genomic screen for non-coding RNAs. Primarily three models were assumed for the conserved regions in a genome: RNA model to account for the conserved base paired RNA secondary structure, coding model accounting for the codon position dependent conserved patterns and a third model for the position independent conserved patterns for all other conserved sequences. A logodds function is used to compute the likelihood of a sequence to be generated by an RNA model as opposed to the other two models, a positive score is indicative of a conserved RNA secondary structure at the loci of interest. Application of this algorithm to E. coli genome yielded 275 putative non-coding RNA loci, a few of them were shown to express small RNA transcripts of unknown function. Profile hidden Markov models are often invoked to score sequence alignment, particularly at the level of amino acid residue conservation. In contrast, scoring the secondary structure and primary sequence alignment of non-coding RNAs is non-trivial. This requires models that can account for the long-range correlations arising from the base pairing in the RNA secondary structure. Such a model adapted

68

R. K. Azad

from computational linguistics is known as ‘stochastic context-free grammars’ or SCFGs (Durbin et al., 1998). This has been used in a tRNA gene prediction algorithm, tRNAscanSE, by Lowe and Eddy (1997). Griffiths-Jones et al. (2005) employed profile SCFGs to compile a database of non-coding RNA families in all three kingdoms of life. Some non-coding RNAs like ribosomal RNA which show high conservation at the primary sequence level may even be modeled by HMMs. Recently Lagesen et al. (2007) have developed a HMM based program, RNAmmer, for predicting the rRNA genes. Though comparative genome analysis is the theme of most RNA gene prediction programs, some recent programs like GPboostReg have focused on using solely the intrinsic evidences to decipher non-coding RNA genes (Saetrom et al., 2005). GPboostReg is a machine learning algorithm that combines genetic programming and boosting algorithm to create classifiers for non-coding RNA genes (positive sequences) and intergenic regions (negative sequences). Genetic programming guides the simulated evolution of candidate solutions to classifying the sequences; each time candidate solutions with high fitness are selected followed by modulation through noise and this process is repeated till an optimal classifier, one that performs the best on the training set, is obtained. Boosting algorithm is used to combine the classifiers using appropriate weights to optimize the performance the prediction algorithm.

14. Assessing Gene Prediction Programs A number of prokaryotic gene prediction programs have emerged in the last two decades; assessing these programs is confounded by the absence of datasets of experimentally validated protein-coding and non-coding ORFs even for a single genome. Generally the annotated genes are assumed as the control, which is not strictly valid as the genes are often annotated using a gene prediction program and thus reflect the biases of the methods used. Still carefully done annotations represent experts’ opinion and judgment, and are thought to be closer to reality. Any program that can reproduce most of the annotations is assumed to be an effective predictor. The accuracy of gene prediction programs is generally assessed using two parameters — sensitivity (Sn) and specificity (Sp). Sensitivity is the percentage of real genes in a test genome that are correctly identified by a program. Specificity is the percentage of predicted genes that match with the real genes. Sn and Sp have reciprocal relationship; increase in Sn in general causes a decrease in Sp, and vice versa. The prediction program uses a threshold to balance these two parameters, while often the performance is assumed optimized when the average value of these parameters is maximized. Sometimes the complement of these parameters, that is, (100% − Sn) and (100% − Sp) are used as error rate quantifiers to assess the performance of a program. Most gene prediction programs developed since the beginning of last decade have reported a high level of sensitivity and specificity in predicting the annotated

Genes in Prokaryotic Genomes and Their Computational Prediction

69

genes. A survey of papers published in the last ten years presents an impressive record of accuracy achieved by programs based on probabilistic models of gene structure. Gene prediction algorithms such as GeneMarkS, GeneHacker Plus and EasyGene have attained sensitivity and specificity of over 90% on nearly all available prokaryotic genomes. Glimmer reports a sensitivity of 98–99%, but generates many false positives. The most recent version, Glimmer 3.0, shows a significant improvement in specificity. FrameD was shown to perform comparably with GeneMarkS on six representative genomes spanning the GC range in prokaryotes. The most recent gene finders MED 2.0 and GISMO also perform comparably with the above programs; there is no definitive evidence that new programs have achieved better accuracy, however, the prediction programs have shown complementary strength and thus the development of new programs is necessitated till the complete solution of this problem. For a fair comparison of gene prediction algorithms, efforts have been made to generate test sets of ORFs having significant matches in the database. Additionally, sets of experimentally validated genes have also been used where available. Larsen and Krogh (2003) compiled a set of 2,042 putative E. coli genes whose protein translations have at least one significant match in the protein database. Note that these genes have only their 3′ -end precisely defined. The performance of five gene finders—EasyGene, Glimmer, ORPHEUS, GeneMark and Frame-by-Frame was assessed on this test set (Table 1). False predictions by these programs were assessed by the number of predictions in the gene shadow regions. A gene is thus considered identified if its 3′ -end is correctly identified. This test showed the sensitivity of all program between ∼96–98%. Only EasyGene, GeneMarkS and Frame-by-Frame managed no false prediction in the shadow region. In order to evaluate the programs in detecting both 5′ - and 3′ -end (that is, the accuracy in gene start prediction), Table 1. Performance of gene prediction programs on test sets of conserved E. coli K12 genes (adapted from Larsen and Krogh, 2003). Program Test set

EasyGene

Glimmer

Orpheus

GeneMarkS

Frame-by-Frame

% 3′ -end match Conserved genes with confirmed 3′ -end

98.1

98.4

95.6

96.3

96.1

% 5′ - and 3′ -end match Conserved genes with confirmed 5′ and 3′ -end

93.8

95.3

92.4

88.0

93.2

# false predictions in gene shadow region Shadow of conserved genes

0

21

9

0

0

R. K. Azad

70

Table 2. Performance of gene prediction programs on test sets of confirmed genes. See text for detail (adapted from Zhu et al., 2007). Program Test set EcoGene Link EcoGene short Bsub123 Bsub72 Bsub51 Psaer107 Mtub66

EasyGene

Glimmer

99.4 100.0 93.3 73.0 82.4 84.8 100.0 97.5

99.4 100.0 96.6 87.8 87.5 82.3 95.3 97.0

GeneMarkS

% 3′ -end match 99.9 100.0 100.0 97.6 98.6 98.0 93.5 98.5

MED 99.1 99.0 93.1 95.1 94.4 92.2 97.2 95.5

ZCURVE 98.8 100.0 86.2 91.9 93.1 90.2 95.3 97.0

% 5′ - and 3′ -end match EcoGene Link EcoGene short Bsub123 Bsub72 Bsub51 Psaer107 Mtub66

91.1 92.1 90.0 66.0 76.5 81.8 88.0 82.5

91.9 94.4 89.7 77.2 77.8 78.4 90.6 80.3

93.8 94.4 98.3 87.8 93.1 94.1 85.0 80.3

92.0 93.3 91.4 85.4 87.5 90.2 93.5 87.9

89.2 92.3 77.6 78.0 86.1 84.3 91.6 75.8

ORFs with matches extending up to their 5′ -most start codon were selected. This set comprised 1,136 genes with precisely defined gene boundaries. The sensitivity ranged in the interval ∼90–95%, though this test set is biased in favor of programs with tendency to predict the longest ORFs. Another comprehensive comparison of different programs, namely, EasyGene, Glimmer, GeneMarkS, MED and ZCURVE, is done in a recent paper by Zhu et al. (2007) (Table 2). Test sets have been created mainly from the validated genes of E. coli and B. subtilis, perhaps the most well characterized bacterial genomes available. Link dataset is a compilation of 195 N-terminally confirmed E. coli genes (Link et al., 1997). Another larger set of 854 experimentally validated E. coli genes is compiled in EcoGene dataset by Rudd (2000). EcoGene has 58 genes with size less than 300 nt; this provided another test set, EcoGene short, for testing the efficiency in detecting short genes. Bsub123, Bsub72 and Bsub51 are the three sets of annotated short genes (93% in detecting 3′ -end of GC-rich genes in sets Mtub66 and Paer107, the difficulty in identifying gene starts was obvious particularly when tests were done on the M. tuberculosis genes (Table 2). Hon-yaku, a most recent program designed to predict only gene start, outperformed other programs in this category when tested particularly on GC-rich genomes (Makita et al., 2007). The critical assessment of non-coding RNA gene prediction programs is notoriously difficult. The most common experimental procedure for validation is Northern blot, which tests for the RNA expression. RNAs express in different conditions and often this will effect an underestimation of sensitivity. Most methods were tested on the E. coli genome, thus the only up to date comparison of four methods available is from tests in detecting non-coding RNA genes in this genome. These methods by Argaman et al. (2001), Wassarman et al. (2001), Carter et al. (2001) and Rivas et al. (2001) have been described in the section on non-coding RNA gene prediction. They identified respectively 24, 60, 370 and 275 non-coding RNA genes; of a fraction of these genes tested for expression by Northern blot, 31 were found to express in various conditions. The highest sensitivity was achieved by Rivas et al. method based RNA secondary structure conservation which predicted 22 of these genes, 6 of these were unique to this approach. Argaman et al. method showed the highest specificity with 14 of its 24 predictions matching the expressed RNAs.

15. Discussion A survey of the computational methods for prokaryotic gene prediction presents a picture of significant accomplishments in rather a short span of the last ten years. In fact this has become a showcase of the achievements of computational techniques in molecular biology. Most methods, mainly those using probabilistic models, have raised the accuracy bar above 90% level. Earlier this degree of accuracy was attained in 3′ -end detection; now with the use of elaborate RBS models, 5′ -ends are also detected with nearly the same precision. The new algorithms developed in the last couple of years have added little, if any, in terms of improvement in gene prediction. One may be tempted to say that the prokaryotic gene prediction has come to a saturation level, and this problem appears solved with an error margin of ∼5–10%. In hindsight, this might seem an exaggeration considering the fact that prediction programs have been assessed on test-sets of yet to be validated annotated genes, or on only a handful of experimentally confirmed genes. A robust evaluation platform is still eluding; until we have even a single genome with all protein-coding genes precisely determined, a fair comparison of prediction methods will not be realized. This is a significant challenge, and the current status of prokaryotic gene prediction calls for investment of considerable efforts here. Construction of reliable test beds is not just vital for robust and consistent application of existing programs, but also in

72

R. K. Azad

guiding the development of novel methods that can take the accuracy bar to higher levels. Development in these directions is required also to exploit other avenues of accuracy improvement including utilizing the complementarities of prediction methods, which have hitherto been not given sufficient attention for obvious reasons. Precise identification of genes is often confounded by the presence of overlapping genes in prokaryotes. In fact a significant fraction of false predictions has been attributed to this factor. Programs like Glimmer have devised several strategies to resolve the over-prediction of overlapping ORFs (Delcher et al., 1999, 2007). Overlaps also make it difficult to detect translation initiation signals. The last few years have seen concerted efforts in addressing this problem, though there is still significant room for improvement. There is a tendency towards achieving higher sensitivity in gene identification, however, this objective is only achieved at the cost of sizeable number of ‘unwanted’ or false predictions by most programs. By studying the length distributions of known proteins and annotated ones, Skovgaard et al. (2001) have argued that most genomes are over-annotated and this tendency increases with increasing G+C content of a genome. The stop codons TAA, TAG and TGA, being AT-rich, occur less frequently in GC-rich genomes and this causes an increase in abundance of long non-coding ORFs. To assess the extent of over-annotation, a ‘unique gene set’ that has no similar genes within was created from the set of annotated genes in a genome and then the length distributions of known proteins in this set (obtained from SWISSPROT database) and the remaining genes translated to ‘proteins’ (no matches in SWISS-PROT) were compared. A clear distinction between the two distributions was observed (Fig. 8); overall the genes with no matches in the database were shorter than the genes with matches to known proteins. This meticulous study by Skovgaard et al. (2001) highlighted the main source of over-annotations of genes, namely, the short non-coding ORFs which are frequently predicted as genes by

Fig. 8. Length distribution of unique proteins in E. coli that match SWISS-PROT entries (red line) and those which do not match (blue line) (printed with permission from Skovgaard et al., 2001).

Genes in Prokaryotic Genomes and Their Computational Prediction

73

almost all programs. This tendency of over-annotation grows with the increasing G + C content of an organism. Over-prediction of short genes was one of the main motivations for developing the EasyGene program by Larsen and Krogh (2003) who have attempted to address this problem by obtaining the statistical significance of ‘coding’ score assigned to an ORF. This should be considered as one of the significant developments in prokaryotic gene prediction in recent times. In addition to the challenge of correctly predicting the reading frame or the 3′ -end of a gene in GC-rich genome, the prediction of exact gene start is made further difficult due to the occurrence of more possible start codons (GTG being most frequent) for a stop codon. The frequent long overlaps of ORFs severely confound the accurate gene detection in such genomes. This is one of the biggest issues with current bacterial and archaeal gene modelers. Further for those interested in this field, there is a need of great caution in dealing with short ORFs as they do not provide sufficient statistical signals for robust predictions. Different methods have shown varying sensitivities in accurate prediction of short genes. It is important to assess the limitations of a method in dealing with short genes before applying it to a whole genome; artificial genome sequences provide a test bed for probabilistic methods (Azad and Borodovsky, 2002; Azad and Lawrence, 2005). Conserved short genes provide another test set (Besemer et al., 2001). Indeed using these test sets, it was shown that combining models of different orders through appropriate interpolation techniques has the potential to improve short gene detection (Azad and Borodovky, 2002). There is a need to devise other strategies for handling short ORFs. Unlike eukaryotes, prokaryotes have significant fraction of their genes acquired through horizontal gene transfer (Ochman et al., 2000). These genes reflect the mutational proclivities of donor organisms and thus exhibit distinct compositional bias in the recipient genome context. These so called atypical genes often escape detection, and so most gene finders employ more than one gene model to address this problem. Typically three gene models are used corresponding to native, highly expressed and horizontally transferred genes. Use of multiple gene models was observed to improve considerably the detection of atypical genes (Hayes and Borodovsky, 1998). Yet it is believed that some of the false negatives are such genes and further improvisation of methods is needed to improve the identification of genes of significantly distinct compositions. Although application of neural networks to automatic generation of multiple gene classes was a step forward in this direction, there is still a need of alternative methods for addressing this problem satisfactorily. A gene clustering method proposed recently (Azad and Lawrence, 2007) is a potentially powerful tool to segregate genes of similar compositions into classes representing the genic variability in a genome. It might be possible to build as many gene models as the number of distinct gene classes and this can help in identification of genes with different histories.

74

R. K. Azad

Lastly, the most important resource for identifying and validating genes is the public databases. Earlier programs based on extrinsic evidences were developed at a time when only few tens of complete genomes were available. Now with over 500 complete prokaryotic genomes available, there is a need to revisit and fine tune these programs. This has also given us the opportunities to develop novel, more accurate methods. Future strategies should be devoted to combining the intrinsic and extrinsic evidences in addition to utilizing the complementary strengths of different methods to bring prokaryotic gene prediction further close to a point of complete solution.

16. Further Reading For those interested in understanding the basics of HMMs, Rabiner (1989) and Rabiner and Juang (1986) are excellent introductions. For a further comprehensive review and latest development in this field, readers are referred to Ephraim and Merhav (2002). In the context of gene prediction and other biological applications of HMMs, the book by Durbin et al. (1998) is highly recommended. Those interested in theory of SVMs are referred to the book by Vapnik (1995). Some reviews in the field worth reading are Fickett and Tung (1992), Eddy (2002), Azad and Borodovsky (2004b), Overbeek et al. (2007) and Meyer (2007).

Acknowledgments The author thanks Eric Polinko and Lydia Daniel for critical reading of the manuscript. Several valuable suggestions from three anonymous reviewers are also gratefully acknowledged.

CHAPTER 3 EVOLUTION OF THE GENETIC CODE: COMPUTATIONAL METHODS AND INFERENCES

GREG FOURNIER

1. Introduction The three known domains of life (bacteria, archaea, and eukarya) show a striking amount of physiological and morphological variation, the consequence of thousands of gene families simultaneously evolving over the 4 billion year history of life on Earth. However, despite this tremendous amount of accumulated diversity, every existing living organism still conforms to the central dogma of molecular biology — that is, DNA sequences (genes) are transcribed into RNA, which are then translated by ribosomes into polypeptide chains that fold into functional proteins (Fig. 1). At the heart of this process is the genetic code, the set of triplet nucleotides (codons) on messenger RNA (mRNA) each representing one of twenty amino acids, as interpreted by transfer RNA (tRNA) molecules. This code has largely remained unchanged, and is one of the major supporting arguments for a singular origin of cellular life, with existing domains sharing a common ancestor. Nevertheless, some differences in the genetic code and the machinery of its implementation have evolved within specific groups of organisms; by studying these we can gain insights into the very ancient evolution and the origin of the genetic code itself. Here we discuss the organization and features of the genetic code and its variants, as well as the various computational and statistical methods that have been employed in reconstructing its ancient and obscured evolutionary history.

1.1. The Amino Acids All organisms make use of at least twenty amino acids (peptides) in their genetic code, the so-called “canonical code” (Table 1). During protein synthesis, the ribosome forms polypeptide chains via a peptidyltransferase reaction, forming a peptide bond between the nascent amino acid chain, and an amino acid attached to the tRNA recognizing the next codon in the messenger RNA sequence. Each amino acid differs in its sidechain structure, which can have various physiochemical properties. It is these sidechain properties that determine the chemical properties of a protein, including its folded structure, enzymatic activity, and binding affinities. 75

G. Fournier

76

Fig. 1.

Schematic of protein translation.

Amino acids can be classified into groups having similar structural or chemical properties; often, these amino acids will be found performing similar roles in a protein, and are often substituted for one another in the course of evolution. However, these sets differ depending on the property that one is interested in. For example, serine (Ser), threonine (Thr) and tyrosine (Tyr) all have a terminal hydroxyl group that often undergoes phosphorylation, a critical process for inducing structural changes in a protein or storing chemical energy for use in metabolic pathways. Alternatively, Tyr could be included in the group of aromatic amino acids, along with phenylalanine (Phe) or tryptophan (Trp), because of its hydrophobic phenyl moiety, often conserved for its role in providing a large planar hydrophobic surface for substrate binding or interactions deep within a protein structure. The relevant property of the sidechain of each amino acid depends largely on its context within a given protein structure.

1.2. Codon Designations There are 64 possible codons that can be generated as triplets of the nucleotides found in RNA, adenosine (A), guanine (G), cytosine (C), and uracil (U). Three of these codons are reserved as “termination” signals that indicate the end of a protein coding sequence, (UGA, UAA, UAG). The code is degenerate, with amino acids encoded by either one, two, three, four or six of the remaining 61 codons. However, these partitions are not random. The codons encoding specific amino acids are organized into blocks, typically with the same nucleotides used in the first and second codon positions. Furthermore, when these blocks are partitioned into pairs of codons corresponding to different amino acids, the pairs are always grouped

Evolution of the Genetic Code Table 1.

77

Variations on the canonical genetic code in specific organismal lineages.

Amino acid Alanine (Ala)-A Cysteine (Cys)-C Aspartate (Asp)-D Glutamate (Glu)-E Phenylalanine (Phe)-F Glycine (Gly)-G Histidine(His)-H Isoleucine (Ile)-I Lysine (Lys)-K Leucine (Leu)-L

Codons GCN UGY GAY GAR UUY GGN CAY AUY, AUA AAR

Methionine (Met)-M

CUN, UUR ATG

Asparagine (Asn)-N

AAY

Proline (Pro)-P Glutamine (Gln)-Q

CCN CAR

Arginine (Arg)-R

CGN, AGR

Serine (Ser)-S Threonine (Thr)-T Valine (Val)-V Tryptophan (Trp)-W

UCN, AGY AGN GUN UGG

Tyrosine (Tyr)-Y Pyrrolysine (Pyl) Selenocysteine (Sel)

UAY UAG* UGA*

Variants (Nuclear/Prokaryotic)

Variants (Mitochondrial) +UAG in some chlorophytes

+UGA in Euplotes +UAA insome ciliates +AGR in Vertebrates −AUA in Micrococcus

−CUG in Candida

−AUA in Yeast, Triploblasts, Echinoderms −AAA in Platyhelminths, Echinoderms −CUN in Yeast; +UAG in some chlorophytes −AUA in Yeast, Triploblasts, Echinoderms +AAA in Platyhelminths, Echinoderms

+UAR in ciliates, some green algae −CGG in some Mollicutes; −CGY, CGA in Yeast; −AGA in Micrococcus −CGG in Candida, Prototheca; −AGR in Tunicates, Vertebrates, Bilateria, Drosophila +CUG in Candida +AGR in Bilateria +CUN in Yeast +UGA inMycoplasma

+UAG in Dictyostelium, Plants, Chondrus crispus, some prymnesophytes +UAA in Platyhelminths

N = any nucleotide; Y = pyrimidines (C,U); R = purines (G,A).

by having either a purine (A, G) or pyrimidine (C, U), in the third position. Tables of the genetic code are organized using these principles (Fig. 2).

1.3. Transfer RNA Transfer RNAs (tRNAs) are a set of structurally conserved molecules that create the direct physical link between codons and their cognate amino acids. Typically 75–100 nucleotides in length, after transcription they fold into a characteristic “clover leaf” design, with three stem-loops (Fig. 3). These loops are important for a variety of molecular interactions, including codon-anticodon interaction on mRNA, binding of elongation factors for recruitment into the ribosome, specific recognition by

78

G. Fournier

Fig. 2. The canonical genetic code. Dark grey boxes (e.g., Phe) indicate amino acids with class II cognate synthetases; light grey boxes (e.g., Leu) indicate class I. Medium grey boxes (Lys) indicate both class I and class II.

Fig. 3. A typical tRNA molecule. This tRNA is specific for decoding Phe, as the GAA anticodon recognizes both UUU and UUC codons.

aaRS proteins, and, in some cases, recognition by tRNA-dependent amino acid biosynthesis enzymes for modification of a bound amino acid. A single tRNA molecule is often capable of recognizing many different codons of a specific amino acid, due to the nature of the hydrogen-bonding interactions in the 3rd codon position, the so-called “wobble position” (Crick, 1968). However, this alone is insufficient for translating the genetic code. Post-transcriptional

Evolution of the Genetic Code Table 2.

tRNA wobble rules and modified bases.

Anticodon (3rd position)

Codon (3rd position)

Adenine (A) Inosine (I) guanidine (G) 2-thiouridine (U) uridine (unmodified) (U)

U A, G, U U, C A, G A, G, U, C

Cytidine (C) Lysidine (L)

79

G A, U, C

Use (if rare) Mycoplasma tRNAThr

Mycoplasma and mitochondrial tRNAs Universally used in tRNAIle for proper specificity

modification of the anticodon at the wobble position is often needed in order for a properly functioning tRNA to further restrict or expand recognition of nucleotides at this position (Agris et al., 2007). The enzymes responsible for these modifications are found throughout all three domains of life, further supporting that the MRCA (their most recent common ancestor) had a complete and well-developed tRNA-based translation system. The chemical nature and consequences of these modifications are often extensive and complex, however a few are especially important with respect to the translation of the genetic code (Table 2). Adenosine in the wobble position is almost always deaminated to inosine, and uridine in this position is almost always sulfunated to 2-thiouridine, and occasionally subsequently modified to 2-selenouridine (Veres et al., 1994; Veres and Stadtman, 1994). Exceptions occur in some species of Mycoplasma and in mitochondria, where an unmodified uridine in the wobble position of several tRNA species allows a single tRNA to recognize an entire 4-codon block. Presumably, in both cases this may be a consequence of genome minimization. Also in Mycoplasma, an unmodified adenine in tRNAThr specifically recognizes only the highly-used AGU codon (Andachi et al., 1987). Much more specific is the universal modification of tRNAIle , permitting the specific recognition of AUA, AUC, and AUU without also recognizing AUG, which codes Met. In this case, a cytidine in the wobble position is modified to lysidine via the addition of a positively charged lysine group (Muramatsu et al., 1988). Nucleotide bases in other regions of the tRNA molecule undergo substantial modifications as well, although these are not directly related to codon-anticodon recognition. For example, the conserved purine nucleotide immediately adjacent on the 3′ side of the anticodon is almost always modified in some way, in order to both maintain the correct confirmation of the anticodon loop, and prevent translational frameshifting (Agris et al., 2007).

1.4. Aminoacyl-tRNA Synthetases The genetic code is largely enforced by the activity and specificity of aminoacyltRNA synthetase (aaRS) enzymes, which “load” tRNA with the correct amino

80

G. Fournier

acid for protein translation to proceed correctly (Fig. 2). While several different tRNA molecules often recognize the codons for a specific amino acid, a single aaRS recognizes and binds all tRNA molecules that correspond to its amino acid affinity. This is accomplished by recognition of various parts of the tRNA molecule, usually including the acceptor stem where the amino acid is transferred, and the anticodon loop that recognizes the correct codon on the mRNA. One exception to this rule is SerRS. Since Ser is the only amino acid associated with two sets of completely dissimilar codons (UCN and AGY), it is not possible for SerRS to recognize tRNASer via the anticodon. Rather, a long variable arm on tRNASer is the identity element, recognized by a unique helical arm of SerRS (Hartlein and Cusack, 1995). There are two major classes of aaRS, class I and class II (Figs. 4 and 5). While each class performs the same function of aminoacylating tRNA, they are nonhomologous, and structurally quite different (Eriani et al., 1990; Arnez and Moras, 1997). Class I aaRS uses a Rossmann fold catalytic site to aminoacylate the 2′ OH position of the terminal nucleotide in the 3′ tRNA acceptor stem, while Class II aaRS uses an antiparallel fold to aminoacylate the 3′ OH position (with the exception of PheRS). These enzymes form various complexes, including monomers, homodimers, and heterodimers, as well as more complex arrangements. However, within each class, the core catalytic domain shows sequence and structural conservation.

Fig. 4. Phylogenetic relationships of class I aminoacyl-tRNA synthetases. A maximum-likelihood phylogenetic tree was constructed using class I aaRS protein sequences. Dotted lines represent clades with low (100kb), relatively unstable (excision frequencies ∼10−4 to 10−6 ) regions containing clusters of virulence associated genes (Knapp et al., 1986; Hochhut et al., 2006). As more genomes were sequenced, it became clear that genetic elements which share similar structural features with PAIs can encode other important functions (new metabolic capabilities, etc.). Hence PAIs were grouped with other similar elements and referred to collectively as GIs. GIs appear to contribute to the adaptation of microbes in two ways. First, genes acquired in GIs have been shown to allow the microbes to explore new niches and to improve fitness. For example, many rhizobia species harbour symbiotic islands containing nitrogen fixation and nodulation genes to allow their interaction with plant hosts (Sullivan et al., 1998). For pathogens, GIs encoding iron uptake functions, type III secretion systems, toxins, and adhesins augment their abilities to survive and cause diseases in the host (Dobrindt et al., 2004; Gal-Mor et al., 2006). The second type of contribution of GIs to microbial adaptation is less well studied but may play an equally important role. New studies are emerging that show selective loss and possible regaining of islands may provide an additional means to modulate pathogenicity (Lawrence, 2005; Manson et al., 2006). Spontaneous excisions of PAIs have been observed in various pathogens resulting in distinct pathogenic phenotypes compared to wild types (Bueno et al., 2004; Middendorf et al., 2004). In the case of Salmonella enterica serovar Typhi pathogenicity island 7, called SPI7, deletion of this GI is associated with more rapid invasion in-vitro and reduced resistance to complement attack (Bueno et al., 2004). As the genetic requirements for initiation of infection and long-term infection can be quite different, the capability to lose or alter certain genes, such as surface antigens, after the initial infection has been postulated as a means to establish long term colonization and avoid immune detection (Finlay et al., 1997; Gogol et al., 2007). GIs share some sequence and structural features that help to distinguish them from the rest of a given prokaryotic genome. These features are summarized below and in Table 1. First, GIs are sporadically distributed in closely related species or strains of the same species. For example, most PAIs are present in pathogen genomes but are absent from their non-pathogenic relatives. However, it is important to keep in mind that the concept of virulence is context specific and a particular virulence factor (e.g. factors involved in iron-uptake) may contribute to pathogenic potential in one species but act as important factor for survival and replication in other ecological niches not susceptible to infection. In such nonpathogenic hosts or environments,

Mobile Genetic Elements and Their Prediction Table 1.

115

List of features associated with genomic and pathogenicity islands.

Feature associated with GIs

Possible method(s) to detect these features

Sporadic distribution

Comparative genomics to identify unique and shared regions Various tools have been developed to detect bias (see section Detection of Genomic Islands) Detect full or partial tRNA using BLAST (Basic Local Alignment Search Tool) (Altschul et al., 1997) or tRNAscan-SE (Lowe et al., 1997) Comparative genomics to identity large insertions Compare to functional databases such as COG (Clusters of Orthologous Groups) (Tatusov et al., 1997) Similarity search of mobility genes using Hidden Markov Models (HMMs) or BLAST Use repeat finders such as REPuter (Kurtz et al., 1999) to identify repeats Comparative genomics to identify unique regions; targeted PCR or hybridization to detect altered regions

Sequence composition bias Adjacent to tRNA

Usually relatively large (>10 kb) Contain genes of unknown functions Contain mobility genes or elements Flanked by direct repeats Unstable and can excise spontaneously

these islands may more appropriately be called “fitness islands” or “ecological islands” (Hacker et al., 1997). Second, GIs often exhibit sequence composition bias compared to the core genome. The classic measure of sequence composition bias is G + C content (%G + C). However, due to its limited sensitivity (Hsiao et al., 2005), additional measures using oligonucleotides (k-mers), have been more recently used. Since the majority of a given genome exhibits consistent sequence composition, the average composition from the entire genome is often used as a substitute for the core genome. Third, GIs are frequently found adjacent to tRNA genes or flanked with direct repeats (Hacker et al., 1997). tRNA genes are known phage integration sites and therefore may serve as integration sites for MGEs that become PAIs (Reiter et al., 1989). GIs that use tRNAs as insertion sites often carry an “identity block” of several nucleotides-long that is identical to the 5′ or 3′ end of a tRNA; and upon insertion, the tRNA is reconstituted by the identity block generating a pair of direct repeats (one from the identity block and the other from the tRNA gene) at the opposite ends of the inserted fragment (Williams, 2002). The sites have the added benefit of being highly conserved. tRNA genes are often reconstituted upon insertion or excision, and as a result, GIs do not abrogate tRNA function. Fourth, most GIs discovered to date are relatively large, ranging from 10 to 200 kb (Hacker et al., 1997). They often contain clusters of functionally synergetic

116

M. G. I. Langille et al.

genes leading to the formation of the selfish operons hypothesis (Lawrence et al., 1996). This hypothesis postulates that HGT provides a mechanism for weakly selected clusters of functionally-synergetic genes to spread and survive better than unlinked genes since horizontal transfer is size-limited. In the long run, this selective pressure leads to operon structures common in prokaryotes. Fifth, while many GIs have been characterized, previous studies have shown that genes within islands disproportionately contain genes with no known homologs or with unknown function. In addition, while certain gene functional classes such as cell surface proteins, host-interaction proteins, and DNA-binding proteins are more often observed in GIs (Nakamura et al., 2004; Merkl, 2006), and others such as genes involved in information processing, are rarely observed, it appears that in a long run, all genes are subjected to HGT. Sixth, GIs often contain functional or cryptic mobility genes (those genes related to the movement of MGEs) such as integrases and transposases. These mobility genes may indicate that a GI is autonomous or they could be remnants of other embedded MGEs such as IS elements that are frequently found in GIs (Hacker et al., 1997). Non-autonomous GIs can also depend on host encoded recombination enzymes or recombination can occur among highly-similar or identical copies of the embedded mobile elements (e.g. IS elements) resulting in rearrangement, translocation or deletion of GIs (Hacker et al., 1997). Lastly, many GIs are unstable and have been reported to be sporadically excised; therefore, certain isolates may not contain the GI (O’Shea et al., 2002; Middendorf et al., 2004) (Hochhut et al., 2006). While it is not necessary for every feature to be present in a region for that region to be called a GI, the simultaneous presence of a subset of these features is generally viewed as strong evidence for the region’s horizontal origin.

2.2. Prophage A prophage is the latent form of a prokaryotic virus known as bacteriophage or simply phage. The movement of DNA between prokaryotic cells via a phage is referred to as transduction. Phage can be divided to into two general groups depending on whether they possess the ability to become dormant, called temperate phage, or if upon infection of the host their only choice is to enter a lytic cycle (the production of phage progeny), called virulent phage (Lwoff, 1953). The dormant phage, upon invading the bacterial cell, will often integrate its own DNA into the bacterium’s genome (Freifelder et al., 1970) and will be replicated for numerous generations along with the bacterial genome. Induction provokes dormant prophage to enter a complete lytic cycle, and this may happen spontaneously or as a consequence of change in the bacteria’s environmental conditions. These integrated prophages account for a large portion of the variation seen between bacterial strains (Ohnishi et al., 2001) and can represent a substantial number of genes in a bacterial genome (Casjens et al., 2000). Furthermore, virulence factors that contribute to a

Mobile Genetic Elements and Their Prediction

117

bacterium’s pathogenicity can be mobilized by phage and are seen as a key factor in the evolution of new pathogens (Boyd et al., 2002). Non-computational methods for identifying prophage within a bacterial species depend on whether or not a susceptible host is available. If such a host exists then spontaneous induction is sufficient for proliferation of the phage; otherwise, the induction of a prophage would require special protocols such as the addition of mitomycin C in Yersinia and Streptococcus strains (Yamamoto, 1967; Huggins et al., 1977; Popp et al., 2000) along with electron microscopy (EM) and genome analysis. Prophage regions typically contain an integrase and several phage associated genes. However, they can often carry other genes that are not associated with the proliferation of the phage. Similarly to GIs, the presence of a tRNA or a flanking direct repeat (described above) is supportive evidence that phage integration may have occurred in a region.

2.3. Integrons Integrons are genetic elements that utilize site-specific recombination to capture and direct expression of exogenous open reading frames (ORFs). They were first identified in the late 1980’s for their important role in the capture and spread of antibiotic resistance genes (Stokes et al., 1989). Bacteria harboring integrons possess the ability to incorporate and express genes with potentially adaptive functions, including antibiotic resistance genes, and therefore pose a major problem for treatment of infectious diseases (Rowe-Magnus et al., 2002). Furthermore, some bacteria become resistant to multiple antibiotics by harboring integrons that have captured multiple antibiotic resistance genes and, potentially, genes encoding other traits which give the bacteria an adaptive advantage. Additionally, integrons are often linked with other MGEs, such as plasmids and transposons, leading to rapid dissemination of such traits within a population. A recent study reported that up to 9% of bacteria harbor integrons (Boucher et al., 2007) making them an important player in acquisition and spread of adaptive traits and antibiotic resistance in bacterial populations. Integrons consist of three key elements necessary for the capture and expression of exogenous ORFs: An integrase gene (intl ) and recombination site (attl ) are necessary for acquisition of genes, and a promoter (Pc) ensures their expression. Intl, attl and Pc comprise the 5′ conserved segment (5′ CS), and the 3′ conserved segment (3′ CS) contains known genes that confer resistance to various compounds (Fig. 1). Intl catalyzes the recombination between attl and a recombination site at the 3′ end of the gene called attC or the 59-base element (59-be). The 59-be consists of a variable region spanning 45–128 nucleotides in length flanked by imperfect inverted repeats at the ends designated R′ (GTTRRRY) and R′′ (RYYYAAC), where R is a purine and Y a pyrimidine. The recombination site in the 59-be recognized by intl is between the G and T bases of R′ . An ORF and its associated 59-be is termed

118

M. G. I. Langille et al.

Fig. 1. Schematic representation of a class 1 integron. IntI, integrase gene; attI, integration site; Pc, promoter for expression of integrated gene cassettes; 59-be (attC ), site adjacent to ORF recognized by intI; sul, sulphonamide resistance; qacE, quaternary ammonium compound resistance; ORF, open reading frame; 59-be, 59 base element. Note that the circular cassette comes from excision of the integrated form (not shown).

a gene cassette. These gene cassettes have been shown to be excised as covalently closed circles that may contain more than one gene cassette linked together (Collis et al., 1992). All integrons characterized to date are classified as either integrons or superintegrons. Integrons are defined as gene cassettes associated with MGEs such as insertion sequences, transposons, and conjugative plasmids, which serve to disseminate genes through mechanisms of HGT. Five classes of integrons have been described, classified based on sequence homology of their integrase genes (Mazel, 2006). Class 1 integrons are the most clinically relevant, isolated frequently from patients with bacterial infections. Bacteria harboring class 1 integrons often confer multi-antibiotic resistance and possess gene cassettes resistant to a wide variety of antibiotics, including all known β-lactam antibiotics (Mazel, 2006). One such class 1 integron was identified in E. coli that contains 8 different antibiotic resistance cassettes including a broad-spectrum β-lactamase gene of clinical importance (Naas et al., 2001). Association with MGEs can lead to rapid dissemination of integrons and their associated gene cassettes through both intraspecies and interspecies transfer. In support of this, extensive reports have identified integrons in diverse Gramnegative bacteria and also in some Gram-positives (Hall et al., 1999; Mazel, 2006). Superintegrons differ from integrons in that they are chromosomally located and not linked to MGEs. They also differ in that their cassette arrays can be quite

Mobile Genetic Elements and Their Prediction

119

large in size; one unique superintegron identified in Vibrio cholerae harbors over 170 cassettes (Mazel et al., 1998; Rowe-Magnus et al., 1999). In addition to antibiotic resistance genes, integron and superintegron gene cassettes have also been shown to encode proteins involved in other adaptive functions, including virulence factors, metabolic genes, and restriction enzymes (Ogawa et al., 1993; Rowe-Magnus et al., 2001; Vaisvila et al., 2001). However, a recent study reported that 78% of cassette-encoded genes are uncharacterized or have no known homologs to date (Boucher et al., 2007). Therefore, more investigation into the function and diversity of genes encoded in integrons is needed to gain a better understanding of their adaptive importance in microbial evolution.

2.4. Transposons and IS Elements Barbara McClintock was the first to have observed recurring chromosomal breakages in the same region caused by a genetic element, Ds (Dissociation), in maize in early 1940s (McClintock, 1941). She later found another element, Ac (Activator), in maize that must be present for the Ds element to exert chromosomal breakage. These two elements were later proposed to be the autonomous (Ac) and nonautonomous (Ds) members of the same transposon family (Fedoroff et al., 1983). More generally, transposons are DNA elements having lengths ranging from a few hundred base pairs (bps) to more than 65,000 bps, that proliferate in the host genome and have been observed in all three kingdoms of life; bacteria, archaea and eukaryotes. Each group of transposons may consist of autonomous and non-autonomous members. An autonomous transposon encodes transposition catalyzing enzymes, called transposases, and is able to transpose itself. A non-autonomous transposon does not encode such proteins and relies on its autonomous counterparts with similar cis signals to transpose it. Movement of transposons is usually limited to within a single cell, but they are often contained within other MGEs such as GIs and prophages that allow for cell-to-cell transfer. Of course, as with any genomic region, transposons could also be transferred between naturally competent cells via transformation. In addition, some transposons called conjugative transposons can move via conjugation and we will discuss these at the end of this section. A transposon consists of one or more overlapping genes, one of which may be a transposase (Mahillon et al., 1998; Chandler et al., 2002; Siguier et al., 2006a), as shown in Fig. 2. For a transposon with more than one gene, the upstream gene encodes a DNA recognition domain, while a second overlapping gene encodes the catalytic domain in most cases (Wicker et al., 2003). Additional genes may follow, which may alter the host phenotype. These include antibiotic resistance genes (Stokes et al., 2007). Most transposons carry a pair of terminal inverted repeats (TIRs) (shorter than 50 bps) at the two termini, and they are termed TIR transposons (Fig. 2A) while a non-TIR transposon (Fig. 2B) does not harbor such

120

M. G. I. Langille et al.

Fig. 2. Structures of two types of transposons in prokaryotes. (A) TIR transposon and (B) nonTIR transposon. Both of them have autonomous and non-autonomous members. (C) A transposon may also proteins other than a transposase.

TIR signals at the termini. Linker sequences are located between each terminal signal and the ORF region. The relocations of transposons could be deleterious to the host as they may disrupt host genes by inserting into them and may alter the expression of the neighboring genes with their endogenous promoters (Mahillon et al., 1998; Chandler et al., 2002). Also, homologous recombination between two transposons contributes to reorganization and deletion of chromosomal regions in the host genome (Toussaint et al., 2002). Later studies suggested that transposons were also able to introduce beneficial mutations to the host genome through insertion and recombination (Blot, 1994). For example, several studies have shown that transposons can give a selective advantage to the host in specific environments by introducing recombinations in E. coli (Zambrano et al., 1993; Naas et al., 1994; Lenski, 2004). By taking advantage of such mutagenesis capabilities, transposons have been extensively used in genetic engineering to mediate global insertional mutagenesis of bacteria (Ely et al., 1982; Berg et al., 1984; Zink et al., 1984; Rella et al., 1985). Also, transposons served as mobile priming sites to sequence DNA segments in the 1980s (Ahmed, 1985; Adachi et al., 1987). Insertion sequences (IS elements) are similar to autonomous DNA transposons, in that they encode a transposase, but unlike transposons they do not encode any genes contributing to the phenotype of the host (Adhya et al., 1969; Shapiro, 1969; Shapiro et al., 1969). As of today, more than 1,500 IS elements have been identified and they are classified into 20 families, with some families being subdivided into groups, based on their genetic structures and the sequence similarities of the encoded transposases (Siguier et al., 2006b). Recent studies suggest that ∼99% of known IS

Mobile Genetic Elements and Their Prediction

Fig. 3.

121

Structure of a composite transposon, Tn5.

elements in prokaryotes have fewer than 100 copies in their host genomes (Siguier et al., 2006b). Two adjacent IS elements, plus intervening DNA sequence, can form a composite transposon as shown in Fig. 3, which may carry its own protein-encoding genes within the linking DNA sequence, e.g. the antibiotic genes in Tn5 (Berg et al., 1989; Berg, 1989; Reznikof, 2002) and Tn10 (Haniford, 2002). Several other transposons with much more complex structures, e.g. Tn3 (Haniford, 2002) and Tn7 (Craig, 2002), have also been characterized in prokaryotes. Conjugative transposons (CTns) are MGEs that have features of transposons, plasmids and phage. As with transposons, conjugative transposons excise and integrate themselves into the genome and are traditionally named under the nomenclature of transposons, e.g. Tn916 (Franke et al., 1981) and Tn1545 (BuuHoi et al., 1980; Courvalin et al., 1987). However, conjugate transposons are similar to plasmids in that they have a covalently closed circular transfer intermediate that can be transferred by conjugation. This allows conjugate transposons to be integrated within the same cell or between organisms. Contrary to plasmids, conjugate transposons in their circular form cannot autonomously replicate and must become integrated into a prokaryotic genome to maintain their survival (Scott et al., 1988; Rice et al., 1994). Some conjugative transposons have site specific integration and have integrases that are highly similar to lambdoid phages (Poyart-Salmeron et al., 1989; Poyart-Salmeron et al., 1990). However, they differ from phages in several aspects, including that they do not form viral particles and are not transferred by transduction. As far as we know, no computational prediction of conjugative transposons has been published in the literature. Reviews on conjugative transposons may be found elsewhere (Clewell et al., 1993; Scott et al., 1995).

2.5. Other Mobile Elements As we have already shown in this section, MGEs are complex elements that due to their mosaic nature and multiple methods of movement are not easily classified or defined. Indeed, differences between transposons and IS elements, or prophage and GIs are not always clear and represent the dynamic nature of biology and research. Besides the most common MGEs outlined above, many other rare elements exist such as inteins, intron-like regions that are spliced out after translation

122

M. G. I. Langille et al.

(Gogarten et al., 2006), or group II introns (Dai et al., 2003; Fedorova et al., 2007). Here we will not discuss these elements any further, we do recommend Gogarten et al. (2006) and Dai et al. (2003) as starting locations for identification of inteins and group II introns, respectively. In addition to the rare MGEs mentioned above, we also will not be discussing MGEs that do not integrate into the host genome such as plasmids and replicative forms of phage. In contrast to the rare elements just mentioned, these MGEs (especially plasmids) are prevalent in prokaryotes and are of medical importance due to their ability to pass multiple antibiotic resistance factors between organisms. However, due to the limited need for computational prediction of these MGEs they are beyond the scope of this chapter.

3. Computational Methods for Mobile Element Prediction Since the formations of mobile elements are macroevolutionary events, we often do not have the luxury to observe the events in real time. It is therefore necessary to rely on evidence available to us in the present time to infer the history of the organism’s genomic evolution. Therefore, the features associated with mobile elements, as outlined above, can be leveraged for carrying out bioinformatics analyses and for building bioinformatics tools to detect mobile elements. Although the types and features of mobile elements are varied, many tools use common approaches for their detection. In particular, similarity searches conducted with tools such as BLAST (Altschul et al., 1997) or FASTA (Pearson et al., 1988) are often used to query previously curated databases of mobile elements to identify putative new elements. However, different cutoff criteria and parameters along with additional requirements, such as a minimum number of genes in a contiguous cluster, are often used to produce tools that are optimal for identification of a particular mobile element type. The following sections describe the methods that are used for the identification of particular mobile elements in further detail. Considering that new methods are being constantly published we will discuss only selected methods that appear to be commonly used. In addition, we will highlight mobile element features that may provide additional means for detection that have not been exploited previously.

3.1. Detection of Genomic Islands Below, we will discuss methods in the context of the two main bioinformatics approaches to identify GIs; sequence composition and comparative genomics. 3.1.1. Sequence Composition-based Approaches Sequence composition based approaches rely on the assumption that different organisms exhibit different nucleotide pattern preferences that constitute their

Mobile Genetic Elements and Their Prediction

123

signatures. More phylogentically related organisms share similar preferences and, therefore, have more similar sequence composition signatures. As a result, if a gene or a gene-cluster whose signature deviates from the genome signature, a plausible explanation is that this gene (or gene cluster) has a foreign origin and its signature reflects that of the original donor. The basic form of a genome signature is the G + C content (%G + C) of the genome, which can be thought of as mononucleotide frequencies. Prokaryotic genomes sequenced today have a %G + C range of approximately 20% to 75%. A large number of studies using dinucleotide frequencies and many other studies based on codon usage as genome signatures largely confirmed that closely related species share more similar signatures than more distantly related species (Karlin et al., 1998; Sandberg et al., 2001; Carbone et al., 2003). Higher order DNA patterns, such as tetra-, hexa, and octa-nucleotide frequencies, have also been proposed to be useful as genome signatures (see Chapter 1 for further discussion). Factors other than HGT may contribute to observed sequence composition bias and cause false HGT detection. For example, gene expression level has been linked to codon usage and therefore also affects the trinucleotide frequencies. In 1982, using then-available sequences, Gouy and Gautier analyzed codon usage patterns in bacterial genes and confirmed that codon composition is correlated to mRNA expression level (Gouy et al., 1982). Later on, it was shown that highly expressed genes such as ribosomal proteins exhibit atypical composition bias (Karlin, 2001). Also, natural variation in coding sequences can produce bias, especially, if the sample size is too small (i.e. the sequence is too short) to generate a reliable signal. As a consequence, while sequence composition has been used to detect HGT in single genes (Nakamura et al., 2004; Tsirigos et al., 2005), they are perhaps more suitable in detecting GIs because it is less likely to have a cluster of genes all exhibiting sequence composition bias due to random noise (Karlin, 2001; Hsiao et al., 2003; Waack et al., 2006). Another issue associated with using composition bias to detect genes acquired horizontally is that mutational pressure acting on a foreign gene may cause it to adapt to the host genome signature over time in a process termed “amelioration” (Lawrence et al., 1997). It is believed that over a period of time, the signature from the donor is lost and is replaced by the recipient’s signature. Therefore, sequence composition bias is more suitable for detecting recent HGT. Lastly, based on sequence composition alone, genes that are acquired from another organism sharing the same or very similar genome signature (presumably due to relatedness) would not be detectable (see Chapter 6 for further discussion of HGT detection). Despite these issues, sequence composition based approaches for detecting horizontally acquired genetic material have been developed and improved in the past few years and have been shown to be capable and versatile tools for detecting GIs. All of the methods described below essentially calculate the k-mer frequencies (k is usually from 1 to 9) for a sub-region of a genome and compare these results with the expected frequencies from that genome. Deviation from the genome frequencies

124

M. G. I. Langille et al.

is scored and if the score is above a certain cut-off, these regions are marked as putative GIs. Below, we have highlighted the advantages and disadvantages of a selection of representative tools. 3.1.2. SIGI and SIGI-HMM SIGI and SIGI-HMM both use codon usage (frequency of a trinucleotide normalized by synonymous codons) as a genome signature (Merkl, 2004; Waack et al., 2006). The codon usage frequency table of an organism is derived either from its whole genome if available or from its species’ entry in the CUTG codon usage database (Nakamura et al., 1999; Tu et al., 2003). For each gene, the multiplicative-product of the codon usage frequency from each codon in the gene is determined using the organism’s own codon frequency table (the host table). The same multiplicativeproduct is also calculated using the same gene sequence but instead of using the organism’s own table, other organisms’ frequency tables are used (the donor tables). Lastly, a score in the form of a normalized odds-ratio is calculated from each pairwise comparison between the product derived from the host table and that derived from a donor table. The score value can be used to decide whether the codon usage of a gene resembles more to the codon prevalence of the host species or to that of another (putative donor) species. In the cases where the resemblance is closer to the latter, this gene, if it meets a custom cut-off, is marked as a putative foreign gene. Using a BLAST-like extension mechanism, non-contiguous clusters of putative foreign genes are combined to form a putative GI until the frequency of putative foreign genes within a region fall below a predetermined cut-off. In the original SIGI paper, a local frequency of 2 times the genome frequency was used. So if the frequency of putative foreign genes in a genome is determined to be 10%, the frequency of foreign genes within a putative GI has to be at least 20% or higher. In SIGI-HMM, the odds-ratio scores are similarly determined as the SIGI process described above to make a list of putative foreign genes. However, instead of using a BLAST-like extension mechanism to construct putative GIs from putative foreign genes, the updated program used a hidden Markov model (HMM). The HMM incorporates an alternative probabilistic model based on randomly generated nucleotide sequences using the same amino acid sequence as the real gene product. This alternative model provides a baseline measure for the random noise in the sample. Moreover, an additional filter to remove potentially highly expressed genes was also incorporated into the HMM using the codon usage of ribosomal proteins as a reference. Using a path-generating algorithm of HMM, a final list of GIs is predicted. All genes assigned to a putative foreign state (i.e. more similar to a donor frequency table) are considered in GIs and these regions are further combined if there are less than 4 native (not foreign) genes between them. One unique feature of the SIGI and SIGI-HMM approach is its ability to detect putative donor of GIs from its pair-wise comparison scores (the more a gene resembles the codon usage of another organism, the more likely that organism is related to the donor). Due to the current limited sampling of the Earth’s biomes,

Mobile Genetic Elements and Their Prediction

125

the donor often cannot be precisely predicted. However, based on the developer’s own preliminary analysis, a false prediction at the domain taxonomical level (i.e. Archaea and Bacteria) is less than 1%, suggesting that better sampling can help to improve the accuracy of the donor prediction. One potential shortcoming of SIGI is the use of codon usage as the genome signature because this measure is subjected to the influence of gene expression level. 3.1.3. PAI-IDA The PAI-IDA program (PAthogenicity Island–Iterative Discriminant Analysis) uses iterative discriminant analysis on 3 different genome signatures: %G + C, dinucleotide frequency, and codon usage (Tu et al., 2003). Initial window size of 20 kb and step size of 5 kb were used to calculate the DNA signature of a window compared to the whole genome. A small list of known PAIs from 7 genomes was used as the initial training data to generate the parameters used in the linear functions to discriminate anomaly regions from the rest of the genome. Then through iteration, the discriminant function is improved by taking additional (predicted) anomaly regions into account. The iteration ends if the status of each region stops changing. This algorithm was the first to demonstrate that it is possible to combine multiple genome signatures for the detection of GIs. 3.1.4. Alien-Hunter Alien Hunter uses “Interpolated Variable Order Motifs” (IVOMs) which generate variable length k-mers and prefers longer k-mers over shorter k-mers as long as there is enough information (Vernikos et al., 2006). The length k is set from 1 to 8. The program assigns a weight to each k-mer based on its length in order to linearly combine all the k-mer frequencies as a score. The weights are necessary because shorter k-mers are more likely to appear than longer k-mers but longer k-mers contain more information and are more specific. The initial sliding window size is 5 kb and the step size is 2.5 kb. IVOM vectors from a region are compared to IVOM vectors of the genome to derive a distance score. A HMM is then used to refine the boundaries of the HGT regions. The advantage of this approach is in its ability to incorporate variable length k-mers, and based on the developer’s own analyses, longer k-mers provide better sensitivity and specificity than shorter k-mers alone (Vernikos et al., 2006). 3.1.5. Z-Curve (GC Profile) An approach called Z-curve plots the accumulative ratio of A+T versus G+C along a genomic sequence and uses a segmentation algorithm to detect break points where the A + T to G + C ratio changes abruptly (Zhang et al., 2004). These break points are hypothesized to correspond to the insertion points of a GI. Segments in between large break points are, therefore, putative GIs. This approach has been incorporated into a web based tool named GC Profile and is also available for download as a

126

M. G. I. Langille et al.

software package (Gao et al., 2006). It should be noted that this approach does not produce a list of GIs and relies on users to interpret the graphic outputs. As a result, it is not suitable for automated detection of GIs. 3.1.6. IslandPath Similarly to Z-Curve, IslandPath provides a visual interface to aid researchers in the detection of GIs (Hsiao et al., 2003). Each gene in the genome is represented as a small circle that has a color assigned to it depending on if it shows a significant deviation from the GC content and dinucleotide genome average. In addition, any mobility genes and tRNAs are given special markers on each gene circle. The end result is a whole genome graphical view that highlights features that are associated with GIs and allows for manual identification of putative GIs. 3.1.7. Wn-SVM Tsirigos and Rigoutsos improved their original approach (called Wn) by incorporating a support vector machine (SVM) to classify a gene either as “native” or “foreign” (Tsirigos et al., 2005). SVMs have been used in many other bioinformatics tools to classify biological entries into different classes with very good sensitivity and specificity (for a good example and explanation of the use of SVMs in bioinformatics see Gardy et al., 2005). While, the SVM version of Wn approach (Wn-SVM) indeed showed improved sensitivity over the original approach (Tsirigos et al., 2005), the paper suffered from using a simulated data set to evaluate the approach. As a result, the actual improvement under realistic biological settings is not clear. Nevertheless, using SVM to detect GIs presents a novel strategy. 3.1.8. Comparative Genomics-based Approaches Comparative genomics based approaches entail the use of multiple genomes to detect GIs. In these methods, GIs are often defined as clusters of genes in one genome that are not present in a related genome (see Table 1). They are based on the observation that GIs are sporadically distributed among closely related species and can sometimes be found between very distantly related species as judged by the degrees of sequence divergence in 16S rRNAs or other orthologs (Ragan, 2001). An example of a GI between distantly related species is a 16 kb region that is 99% identical between some strains of Pyrococcus furiosus and some strains of Thermococcus litoralis (Diruggiero et al., 2000). These methods can roughly be divided into gene content based approaches and whole genome (nucleotide) alignment based approaches (Ragan, 2001). However, due to complications in automatically picking reference strains to carry out the comparison and the difficulty in interpreting the comparative results, there is currently no publicly available software package that has been published for detecting GIs using comparative genomic approaches.

Mobile Genetic Elements and Their Prediction

127

In general, gene content based approaches use BLAST or other similarity search tools to detect variation of gene contents between two strains. For example, 528 genes in E. coli K12 do not have homologs in E. coli O157:H7, while 1387 E. coli O157:H7 genes cannot be detected in E. coli K12 (Perna et al., 2001). These strainspecific genes are often found in clusters and may correspond to putative GIs. Unusual sequence similarity between genes found in distantly related organisms have been reported as an indicator of HGT. Such practice, in the absence of full genomes for the organisms compared, is not recommended because it is impossible to ascertain orthology of genes when a complete genome is not available. Moreover, unusual sequence similarity between two species can be due to purifying selection pressure acting on the species compounded with lineage-specific gene loss in the intervening species. Comparison at the nucleotide sequence level can be carried out using genome aligners such as MAUVE and MUMmer (Delcher et al., 2002; Darling et al., 2004). These approaches typically use BLAST-like strategy to find short conserved regions between genomes and then extend the conserved blocks by aligning intervening regions using more robust alignment strategies such as the Smith-Waterman algorithm. Regions that cannot be aligned then may represent putative GIs. In general, unless the direction of evolution is known, which is rarely the case, it is difficult to distinguish an insertion from a deletion based on comparative approaches. Moreover, finding strains that are within the appropriate phylogenetic distance and with which reasonable whole-genome alignment could be achieved can be a difficult challenge. Comparative genomic approaches may be augmented by using additional evidence associated with HGT. For example, a strategy developed by Ou and colleagues used tRNAs to anchor putative GIs (Ou et al., 2006). They first identify shared tRNAs among strains of E. coli. Then, extracting regions upand downstream of orthologous tRNAs, the authors used MAUVE to align these regions to identify conserved blocks. Regions that fall between aligned upstream and downstream blocks were investigated further as possible GIs using several filters to remove false positives. The incorporation of tRNAs, which are often used as insertion sites (see above), as an anchor for finding GIs can aid their identification. However, not all GIs use tRNAs as insertion sites and thus this strategy is limited in the types of GIs that are detectable. Better sequence or structural characterization of other insertion site types can provide additional anchoring points for GI detection. In summary, the exponential increase in the number of available genomes for comparison now makes it possible to develop automated methods for comparative genomics based detection of GIs. 3.2. Detection of Prophages Since prophages have several features in common with GIs, they can often be identified using many of the same approaches. Abnormal base composition from the host genome including GC content, codon usage, and dinucleotide bias are

128

M. G. I. Langille et al.

common signatures of prophage regions. In addition, the presence of mobility genes (e.g. integrases, site-specific tyrosine and serine recombinases, lysases, etc), direct repeats, and tRNAs support the evidence that the region has been recently integrated (see above). Approaches for the sole detection of prophage have also been developed that depend on features that are unique to prophage. First, many of the genes found in phage are quite different from those that would be natively used by bacteria. Virion structural proteins, such as head, tail, and tail fibers that appear in close proximity to each other can be strong indicators of a prophage region. Since phage structural and regulatory genes are strong indicators of prophages, the most common approach for detection is to search for these genes based on sequence similarity. This approach usually starts by taking every gene within a bacterial genome and searching for similar genes against a database of known phage genes that have been derived from previously sequenced phage. To reduce the number of false identifications, multiple genes that have significant matches to the database are required to be in a cluster. The most stringent criteria would require that every gene within a certain size window would have to be identified as a phage-like gene. However, there are a couple of scenarios that could lead to a prophage region containing a gene that does not have a significant hit. One is that the gene truly does have phage origin, but that phage gene does not exist in the phage database (i.e. it’s a novel phage gene). Although many phage genomes have been sequenced and recent metagenomic studies have rapidly increased the number of phage genes in phage databases (Casas et al., 2007), these databases should still not be considered a complete list of all phage genes. Another scenario is that genome rearrangement has occurred since the phage integration and this resulted in mixing of phage and bacterial genes. To allow for this noise, a clustering technique or a sliding window is often used to find regions with a significant number of phage-like genes. For example, Prophage Finder (Bose et al., 2006) clusters any hits within a certain distance cutoff, ranging from 3 to 6 kb and uses another cutoff requirement of between 5 and 10 phage hits per prophage. Phage Finder (Fouts, 2006) on the other hand uses a sliding window with a fixed size of 10 kb and step size of 5 kb; searching for windows with at least four hits. These windows are then extended gene by gene if the annotated gene is known to be associated with prophages such as tRNAs, integrases, etc. 3.3. Detection of Integrons Computational identification of integrons in genomic sequence is complicated due to the considerable diversity in the integron sequence between the different classes, and the diversity of their associated gene cassette arrays. Identification of integrons in clinically isolated bacteria initially involves both in vitro methods to identify the presence of integrons, and downstream bioinformatics tools to functionally annotate genes. In silico, there is not one generally adopted method to computationally identify integrons in genomic sequence. Often multiple bioinformatics approaches are combined to detect integrons in genomic sequence, again followed by functional annotation of genes.

Mobile Genetic Elements and Their Prediction

129

Usually bioinformatics tools or scripts are used to detect the conserved integron features (see above). For example, some studies use BLAST-based similarity search to detect 59-be, integrases, transposons, or known species specific gene cassettes (Holmes et al., 2003; Gillings et al., 2005). One study used a more high-throughput approach to identify integrons in a global analysis encompassing multiple diverse sequenced bacterial genomes. In this case, integrons were identified through a BLASTP similarity search (e-value cutoff 10−25 ) to previously described integrases from Vibio and Escherichia species (Szekeres et al., 2007). Various software tools other than BLAST are used to identify these conserved regions. One study used Transact-SQL, an extension of SQL query language, to identify 59-be (Boucher et al., 2006). Another study utilized a software package called Sequence Analysis (developed by the Genetics Computer Group at the University of Wisconsin Biotechnology Center), to identify direct repeats and conserved R’ and R” regions of 59-be (Vaisvila et al., 2001). Another study combined multiple approaches, initially using MAP software (Genetics Computer Group, Madison, Wisconsin) to detect ORFs, followed by a BLAST search of predicted coding regions and 59-be (Holmes et al., 2003). Finally, one report used characteristics of integron structure to identify superintegrons in Vibrionaceae. In this case, they developed custom Perl scripts to detect 59-bes and genes, and included various length constraints, such as maximum length of genes and attC sites (Rowe-Magnus et al., 2003). Subsequent identification and annotation of genes in gene cassettes are primarily performed with a BLAST similarity search against the GenBank and/or GenPept databases from the National Center for Biotechnology Information (NCBI). Sometimes additional sequence databases such as EMBL, Uniprot and the NCBI Microbial Genome database are also used (see Sec. 4). Additionally, in some analyses open reading frames are predicted using ORF Finder or WebGeneMark.HMM (Lukashin et al., 1998) also available through the NCBI. Similarly, a study used the following criteria to identify hypothetical genes that may not necessarily be identified through homology search: a reading frame in the opposite orientation to intl ; a start codon within 30 bp of attl or the 59-be; a stop codon in or adjacent to the next 59-be; and being the largest ORF bounded by two 59-be (Gillings et al., 2005). Another study specified the longest coding region between two 59-bes as a probable gene (Vaisvila et al., 2001). Most methods that are used to detect integrons are designed as in-house solutions and are never fully developed into tools that are reusable. Unfortunately, this results in methods that are almost impossible to be compared. However, identification of integrons will hopefully improve and allow for further tool development. 3.4. Detection of Transposons and IS Elements Like integron prediction, transposon and IS element prediction is fairly limited. Primarily, identification is based on sequence similarity searches against known transposons and IS elements. Fortunately, these elements have been previously

130

M. G. I. Langille et al.

collected into web accessible databases such as ISfinder and ACLAME (Leplae et al., 2004; Siguier et al., 2006b). ACLAME (Leplae et al., 2004) is a database that was started in 2003 to deal with MGEs from plasmids and viruses, and provides high quality classification of all the encoded proteins through clustering. The current version ACLAME 0.2 provides browsing interfaces for individual mobile elements, mobile proteins clustered in families, hosting organisms and functions defined in ACLAME. In addition, Mahillon and Chandler collected and characterized ∼500 IS elements in prokaryotes in 1998 (Mahillon et al., 1998), and organized the information into the ISfinder database (Mahillon et al., 1998; Chandler et al., 2002; Siguier et al., 2006b). Currently the database has ∼1,500 IS elements and has evolved into one of the most comprehensive databases for IS elements in prokaryotes. A recent example for the use of these databases was provided by Touchon and Rocha (Touchon et al., 2007) when they scanned the genomes of 262 sequenced prokaryotic organisms against the ∼1,500 IS elements in the ISfinder database (Siguier et al., 2006b). The identified proteins were further grouped into the 20 IS families based on their best matched IS elements in the ISfinder database (Siguier et al., 2006b). They showed that an IS element could have multiple consecutive and possibly overlapping ORFs. The family assignments were based on protein level comparison as the linker sequences and the TIR signals for each IS element in ISfinder were not considered by Touchon and Rocha. Through this genome-scale annotation of IS elements in 262 prokaryotic organisms several interesting observations were proposed by the authors, including that the genome size is the only significant predictor of the number of IS elements and the density of IS elements in a host genome. A limitation of this method is that only the coding sequence of transposases and not other signals associated with IS elements such as the terminal signals were used; hence, possibly leading to high false positive prediction rates (Zhou et al., 2007). One alternative approach to similarity searching is to search for TIRs. A pair of TIRs together with other features, like a coding region in between, would strongly suggest that a region is a transposon, and also provides the boundary information of the transposon (Prosseda et al., 2006; Alavi et al., 2007). This approach would be limited to finding only TIR transposons and could be one of the reasons that no such algorithm has yet been reported.

4. Resources ACLAME — http://aclame.ulb.ac.be/ Sequence Resources NCBI — http://www.ncbi.nlm.nih.gov/ NCBI Microbial Genomes — http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi EMBL — http://www.ebi.ac.uk/embl/ UniProt — http://www.ebi.ac.uk/uniprot/

Mobile Genetic Elements and Their Prediction

131

Phage Phage Finder — http://phage-finder.sourceforge.net/ Prophage Finder — http://bioinformatics.uwp.edu/∼phage/ProphageFinder.php NCBI Phage Database — www.ncbi.nlm.nih.gov/genomes/static/phg.html Genomic Islands SIGI-HMM — http://www.g2l.bio.uni-goettingen.de PAI-IDA — http://compbio.sibsnet.org/projects/pai-ida/ Alien-Hunter — http://www.sanger.ac.uk/Software/analysis/alien hunter/ Z-Curve — http://tubic.tju.edu.cn/GC-Profile/ IslandPath — http://www.pathogenomics.sfu.ca/islandpath/ HGT-SVM — http://cbcsrv.watson.ibm.com/HGT SVM/ Insertion Elements ISFinder — http://www-is.biotoul.fr/ Other tools tRNAscan-SE — http://lowelab.ucsc.edu/tRNAscan-SE/ Group II introns — http://www.fp.ucalgary.ca/group2introns/

5. Discussion Several different classes of mobile elements each have their own set of features that allow for different detection methods to be used. In light of this, published algorithms and methods usually focus on detection of a single type of mobile element to avoid the complexities of designing a method that detects all mobile elements. However, there are some general approaches that can be extended for the detection of various mobile elements and may allow for future integration of multiple approaches into a single detection tool. The most common approach to identify any genomic element is to use similarity searches against a dataset of known genetic elements. Quite often a similarity search tool such as BLAST (Altschul et al., 1997) is used to find genes or genomic regions with sequence similarity to an entry in a previously-curated database. These methods are usually quite successful in finding mobile elements in unexamined genomes; however, this type of approach has several limitations. The largest limitation is that the sensitivity of the tool is heavily dependent on the completeness of the known dataset of mobile elements. Any mobile element that is not similar to a previously known mobile element will not be detected by this approach. For example, if we are searching for prophage using a database of phage genes we are limited to finding only the prophage that have similarities to those phage genomes that have been previously sequenced. Novel phage genes cannot be detected using this approach.

132

M. G. I. Langille et al.

The detection of compositional bias in genomic regions can be used to aid in the identification of regions that have been horizontally transferred and this approach does not depend on sequence similarity. However, these methods have their own limitations (see Sec. 3.1) and usually bias detection toward relatively recent transfers. As the number of genomes increase, comparative approaches will become increasingly important, yet such methods remain underdeveloped to date, versus the plethora of sequence composition-based approaches. As with all mobile elements, transposons are actively involved in the genome evolution, and could introduce many types of changes to the genome, including gene rearrangement, insertion and deletion. Hence it is interesting as well as important to study the distributions of the transposons across the sequenced genomes to understand the possible factors that might affect the distributions of transposons in their host genomes (Frost et al., 2005; Wagner, 2006). Another question of interest that could be asked based on the annotation of transposons is how transposons affect the cellular machineries of the host organism through affecting the neighboring genes with their endogenous promoters. Currently only ∼220 out of the ∼1,500 IS elements in the ISfinder database (Siguier et al., 2006b) are reported to appear in more than one organism. The general distributions of IS elements across a genome or multiple genomes are not very well understood and identification of all the known IS elements in a genome would be difficult using experimental techniques. Therefore, prediction and analysis programs with improved capabilities could help in annotation of all known transposable elements in all sequenced genomes and lead to an improved understanding of their distribution. In addition to more tools, well curated and updated databases of mobile elements are needed. Often, the most beneficial databases are those that are successful in obtaining submissions from many researchers. ISfinder is a good example of this as a number of journals, including Microbiology and Journal of Bacteriology, now require authors reporting new IS elements to deposit them into ISfinder. Comprehensive databases also allow for thorough testing of existing and newly developed tools. Many in silico tools for the detection of mobile elements are being published in the scientific literature; however, quite often accuracy measurements are not reported or comparisons between tools are based on different criteria. Balanced and public evaluations of tools are needed to allow researchers to effectively evaluate each tool’s capabilities. In particular, no study to date has been performed to compare the accuracy (sensitivity and specificity) of the different in-silico approaches for identifying integrons in genomic sequence. Furthermore, more investigation is needed into which features are best used to predict mobile elements like integrons. Most current integron-prediction approaches seem to take advantage of conserved regions of 59-be to detect integrons. However, there are known resistance gene cassettes that harbor different 59-be regions

Mobile Genetic Elements and Their Prediction

133

(Mazel, 2006). Therefore, more investigation into the diversity of these elements is needed. In addition, to our knowledge, no paper has yet reported approaches to identify both integrons and superintegrons in genomic sequence. With the continued increase in genomic data, from metagenomic studies for example, there is a growing interest in identifying all MGEs in newly sequenced genomes, and therefore more research into producing more standard and accurate approaches is needed. 6. Summary Mobile genetic elements are regions of DNA that are able to integrate themselves into other genomic locations or even to other hosts. They may carry a single gene, such as a transposase in an IS element, or a large number of genes contributing to a common function such as pathogenicity or antimicrobial resistance. These elements are important because they enable rapid phenotypic changes to occur, due to the insertion of novel genes or the disruption of existing genes. As with the detection of many other genomic elements, similarity search is commonly used to identify MGEs. This approach can be used successfully for the detection of MGEs because these regions contain genes such as transposases, integrases, or phage-like genes that are not common to other genomic regions. False positives, however, can arise through genome rearrangement. Also, similarity search of these hallmark genes alone may not be sufficient to identify the boundaries of the mobile elements. Since several types of mobile elements exhibit sequence composition bias, computational approaches which measure the difference in DNA sequence composition provides one alternative approach to identify these elements. Additionally, comparative genomic approaches may be increasingly useful. Currently, each in silico detection method has fairly significant limitations. Future methods will need to tackle these limitations and integrate many of the current approaches into universal tools. Also, the development of robust databases of MGEs will provide critical datasets for training and testing of the computational methods developed. In section 5.2, Features of Mobile Elements, we have described the importance of MGEs in the development of adaptive changes in bacteria of medical or environmental interest. Hopefully, this will stimulate the development of more computational tools and databases that will address current limitations and facilitate new insights regarding the evolution and function of these mobile genetic regions. 7. Further Reading Frost LS, Leplae R, Summers AO, Toussaint A (2005) Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol 3:722–32. Chandler M, Mahillon J: Insertion sequences revisited. In: Mobile DNA. Edited by A.M. L, II. Washington, DC.: American Society for Microbiology; 2002:631–662. Craig, N.L., Craigie, R., Gellert, M. Lambowitz, A.M. (eds) Mobile DNA II ASM Press, Washington DC, 2002 Dawkins R: The selfish gene, 30th anniversary edition. Oxford University Press, 2007.

134

M. G. I. Langille et al.

Acknowledgments MGIL and WWLH are Michael Smith Foundation for Health Research (MSFHR) Trainee awardees and FSLB is a MSFHR Senior Scholar. WWLH and FSLB were also awarded a Canadian Institutes of Health Research Scholarship and New Investigator award, respectively. FZ and YX are supported in part by the National Science Foundation (NSF/DBI-0354771, NSF/ITR-IIS-0407204, NSF/DBI-0542119, NSF/CCF0621700) and a Distinguished Scholar grant from the Georgia Cancer Coalition.

Glossary Compound transposon — A transposable element formed when two IS elements insert on either side of a non-transposable segment of DNA. Conjugation — Gene transfer that is mediated by certain plasmids and requires direct cell contact. Conjugative transposon — A transposon that encodes functions allowing transfer of the transposon DNA between donor and recipient bacterial cells. Cryptic genes — Phenotypically silent DNA sequences, not normally expressed during the life cycle of the organism. Genomic island — Clusters of genes in prokaryotic genomes that have evidence of horizontal origins. Horizontal gene transfer — Any process in which an organism transfers genetic material to another cell that is not its offspring. Insertion sequence (IS) element — A short mobile DNA sequence similar to transposons except that they encode only genes for their transposition. Integrase — An enzyme that is used by phage to integrate one DNA molecule into another. Integron — A genetic element that encodes an integrase enzyme, which can assemble tandem arrays of genes and provide them with a promoter for expression. They are often contained within other mobile elements allowing themselves to be mobile. Phage — A virus that infects a prokaryotic organism. Plasmid — A self-replicating (autonomous) circle of DNA distinct from the chromosomal genome of bacteria. A plasmid contains genes normally not essential for cell growth or survival. Prophage — The dormant stage of a phage life cycle that is usually integrated in the host genome. Superintegrons — Integrons that are not linked to a mobile element and often have much larger gene cassette arrays.

Mobile Genetic Elements and Their Prediction

135

Transduction — Gene transfer that is mediated by a phage. Transformation — Gene transfer that is mediated by the uptake of naked DNA. Transposase — An enzyme that promotes cutting of the DNA at the ends of a transposable element and joining to the DNA molecule into which the element is to be inserted. Transposon — A mobile DNA element that can relocate within the genome of its host.

This page intentionally left blank

CHAPTER 6 HORIZONTAL GENE TRANSFER: ITS DETECTION AND ROLE IN MICROBIAL EVOLUTION

J. PETER GOGARTEN and OLGA ZHAXYBAYEVA

1. Introduction 1.1. The Early History of Gene Transfer Gene transfer played a crucial role at the birth of molecular biology: Using killed pathogenic Pneumococci Griffith was able to transform non pathogenic Pneumococcus strains into pathogens (Griffith, 1928). The same approach with additional treatments of the killed pathogens later was used to demonstrate that DNA is the transforming factor (Avery et al., 1944). Gene transfer between prokaryotic organisms became an intense focus of interest, when it was recognized that bacteria can acquire antibiotic resistance genes not only from members of the same species, but also from only distantly related organisms (Gray and Fitch, 1983; Trieu-Cuot et al., 1985). The ability to share genetic information and the absence of a clear barrier towards gene flow even led to the suggestion that all bacteria could be considered as a single species (Margulis and Sagan, 2002); or, as suggested by Sonea (1988b), could be seen as a single super organism: “all bacteria on Earth contribute to and draw benefits from, a common gene pool, which constitutes the communication network of a single super-organism whose continually shifting components are dispersed across the surface of the planet.”

1.2. Towards a Natural Taxonomy Traditionally, single celled organisms without a nucleus were placed into a group called monera (Haeckel, 1866) or prokaryotes (Stanier and Van Niel, 1962). Classification of single celled organisms within this group was initially based on morphology (e.g., cell shape), physiology and biochemistry (e.g., fermentation type and temperature ranges for growth) (for further discussion see Rossello-Mora and Amann (2001), Olendzenski et al. (2004), Sapp (2005)). The goal of systematic classification is to develop a natural taxonomic system, in which named groups are defined based on shared ancestry (Hennig, 1966). The absence of clear morphological and biochemical markers, the absence of 137

138

J. P. Gogarten & O. Zhaxybayeva

an agreed upon species concept for prokaryotes (beyond one for convenient categorization, cf. Cohan and Perry (2007)), and the realization that genetic exchange could occur between divergent partners argued against the possibility of a natural systematic system for prokaryotes (Winogradsky, 1952; van Niel, 1955; Sapp, 2005). This pessimistic view changed, when Woese and Fox introduced the small ribosomal RNA (rRNA) as a taxonomic marker (Woese and Fox, 1977). Comparisons of this molecule for the first time revealed a large-scale structure of phylogenetic relationships. Furthermore, rRNA based groupings were often in agreement with morphological (e.g., Spirochetes), and biochemical characteristics (e.g., type of photosynthesis, properties of cell walls and membranes) (Woese, 1987). The use of small subunit rRNA as a taxonomic indicator revolutionized microbiology. Ribosomal RNAs contain both highly conserved and highly variable regions that can be compared between organisms of varying degrees of divergence. Furthermore, the ability to amplify rDNA (ribosomal RNA encoding DNA) outside the living organism allows analyses not only from cultured organisms, but also from environmental samples (Ley et al., 2006; Sogin et al., 2006), leading to the realization that much of the microbial biosphere is currently uncultured and largely unknown (Schloss and Handelsman, 2004; Sogin et al., 2006).

1.3. Ribosomal RNA versus Other Molecular Markers Ribosomal RNA became the gold standard for microbial taxonomy (RosselloMora and Amann, 2001). Over 380,000 aligned and annotated small subunit ribosomal RNAs are included in release 9.51 of the ribosomal databank project (Cole et al., 2007). However, as more sequences from other molecules became available, different gene trees were reconstructed and compared to rRNA trees. While some markers were in agreement with rRNA based phylogenies, there was also conflict (Gogarten, 1995). For example, archaeal type ATP synthases and an euryarchaeal lysyl tRNA synthetases were found in bacteria (Hilario and Gogarten, 1993; Ibba et al., 1997; Wolf et al., 1999). When the first genome wide analyses became possible, different molecules reflected rather different relationships between the Archaea and Bacteria (Pennisi, 1998). One explanation for the incongruence of phylogenetic trees constructed from different markers is horizontal gene transfer (HGT) (Doolittle, 1999; Gogarten and Townsend, 2005). However, artifacts of phylogenetic reconstruction are another important consideration. The study of deep branching eukaryotic lineages reveals that one should not lose sight of the fact that the small subunit rRNA based tree of life at best depicts the evolutionary history of a single molecule. In some lineages this molecule evolved faster than in others. Many phylogenetic reconstruction algorithms tend to group long branches together, even if they are not specifically related (Felsenstein, 1978). For example, the microsporidia, considered a deep branching lineage based on small subunit rRNA (Vossbrinck et al., 1987), now have been recognized as more recently emerging, rapidly evolving relatives of the fungi (Embley and Hirt, 1998).

Horizontal Gene Transfer: Its Detection and Role in Microbial Evolution

139

Concerns that rRNA based taxonomy might be insufficient to capture microbial evolution were fueled by the realization that other molecular markers usually did not agree with the rRNA-based tree of life (Hilario and Gogarten, 1993; Pennisi, 1998; Doolittle, 1999; Gogarten et al., 2002). Even if a few genes might provide a phylogeny that resembles the organismal history, this history is embedded in a web formed by the exchange of genetic information between divergent organisms (Gogarten, 1995; Martin, 1999). Describing microbial evolution as a single steadily bifurcating tree is insufficient to capture microbial evolution (Dagan and Martin, 2006).

1.4. Patterns Created Through Gene Sharing One complication in reconstructing genome phylogenies is that highways of gene sharing (Beiko et al., 2005) can create a signal indistinguishable from a signal created through shared ancestry (Gogarten et al., 2002). These signals can pertain to groups of closely related organisms (e.g., marine cyanobacteria, Zhaxybayeva et al., 2006), or they might be due to interdomain gene transfers as was suggested for Thermotoga (Gogarten and Townsend, 2005; Gophna et al., 2005). In the former case the situation is similar to that frequently observed in the evolution of eukaryotes, where gene flow between sympatric species can make their genomes more similar to one another, without erasing the species characteristics (Grant et al., 2004; Arnold, 2006). In case of transfer from very divergent species the recipient might be pulled to the base of their respective taxonomic group in a phylogenetic reconstruction as was suggested for the case of Thermotoga. The use of weights or filters might help to minimize the signal due to gene transfer (Daubin et al., 2002; Gophna et al., 2005; Ciccarelli et al., 2006); however, the resulting phylogenies should be considered as hypothesis pertinent to only a small section of the genome (Dagan and Martin, 2006). Advances to detect and pinpoint transfer events promise to untangle at least a few of the intertwined histories of individual genes within organismal phylogenies.

1.5. Patterns Created Through Convergent Evolution Patterns created through highways of gene sharing are one problem in reconstructing genome evolution. A related, potentially even more devastating problem is artifacts shared by most gene families included in an analysis. The convergent evolution that occurred in endosymbiotic bacteria (see Chapter 7) might provide a possible example for convergent evolution that created an overwhelming signal. Most rRNA and genome based studies of the gamma proteobacteria (e.g., Gil et al., 2003; Lerat et al., 2003; Canback et al., 2004; Poptsova and Gogarten, 2007, compare Fig. 2) group two genera of endosymbiontic bacteria, Buchnera and Wigglesworthia, together. In particular, the monophyly of these and other insect endosymbionts was supported by several studies that were based on analyses of the available

140

J. P. Gogarten & O. Zhaxybayeva

genome sequences. However, the endosymbionts are characterized by reduced and AT rich genomes. A study (Herbeck et al., 2005) that utilized increased taxon sampling for only two genes, small subunit rRNA and groEL, in conjunction with a non-equilibrium model for phylogenetic reconstruction came to a very different conclusion. This study found that the Buchnera sequences form a group that is distinct from that of Wigglesworthia and other insect endosymbionts. If indeed these two groups of insect endosymbionts have an independent origin, and are more closely related to different non-symbiotic gamma proteobacteria, then the strong signal found in several whole genome studies in support of a Wigglesworthia, Buchnera clade (Lerat et al., 2003; Zhaxybayeva et al., 2004b) were due to convergent evolutionary processes only.

1.6. Other Artifacts That Can Lead to Misleading Genome Phylogenies The problem of long branch attraction (Felsenstein, 1978) was already illustrated in Sec. 1.3. This artifact is positively misleading: the more sequence data are included in the analysis, the more certain will the wrong phylogeny be recovered from the analysis. The long branches that give rise to this artifact can be due to faster rates of substitutions (as in the case of the microsporidia, see Sec. 1.3), or due to absence of speciation events along a branch. In either case, the recommended solution is to add additional operational taxonomic units (genomes, sequences, species) to break up the long branches, and to use reconstruction algorithms that are less prone to this artifact (Bergsten, 2005). A similar misleading artifact can be due to incomplete lineage sorting (Degnan and Rosenberg, 2006). Multiple alleles can exist in a population and survive through successive speciation events. The phylogeny of these alleles will reflect the divergence of the alleles and not the more recent evolution of the species. Degnan and Rosenberg showed that gene trees in addition to being different from the organismal history, can be positively misleading in some instances, i.e., the tree topology recovered from the plurality of gene trees that reflect incomplete lineage sorting can be different from the organismal history. The potential for artifacts due to incomplete lineage sorting increases with shorter internode distances and with larger populations. In case of a long internode distance one or the other allele will be fixed due to random genetic drift, and the smaller a population, the faster genetic drift operated (Li, 1997). The frequency with which incomplete linage sorting might pose a problem in reconstructing microbial evolution depends on how frequently microbial populations encounter bottlenecks that purge multiple alleles from populations. The latter question is closely related to the process by which new prokaryotic species originate: Does a single organism that acquires an adaptive trait become the single parent of new species? If this were the case, then lineage sorting would not likely pose a problem. However, prokaryotes were suggested to possess large effective populations (Lynch, 2006). If these can divide without loss of allelic diversity (see Sec. 2.2) then

Horizontal Gene Transfer: Its Detection and Role in Microbial Evolution

141

lineage sorting might be a serious hurdle for attempts to reconstruct the history of genomes.

2. Gene Transfer, Species, and the Units of Selection 2.1. Species Concepts Microbial organisms do not form a continuum. Rather, microbiologists can reliably classify organisms according to domain, phylum, family, genus and species. Which phenomena generate this cohesion within prokaryotic groups? Three very different, but not mutually exclusive mechanisms have been suggested. High levels of gene transfer followed by homologous recombination could play the role that sexual reproduction plays for gene flow in multicellular eukaryotes (Dykhuizen and Green, 1991). In this case, cohesion would be maintained by high levels of genetic exchange within the group, and smaller rates between group recombination. The situation under this model would be similar to the biological species concept (Mayr, 1942), which defines a species as a potentially interbreeding group of organisms capable of producing fertile offspring. Within such groups, gene phylogenies are seldom congruent due to high rates of gene flow and recombination (Gogarten et al., 2002; Lawrence, 2002). Cohesion could also be generated through selective sweeps that occur, if a gene that provides a selective advantage to its carrier arises through mutation or gene transfer (Cohan, 2002b). In the absence of recombination the advantageous gene would carry the whole genome along, erasing all diversity with the population, leading to more similarities within than between populations. Under this model, the extent of a species is defined by the margins of the selective sweep. At the species boundary high rates of recombination will restrict the sweep of an adaptive gene to the gene itself and prevent the annihilation of a related species (Majewski and Cohan, 1999; Cohan, 2002a). However, the observed clustering of organisms into groups of high within group similarity might not be due to any specific biological process but could be caused by random processes only. For example, the tree resulting from a random extinction-bifurcation process follows Kingman’s coalescence (Kingman, 1982), where the deepest split in a clade on average covers half the time the clade is in existence (Zhaxybayeva and Gogarten, 2004). This process will thus lead to much higher within than between clade similarity (Zhaxybayeva, Doolittle, Gogarten unpublished; Gevers et al., 2005; Cohan and Perry, 2007).

2.2. What Is the Unit of Selection in Microbial Evolution? Genetic information can be transferred across species boundaries. As a consequence the target of natural selection remains controversial. The following have been

142

J. P. Gogarten & O. Zhaxybayeva

considered as targets of natural selection (simultaneous selection at multiple levels appears likely): • Individuals in a population: According to Darwin’s theory of natural and sexual selection (Darwin, 1859), evolution operates at the population level, where the “fittest” organisms produce more offspring and as a consequence the traits that produce this fitness become dominant in the population. • Selfish genes: Dawkins introduced a gene centered view of evolution that considers organisms as vehicles created by selfish genes, most of which cooperate in creating the organism (Dawkins, 1976). In contrast, parasitic genes do not improve the vehicle, they just catch a ride into the future and are ready to leave the sinking ship (Gogarten and Hilario, 2006). • Group selection: Larger units of selection have been proposed as well. WynneEdwards suggested that a population or a group as a whole could be the subject of selection, out-competing groups with less optimal properties (WynneEdwards, 1962). The process of evolution is itself subject to evolution (Rossler, 1979), thus it would not be surprising to find that lineages with better ways to evolve are favored over the long run. Conceivably, groups that adapt faster to new environments would out-compete lineages that adapt to new niches more slowly. • Communities as units of selection: Through gene transfer members of a microbial community have access to a common shared gene pool. Sonea (1988a) suggested that whole microbial communities, consisting of populations that traditionally are assigned to different species, could be considered as individuals and units of selection. The interplay between selection at the group and the selfish gene level is best illustrated through concrete examples: Agrobacteria that carry a tumor inducing plasmid (Ti plasmid) can transform plant cells with a T DNA (short for transfer DNA). As result of a successful transformation the plant cell has integrated the T DNA into its genome and expresses the encoded genes. This results in the transformed cells forming a tumor, and, in addition, the transformed plant cells also produce a strange amino acid that cannot be utilized by the plant cells, but that serves as a carbon and nitrogen source for the Agrobacteria. In the presence of this strange amino acid, the genes responsible for transferring the Ti plasmid between different Agrobacteria (tra genes) are under the control of quorum sensing (Oger et al., 1998; Luo et al., 2003). The effect is that if one Agrobacterium strain has successfully transformed a plant, and now lives from the plant produced strange amino acid, other Agrobacteria can receive the Ti plasmid, which contains the T DNA transferred into the plant and in addition encodes enzymes that allow the metabolism of the strange amino acids. The Agrobacteria, which receive the Tiplasmid thus participate in the utilization of the plant produced carbon and nitrogen source. This observation might be described as group selection: the population of Agrobacteria avoids a selective sweep and carries larger genetic diversity into the

Horizontal Gene Transfer: Its Detection and Role in Microbial Evolution

143

population living on the transformed plant. The increased diversity will facilitate future adaptations to a changing environment, and will avoid the fixation of slightly deleterious mutations that might have been carried by the Agrobacterium that transformed the plant cell. On the other hand, one can consider this process the outcome of the “selfishness” of the tra-genes and of the Ti plasmid. These genes manage to move themselves into the growing part of the population, and they will benefit form a more diverse group of host organisms. A similar observation, connecting selfish genes to the selection of communities, was made by Peter Hirsch (personal communication) in studying microbial communities inside rocks in the dry valleys of Antarctica: These rocks have high concentrations of toxic heavy metals. The endolithic microbial community readily shares heavy metal resistant genes with microbes that might be able to become part of the community. At the community level the outcome is a higher diversity, and a richer network of metabolic reactions. Presumably the more diverse communities are more stable towards perturbations, and provided the community can propagate as a whole, this would provide a selective advantage to the community. However from the selfish gene point of view, the resistance gene increases its chances of long term survival by invading as many additional species as possible.

3. HGT Detection Several types of methods for detection of HGT events are being developed and constantly improved. Due to varying underlying assumptions, different methods detect HGTs at different phylogenetic distances and of different age. It is therefore not surprising that the different approaches often return non-overlapping sets of HGT candidates (Ragan, 2001a; Lawrence and Ochman, 2002; Ragan, 2002; Ragan et al., 2006). In addition, all methods are imperfect and suffer from high error rates. While the rate of false positives is often reflected in the significance of detection, the estimation of false negatives requires either simulations or in silico transfers (Cortez et al., 2005; Poptsova and Gogarten, 2007).

3.1. Surrogate Methods Since a horizontally transferred gene comes from a different genomic background, its nucleotide sequence can contain signatures of its previous “home” genome. One group of HGT detection methods use either atypical nucleotide composition (Lawrence and Ochman, 1997; Karlin, 2001), and/or atypical codon usage patterns (Lawrence and Ochman, 1998) to infer which genes in a genome are instances of HGT. Since these methods do not rely on phylogenetic reconstruction, they are sometimes called surrogate methods (Ragan, 2001b). Because genes ‘ameliorate’ [that is adapt to the signatures of its new genome] quickly (Lawrence and Ochman, 1997), these methods are applicable to detection of very recent transfers only. While easily applicable to completely sequenced genomes, these methods were criticized

144

J. P. Gogarten & O. Zhaxybayeva

for returning high rates of false positives and negatives (Koski et al., 2001; Azad and Lawrence, 2005; Cortez et al., 2005). An application of a compositional approach to 116 available genomes revealed that the number of recently transferred genes ranges from 0.5% in pea aphid endocellular symbiont Buchnera sp. APS to 25.2% in the anaerobic methane-producing archaeon Methanosarcina acetivorans C2A (Nakamura et al., 2004). Other surrogate methods are applicable to only very closely related organisms. The extent of HGT among the closely related organisms can be judged through a comparison of gene content of their genomes. For example, three sequenced E. coli genomes (Welch et al., 2002) each harbor a substantial proportion of genes absent from the two other strains (585 genes in non-pathogenic E. coli K12, 1623 genes in uropathogenic E. coli CFT073 and 1346 genes in enterohaemorrhagic E. coli O157:H7); only 39.2% of their common gene pool is found in all three genomes. The genes that are present in only one of three strains are assumed to be introduced into E. coli through HGT. These numbers, which reflect dynamic and open pangenomes (see Chapter 4) are not unique to E. coli. Another example is provided by Frankia, actinobacteria that are nitrogen fixing symbionts of plants. Three Frankia strains (Normand et al., 2007) are less than 3% divergent in their small subunit ribosomal RNA, but less than 20% of the common shared gene pool is represented in all three Frankia genomes, and the individual genomes have 1112, 1703, and 581 genes, respectively, that do not have any detectable homolog in the other two genomes. While gene loss and gene duplication make important contributions to changes in genome size, the latter numbers appear to mainly reflect acquisition of genes by gene transfer strains (Normand et al., 2007). This comparison of closely related genomes also could be considered a phyletic approach (see Sec. 3.3); however, the only phylogenetic information used is the close relationship between the analyzed genomes. Another surrogate method that uses some phylogentic information is the comparison of substitution rates. Novichkov et al. (2004) reason that if a gene evolved vertically in a lineage then often it should accumulate substitutions in a monotone and steady fashion. As a result a plot of the distance between such a gene to its orthologs in related organisms against a measure of the relationship between the organisms should result in a straight line going through the origin of the coordinate system (see Fig. 1). In contrast, if the gene was recently acquired from an organism outside the group under consideration (i.e., the donating organism represents a deeper branch compared to the organisms that carry the orthologs being used in the comparison), then the relationship between gene and organismal distances should approximate a parallel to the Y-axis, i.e., all gene distances to the transferred gene are approximately the same, irrespective of the distance between the organisms. An advantage of this method is that it can be used to test the hypothesis of vertical inheritance “does the confidence interval for the Y-axis intercept exclude the origin?”, and the hypothesis of recent horizontal acquisition “does the confidence interval for the slope exclude zero?” A problem

Horizontal Gene Transfer: Its Detection and Role in Microbial Evolution

145

Fig. 1. Use of an approximate molecular clock to detect horizontally transferred genes. For each gene the distance between the gene and its orthologs from closely related genomes is calculated and plotted against the evolutionary distance separating the organisms. The latter can be approximated by ribosomal RNAs or by a genome average. If the gene was inherited vertically, and if the substitution rate remained approximately constant (panel A), then the points will fall on a straight line through the origin, with a slope depending on the substitution rate of the individual gene. If the gene was acquired from outside the organisms considered in the analysis (organism X) (panel B), then all gene distances will be approximately the same and independent from the distance between the organisms. If the transfer occurred to a deeper branch in the tree, part of the points will fall on the diagonal, and part on a parallel line to the abscissa. Modified from (Novichkov et al., 2004).

is that significant departures from a straight line through the origin can be due to violation of the clock assumption. Another shortcoming that this method shares with phylogenetic approaches is that it only considers genes that have recognizable orthologs (cf. Sec. 3.3) in other genomes. Because of these limitations, this approach might be best suited for sub-phylum analyses. In case of the Bacillus-Clostridium group, the gamma-, and the alpha-Proteobacteria the clock-like null hypothesis could not be rejected for 70% of the analyzed gene sets (Novichkov et al., 2004). For over one-half the remaining gene sets, the authors detect evidence for orthologous replacement, i.e. a gene is displaced by an ortholog from a different lineage.

3.2. Unusual Phyletic Patterns A different way to assess whether a gene could have been transferred is to do a BLAST (Altschul et al., 1997) (or any other similarity or clustering algorithm) search of a sequence database (such as NCBI’s non-redundant nr database) to find homologs to the query gene, define a gene family using this information and to look at the taxonomic distribution of members of the gene family (so called phyletic patterns). A significant top-scoring BLAST hit itself may suggest the most similar

146

J. P. Gogarten & O. Zhaxybayeva

sequence in the database; this has been used to obtain rough estimates for the number of horizontally transferred genes in a genome (for example, the Thermotoga maritima genome was proposed to have 24% of horizontally transferred genes from Archaea based on the top-scoring BLAST hits (Nelson et al., 1999)). However, a top-scoring BLAST hit might not represent a sequence that in a phylogenetic reconstruction would group with the query sequence (Koski and Golding, 2001); therefore, the phylogenetic affiliation of the top scoring BLAST hits is not a reliable approach to HGT detection. Phyletic patterns, however, can be further used to infer whether the patchy distribution of a gene is most parsimoniously explained by HGTs or by gains and losses (Snel et al., 2002; Kunin and Ouzounis, 2003; Mirkin et al., 2003). The outcome of the inferences depends on a value of “HGT penalty” (a ratio between HGT events and gene losses), which is not known, but has to be estimated or set a priori, and different studies disagree on what value to use. A recent attempt to apply this type of approach to 165 microbial genomes resulted in an inference of ∼40,000 horizontal gene transfers, ∼90,000 gene losses and over ∼600,000 vertical transfers in all analyzed gene families (Kunin and Ouzounis, 2003). While the numbers given above may be interpreted as showing only a limited number of HGTs among the 165 genomes, one should not forget that those estimates do not consider HGTs resulting in orthologous replacement, which could constitute a substantial part of a genome (see Sec. 3.3). And yet another problem comes from the lack of firm definition of “gene absence” — at what stage of gene decay the gene should be declared absent? Consideration of gene remnants within a genome as absent genes can lead to systematic overestimation of within-species HGT events (Zhaxybayeva et al., 2007). Patterns of gene presence and absence have also been used to estimate the frequency of gene transfer (Dagan and Martin, 2007). As discussed in the previous paragraph, a given phyletic pattern can be explained either by vertical inheritance and gene loss, or through gene transfer. The former explanation alone, without consideration of gene transfer, forces one to assume that a gene present in two organisms was already present in the ancestor of the two genomes. In particular, under this assumption any gene present in at least one archaeon and one bacterium would have to be assumed present in the ancestral “Garden of Eden” genome (Doolittle et al., 2003). Using 190 present day genomes Dagan and Martin calculated the size of the genome at the base of the bacterial domain to encode 53,658 proteins under the no-transfers assumption. For genomes in the past to have had about the same size as today’s genomes about one gene transfer is required to have occurred per gene family (Dagan and Martin, 2007).

3.3. Phylogenetic Incongruence These methods rely on reconstruction of phylogenetic trees for sets of orthologous genes and comparison of them to each other, assuming that trees with unexpected (that is topologically incongruent) phylogenetic histories are results of horizontal

Horizontal Gene Transfer: Its Detection and Role in Microbial Evolution

147

gene transfer. Most of these approaches require using an expected phylogenetic history (organismal tree) as a reference tree for the comparison. One of the earliest such analyses came from comparison of gene families from the Aquifex aeolicus genome to the rRNA tree, with the conclusion that one gets “different phylogenetic placements based on what genes are used” (Pennisi, 1998). Later, 205 gene families from 13 gamma-proteobacteria were compared to their concatenated phylogeny (Lerat et al., 2003; Beiko et al., 2005) and 22,432 gene families from 144 prokaryotic genomes were compared to the supertree constructed from compatible bipartitions (Beiko et al., 2005). Choices of reference trees (as a proxy of organismal trees) include rRNA trees, genome trees, trees derived from concatenation of selected datasets, or trees (possibly only partially resolved) supported by a plurality of sets of orthologous genes (consensus trees or supertrees). Ideally, if the organismal tree is not known, all possible tree topologies should be tried as a reference tree (examples of such methodologies to analyze four and five genomes are in Zhaxybayeva and Gogarten (2002); Zhaxybayeva et al. (2004a) and Hamel et al. (2005). However, due to the vast number of possible tree topologies this approach is computationally impossible for large-scale analyses. As an alternative, the trees to be analyzed could be broken into smaller pieces (e.g., bipartitions or quartets). There is a significantly smaller number of possible bipartitions/quartets than trees for a given number of analyzed genomes, and therefore all possibilities can be evaluated (giving rise to bipartition (Zhaxybayeva et al., 2004b) and quartet decomposition analyses (Zhaxybayeva et al., 2006) (see Figs. 2 and 3, and text below). A bipartition corresponds to a branch or split in a phylogenetic tree. If an edge between two nodes in a tree is removed from the tree, the tree is split into two unconnected trees. The two sets of leaves of these two trees represent the bipartition corresponding to the edge. A given tree represents a set of bipartitions, i.e. all those bipartitions corresponding to the edges of the tree. Two bipartitions are considered compatible, if they could coexist on a single tree, and considered incompatible if they cannot. Given a single data set of aligned sequences, support values for the different bipartitions (bootstrap support values or Bayesian posterior probabilities) can be calculated (Felsenstein, 1988; Zhaxybayeva and Gogarten, 2003). Bipartition spectra, also known as Lento plots (Lento et al., 1995), summarize the statistical support for bipartitions in form of a histogram. Support for a bipartition is given as a column above the x-axis, and conflict as a column below the x-axis. In case of comparative genomics, one can use the number of gene families that significantly support the different bipartitions, and the total number of gene families significantly supporting conflicting bipartitions, as measures for support and conflict, respectively (see Fig. 2 and Zhaxybayeva et al. (2004b)). Bipartition based analyses are useful to find gene families with conflicting phylogeny, without requiring a completely resolved reference tree. However, the applicability of this approach depends on at least some bipartitions being significantly supported by the plurality of gene families (Poptsova and Gogarten, 2007). This is a problematic requirement, because the more leaves are added to

148

J. P. Gogarten & O. Zhaxybayeva

Fig. 2. Example of a bipartition spectrum (Zhaxybayeva et al., 2004b). This spectrum, or “Lento”-plot (Lento et al., 1995), summarizes all bipartitions that were found in the phylogenetic analyses of gene families that were represented in each of 13 Gamma proteobacterial genomes (236 families). The bars above the x-axis give the number of gene families that support a bipartition with more than 70% bootstrap support, higher support values are color coded. The numbers of gene families that support a conflicting bipartition are depicted below the x-axis. Each number can be greater than the number of gene families, because a single gene family can support several conflicting bipartitions. Note that the first eight bipartitions are supported by the majority of gene families, and that only three datasets conflict with these plurality bipartitions at the 99% bootstrap support level. Only the 29 bipartitions that are supported by at least one gene family with at least 70% bootstrap support are included (from a total of 4082 possible bipartitions). The sixth bipartition (counting from the left) corresponds to the grouping of Wigglesworthia with Buchnera, a grouping that might reflect an artifact rather than shared ancestry. (Figure modified from Poptsova and Gogarten, 2007.)

a tree, the shorter the internal edges and the smaller their support values tend to become (Wainright et al., 1993). A solution to this conundrum is quartet decomposition analysis. In this type of analysis a gene tree is “decomposed” into sets of all possible embedded quartets (Zhaxybayeva and Gogarten, 2003; Zhaxybayeva et al., 2006). An embedded quartet is any subset of a tree consisting of four leaves (taxa). The phylogeny of the tree is calculated using all sequences; however, to calculate support for each quartet embedded in the tree, the remaining taxa on the tree are ignored and only relationships between four taxa constituting the quartet are evaluated for each bootstrap sample (that is why the quartet is called “embedded”). Such an approach not only avoids problems of short internal branches and taxon sampling, but also allows us to combine phylogenetic signals from gene trees containing various numbers of taxa. Similarly to the bipartition approach described above, putative HGTs are determined by examination of gene families that give rise to conflicting quartet topologies.

Horizontal Gene Transfer: Its Detection and Role in Microbial Evolution

149

Fig. 3. Example of a quartet decomposition analysis (Zhaxybayeva et al., 2006). Panel A illustrates results for a single quartet. The black dot indicates the number of data sets containing this embedded quartet. The vertical bar shows the number of data sets having the topology of the quartet that is supported by a plurality of gene families (value above zero) and the number of data sets having one of the other two quartet topologies (value below zero). The bar is color-coded with respect to bootstrap support. Panel B shows the quartet spectrum of 1128 sets of orthologous genes from 11 completely sequenced cyanobacterial genomes. Columns are sorted according to the number of supporting data sets with at least 80% bootstrap support. Quartets above the x-axis are combined into a plurality signal. Quartets below the x-axis are conflicting with the plurality and are embedded in 60.7% of the analyzed gene families. For further discussion see text and (Zhaxybayeva et al., 2006). (Figure modified from Zhaxybayeva et al., 2006, copyright Cold Spring Harbor Laboratory Press.)

For example, analysis of 1128 gene families in ten cyanobacterial genomes resulted in 685 gene families with phylogenetic trees incongruent with a reference tree supported by a plurality of gene families (Zhaxybayeva et al., 2006), and hence providing candidates for HGTs. While most gene families deviated from the plurality consensus, all the quartet topologies were compatible with one another and resulted in a completely resolved consensus phylogeny. One drawback of phylogenetic approaches (aside from artifacts of phylogenetic reconstruction, which are not discussed here) is that HGTs between neighboring taxa on the reference tree are invisible for these methods, because these transfers do not result in a change of tree topology. Another drawback is that weak phylogenetic signal in a set of orthologous genes often results in an unresolved (or unsupported) tree topology. The latter topologies cannot be used to delineate HGT events, but

150

J. P. Gogarten & O. Zhaxybayeva

they also should not be used (although unfortunately they are sometimes used, e.g., (Snel et al., 2002)) as an evidence for absence of HGT. The third drawback is that often a choice of reference tree may bias the results of HGT quantification. This is particularly a problem when a reference tree is obtained as a plurality tree (or supertree) from the same sets of genes that are subject to HGT detection in the study. The underlying assumption is that the number of HGTs should be minimized, and that the plurality of genes therefore reflects organismal evolution but not a reoccurring pattern of HGT. This assumption is not always justified: in addition to the organismal history, the plurality consensus might in part reflect highways of gene sharing, artifacts due to trees more frequently resulting from lineage sorting, or from long branch attraction or convergent evolution. For example, in the abovementioned analysis of cyanobacterial genomes, the different Prochlorococcus marinus strains included in the study do not form a monophyletic group and the inferred relationships likely represent some of the “highways of gene sharing”. For the future, development of algorithms will be important that use genome data to reconstruct the reticulate history of genomes rather than providing a single phylogenetic tree only. Ideally, these methods would explore a priori non-tree like approaches (such as Splits-Tree and Neighbor-Net (Huson and Bryant, 2006) do for single gene phylogenies) to reconstruct the reticulate evolutionary genomic history and not begin with incompatible trees that are merged into networks only later.

4. Summary Advances in genome and meta-genome sequencing have revolutionized our understanding of microbial evolution. Reconstructing the evolutionary history of organisms turned out to be more difficult than anticipated two decades ago. Unequal substitution rates, substitution bias, and lineage sorting create artifacts whose magnitude previously was underappreciated. Exchange of genetic information between organisms now is recognized as an important process in microbial evolution. This has changed the view of microbial evolution from a tree-like process to one where lines of vertical descent are embedded in a tangle of gene transfers. Deciphering the details of this web of life remains a challenge; however, already the recognition of the major connections will yield information about the order (and timing) in which metabolic pathways were assembled and shared between different organisms and thus provide rich detail on the history of Earth’s biosphere.

5. Further Reading Arnold, M. (2006). Evolution through genetic exchange. Oxford, Great Britain, Oxford University Press. This book places genetic exchange between organisms into a wider evolutionary framework. The focus of this book is gene flow in eukaryotes not prokaryotic organisms. Dawkins, R. (1976). The Selfish Gene, Oxford University Press. A classic that introduces a gene centered view of evolution by natural selection.

Horizontal Gene Transfer: Its Detection and Role in Microbial Evolution

151

Doolittle, W.F. (1999). Phylogenetic classification and the universal tree. Science 284(5423):2124–9. This article is one of the first reviews on horizontal gene transfer and the tree of life. Fitch, W.M. (2000). Homology: a personal view on some of the problems. Trends Genet 16(5):227–31 A good overview of the often-confusing intricacies of homology, paralogy and orthology. Focus on horizontal gene transfer, Nature Reviews Microbiology volume 3, number 9 (2005).

This issue combines several reviews on different aspects of horizontal gene transfer, ranging form the mechanisms and vehicles of gene transfer to divergent views on prokaryotic species concepts.

Acknowledgments Work in JPG’s lab was supported through the NSF (MCB-0237197), the NASA Applied Information Systems Research (NNG04GP90G) and NASA Exobiology Programs (NNX07AK15G). OZ is a CIHR Postdoctoral Fellow.

This page intentionally left blank

CHAPTER 7 GENOME REDUCTION DURING PROKARYOTIC EVOLUTION

FRANCISCO J. SILVA and AMPARO LATORRE

1. Introduction It has long been known that the genome size of living organisms varies by several orders of magnitude. This variation is not uniformly distributed among taxonomic groups because it is mainly due to that observed within eukaryotes, which have a range of eight orders of magnitude. The sizes of the prokaryotic genomes are not so variable, but after the complete sequencing of 37 archaeal and 450 bacterial genomes (April 2007), the observed range of variation is 0.49–5.75 Mb and 0.16–9.77 Mb, respectively (Fig. 1). The size distribution in the different bacterial taxonomic groups shows that species with large and small genomes coexist in lineages such as the gammaProteobacteria or the Firmicutes (Fig. 2). Important variations may even be observed among strains of the same species such as, for example, Escherichia coli (4.6–5.6 Mb), Prochlorococcus marinus (1.7– 2.7 Mb) or Pseudomonas fluorescens (6.4–7.1 Mb). All these observations suggest that genome size is a highly variable characteristic of prokaryotic species and that, contrary to eukaryotes, it may change drastically even in the small divergence time between strains of one species. The sequencing and analysis of eukaryotic and prokaryotic genomes revealed that the larger differences observed in the former were a consequence of variations in the amount of nongenic DNA, including intronic but mainly intergenic DNA. In contrast, bacterial and archaeal genomes are highly compacted, with genic DNA in a proportion higher than 85–90% for most of the species. Only a few exceptions to this high compactness have been reported in prokaryotes and include organisms containing hundreds or even thousands of pseudogenes, such as Mycobacterium leprae (Cole et al., 2001) or Sodalis glossinidius (Toh et al., 2006). The sizes of the intergenic regions in prokaryotes are very small, with median values as small as 3 bp in Pelagibacter ubique, 85 in E. coli, or 151 in Yersinia pestis (Giovannoni et al., 2005). Intergenic regions included in operons are even smaller and genes are frequently overlapping. All these observations show that compactness is a characteristic feature of prokaryotic genomes. 153

F. J. Silva & A. Latorre

154 10

Genome size (Mb)

9 8 7 6 5 4 3 2 1 0 10

20

30

40

50

60

70

GC content (%) 10

Genome size (Mb)

9 8 7 6 5 4 3 2 1 0 10

20

30

40

50

60

70

GC content (%) 1

Genome size (Mb)

0,9

twh nse

0,8 0,7

wgl

bfl

0,6 mge

0,5 nqe

0,4

BCc

0,3 0,2 crp

0,1 0 10

15

20

25

30

35

40

45

50

GC content (%) Fig. 1. Genome sizes in completely sequenced archaeal and bacterial species. From top to bottom, archaeal genomes, bacterial genomes, and a selection of genomes smaller than 1 Mb. Species code: Candidatus Carsonella ruddii PV (crp), Buchnera aphidicola BCc endosymbiont of Cinara cedri, Mycoplasma genitalium G37 (mge), Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis (wgl), Candidatus Blochmannia floridanus (bfl), Neorickettsia sennetsu str. Miyayama (nse), Tropheryma whipplei str. Twist (twh) and Nanoarchaeum equitans Kin4-M (nqe). Genome sizes and GC contents were obtained from the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi).

Genome Reduction During Prokaryotic Evolution

155

8

7

Genome size (Mb)

6

5

4

3

2

1

0 10

20

30

40

50

60

70

G+C content (%) Fig. 2. Genome sizes in Gamma-Proteobacteria () and Firmicutes (). Genome size is plotted against GC content.

It has also been observed that in many higher eukaryotic species, a large proportion of intergenic regions were composed of interspersed or tandemly repeated DNA (44% in the human genome and 12% in Drosophila melanogaster ). In prokaryotic genomes, most of the DNA is single copy but it is possible to detect some repeated sequences. Insertion sequences (ISs) are the most frequent transposable elements. These types of elements are detected in the genome of at least one species from all the prokaryotic phyla, although in 63 out of 262 analyzed genomes they were absent (Touchon and Rocha, 2007). In a few species they constitute a large fraction of the DNA (up to 8%). The bacteria Shigella sonnei and Bordetella pertussis and the archaeon Sulfolobus solfataricus with 342, 246 and 123 IS copies of different types, respectively, are species that show remarkably high numbers. Although prokaryote genome sizes are not strongly dependent on the number of IS elements, it has been observed that part of genome size variance (40%) is explained by IS abundance (Touchon and Rocha, 2007). The most probable reason is that large genomes contain a huge proportion of IS’s target sites, because many genes are not essential and even completely disposable. However, the smallest genomes have lost most of the nonessential genes and disruptions of the genes retained by IS elements many be lethal or at least detrimental. The smallest known organismal genome sizes have been detected in two prokaryotic symbionts, the archaeon Nanoarchaeum equitans (0.49 Mb), which is an exosymbiont (an organism living outside its symbiont host) living attached to another archaeon (Ignicoccus sp.) (Huber et al., 2002), and the gammaproteobacterium B. aphidicola strain BCc (0.42 Mb), which lives intracellularly in

F. J. Silva & A. Latorre

156

the bacteriocytes of the aphid Cinara cedri (Perez-Brocal et al., 2006). Recently, the genome of Carsonella ruddii, a gamma-proteobacterial endosymbiont (an organism living inside its symbiont host, and in some cases intracellularly) of the psyllid Pachypsylla venusta, has been sequenced (Nakabachi et al., 2006). Its small genome (0.16 Mb), the small number of genes, and the loss of several apparent essential genes cast doubts about its status (see Sec. 5). Independently of whether C. ruddii has just crossed the line between a living cell and an organelle, we know that other bacteria did so a long time ago. The endosymbiont theory, which explains the origin of mitochondria and chloroplasts, is widely accepted (Gray, 1999). These organelles have drastically reduced the original alpha-proteobacterial and cyanobacterial genomes, respectively, with a portion of the protein-encoded genes and even RNA genes transferred to the eukaryotic nuclear genome. Chloroplast genome sizes are not very variable among eukaryotes, but we may detect a range of between approximately 35 and 224 Kb (Fig. 3). On the other hand, the size of the mitochondrial genomes is variable and did not show a correlation with the complexity of the organism. Thus, multicellular animals show very compact mitochondrial genomes of around 16 Kb, while in fungi they are usually larger and with a higher range of variation (Saccharomyces cerevisiae 78 Kb and 46 genes). Plants show the largest known mitochondrial genome sizes, with Zea mays subsp. parviglumis (680 Kb) presenting the largest completely sequenced mitochondrial genome. A plot comparing gene content and genome size (Fig. 3) shows the lack of correlation between plant mitochondrial gene number and genome size.

800

Genome size (Kb)

700 600 500 Metazoa mit. Plants mit. Fungi mit. Other eukaryota mit. Chloroplast

400 300 200 100 0 0

50

100

150

200

250

300

Gene number Fig. 3. Sizes of chloroplast and mitochondrial genomes. Genome sizes and gene numbers were obtained from the NCBI (http://www.ncbi.nlm.nih.gov/genomes/ORGANELLES/organelles. html).

Genome Reduction During Prokaryotic Evolution

157

In this review we will provide an overview of the current knowledge on the phenomenon of genome reduction in prokaryotes. We will start showing that this evolutionary reduction is not restricted to specific taxonomic groups, but may be considered a broad phenomenon. We will describe how to reconstruct ancestral genome contents and based on them and the present genomes how to identify the extent of these reductive processes. We will consider the different hypotheses that have been proposed to explain it. Finally, we will describe several organisms that have lost a large proportion of their genes approaching what could be considered the minimal genomes, and we will discuss the uncertain limits between cells and organelles.

2. Genome Reduction as a General Phenomenon The knowledge of the genome sizes of many prokaryotes and the reconstruction of their phylogenetic histories based on well-supported phylogenetic trees showed that several small genome-size bacterial lineages were closely related to other with larger genome-sizes. The idea that the evolution of life on Earth consists of a progressive increase in the complexity of organisms, from the most simple to some specialized lineages could suggest that the ancestors contained small genomes that were maintained in some lineages, while others evolved increasing their sizes. The alternative was that the reduced-genome lineages were a degeneration of an ancestor with a larger genome. Many examples have led to the consideration of this second hypothesis as the most likely explanation for these small genomes, genome reduction being a general phenomenon that has taken place many times during prokaryotic evolution. Here we report some of these examples belonging to different taxonomic groups. The class Mollicutes is a group of prokaryotes with small genome sizes that includes several genera such as Mycoplasma, Anaeroplasma, Spiroplasma or Ureaplasma. Its origin was controversial and it was suggested that all of them derived from a type of primitive organism with a small genome size that later evolved in some lineages to produce other classes of bacteria with more evolved and larger genomes (Wallace and Morowitz, 1973). However, phylogenetic analyses showed that the Mollicutes were not a coherent group, and only some of its genera formed a lineage that diverged many million years ago from a clostridial ancestry that also led to Bacillus and Lactobacillus (Woese et al., 1980). Thus, the mycoplasmas are considered merely as degenerate clostridia, and their special characteristics are related with their special type of evolution, which has led to a strong decrease in size and to a fast nucleotide substitution rate. The class Chlamydiae has long been considered as a unique coherent group, comprising a few closely related pathogenic species, whose divergence from other known groups was in the origin of bacteria (possibly 2 billion years ago). They show small genomes, in the range of 1.0–1.2 Mb. However, several other nonpathogenic chlamydiae have recently been detected in the environment. One

158

F. J. Silva & A. Latorre

of them, Protochlamydia amoebophila UWE25, is an endosymbiont of free-living amoebae and possesses a larger genome (2.4 Mb) (Horn et al., 2004). These findings suggest that pathogenic chlamydiae reduced their genome size after adapting to pathogenic life. Several examples of genome reduction have been proposed in the group Actinobacteria. The case of Mycobacterium leprae is remarkable because, in spite of its large genome (3.2 Mb), it is considered a recent evolutionary example of gene decay and genome reduction. It contains more than 1.000 pseudogenes (Cole et al., 2001) and after divergence from M. tuberculosis it has lost more than 1,500 ancestral genes (Gomez-Valero et al., 2007a). For a few hundred ancestral M. tuberculosis genes there is no similar DNA sequence in M. leprae, while other genes remain as pseudogenes in M. leprae. The smallest characterized genome in actinobacteria belongs to Tropheryma whipplei. Its genome (0.93 Mb) is extremely reduced and encodes around 800 proteins. It contains a few pseudogenes, a small proportion of noncoding sequences and little sign of ongoing degradation (Raoult et al., 2003). In cyanobacteria, the smallest genome sizes are found in the strains of the freeliving Prochlorococcus marinus. The high-light-adapted strain MED4 and the lowlight-adapted strain SS120 show genome sizes of 1.66 and 1.75 respectively. However, the closely related strain MIT9313 possesses a genome of 2.4 Mb. Comparative analysis and the identification of orthologs in a strain with a larger genome size are compatible with the assumption that massive gene loss has occurred in both strains during their evolution from a Prochlorococcus ancestor (Dufresne et al., 2005). Within spirochaetes, the smallest genome (1.14 Mb) corresponds to the pathogen Treponema pallidum. The sequencing of the closely related species Treponema denticola has shown that 442 genes of the latter, which are absent in T. pallidum, have homologs in other spirochaetes (Borrelia burgdorferi or Leptospira interrogans), suggesting that these may have been lost by genome reduction in the T. pallidum lineage. Finally, proteobacteria are the group of species with examples of genome reduction that have been more extensively studied. The large range of variation in the genomes of the sequenced alpha-proteobacterial genomes and the presence of several genomes of around 1 Mb, such as those of the genera Rickettsia, Wolbachia, or Bartonella, suggested that massive reductions have taken place independently in, at least, two of these lineages (Boussau et al., 2004). The complete genome sequence of Rickettsia prowazekii revealed an unusually (for prokaryotes) high noncoding DNA content and some pseudogenes (Andersson et al., 1998). The study of a pseudogene region in several Rickettsia spp. showed that, in spite of their small genome sizes, the process of losing DNA and reducing the size of the genome is still active (Andersson and Andersson, 1999). Subsequent studies comparing the gene content and order of many alpha proteobacterial species with large and reduced genomes have revealed that gene losses and gains, affecting in some cases more than 1,000 genes, have taken place several times during the evolution of this taxonomic

Genome Reduction During Prokaryotic Evolution

159

group (Boussau et al., 2004; Sallstrom and Andersson, 2005). For example, the genomes of Rickettsia, Wolbachia and Ehrlichia usually contain less than 1,500 genes but they had a common alpha-proteobacterial ancestor with around 3,500 genes. A similar case of reduction took place to produce the small genome sizes of Bartonella spp. Gamma-proteobacteria constitute the taxonomic group in which more examples of genome reduction, and to a different extent, have been reported. Many species of this group live in association with insect species. The extent of this association is variable and includes obligate and facultative symbionts. The genomes of several of these strains or species have been completely sequenced, while in others the size has been estimated indirectly by Pulse-Field Gel Electrophoresis or other methods (Table 1). Genome reduction in some bacterial lineages is correlated with the evolution towards a completely host-dependent life. The most likely dynamics would start with bacterial symbionts living in a facultative association with a eukaryotic host. With time, this association would become obligatory, and even, in some cases, life would be restricted to inside a specific host cell type. Symbionts may have commensal, parasitic or mutualistic relationships with their hosts. The comparative analysis of several bacterial symbionts shows differential features associated with both the initial and final stages of the evolution towards obligate symbiosis (Table 2). At initial stages, drastic changes take place in the genome. Hundreds or thousands of genes are lost. Many of them are detected as pseudogenes whose DNA sequences progressively disappear from the genome through small or large deletions. It produces large decreases in the genome sizes. IS elements proliferate and, due to them or to other causes, many chromosomal rearrangements take place. At initial stages horizontal gene transfer (HGT) is still possible. At these stages, bacterial symbionts may not be restricted to specific hosts or tissues, and because the association is not obligate for any of the two partners, individuals without these symbionts may be detected in the populations. Some of these initial associations have progressed until the present obligate symbiosis. Estimates of the age of some obligate associations vary between around 29–36 My for Blochmannia spp (the endosymbiont of carpenter ants) (Degnan et al., 2004), and at least 100–150 My for B. aphidicola (the bacterial endosymbiont of aphids) and Baummania cicadellinicola (the endosymbiont of sharpshooters) (Baumann, 2005). Obligate symbionts continue evolving losing genes and reducing the genome size. But in some cases, genome sizes are so small (Table 1) that only a few genes may be lost without the extinction of the species.

3. Reconstruction of Ancestral Genomes Reconstructing ancestral genomes implies the identification of the ancestors’ gene content and, less frequently, their gene order. Because of the greater difficulty, very often gene order reconstruction is not performed, or it is at least restricted

160

Table 1. Genome sizes in gamma-proteobacterial symbionts. Sizes were estimated in the genome project after the complete sequencing of the genome (GP), with Pulse Field Gel Electrophoresis (PFGE), or indirectly from the expected gene content and the partial sequences in an ongoing genome project (OGP). Genome size includes, when available, the size of the plasmids. Organism

Blochmannia pennsylvanicus* Ruthia magnifica* Serratia Symbiotica SCc* Hamiltonella defensa* Arsenophonus arthropodicus* Sodalis glossinidius

Cinara cedri (aphid) Cinara tujafilina (aphid) Chaitophorus populeti (aphid) Thelaxes suberi (aphid) Tetraneura caerulescens (aphid) Baizongia pistaceae (aphid) Acyrthosiphon pisum (aphid) Schizaphis graminum (aphid) Homalodisca coagulate (sharpshooter) Glossina brevipalpis (tsetse fly) Camponotus floridanus (carpenter ant) Camponotus pennsylvanicus (carpenter ant) Calyptogena magnifica (giant clam) Cinara cedri (aphid) Acyrthosiphon pisum (aphid) Pseudolynchia canariensis (pigeon louse fly) Glossina morsitans (tsetse fly)

* These species appear in databases as Candidatus.

Type of symbiosis

Genome size (Kb)

Reference

Estimation

Obligate Obligate Obligate Obligate Obligate Obligate Obligate Obligate Obligate

420 475 520 550 565 618 652 653 686

Perez-Brocal et al. (2006) Gil et al. (2002) Gil et al. (2002) Gil et al. (2002) Gil et al. (2002) van Ham et al. (2003) Shigenobu et al. (2000) Tamas et al. (2002) Wu et al. (2006)

GP PFGE PFGE PFGE PFGE GP GP GP GP

Obligate Obligate

698 706

Akman et al. (2002) Gil et al. (2003)

GP GP

Obligate

792

Degnan et al. (2004)

GP

Newton et al. (2007) Latorre (unpublished)

GP OGP

1, 700 3, 510

Moran et al. (2005) Dale et al. (2006)

PFGE PFGE

4, 293

Toh et al. (2006)

GP

Obligate Facultative/ Obligate Facultative Facultative Facultative

1, 161 ∼1, 300

F. J. Silva & A. Latorre

Buchnera aphidicola strain BCc Buchnera aphidicola strain BCt Buchnera aphidicola strain BCp Buchnera aphidicola strain BTs Buchnera aphidicola strain BTc Buchnera aphidicola strain BBp Buchnera aphidicola strain BAp Buchnera aphidicola strain BSg Baumannia cicadellinicola str. Hc* Wigglesworthia glossinidia Blochmannia floridanus*

Insect host

Genome Reduction During Prokaryotic Evolution

161

Table 2. General features associated with the evolution towards obligate symbiosis in bacteria. The following features take place during the evolution of bacterial symbionts but some exceptions may be observed. Symbionts at initial steps include facultative symbionts that are not completely dependent on its partner and can live outside symbiosis. Initial stages

Final stages

Many genes may be lost DNA loss is spread along the genome and produces fast reduction of genome size Many pseudogenes may be detected IS proliferation, producing gene inactivation in some cases HGT acquisition still possible Bacterial symbionts may not be restricted to a specific tissue Bacterial symbionts may not be restricted to a single host Coevolution of host and endosymbiont lineages are altered by occasional transferences between hosts Variable presence in individuals of the same species host

Ongoing gene loss but at a slow rate Ongoing DNA loss but affecting a small number of regions in the genome A few pseudogenes detected No IS elements No HGT Bacterial symbionts are restricted to a specific tissue and type of cell Bacterial symbionts are restricted to a single host Host and endosymbiont lineages coevolved Presence in all the individuals of the species host

to the identification of syntenic blocks. For both analyses, a first step is required in which the truly orthologous relationships among genomes must be determined based on comparative genomics. The definition of orthology from an evolutionary point of view is clear: orthologous genes are those that evolved from a common ancestral gene by speciation. They differ from paralogs and xenologs in their form of origin, by duplication or acquisition by HGT, respectively. In practice, the identification of orthologs is tremendously complicated (see Chapter 9) and frequently the term orthology is incorrectly changed to a more practical definition based exclusively on sequence similarity. The recognition of orthologs is easier the closer the compared genomes are. For that reason, in the initial years of the genomic era, these studies were extremely difficult due to the small number of closely related genomes. However, over time the number of closely related genomes has increased spectacularly and, at the end of 2006, the number of non-unique completely sequenced genomes in their taxonomic genus was higher than 300. Several types of events alter the content and order of genes in the genomes. The reconstruction of the gene content of ancestral genomes requires distinguishing between two kinds of situations: gene gain and gene loss. Prokaryotic genomes may increase their number of genes by means of at least two mechanisms: (1) duplications of either individual genes or blocks of genes and (2) acquisition through horizontal gene transfer. Gene duplications may be recognized because many times they are observed in tandem or in duplicated segments, although HGT may also results in tandem repeats with one sided homologous recombination. Phylogenetic analyses may also corroborate this scenario, although genes acquired by HGT from closely related genomes may produce similar tree topologies. To determine whether a gene

162

F. J. Silva & A. Latorre

present in the genome of one species and absent in a second, was present in their last common ancestor and was lost in the second species, or on the contrary was not ancestral and was gained in the first species, is a difficult question that requires the availability of other related genomes, especially others that may be used as outgroups based on a previously known evolutionary divergence. The inference of the structure and content of ancestral genomes and of the scenarios of genome evolution requires three steps: (1) Production of a table of orthologous gene groups in the compared genomes. (2) Determination of the gene content of one or several ancestors. (3) Determination of the order of genes in the ancestral genome(s).

3.1. Production of a Table of Orthologous Gene Groups in the Compared Genomes The first decision is to determine the types of elements that will be included in the table. If the goal is to reconstruct the gene content of an ancestor, both the genes and the pseudogenes will be useful. In fact, the groups should include not only recent pseudogenes (sometimes as long as the functional genes) but any remnant DNA sequence that permits the inference that a specific gene existed in the lineage. Regarding the definition of pseudogene, several authors use this term to refer to any degraded portion of a gene detected in the genome (Silva et al., 2001, 2003; Gomez-Valero et al., 2004a; Blanc et al., 2007), while others do not assign the status of pseudogenes to these degraded and frequently fragmented remnants of the genes (Andersson and Andersson, 1999). We have also to decide whether the table will only include protein coding genes or, in addition, RNA genes. The problem with the latter is that most of them are very small and frequently present in several copies (for example, the tRNA genes with an average of 77 bp). For that reason, although convenient, the inclusion of RNA genes may be restricted to closely related genomes (Silva et al., 2003; Gomez-Valero et al., 2004a; Withers et al., 2006). Finally, one type of element that is conveniently removed from orthologous gene tables, are genes included in transposable elements. For example, some active IS elements show 100% identity to each other in the same genome and the application of the term ortholog to the relationships among them and to those of closely related species does not seem appropriate. The second problem is to decide the annotations that we are going to use. We may base this on the original genome annotations, but it is widely assumed that in prokaryotes a certain degree of error exists (see Chapter 2). Some true genes may not have been included because they were not detected by computational gene finding methods for several reasons such as small sizes or unusual codon usage characteristics. In contrast, others are false positive, for example small orphan open reading frames or recent pseudogenes annotated as several contiguous independent genes due to several frameshift mutations. This implies that we should search for

Genome Reduction During Prokaryotic Evolution

163

these absent genes in the genome with the TBLASTN algorithm to detect either the non-annotated gene or its pseudogene. However, with false positive small orphan genes, TBLASTN would identify the corresponding DNA in closely related genomes, even though in both cases the genes were not real. Finally, a third problem is associated with the fact that many protein-coding genes are modular and the possibility of domain loss or domain shuffling exists. The first approach to identifying orthologs was based on the method of circular best hits, which is represented in the Clusters of Orthologous Groups of proteins (COG) database (Tatusov et al., 1997) (see Chapter 9). When three genes from three different genomes are related in a triangle of best hits, they are considered orthologs. However, there are many examples in which after differential gene losses, triangles connecting paralogous genes may be generated. To distinguish between orthologous and nonorthologous trios, additional methods have been applied. For example, the orthologous gene groups in three archaeal genomes (Pyrococcus spp.) (Fig. 4) were determined using the ratio between two pair-wise genetic distances (Lecompte et al., 2001). The dAF /dHF ratio was estimated for each group of orthologous trios. The average was around 1, and trios with unusual values (high or low) were discarded. With this strategy around 15% of initial hits were considered spurious. Unfortunately, unusual distance ratios may be produced in groups of orthologous trios when the gene in one species evolves at a faster rate for one of several reasons, such as, for example, the relaxation of natural selection pressure or adaptation to new environmental conditions. Orthologous classification may be carried out making use of any of the databases of orthologous groups constructed with curated methods, such as COG (Tatusov et al., 2003) or MBGD (Uchiyama, 2007). Most of these classification systems suffer from the high level of complexity of gene families. The use of phylogenetic tree topologies has sometimes been used instead of the simple BLAST hit results, to distinguish among orthologous, paralogous and xenologous relationships. However, many gene genealogies do not recover their true phylogeny. This is mainly observed when the phylogeny includes one or several fast-evolving lineages. For example, the assignment of orthology for many B. aphidicola genes could not be exclusively supported by phylogeny because many B. aphidicola fast-evolving genes tend to show aberrant positions in the phylogenetic tree. One of the best ways to validate orthologous groups is to examine the gene neighborhood. As a general rule, gene P. abyssi (A)

P. horikoshii (H) P. furiosus (F) Fig. 4.

Phylogenetic tree of Pyrococcus spp. Abbreviations for each species are in brackets.

F. J. Silva & A. Latorre

164

Table 3. Assignment of orthology for argG genes in some gamma Proteobacteria. Species*

Gene name

eco stt eca ype buc plu vch

B3172 T3208 ECA0104 YPO1570 BU050 PLU4742 VC2642

Reciprocal best hit

Phylogenetic clade

Yes Yes Yes Yes Yes Yes Yes

1 1 1 1 2 2 2

Genome context Type Type Type Type Type Type Type

1 1 2 3 4 4 4

*Taxonomic name abbreviations: Buchnera aphidicola BAp (buc), Erwinia carotovora subsp. atroseptica SCRI1043 (eca), Escherichia coli K12 MG1655 (eco), Photorhabdus luminescens subsp. laumondii TTO1 (plu), Salmonella enterica subsp. enterica serovar Typhi Ty2 (stt), Vibrio cholerae O1 biovar eltor str. N16961 (vch) and Yersinia pestis CO92 (ype).

order conservation decreases consistently over time (Tamames, 2001; Belda et al., 2005). However, there are different examples of lineages showing exceptional rates of genome rearrangements, which in some cases are very low and in others extremely high (Belda et al., 2005). An example of how reciprocal best hit, phylogeny and gene order conservation may lead to different conclusions about orthology is shown in Table 3 for the argG genes from some gamma-Proteobacteria. The argG gene, which encodes argininosuccinate synthase, is included in the middle of a cluster of several arg genes in B. aphidicola, P. luminescens and V. cholerae but is completely isolated in other enterics (Fig. 5). The chromosomal position of the isolated argG genes reveals that the type 4 position was the ancestral one and that type 1, type 2 and type 3 probably represent three independent HGT insertions which led to the replacement of the ancestral argG gene in these lineages. In contrast, phylogenetic analysis divides the argG gene sequences into two clades (Fig. 6). Finally, genes are considered orthologs when the reciprocal best hit method is used.

3.2. Determination of Gene Content of One or Several Ancestors To determine the gene content of the last common ancestors of present genomes, a maximum parsimony criterion can be used. To apply this criterion the correct phylogeny is required. The first studies on the inference of genome contents were done with the minimal number of three genomes. Knowing which of them was the outgroup, the content of the last common ancestor of the ingroup species could be inferred. Based on this approach, the minimal gene content of the ancestral genome of the endosymbiotic bacteria B. aphidicola was reconstructed (Silva et al., 2001). The completely sequenced genomes of B. aphidicola BAp, E. coli K12 and V. cholerae were compared (Fig. 7). Genes present in both E. coli and B. aphidicola

Genome Reduction During Prokaryotic Evolution

Fig. 5.

165

Genomic context of argG and argH genes.

were considered ancestral. E. coli genes absent in B. aphidicola were considered ancestral when they were also present in the outgroup genome of V. cholerae. Only orthologous genes were included for this inference. The few genes unique to B. aphidicola were also considered to be present in the free living bacterium that initiated the endosymbiotic life, since HGT was considered very improbable in this lineage. With this method, the minimal gene content of 1818 genes was established for the ancestor. An increased ancestral gene number was obtained, if the criterion used included as ancestral those genes present in at least E. coli and Y. pestis (Moran and Mira, 2001). However, because well-supported phylogenies show that Y. pestis is a closer relative of E. coli than of B. aphidicola (Gil et al., 2003), some of the genes considered ancestral under this criterion could have been gained by the ancestor of E. coli and Y. pestis after divergence from B. aphidicola. The reconstruction of the last common symbiont ancestor of B. aphidicola was performed by comparing the gene content and gene order of three B. aphidicola strains (Shigenobu et al., 2000; Tamas et al., 2002; Silva et al., 2003; van Ham

F. J. Silva & A. Latorre

166

eco:B3172 (argG) stt:T3208 (argG) eca:ECA0104 (argG) ype:YPO1570 (argG) vch:VC2642 plu:PLU4742 (argG) buc:BU050 (argG) plu:PLU4567

stt:T3501 (argH) eco:B3960 (argH) ype:YPO3924 (argH) eca:ECA0194 (argH) plu:PLU4741 (argH) buc:BU051 (argH) vch:VC2641

Fig. 6. Phylogenetic trees for argG (top) and argH (bottom) genes. Note that an additional argG-like gene is present in P. luminescens. Colors show the four argG gene types.

LCSA

B. aphidicola BAp 608 genes

> 1000 lost genes LCA 1818 genes

E. coli K12 4398 genes V. cholerae 3957 genes 150

100

50

Approximate time scale (My) Fig. 7. Reconstruction of the gene content of the ancestor of E. coli and B. aphidicola. LCA is the last common ancestor of these two species. LCSA is the last common symbiont ancestor of the present B. aphidicola strains. LCSA gene content was smaller than 1,000.

et al., 2003; Gomez-Valero et al., 2004a). In spite of being considered strains, as they evolve with their hosts, their divergences have been estimated at around 86164 My for the first split (Von Dohlen and Moran, 2000) and 50-70 My for the second (Clark et al., 1999), based on the divergence of their aphid hosts (Fig. 8). The strategy for the reconstruction of the gene content of the B. aphidicola ancestor that established the symbiotic relation with an ancestral aphid species was based on several special characteristics of this type of bacterium. Because of the extreme stasis of the genome (Tamas et al., 2002) with almost perfect gene order conservation and without acquisition of foreign genes, any gene detected in one strain was considered ancestral and its position considered to be the same as

Genome Reduction During Prokaryotic Evolution

167

Fig. 8. Gene losses during the evolution of B. aphidicola lineages. See Table 1 for the descriptions of the three B. aphidicola strains. Period I is the time interval between the divergence of BBp and that of BAp and BSg. Period II is the time interval after the divergence of BAp and BSg.

that of the ancestral genome. To reconstruct the gene content of the last common ancestor of BAp and BSg strains the parsimony criterion was used. Ancestral genes have three statuses in the present genomes: retained gene, pseudogene, or absent gene. In the latter case, the position of the gene was occupied, in some cases by a DNA sequence (generally shorter than the gene) with no significant similarity to the functional gene. Depending on the combinations of the three types of status, the gene content of the internal ancestor and the number of losses in segments I and II in the lineages was determined (Fig. 8) (Silva et al., 2003; Gomez-Valero et al., 2004a). Thus, for example, the 9 genes retained in BSg and with the status of absent gene or pseudogene in BAp were considered losses in segment II in the lineage of BAp. Also, because of the fast rate of nucleotide substitution in this bacterium, it was considered that present pseudogenes could not have been inactivated more than 50 My ago (the age of divergence between BAp and BSg). So, genes present in BBp and with the status of absent gene in BAp and BSg were considered losses in segment I. It is worth mentioning that a fourth B. aphidicola genome has been sequenced (B. aphidicola BCc, P´erez-Brocal et al., 2006). As the synteny among the four genomes is maintained, and as this new genome possesses five exclusive genes, the LCSA would have had at least 645 genes. However, as the phylogenetic relationship of this new B. aphidicola strain is not clear, probably due to the high acceleration of evolutionary rates of the retained genes, it is not possible to know when the losses occurred and the genes were not incorporated into Fig. 8. It is likely that as more

F. J. Silva & A. Latorre

168

B. aphidicola genomes are sequenced, the number of genes inferred for the LSCA will increase towards the real number. The inference of the gene content of ancestral genomes and of the evolutionary scenarios of gains and losses has been analyzed in some other prokaryotic lineages. The analysis of alpha-proteobacteria and, within them, the Rickettsia genus has been especially interesting (Boussau et al., 2004; Blanc et al., 2007). The gene sets of the ancestors of 13 alpha-proteobacteria were computationally inferred, starting from a well-supported topology. The most parsimonious scenarios of genome evolution were reconstructed by character mapping using generalized parsimony as implemented in PAUP* (http://paup.csit.fsu.edu/index.html), using the ancillary criteria of accelerated transformation (ACCTRAN) or delayed transformation (DELTRAN). Penalty values were assigned to the three different types of events: duplication, deletion, or gene genesis. With this strategy, the authors established that the last common ancestor of these bacteria would contain 3000-5000 genes and that massive gene gains and losses took place in several lineages (Boussau et al., 2004). Blanc et al. (2007) have recently performed a study with 7 Rickettsia species. Based on phylogeny they divided the species into three groups: the spotted fever group (SFG), the typhus group (TG) and the outgroup (R. bellii) (Fig. 9). Based on the gene repertoires (including pseudogenes) of the present genomes, the gene content of the ancestors R0 and R1 (Fig. 9) was determined. A gene was considered present in the R0 genome, if it was found in a full-length or pseudogene state in R. bellii and in at least one species of the SFG or TG groups. A few genes not fulfilling this requirement were also considered ancestral, if they were only present in R. bellii or the TG/SFG clade, but the encoded protein showed a better hit with a protein from other alpha-Proteobacteria than with the remaining organisms. For the R1 ancestor, genes present in R0, and in at least one genome from the TG/SFG groups, were also considered present in R1. Genes absent in R0 were considered R.felis R.massiliae

R1

(404)

R.conorii

SFG

R.africae R.typhi

R0

1,252 genes

R.prowazekii

TG

R.bellii

(211) Fig. 9. Gene contents in the Rickettsia phylogeny. The spotted fever group (SFG) and the typhus group (TG) are indicated. Genome reconstruction showed that 1,252 genes were present in the ancestor. It would also have contained 211 genes specific to R. bellii, and 404 genes specific to the SF and TG groups (Blanc et al., 2007). Arrows show two massive gene loss events.

Genome Reduction During Prokaryotic Evolution

169

present in R1, when they were present in at least one species of the TG group and another of the SFG group. Finally, for more than 200 genes from R. bellii and 400 from the TG/SFG genomes, it could not be determined whether they had been acquired by HGT or, alternatively, were originally present in the R0 genome and were lost in one of the two lineages.

3.3. Gene Order Reconstruction in Ancestral Genomes As stated previously, ancestral gene order conservation decreases over time but at a different rhythm, depending on the lineage. In a three-genome comparison, we can reconstruct the order of the genes in the ancestral genome by applying the criterion that when collinear order is observed in two out of the three genomes, this is the most probable ancestral order. We have to take into account that the ancestor we are reconstructing is that of the two evolutionarily closer genomes in a rooted tree (for example, LCA in Fig. 7), or the internal node in an unrooted tree. This is also called the median of the three genomes (Bourque and Pevzner, 2002). When, in some cases, the phylogenetic reconstruction produces a polytomy, we can consider that the reconstructed ancestor is that of the three genomes. Using this strategy we can see the steps for obtaining the ancestral order in the following example (Fig. 10). In step 1, we follow the collinear order of the three genomes up to the breakpoint after gene G5. Then we follow with genomes E and F up to the breakpoints after genes E10 and F10. We see that the colinearity between genome E and G was reestablished after the gene E6 and G2500, respectively. In step 2, we continue extending the ancestral (A) block of genes, now following the collinear segments up to positions E14 and G2492. Because we observe that after the breakpoint in F10/F2300 the genome F has recovered the colinearity with the two others, we need to reorder the table to test whether either E or G genomes continue the colinearity with F or, at this point, each genome shows a breakpoint.

Step 1

Step 2

Step 3

A1 E1 F1 G1

A2 E2 F2 G2

A3 E3 F3 G3

A4 E4 F4 G4

A5 E5 F5 G5

A6 A7 A8 A9 A10 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 F6 F7 F8 F9 F10 F2300 F2299 F2298 F2297 F2820 F2821 G2500 G2499 G2498 G2497 G2496 G2495 G2494 G2493 G2492 G2610 G2611

A1 E1 F1 G1

A2 E2 F2 G2

A3 E3 F3 G3

A4 E4 F4 G4

A5 E5 F5 G5

A6 A7 A8 A9 A10 A11 A12 A13 A14 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 F6 F7 F8 F9 F10 F2300 F2299 F2298 F2297 F2820 F2821 G2500 G2499 G2498 G2497 G2496 G2495 G2494 G2493 G2492 G2610 G2611

A1 E1 F1 G1

A2 E2 F2 G2

A3 E3 F3 G3

A4 E4 F4 G4

A5 E5 F5 G5

A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 E6 E7 E8 E9 E10 E11 E12 E13 E14 E2800 E2799 F6 F7 F8 F9 F10 F2300 F2299 F2298 F2297 F2296 F2295 G2500 G2499 G2498 G2497 G2496 G2495 G2494 G2493 G2492 G2491 G2490

Fig. 10. Gene order reconstruction of the ancestor of species E, F and G. The numbers after letters E, F and G indicate the gene order in each genome. The ancestral gene order is represented by the letter A. The genes in each column are orthologs.

170

F. J. Silva & A. Latorre

In step 3, we see that after reordering the table with the F genome, we show that we can extend the ancestral block with F2297 and G2492. With this strategy it is possible to reconstruct the gene order of several blocks of ancestral genes. The way in which these blocks are connected in the ancestor is unknown, although the inclusion of additional genomes may generate larger ancestral gene blocks. In some cases, in closely related genomes, the complete ancestral order may be reconstructed in the case of the 2977 genes of the ancestor of M. leprae and M. tuberculosis, where a block of 2975 genes could be inferred using M. avium as the outgroup genome and the ongoing genome project of M. marinum as a support (Gomez-Valero et al., 2007a). Finally, in recent years different algorithms and programs have been developed to solve the pairwise genome rearrangement problem, through the determination of the simplest scenario required to pass from one genome to another with the minimal number of rearrangement events or breakpoints (see, for example GRAPPA (Moret et al., 2002) or GRIMM (Tesler, 2002)). These programs consider several types of changes in the genome, such as the frequent inversions (reversal), translocations or gene fusions/fissions. Algorithms for reconstructing scenarios from multiple species are extremely time consuming, even for the simple case of three genomes. The Multiple Genome Rearrangement program (MGR) implements an algorithm which, given a set of genomes, seeks a tree such that the sum of the rearrangements is minimized over all the edges of the tree. It can be used for inferring the order of the ancestral genes (Bourque and Pevzner, 2002).

4. Evolutionary Process of Genome Reduction The compactness of the genome sizes of unicellular species, particularly prokaryotes, relative to the larger genomes of higher eukaryotes, has been a matter of great controversy during recent decades (Lynch and Conery, 2003; Daubin and Moran, 2004; Lynch, 2006). In general, in prokaryotes, a reduction in the genome size may be a consequence of the change to a new lifestyle. Thus, changes from free-living to host-associated, from multiple to specific hosts or from many to one specific tissue are signals to start the process of genome reduction. But, why do such changes have this consequence? Prokaryotes show a high level of correlation between gene number and genome size. Thus, reductions of genome size and gene number are almost always connected. This leads us to conclude that the process of genome reduction is associated with the loss of genes. There are two ways to lose a gene: one requires a point mutation producing the inactivation of the gene (nucleotide substitution or indels either at coding or regulatory regions), the other implies an event affecting a large section of DNA (deletion of hundreds or thousands of nucleotides) which can either completely remove the DNA sequence of the gene, or at least a large segment, whose loss will be responsible for gene inactivation. However, mutations take place in the cells or in the individuals and only have evolutionary importance if they are fixed in

Genome Reduction During Prokaryotic Evolution

171

the population. The main force fighting against gene loss is natural selection. The fixation of a gene loss producing only a very small decrease in fitness has a higher probability of occurring than that of a loss of the more important or essential gene. From population genetics theory, we know that the rate of fixation of a gene loss depends on three parameters: effective population size (Ne ), rate of mutation (µ) and selection coefficient (s). Gene loss mutations with positive s values will be fixed at a faster rate than neutral or mildly deleterious losses. For example, several deletion events removing antivirulence genes were required for the adaptation of Shigella flexneri to the virulence state (Maurelli et al., 1998). On the other hand, neutral gene losses are fixed at the same rate as the rate of mutation, independently of population size. Finally, the rate of fixation of mildly deleterious gene losses depends on the three parameters (Ne , µ, s), with small population sizes favoring their fixation. After a change in lifestyle, especially those associated with intracellular life, s values of loss-of-function mutations for many genes will change from s < 0 to s ≥ 0. A few mutations will have positive values and the loss of the gene will be quickly fixed in the population. Others will have s = 0 and the gene loss will be fixed by genetic drift. Natural selection will be unable to preserve the functional genes because their absence or presence will not contribute to fitness. For example, the loss of the genes encoding the flagellar apparatus and various adhesins of Y. pestis probably took place because they became unnecessary for a systemic pathogen (Parkhill et al., 2001). Finally, most of the more dramatic gene content reductions (Akman et al., 2002; Gil et al., 2003; Perez-Brocal et al., 2006; Wu et al., 2006) involve the loss of many advantageous genes, although nonessential, at least in this new environment (for example, genes involved in DNA repair processes). Natural selection will act against the fixation of these gene losses and the selection pressure will be higher for genes whose losses are more detrimental. Muller’s ratchet (Muller, 1964), a phenomenon that in asexual organisms decreases fitness over time due to the accumulation of mutations, to the lack of acquisition of genes horizontally, and to the lack of recombination has been proposed as a major factor in genome degradation of the bacterial endosymbiont B. aphidicola (Moran, 1996). The decrease in population size drastically increases the fixation rate for mildly deleterious mutations. Several bacterial lineages with small genome sizes are known to present small population sizes or to pass through population bottlenecks in each host generation. For example, several endosymbionts of insects infect the new generation of eggs with a small number of bacterial cells (sometimes smaller than 1,000) (Mira and Moran, 2002). Their small population sizes favor the fixation of disadvantageous gene losses. Protein-coding genes with smaller constraints (i.e., those evolving with larger non-synonymous substitution rates) will be lost earlier than genes with larger constraints, as has been stated for insect endosymbionts (Delmotte et al., 2006). The rate of fixation of gene losses will increase over time because some of the inactivated genes are involved in repair-DNA pathways (Moran and

172

F. J. Silva & A. Latorre

Wernegreen, 2000). This, in turn, will produce an increase in the mutation rate, which will increase the fixation rate of any type of mildly deleterious mutations. However, we cannot forget that a genome becomes reduced in size because segments of DNA are lost. In fact, a few bacterial species show exceptional lack of correlation between genome size and gene content. In a recent study, the human pathogen M. leprae was compared with others Mycobacterium spp. to reconstruct the gene content of the ancestor and the dynamics of gene loss and genome reduction (Gomez-Valero et al., 2007a). The analysis showed the loss of more than 1,500 ancestral genes. In the M. leprae genome most of them are pseudogenes while a few hundred have completely lost any sequence similarity with the functional gene (absent gene status). The percentage of DNA lost for absent genes included in completely syntenic regions was in most cases higher than 80%. However, a large set of lost genes (more than 1,000), with the status of pseudogenes and derived from recent inactivations, had lost on average only 15% of their DNA sequence. Another example of a species containing a large genome size (4.2 Mb) with close to 1,000 pseudogenes is Sodalis glossinidius, the secondary bacterial endosymbiont of the tsetse fly (Toh et al., 2006), which represents an evolutionary intermediate transition from a free-living to a mutualistic lifestyle. These and other examples indicate that although reduction in genome size and gene content is usually correlated in prokaryotic lineages, an explanation is required for understanding why the DNA of inactivated genes is lost, and why these losses are not compensated for by the increase in other parts of the genome through duplications, HGT insertions or the proliferation of mobile elements. The fate of the nonfunctional DNA after pseudogene formation has been analyzed in the lineage of B. aphidicola, showing that 23.9 million years is the halflife of a pseudogene and that the nonfunctional DNA sequence of an inactivated gene may be completely lost in close to 50 million years (Gomez-Valero et al., 2004a). The pattern of events for DNA loss in this lineage is mainly associated with the actions of two opposite types of indels (insertions vs. deletions). Most of them were very small, affecting one or a few nucleotides. They cannot account for the observed half-life of pseudogenes but it was observed that infrequently larger deletion events (>100 bp) may occur producing a faster genome reduction (Gomez-Valero et al., 2007b). The evolution of the size of a genome is subjected to two antagonist types of events producing increase and decrease in size. In prokaryotes, the genome may increase with gene duplications, insertions of HGT genes, insertions of phages, duplications of transposable elements, plasmid acquisitions, and small scale indels introducing a few nucleotides. On the other hand, they may decrease through the loss of plasmids and phages, the deletion of large segments, frequently connected to unequal crossing-over, and small scale indels deleting a few nucleotides (Fig. 11). To understand the evolutionary dynamics of prokaryotic genomes we need to know two types of rates: that of insertion or deletion mutations and that of their fixation in the population. The average number of nucleotides in each type of event is

Genome Reduction During Prokaryotic Evolution

173

Fig. 11. Genome size evolution in prokaryotes. Types of events increasing (top) and decreasing (bottom) genome size.

also very important, because events involving a large DNA segment may contribute dramatically to the variation in the genome size of a lineage, even if they are rare. Related to the rate and extension of the insertion events, it is considered that differences in bacterial genome sizes are not associated with the duplication of the complete genome. Recent analyses have revealed that most duplication events are small, frequently affecting a single gene and producing tandem duplications (Gevers et al., 2004). The importance of HGT in prokaryotic evolution is tremendous (Ochman et al., 2000), although several intracellular and host-associated lineages have lost their ability to increase the genome through this mechanism. Plasmid acquisition and phage insertions have been revealed in the analysis of bacterial strains and through the sequencing of complete genomes. As stated, the presence of transposable elements such as ISs has been documented in many prokaryotic genomes. However, they constitute a low percentage of the genome in many species, although their proliferation has been detected in recent or facultative pathogens (Ochman and Davalos, 2006). Obligate bacterial endosymbionts and pathogens

174

F. J. Silva & A. Latorre

have completely lost these elements in association with the compactness of the genome. Finally, small scale insertions have been detected during the evolution of pseudogenes. In B. aphidicola they are fixed at a slightly lower rate than small scale deletions (Gomez-Valero et al., 2007b), while in Rickettsia spp. they are at least three times more frequent than deletions (Andersson and Andersson, 1999). In both cases, deletions involved on average a larger number of base pairs than insertions. On the other hand, both large and small deletions have been detected when the genomes of strains or species are compared. Because over time several overlapping or adjacent deletions produce the same result as a large one, it is not possible to determine, after many millions of years of evolution, whether the lack of a large region in a syntenic part of a chromosome between two divergent species is the result of a large chromosomal deletion or a series of overlapping small deletions. The first argument to explain that prokaryotic genomes tend towards compactness is that the rates of deletion are higher than those of insertion or, in other words, that there is a mutational deletion bias that keeps genomes streamlined without the requirement for natural selection (Mira et al., 2001). This means that in a large population of a free-living bacterium, natural selection will avoid the fixation of deletion mutations with negative s values, but the fixation of neutral mutations will depend exclusively on the mutation rates. In species with small Ne , not only neutral mutations but also mildly deleterious mutations will be fixable. It is important to remark that the required bias is mutational, not substitutional. A criticism of the reported bacterial deletion bias is that many studies (see Lynch (2006)) measure the substitution rates of deletion or insertion mutations and not the real mutational rate and, for that reason, may be biased by selection. In fact, differences between insertion and deletion mutations may be small, as was shown in a reporter-construct study with E. coli, where (excluding single nucleotide indels) 29 different deletion mutations (ranging from 3–544 bp) and 29 different insertion mutations (ranging from 2–312 bp) were found (Schaaper and Dunn, 1991). It has also been argued by several authors that the compact microbial genomes are a consequence of selection for high replication rates (Cavalier-Smith, 2005; Giovannoni et al., 2005). The implicit idea is that natural selection would be able to detect the differences in fitness of individuals carrying genomes differing by a few base pairs, because of the cost of maintaining and replicating the largest ones. The large Ne of many prokaryotic species may enhance the effect of natural selection. However, this explanation will be difficult to apply to intracellular pathogens and endosymbionts because of their small Ne values (Daubin and Moran, 2004), although the polyploidy of these cells, at least described for B. aphidicola (Komaki and Ishikawa, 1999), may compensate the small population sizes. Nevertheless, it is not clear whether the rate-limiting factor for cell division was DNA replication and different arguments against this hypothesis have been exposed (see review by Lynch (2006)).

Genome Reduction During Prokaryotic Evolution

175

Finally, M. Lynch and coworkers have proposed an explanation based on fundamental population-genetic principles (Lynch and Conery, 2003; Lynch, 2006). The hypothesis is that weak natural selection acts on nonfunctional DNA because this DNA increases the mutational target sizes of associated genes. Spacer DNA may experience gain-of-function mutations from several causes such as the creation of new transcription regulatory DNA sites or new ribosome binding sites. In polycistronic transcription units, long intergenic spacers may experience mutations generating premature transcription termination, etc. Because nearly all forms of excess DNA are considered a mutational burden, minimization of the genome size is expected to be favorable. However, population-genetic conditions will determine which lineages are more capable of purging the excess of DNA by natural selection. Considering Ng , the effective number of genes residing at a locus at the time of reproduction, µ, the rate of mutation and n, the number of nucleotides that should be preserved unchanged to avoid deleterious effects over the associated genes, it was estimated that a species with 2Ng µ ≫ 1/n is essentially immune to the fixation of a hazardous nonfunctional DNA segment (Lynch, 2006). This means that lineages with a large effective number of genes and/or high mutation rates will tend to have smaller genomes. Although free-living prokaryotic lineages show lower mutation rates than higher eukaryotes, they compensate for these values with large Ng values. Estimates of 2Ng µ in prokaryotes and higher eukaryotes show an average value, which is ten-fold higher in the former (Lynch, 2006). The small population sizes of mutualists and parasites argues against this hypothesis (Daubin and Moran, 2004). However, endosymbionts have higher mutation rates than free-living bacteria and, at least, in some of them Ng is much higher than Ne because of polyploidy.

5. Minimal Gene Sets As soon as one becomes aware of the existence of bacteria with genomes as reduced as the ones found in those adapted to intracellular life the question emerges how the knowledge of the gene content of a minimal natural genome can help us to approach to the definition of the minimal genome required for life, and from this, to the goal of making a minimal living cell. It represents an important scientific challenge that a century ago was already considered the “ideal goal” of biology (Loeb, 1906), and in recent times has strongly emerged. Nowadays there is a reasonable degree of consensus in defining life as the property of a system that displays three features simultaneously: homeostasis, selfreproduction and evolution (Luisi, 2002). To understand life it is first necessary to understand its main nonliving components, proteins and RNA molecules as well as the instructions encoded by genes for making them. Thus, it is possible to come up with a definition of the necessary elements for keeping a minimal cell alive by knowing its complete gene set, which has been called a minimal genome (Mushegian and Koonin, 1996; Mushegian, 1999). According to Koonin (2000), the idea of a minimal gene set refers to “the smallest possible group of (protein-coding) genes

176

F. J. Silva & A. Latorre

that would be sufficient to sustain a cellular life form under the most favorable conditions imaginable, that is, in the presence of a full complement of essential nutrients, and in the absence of environmental stress”. The phenomenon of genome downsizing observed in endosymbionts and intracellular parasites can be used to approach this minimal gene number. The hypothesis is that the reduced genomes must retain all genes involved in housekeeping functions, as well as a minimum amount of metabolic pathways for cellular survival and replication. 5.1. Minimal Natural Genomes: Nature’s Four Most Remarkable Experiments on Genome Minimization Phylogenetic analysis has shown that obligatory intracellular bacteria and archaea are derived from free-living ancestors. Their genomes have undergone a process of reduction parallel to the adaptation to the obligatory parasitic or mutualistic lifestyle. As a consequence of this reductive evolution, many genes involved in DNA recombination and repair, along with specific biosynthetic and metabolic genes, usually are lost. In contrast, genes involved in informational processes (transcription, translation and replication) are retained. Regarding RNA genes, in general, reduced genomes are characterized by the presence of only one or two rRNA gene sets, a small number of tRNA genes and some snRNA genes. However, different bacteria possess a set of particular genes related to their particular environment determined by the host’s needs, but probably also to the particular process of gene loss, which has occurred throughout adaptation to intracellular life. It is then conceivable that naturally evolved, nearly minimal gene sets may contain substantial differences. At present, the smallest sequenced microbial genomes are a human pathogen (M. genitalium) and an archaeal ectosymbiont (N. equitans); and two endosymbionts of insects: an aphid and a psyllid (B. aphidicola BCc and C. ruddii, respectively). See Fig. 12 for a comparison of the differences in protein-coding genes by COG functional categories. The human pathogen M. genitalium, with a genome of 580 Kb and 482 proteincoding genes, has the smallest genome of any organism that can be grown in pure culture (Fraser et al., 1995). Its was the second completely sequenced genome, soon after the genome of H. influenzae (Fleischmann et al., 1995), and its publication attracted a lot of attention to the possibility of defining the minimal gene set of a living cell. Since them, this genome has been presented in all attempts to reconstruct a minimal genome, both experimental and computational. Its genome contains all the essential genes needed for DNA replication, transcription and translation. The drastic economization in genetic information must be associated with its parasitic mode of life. Thus, the M. genitalium genome carries only one gene involved in amino acid biosynthesis, and very few genes for the synthesis of vitamin and nucleic acid precursors. It carries a minimal set of energy metabolism genes, with a restricted supply of ATP. In accordance with its life-style, a significant number of mycoplasmal genes are devoted to adhesins, attachment organelles and variable membrane surface antigens directed towards evasion of the host’s immune system.

Genome Reduction During Prokaryotic Evolution

177

250

gene number

200

150

BCc Crud Mge Neq Minimal

100

50

0

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

COG category

Fig. 12. COG functional category distribution of protein-coding genes in species with small genomes. The number of genes in each COG category are shown for the genomes of B. aphidicola BCc (BCc), C. ruddii (Crud), M. genitalium (Mge) and N. equitans (Nqe). The number in the minimal genome set (Gil 2004) is also shown. Total number of protein-coding genes is 362 for BCc, 182 for Crud, 484 for Mge, 536 for Nqe and 206 for the Minimal set. A few genes have been assigned with more than one COG category. The number of genes for each COG category in Crud, Mge and Neq was obtained in MicrobesOnline database (http://www.microbesonline.org/). The code S was assigned to those genes without a COG code in this database. COG category codes are as follows: A, RNA processing and modification; B, Chromatin structure and dynamics; C, Energy production and conversion; D, Cell division and chromosome partitioning; E, Amino acid transport and metabolism; F, Nucleotide transport and metabolism; G, Carbohydrate transport and metabolism; H, Coenzyme metabolism; I, Lipid metabolism; J, Translation, ribosomal structure and biogenesis; K, Transcription; L, DNA replication, recombination, and repair; M, Cell envelope biogenesis, outer membrane; N, Cell motility and secretion; O, Posttranslational modification, protein turnover, chaperones; P, Inorganic ion transport and metabolism; Q, Secondary metabolites biosynthesis, transport, and catabolism; R, General function prediction only; S, Function unknown; T, Signal transduction mechanisms; U, Intracellular trafficking and secretion; V, Defense mechanisms.

The hyperthermophilic archaeon N. equitans, with a genome of 490 Kb and 487 coding genes represents a basal archaeal lineage, and is the only known archaeon exhibiting a parasitic life style (Waters et al., 2003). For survival, it must attach itself to the surface of the crenarchaeon Ignicoccus in submarine hot vents. In spite of being the third smallest sequenced microbial genome, it has a high coding density, encoding for 536 genes. A remarkable characteristic of N. equitans is the complexity of its information processing system, contrasting with the simplicity of its metabolic apparatus. It shares the conservation of the genes encoding the complete genetic machinery for transcription, translation and DNA replication with the other bacterial reduced genomes. Also in agreement with the other symbionts, it lacks many metabolism specific genes. Thus, its genome lacks many genes for

178

F. J. Silva & A. Latorre

central metabolism, primary biosynthesis and bioenergetics apparatus. However, a remarkable difference with other obligatory intracellular bacteria is that N. equitans possesses most of the DNA repair and archaeal recombination enzymes. The unusual genome reduction and genome composition of this microorganism are the consequence of the dual adaptation of N. equitans to high temperature and to an obligate parasitic lifestyle (Das et al., 2006). Bacterial endosymbionts of phloem-sap feeding insects, B. aphidicola BCc (422 Kb and 362 protein-coding genes) (Perez-Brocal et al., 2006) and C. ruddii (160 Kb and approx. 182 protein-coding genes) (Nakabachi et al., 2006) are the primary endosymbionts of aphids and psyllids, respectively. They are the smallest microbial genomes sequenced to date. The original role of the bacterium in the symbiosis is to supply the nutrients that the insect diet lacks, mainly essential amino acids and vitamins. The cedar aphid Cinara cedri symbiont, B. aphidicola BCc, represents an extreme reduction process with a genome that is about 200 Kb smaller than the other three sequenced genomes from B. aphidicola strains. The genome comparisons have shown that this reduction is mainly due to the loss of protein-coding genes and not to a reduction in the sizes of the intergenic regions or open reading frames. Compared with already highly compacted genomes of the other strains, B. aphidicola BCc has additionally lost the genes responsible for the biosynthesis of nucleotides, cofactors such as riboflavin, most of the transporters, as well as all genes for the peptidoglycan and ATPase subunit biosynthesis. However, despite its extremely reduced genome it still retains the complete machinery for DNA replication, transcription and translation, and a simplified metabolic network for energy production. Thus, with only 362 protein-coding genes, B. aphidicola BCc represents the minimal known gene set able to support cellular life. It also synthesizes the essential amino acids needed by its aphid host, with the exception of tryptophan. Regarding its mutualistic relationships, it has apparently lost its role as a tryptophan and riboflavin supplier to its host. Thus, they need to be supplied from another source, not only for aphid growth but also for B. aphidicola. These facts, together with evolutionary analysis and microscopic data, led PerezBrocal et al. (2006) to propose that B. aphidicola BCc is gradually being taking over by S. symbiotica SCc, the second bacterial symbiont massively present in this aphid (Gomez-Valero et al., 2004b), and might end-up being replaced. The genome sequence of S. symbiotica SCc (in progress), will tell us if the hypothesis is correct, or if this symbiont has also lost some essential functions and needs to be complemented by B. aphidicola BC. In this last case, a microbial consortium is established in the aphid C. cedri. The hackberry petiole gall psyllid Pachypsylla venusta symbiont, C. ruddii, has been proposed as the bacterial endosymbiont with the smallest genome known to date, with only 182 predicted protein-coding genes (Nakabachi et al., 2006). This number is much lower than previous proposals for minimal genomes and is almost half the number of genes identified in B. aphidicola BCc. More than half the genes

Genome Reduction During Prokaryotic Evolution

179

are devoted to only two categories, translation and amino acid metabolism (J and E COG categories; Fig. 12). However, there is a total absence of genes for numerous categories, including biogenesis and metabolism of nucleotides and lipids and cell envelope (F, I and M categories). A remarkable feature of this bacterium is the extremely compact genome, with overlapping adjacent open reading frames and a few tRNA genes (28). The small number of genes casts doubts on the character of C. ruddii as a living cell, as it lacks many genes for bacterium-specific processes. The authors proposed that some of the lost genes were transferred from the genome of a Carsonella ancestor to the genome of a psyllid ancestor, now being expressed under the control of the host nucleus (Nakabachi et al., 2006). However, in order to consider that C. ruddii is a living organism with a symbiotic relationship with its host, the genes involved in essential living functions, as well as those needed for the maintenance of host fitness must be preserved. A detailed analysis of the gene content of C. ruddii (Tamames et al., 2007) showed that the extensive degradation of the genome is not compatible with its consideration as a living organism. Most of the functions considered essential for a cell to be alive are heavily impaired. E.g., it lacks some essential genes involved in DNA replication, transcription and even in translation (L, K and J in Fig. 12). In addition, some of the pathways towards the biosynthesis of essential amino acids are completely or partially lost, indicating that it has lost, at least partially, its role in the symbiotic relationship. Although the transition of genes from C. ruddii to the host nucleus has been proposed, it would be also possible to consider the implication of the mitochondrial machinery encoded in the insect nucleus. Thus, this strain of C. ruddii could be transformed into a new subcellular entity between living cells and organelles (Tamames et al., 2007). These two genomes make us think about the border between a living cell and an organelle. However, this is difficult to define. We may consider the criterion of being able to grow in a cell-free medium to differentiate a cell from an organelle (Galperin, 2006). However, with this criterion not only C. ruddii and B. aphidicola but many other unculturable environmental species would be considered nonliving organisms.

5.2. Extreme Genome Reduction: Cell Versus Organelle Although an answer to the fate of B. aphidicola BCc and C. ruddii (replacement or organelle) is not possible at present, the understanding of the process undergone by intracellular bacteria with such minimal genomes can give some clues to the understanding of the symbiogenic events between prokaryotes and primitive eukaryotes that took place around 2,000 and 1,000 million years ago and gave rise to mitochondria and chloroplast, respectively, of modern eukaryotes. Subsequent to the symbiotic event, the mitochondrial and chloroplast ancestors in a host-restricted intracellular environment underwent a massive reduction in their genome size to their current size (see Sec. 1 and Fig. 3, as a summary of genome sizes

180

F. J. Silva & A. Latorre

in mitochondrial and chloroplast genomes). The process of genome shrinkage was similar to that which is ongoing in obligate intracellular parasites and symbionts. However, in the case of organelles, besides massive gene loss, a one-way transfer of genetic information contained in the symbiont to the host nuclear genome took place. Later, as a compensatory process, protein import machinery evolved in the symbiont to recover the protein-products of those transferred genes that fulfill an essential metabolic role. For the symbionts, they implied an irreversible loss of control over their own cellular processes (Van Ham et al., 2004). Genome sequences and comparative analyses reveal that since the origin of organelles, recurrent transfer events from the mitochondria and chloroplast genomes to the nucleus have occurred at different times in the past, and the process is continuing. Moreover, the transfer of genes from an organelle genome to nuclear chromosome has been demonstrated under laboratory conditions (Timmis et al., 2004). Beyond shaping the primal eukaryotic cells, symbiogenesis has continued to play a major role in the evolution of life. In particular, prokaryotic symbionts bring with them from their previous free-living state a broad range of metabolic capabilities. Some of these are specifically utilized by the host and enable it to occupy new ecological niches. As stated above, a massive transfer of genes from C. ruddii to the nucleus has been proposed (Nakabachi et al., 2006). This could indicate that C. ruddii is on the way of becoming an organelle. However, the genomic integration of newly acquired symbionts into a host, such as that seen in the primordial eukaryotic symbioses, is thought to occur rarely between bacteria and multicellular eukaryotes. Until recently, the only documented cases were the transfer of genome fragments of Wolbachia endosymbiont to the nucleus of the beetle host Callosobruchus chinensis (Kondo et al., 2002) and to the nematode host Onchocerca volvulus (Fenn and Blaxter, 2006). Recent work provided evidence for gene transfer events from Wolbachia to the nuclear genome of some of its hosts (four insects and four nematode species) and it even has shown that some of the inserted genes are transcribed (Hotopp et al., 2007). Thus, although the lateral transfer from Wolbachia to the eukaryotic hosts can be facilitated by its presence in developing gametes, similar events in other associations cannot be discarded. The sequencing of the genome (of selected parts) of the C. ruddii host (P. venusta) would be necessary to solve the status of this symbiont. 5.3. Defining the Minimal Gene Complement of a Living Cell by Comparative Genomics The sequencing of the M. genitalium genome prompted a variety of researchers to look into the problem of defining the minimal gene complement of a living cell. The goal of determining a definition and reconstruction of the minimum gene set in bacteria has followed two complementary strategies: computational and experimental methods. The difference between the two approaches is that the former identifies a set of essential genes that is shared among diverse taxa, whereas the latter searches for individual genes that are essential for growth in a single species.

Genome Reduction During Prokaryotic Evolution

181

Three different experimental approaches have been used to identify genes that are essential under particular growth conditions: massive transposon mutagenesis strategies (the most widely-used approach), use of antisense RNA to inhibit gene expression and gene inactivation of each individual gene of a particular genome (reviewed in Gil et al., 2004). All these approaches gave gene sets compatible with the ones inferred by comparative genomics. Recently, Glass et al. (2006) using transposon mutagenesis in M. genitalium identified 382 of 482 protein-coding genes as essentials; a number higher than previously estimated by the same authors (265 to 350) and with a similar approach (Hutchison et al., 1999) but without isolation and characterization of pure clonal populations to proof gene dispensability. It is remarkable that in all these experiments, genes encoding proteins with unknown functions constitute around 28% of the essential protein-coding gene set. The analyses based on comparative computational genomics have an evolutionary basis, because they assume that genes conserved across large phylogenetic distances are good candidates to be considered essential. These analyses have proved to be very useful to understand the functions necessary for a living cell, in spite of that the obtained minimal gene sets may be a compromise between necessity and historical contingencies. The first comparative genome analysis was performed on the two first completely sequenced bacterial genomes, H. influenzae and M. genitalium (Mushegian and Koonin, 1996), both with reduced genomes as a consequence of their parasitic life style. The two bacteria belong to lineages with a divergence time of more than 1.5 billion years of evolution. The reconstruction of the minimal gene set was performed in three steps: (i) orthologous genes between the two genomes were identified, (ii) nonorthologous genes that encode proteins with a similar function (non-orthologous gene displacement) were included and (iii) genes that appeared to be functionally redundant or parasite-specific were removed. The authors obtained a minimal gene set of only 256 genes that included the key genes necessary for translation, DNA replication, recombination and repair, transcription, translation, protein folding and core metabolism. Although this first approach was based on only two genomes, it appeared to be a close approximation of a minimal gene set for bacterial life. However, the experimental study based on massive transposon-insertion on M. genitalium and M. pneumoniae (see above) showed that some of the genes included could be disrupted, and therefore could not be considered as essential (Hutchison et al., 1999). More recent approaches to the minimal gene set included the reduced completely sequenced genomes of five insect endosymbionts (three B. aphidicola strains, B. floridanus, and W. glossinidia) (Shigenobu et al., 2000; Akman et al., 2002; Tamas et al., 2002; Gil et al., 2003; van Ham et al., 2003) and the parasite M. genitalium (Fraser et al., 1995). The six genomes only shared 180 housekeeping protein-coding genes, half of them being involved in informational processes (Gil et al., 2003), unveiling once more the essentiality of the genes involved in this functional category. When two additional intracellular parasites (R. prowazekii and Chlamydia trachomatis) were included in the comparisons,

182

F. J. Silva & A. Latorre

the number of shared genes was reduced to 156 (Klasson and Andersson, 2004). However, computational approaches to minimal gene set have important limitations that lead to underestimate the number of candidate genes to be included in a minimal genome. On one hand, the identification of orthologous genes among very divergent lineages is difficult when dealing with paralogous genes due to ancestral gene duplications, or even xenologous ones. On the other hand, non-orthologous genes can carry out similar functions in different organisms. Thus, even in the same environmental conditions, different versions of minimal genomes can be obtained. All the experimental and computational approaches to the minimal genome gave a relative enrichment in genes that encode components of genetic-information processing systems, mainly genes involved in translation, and relatively few operational genes, devoted to the metabolic process. These data indicate that genes involved in informational processes are not exchangeable, whereas the operational ones are more flexible, due to the existence of alternative metabolic pathways to get a specific product. We may conclude that it is not possible to talk of one form of minimal bacterial cell, at least from a metabolic point of view, as different essential functions can be defined depending on the environmental conditions. Moreover, even for the same set of conditions, different sets of genes could fulfill these same functions. Yet, the minimal genome must include the necessary genes to maintain a minimal metabolism in order to achieve metabolic homeostasis, one of the essential functions that define life (Pereto, 2005). In a combined study of all published researches using computational or experimental methods, Gil et al. (2004) defined the minimal core of essential genes for a free-living bacterium thriving in a chemically rich environment. The main difference of this study compared with previous efforts was the emphasis in the coherence of the minimal metabolism coded by the proposed gene repertoire. The analysis rendered a minimal genome containing 206 protein-coding genes (see Fig. 12). Two thirds of them are involved in genetic information storage and processing, including a virtually complete DNA replication machinery, a rudimentary system for DNA repair, a virtually complete transcriptional machinery without any transcriptional regulator, a nearly complete translational system, plus protein processing, folding, secretion and degradation functions. Among cellular functions, a unique gene can drive cellular division, taking into account that in a protected environment, a cell wall might not be necessary for cellular structure. The remaining genes encoded proteins implicated in the transport and use of nutrients from the environment to obtain energy and basic molecules to maintain the cellular structures. Eight genes with poorly characterized functions, but present in all the compared genomes were also included. A general conclusion of the different approaches is that the genes involved in nutrient transport and metabolic functions vary depending on the minimal genome proposal. Evidence of this fact was recently observed by Pal et al. (2006). Simulating host environments where two endosymbionts (B. aphidicola and W. glossinidia) had evolved from an ancestral genome represented by E. coli (a free living related

Genome Reduction During Prokaryotic Evolution

183

bacterium), numerous possible minimal metabolic networks variable in both gene content and number were obtained (Pal et al., 2006). This work showed that when several alternative networks are possible, evolution towards the minimal genome can give rise to the retention of one or another depending to a certain extent on stochastic processes.

6. Summary The observed range of variation in the sizes of completely sequenced prokaryotic genomes is 0.16–9.77 Mb. This range may decrease slightly (0.42–9.77 Mb) in case that Carsonella ruddii is considered an organelle. The presence of species with small and large genomes within many taxonomic groups indicates that genome expansions and contractions have occurred several times during the evolution of prokaryotes. To identify massive genome reduction in lineages, it is necessary to have information about genome sizes, gene contents and orthologous relationships among several related genomes. By knowing their phylogenetic history, it is possible to reconstruct the gene content and order of the ancestors and to identify in which lineages there were genome expansions and in which there were reductions. The production of a table of orthologous genes is a pre-requisite to performing such ancestral genome reconstructions. In the studies of genome reduction, it is required that pseudogenes at any level of degradation be identified. Orthologous groups of genes and pseudogenes may be detected by several methods including similarity searches (usually reciprocal best BLAST hits), genetic distances based on nucleotide or amino acid substitutions, phylogenetic reconstruction, and genome context. Once the groups of orthologs, the paralogs and the xenologs are identified, the reconstruction of either gene content or order can be performed based on the criterion of maximum parsimony. Frequently, to reconstruct the ancestral genome of a species that suffered a process of genome reduction, a trio comparison is performed between the species with the reduced genome, a phylogenetically closely related species with a larger genome and an outgroup species. Orthologous genes/pseudogenes present in two out of the three species are considered ancestral. The reasons why prokaryotic genomes are smaller than higher eukaryotic ones is a matter of debate with two main hypotheses: A) They are small because a mutational bias to deletion exists in prokaryotes. This bias and genetic drift produces small genomes, especially when natural selection relaxes. B) Their small size is mainly due to the effect of natural selection favoring small sizes in prokaryotes. Although several reasons have been proposed to explain the different effect of natural selection on prokaryotes and eukaryotes, M. Lynch and coworkers have recently proposed an explanation based on fundamental population genetic principles. Independently of the causes leading to the reduction of the genome, it has been stated that in many small genome size lineages the signal that started the reductive process was a change in lifestyle. Following this, the presence of many genes became unnecessary, disposable or even deleterious. The relaxation of natural

184

F. J. Silva & A. Latorre

selection resulted in the loss of many of them. Later, the effects of mutations, genetic drift and natural selection led to the loss of DNA and to the compactness of the genome. One of the most attractive studies in biology is the definition of life. Regarding gene content, the question may be stated as the minimal gene set required to sustain a living cell. Many studies using computational methods and experimental approaches have been carried out with the aim of identifying this minimal gene set. These studies have concluded that a single minimal set does not exist because several alternative solutions may be obtained to perform the same function. A minimal number of around 200 protein-coding genes apparently is required for the life of a cell living in symbiosis. The recent sequencing of the genomes of Buchnera aphidicola BCc and C. ruddii with 362 and 182 protein-coding genes, respectively, has shown that some natural living species are approaching this minimal set. The case of C. ruddii is controversial since it lacks many apparently essential genes, especial for translation, suggesting that this organism has crossed the line between a living cell and an organelle. 7. Further Reading Blanc G, Ogata H, Robert C, Audic S, Suhre K, Vestris G, Claverie JM, Raoult D (2007) Reductive genome evolution from the mother of Rickettsia. PLoS Genet 3: Bourque G, Pevzner PA (2002) Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Res 12:26–36. Boussau B, Karlberg EO, Frank AC, Legault BA, Andersson SGE (2004) Computational inference of scenarios for alpha-proteobacterial genome evolution. Proc Natl Acad Sci USA 101:9722–9727. Gil R, Silva FJ, Pereto J, Moya A (2004) Determination of the core of a minimal bacterial gene set. Microbiol Mol Biol Rev 68:518–537. Gomez-Valero L, Rocha EPC, Latorre A, Silva FJ (2007) Reconstructing the ancestor of Mycobacterium leprae: The dynamics of gene loss and genome reduction. Genome Res 17:1178–1185 Koonin EV (2003) Comparative genomics, minimal gene-sets and the last universal common ancestor. Nat Rev Microbiol 1:127–136. Lynch M (2006) Streamlining and simplification of microbial genome architecture. Annu Rev Microbiol 60:327–349. Nakabachi A, Yamashita A, Toh H, Ishikawa H, Dunbar HE, Moran NA, Hattori M (2006) The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science 314: 267–267. Perez-Brocal V, Gil R, Ramos S, Lamelas A, Postigo M, Michelena JM, Silva FJ, Moya A, Latorre A (2006) A small microbial genome: The end of a long symbiotic relationship? Science 314:312–313. Silva FJ, Latorre A, Moya A (2001) Genome size reduction through multiple events of gene disintegration in Buchnera APS. Trends Genet 17:615–618.

Acknowledgments Financial support was provided by projects BFU2006-06003/BMC from Ministerio de Educaci´ on y Ciencia and Grupos03/204 from the Generalitat Valenciana, Spain.

CHAPTER 8 COMPARATIVE MECHANISMS FOR TRANSCRIPTION AND REGULATORY SIGNALS IN ARCHAEA AND BACTERIA

AGUSTINO MART´INEZ-ANTONIO and JULIO COLLADO-VIDES

1. Introduction The human genome and the microbial genomes represent two major frontiers in genomic studies. With the rapid advancement in genome sequencing techniques, the sequencing of a large number of bacterial and archaeal genomes has resulted in the largest increase in the number of new genes identified because of their diversified forms of life on the earth (Rusch et al., 2007; Yooseph et al., 2007). As we know, microbial systems span at least 3.8 billion years of evolution. They are persistent, ubiquitous, and represent the essential components of all ecosystems. The geo- and physicochemical composition of Earth’s biosphere has been continuously molded by microbial activities. Until recently, there had been no agreement among investigators over the issue of diversity and fraction of the cultivated microorganisms and the way they evolve. The development of massive genome sequencing and molecular phylogenetic projects has recently enabled the characterization of naturally occurring microbial biota even without cultivation. Indeed, molecular sequences have enabled us to follow the way in which evolution works (Zuckerkandl and Pauling, 1965). Molecular structures and nucleic acid sequence analyses have suggested that life on this planet forms three distinctive groups of organisms: archaea, bacteria and eukarya (Woese et al., 1990). In general terms, archaea resemble bacteria as being single-cell organisms having similar cellular ultra-structure, genome size and organization. The evolution of archaea and bacteria is difficult to reconstruct, and it is speculated that the chronic energy stress might be the only driving-force for their cellular evolution (Valentine, 2007). The gene order maintained between archaea and bacteria depends to some extent on the conservation between their superoperonic elements encoded for RNA ribosomal proteins, proton-pump ATPase and ABC transport systems (Wolf et al., 2001). However, archaea manage their genetic information through a mixture of the molecular mechanisms occurring in eukarya and in bacteria (Bell and Jackson, 1998; Leigh, 1999; Bell and Jackson, 2001; Bell et al., 2001; Ouhammouch, 2004). The basal transcriptional machinery in archaea, that is the core promoter elements and the general factors for their recognition, is similar to that of eukarya, whereas 185

186

A. Mart´ınez-Antonio & J. Collado-Vides

gene organization and sequence-specific TFs (transcription factors) for regulating transcription are environmental and cellular condition-dependent as in bacteria. In this comparative study we describe and review our current knowledge about the transcriptional mechanisms and the regulatory signals involved in gene expression in the archaeal and bacterial domains. We focus on the distinct molecular mechanisms that ensure transcription: i) the basal machinery for the recognition of the core promoters including DNA sequences (cis-regulatory elements) and the general TFs (trans-regulatory factors) required for transcription initiation, and ii) the way gene transcription is modulated depending on nucleoid structure and through the activity of small ligands or environmental signals that specific TFs can bind or sense. Where possible, we describe in detail the biological processes from the beginning to the end of transcription as well as some additional mechanisms regulating gene expressions before mRNA translation.

2. Gene Organization in Archaea and Bacteria The decision about which genes should be expressed at a specified time is very important because the cellular biochemical abilities and limitations are specified by the activities of the gene products. As transcription of genes is a very costly and time-consuming process, it is not surprising that the regulation of the expression of genes at the transcriptional level has been described as a tightly controlled event. Indeed, a recent analysis measuring mRNA and proteins content conclude that the cellular proteome profile could be defined as ∼70% and ∼50% solely by the transcriptional regulation in yeast and E. coli, respectively, while for the rest, the regulation occurs possibly at the translational and posttranslational levels. The weaker correlation observed in E. coli might be explained by the fact that in the bacterial operon structure, genes are co-transcribed but often differentially translated (Lu et al., 2007).

2.1. Operons and Transcription Units In prokaryotes (archaea and bacteria), most of genes tend not to be dispersed in genomes but frequently encoded by functional units of expression, firstly described as “operons” by Jacob and Monod (Jacob and Monod, 1961). They had initially defined an operon as a set of several genes that are transcribed in a single polycistronic unit (Fig. 1A). An operon has associated cis-regulatory elements: one or several operator sites (the DNA-sequence that is bound/recognized by specific TFs to modulate the promoter activity), a promoter (the region on DNA where the RNAP starts the transcription) upstream of all genes in the operon, and a terminator downstream (the region where transcription ends and RNAP dissociates from DNA). In all cases studied so far, one gene is associated with only one operon. It is also relatively common to find operons with several promoters, some of them internally located, thus transcribing a partial group of genes in the operon.

Comparative Mechanisms for Transcription and Regulatory Signals

187

Fig. 1. Graphical representation of the genetic organization in bacteria and archaea. (A) The pdhR-aceEF-lpd operon of Escherichia coli K12 and their cis-regulatory elements, and (B) the transcription units included in this operon: pdhR-aceEF-lpd, aceEF and lpd encoding genes respectively (data from Regulon DB database, URL: http://regulondb.ccg.unam.mx/) (Salgado et al., 2006).

This partial group of genes constitutes a transcription unit (TU) that is a set of one or more genes transcribed from a single promoter. A TU may also include regulatory protein binding sites affecting the transcriptional activity of the promoter and a terminator signaling the end of transcription. A complex operon with several promoters contains, therefore, several TUs (Fig. 1B). At least one TU must include all the genes present in the operon, given the definition of an operon. Thus, a TU should be considered as the minimal unit of genetic information to be transcribed (as of one gene, Fig. 1B) and an operon is the maximal group of genes to be cotranscribed with the possibility to include more than one TU (Fig. 1A). Transcribed individual genes should be considered as TUs, in agreement with their original definition as they are not operons. Finally, a gene is the segment of DNA involved in producing a polypeptide chain or stable RNA; it includes regions preceding and following the coding region (leader and trailer), (Scherrer and Jost, 2007). More detailed discussions on operons are presented in Chapter 10.

188

A. Mart´ınez-Antonio & J. Collado-Vides

3. Basal Factors and cis-Regulatory Elements for Core Promoter Recognition The metabolic, defensive, communicative and pathogenic capabilities of microorganisms depend primarily on the repertoire of genes encoded in their genome and their ability to regulate the expression of this information according to the changing environments. Thus, as a consequence of the endogenous and exogenous physicochemical setting, micro-organisms should turn on or off a specific set of genes to adapt to new milieu conditions. For the proper regulation of gene expression, the cellular machinery is required to read the genetic information encoded in the genome for which organisms use a set of Transcription Associated Proteins (TAP) including RNA polymerase (RNAP) constituents, basal or general factors for promoter recognition and specific TFs. Thus, the successful turn on or off of genes requires the transcriptional machinery must “interpret” correctly the “cisregulatory code” present at the beginning and terminus of the transcription units. In the following section, we describe the general or basal factors and the cisregulatory signals necessary for the proper recognition of promoter sequences for gene transcription initiation in archaea and bacteria, respectively.

3.1. The Basal Transcription Machinery in Archaea In vitro experiments, mainly with hyperthermophilic archaea (the most genetically engineered archaeal model organism), show that the basal machinery for transcription in archaea is basically similar to that corresponding to RNAP II in eukaryotes (Bell et al., 1999, 2001; Hickey et al., 2002; Bartlett, 2005). In addition to the ∼12 subunits of RNAP, archaea need two general TFs: TBP and TF B, with the first one being homologous to the TATA-box binding protein (TBP) and the second one corresponding to TF IIB, both from eukaryotes, respectively (Fig. 2A). It has also been found that TF E is the third general TF in archaea, homologous to the minimal functional region of TF IIE from eukarya. Although it is not required for transcription in vitro, it can be used to validate the promoter recognition by TBP when it is sub-optimal. The function of the basal factors is to recognize and bind to specific operator DNA sequence near promoters in the transcription units, and to recruit the RNAP to make up the protein complex essential for initiating the transcription in a unidirectional way. The DNA elements in the regulatory region needed for transcription include: i) the TATA-box with an AT-rich DNA sequence of 8 bp in length (formerly called box A) localized around 24 bp upstream of the transcription initiation site. This TATA-box site is recognized and contacted by the TBP (TATA binding protein) factor; and ii) the BRE element (for TF B-responsive element), an additional purine-rich element localized immediately upstream of the TATA-box. This site is important for maintaining the promoter strength, since it is recognized and contacted by the carboxy-terminal domain of TF B. BRE is the principal element

Comparative Mechanisms for Transcription and Regulatory Signals

189

Fig. 2. Basal transcription machinery in archaea. (A) cis-regulatory elements for basal promoter recognition. (B) Recruitment of general trans-regulatory factors to initiate transcription. BRE, TF B-responsive element; TBP, TATA-box binding protein; TF E, transcription factor E; TF B, transcription factor B.

for governing the oriented assembly of the transcription pre-initiation complex since TF B with their N-terminus recruits the RNAP and ensures the proper orientation on the promoter for initiating transcription. The molecular events responsible for the transcription have not been understood totally. The following sequential events are postulated: 1. TBP is bound to the TATA-box, sometimes assisted by the TF E factor. 2. Subsequently, TF B is bound to the TBP-DNA complex by making contact with the BRE site. 3. Then, the N-terminus of TF B recruits RNAP and orients it properly to form the pre-initiation complex. The corresponding DNase footprint extends from −43 to at least +8 bp with respect to the transcription initiation site. 4. In this pre-initiation complex, RNAP catalyzes the isomerization of the promoter DNA by melting a region of about 13 bp extending from around −10 to +3 bp.

190

A. Mart´ınez-Antonio & J. Collado-Vides

5. Subsequently, RNAP initiates RNA synthesis resulting in a RNAP-DNA-RNA complex around the first 5–6 nucleotides of the transcript. 6. Finally, the transition between a promoter-bound to an elongated form of the RNAP clears the promoter. The process of RNAP escape from the promoter seems to be facilitated by an N-terminal sequence, which is conserved between the zinc ribbon and the core domain of the same TF B (Fig. 2B). It is particularly interesting that some archaea encode multiple paralogs of TF B and/or TBP. This finding raises the possibility of differential usages of these basal factors, which entails a general mechanism for regulating gene transcription similar to the multiple σ factors present in bacteria (see below). This should happen when individual paralogs of TF B recognize different BRE sequences, allowing for discrimination between different promoter subtypes. Consistent with this hypothesis, it has been found that one TF B is expressed following a heat shock exposition in Haloferax volcanii, which might indicate a condition-specific expression and activity of this TF B (Facciotti et al., 2007).

3.2. The Basal Transcription Machinery in Bacteria The basal molecular machinery for transcription in bacteria is different from that present in archaea and eukarya. The main component responsible for transcription in bacteria is the multi-subunit RNAP, but not in a promoter-specific manner as it mediates but not initiates the process (Collado-Vides et al., 1991; Browning and Busby, 2004). The RNAP core consists of only 5 subunits: αI, αII, β, β ′ and ω with a combined molecular mass of ∼400 kDa. The β and β ′ subunits (1,324 and 1,047 residues, respectively, in Escherichia coli) constitute the largest RNAP component and contain the active site required for the binding of both template of DNA and the RNA product during transcription. The two α subunits (each 329 residues) each consist of two independently folded domains joined by a ∼20 amino-acids flexible linker. The larger N-terminal domain (α-NTD) dimerizes and is responsible for the assembly of β and β ′ subunits. The smaller carboxyl-terminal domain (α-CTD) is a DNA-binding domain with an important role in certain promoters presenting UP elements (see Sec. 3.4). Finally, the small ω subunit (91 residues) seems to play a chaperone role in the assembly of the β ′ subunit. In order to regulate transcription by a particular promoter, the core RNAP must first interact with a particular σ factor to form the RNAP holoenzyme (Fig. 3). In this way, the promoter sequences correspond to a variety of σ factors associated to the RNA polymerase core (Stragier and Losick, 1990; Gourse et al., 2000; Wade et al., 2006). The σ factor has three main functions: (i) to recognize specific promoter sequences, (ii) to position the RNAP holoenzyme to specific target promoters, and (iii) to facilitate unwinding of the DNA duplex near the transcription initiation site.

Comparative Mechanisms for Transcription and Regulatory Signals

191

3.3. The σ 70 and σ 54 Factor Families in Bacteria Most free-living bacteria contain multiple σ factors to help them to adapt to changing environmental conditions. These σ factors range from one σ factor in Mycoplasma genitalia to seven in Escherichia coli and around 60 in Streptomyces (Ishihama, 2000; Gruber and Gross, 2003; Paget and Helmann, 2003). Each σ factor preferentially recognizes variations in the promoter sequence, resulting in the regulation of the primary mechanism of differential gene transcription. Except for the σ 54 family, σ factors belong to a large phylogenetic group namely the σ 70 family. Members of this family are the dominant σ factors for promoter recognition in bacteria (the housekeeping σ factor). σ factors are multi-domain proteins with domains 2, 3 and 4 known to be involved in the promoter recognition; domain 1 is actually absent from many σ factors. The σ 70 family can in turn be further classified into four groups: the housekeeping sigmas (σ), sigmas that respond to general stress conditions, sigmas responsible for heat shock, motility and sporulation, and sigmas known as extracytoplasmic factors (ECF) originally identified as σ factors associated with extracellular or membrane functions, and frequently co-transcribed with anti-σ regulatory factors.

3.4. Cis-Elements in σ 70 -Depending Promoters Four different DNA-sequence elements for basal transcription have been identified in bacteria, with the first two being the more important ones: (i) one hexamer localized at −10 bp corresponding to the transcription initiation (the PribnowSchaller box). This −10 element resembles the TATA-box site and is recognized by subunit 2 of the σ factor of the RNAP holoenzyme; (ii) the −35 bp promoter element to be recognized by the domain 4 of a σ factor. The two other promoter signals include a −10 bp extended element and the UP element. The extended −10 element is a 3–4 bp sequence located immediately upstream of the −10 hexamer and is recognized by domain 3 of a σ factor. The UP element is a ∼20 bp sequence located upstream of −35 hexamer that is recognized by the C- terminal domains of the RNAP α subunits. The UP element, initially found in a few strong promoters and shown to be the point of contact for the RNAP αCTD, may well be present in a larger number of promoters (Haugen et al., 2006), (Fig. 3A). In bacteria, the combination of −10, −35, extended −10 and UP elements provides the basis for formation of a great diversity of specific sequences for the initial binding of RNAP. Like in archaea, the role of these elements in bacteria is to recruit the RNAP machinery to form an open-complex to be structurally ready for binding for initiating transcription. In general, any deficiency in one of these elements can be compensated by the other (Gruber et al., 2001; Mooney and Landick, 2003; Miroslavova and Busby, 2006; Miroslavova et al., 2006). The transcription process involves three main steps: initiation, elongation, and termination.

A. Mart´ınez-Antonio & J. Collado-Vides

192

B)

Fig. 3. cis-elements and trans-regulatory factors for basal transcriptional machinery in bacteria (A) in σ70 promoters, and (B) in σ54 promoters. The mode of activation at a distance in σ54 promoters is illustrated. Note that the AAA+ domain-containing activators can interact with σ54 directly.

1. Transcription begins with the binding of the RNAP holoenzyme to the −10, −35, −10 extended and UP elements responsible for the initiation of transcription to form the close-complex promoter. 2. After the initial binding of RNAP to the promoter region, the DNA sequence from around −10 to +2 bp downstream of the transcription activation site is unwound to form the open-complex bubble. 3. This strand isomerization allows the movement of the free DNA-template to the active site of RNAP to initiate the synthesis of RNA in the presence of nucleotides triphosphates (NTPs). 4. After the synthesis of ∼9–12 nucleotides of RNA, the RNAP escapes from the promoter region followed by the beginning of the elongation stage. 5. At this point the RNAP undergoes a conformational change, leading to dissociation of RNAP-DNA contacts with the σ factor.

3.5. Bacterial σ 54 -Dependent Promoters The family of σ 54 factors differ in their amino acid sequence and transcription mechanism from the σ 70 family (Xu and Hoover, 2001; Cases et al., 2003b). In spite of the lack of significant sequence similarity between these two groups of σ factors, they bind to the core RNAP, although they produce holoenzymes with different mechanisms. The mechanism of action associated with the σ 54 factor involves distant transcriptional activation (earlier considered to be unique for eukarya organisms) by Enhancer Binding Proteins (EBP; see below). Recently, these

Comparative Mechanisms for Transcription and Regulatory Signals

193

TFs have been classified into the AAA+ protein superfamily (ATPases associated with various cellular activities) (Burrows, 2003; Lilja et al., 2004; Sallai and Tucker, 2005; De Carlo et al., 2006; Dago et al., 2007). Bacterial σ 54 -RNAP is the target for sophisticated signal transduction pathways involving activation through remote enhancer cis-elements localized more commonly in the upstream promoter regions. Activation of σ 54 -RNAP promoters requires specialized transcription factors EBP that recognize the enhancer-binding sites as well as the regulation of ATP or GTP hydrolysis to trigger the transcription. Contrary to σ 70 , the σ 54 factor can recognize and bind their promoters in the absence of the core polymerase. The cis-regulatory elements necessary for the σ factor recognition on σ 54 promoters include short DNA segments at −12 and −24 bp from the onset of transcription. Mutational analyses suggest that the −24 element contributes more to the promoter recognition compared to other elements. The transcription initiation rates on these promoters are controlled via regulation of the DNA melting step (open-complex promoter). The σ 54 holoenzyme forms a closed-complex and occupies the promoter in this state prior to activation; this closed-complex is unusually stable, and it does not spontaneously isomerize into an open-complex. In this way, transcription initiated by the σ 54 promoters is tightly controlled with a low level of escape. Therefore, these kinds of promoters are not subjected to repression, and their transcriptional activity can be controlled at varying levels depending on the presence of cognate activator protein (EBP) and the environmental stimuli. Thus, at σ 70 promoters, the TFs recruit RNAP to DNA, whereas at σ 54 promoters, the TFs work on a stably bound RNAP. Different than σ 70 factors that are present in many copies of orthologous genes per bacterial genome, the σ 54 factor, commonly found as a single copy, is absent in some bacteria with reduced genomes (e.g., intracellular parasites), and is rarely found with two copies such as in Bradyrhizobium, Rhodobacter and Rhizobium. In principle, the main disadvantage of σ 54 promoters is the necessity of long stretches of intergenic DNA for looping mechanism of activation, and thus the necessity for larger chromosomes. However, a prevalent organization involves IHF binding sites, which bind the DNA, developing a specific physical proximity between the bound RNAP and the upstream activation factor. This oriented binding can, in principle, provide the basis for the specificity of activation of regulators by specific promoters, even in small chromosomes as those of bacteria (Gralla and Collado-Vides, 1996) (see the following section on nucleoid-associated proteins). Indeed, this is true for the proper regulation of completely isolated σ 54 promoters to inhibit cross-talks. Thus, evolutive mechanisms working for chromosome compactness might in turn inhibit the activity of σ 54 promoters in bacteria. Their kinetic properties make σ 54 regulons and stimulons to be the equivalent to closed-complexes that are the sets of genes whose expression might have been fully absent for a long time without having even any effect. It is only when the corresponding activator(s) appear that their expression initiates, as opposed to the σ 70 transcription that is leaky and therefore constant repression would be needed to have it silent for a long time.

194

A. Mart´ınez-Antonio & J. Collado-Vides

4. Transcription Termination RNAP responds to two distinct sets of mRNA-sequence signals for transcription termination: i) the Rho-dependent terminators that require the Rho protein as an auxiliary factor; and ii) the Rho-independent terminators, or intrinsic terminators that are functional in the absence of additional factors (Abe et al., 1999; de Hoon et al., 2005; Mooney et al., 2005; Adelman et al., 2006; Banerjee et al., 2006; Ciampi, 2006; Kingsford et al., 2007).

4.1. Rho-Dependent Transcription Termination in Bacteria Rho-mediated transcription termination requires both cis-regulatory elements and trans-regulatory factors on the mRNA: the only cis- regulatory elements common to all Rho-dependent terminators are generally unstructured, C-rich and G-poor, with little sequence conservation. The trans-regulatory factor Rho protein, encoded by the rho gene, is a homohexamer protein universally distributed and essential to bacteria. Most of the mechanism for the Rho activity is unsolved and the initial evidence suggests that Rho is a RNA/DNA helicase or translocase that binds to untranslated naked RNAs and terminates the mRNA synthesis by dissociating RNA polymerase from the DNA template (to release mRNA). The energy used in the process is derived by hydrolyzing ATP through its RNA-dependent ATPase activity. It has been proposed that Rho acts on RNAP via an interaction with the NusG protein; NusG is an essential bacterial protein modulator of the transcriptional elongation and termination events, and interacts directly with RNAP and the Rho protein (Banerjee et al., 2006; Ciampi, 2006; Landick, 2006).

4.2. Rho-Independent Transcription Termination in Bacteria On the other hand, the intrinsic terminators (Rho-independent) contain a short G:C-rich palindromic sequence followed by a run of 6–8 U (uraciles) residues. The G:C rich sequence triggers the formation of stem-loop secondary structures in the mRNA upstream sequence leading to termination near the end of the run of Us. These stem-loop structures motivate pausing of the RNA polymerase, and the terminator has been proposed to destabilize the interaction between the paused RNAP and the template DNA. This destabilization could occur by a decrease in the stability of the RNA:DNA hybridization in the transcription bubble, by direct interactions between the RNA hybrid and the RNA polymerase, or by a combination of these effects (Abe et al., 1999; Yachie et al., 2006).

4.3. Transcription Termination in Archaea Knowledge about transcription termination in archaeal genomes is rather limited. Evidence obtained from experiments on Methanothermobacter thermautotrophicus

Comparative Mechanisms for Transcription and Regulatory Signals

195

suggests that RNAP is subject to intrinsic termination signaled by an intergenic sequence. These preliminary results suggest that transcription can be triggered by an upstream sequence which provokes response to termination of the archaeal RNAP at a remote downstream sequence. However, confirmation of these hypotheses still needs validation. On the other hand, the Rho protein seems to be absent in archaeal genomes. A bacterial-like NusG protein has been suggested to be present in archaea based on sequence similarity with the Thermus thermophilus NusG. In addition, this protein shares stretches of sequence similarity to the eukaryotic transcription elongation factor Spt5. The analysis of the three-dimensional structure of this factor suggests that although there is a clear evolutionary and functional relationship between the archaeal and bacterial NusG proteins, the structural, sequential and biochemical data reveal that many differences exist in their binding specificities to both nucleic acids and other proteins (Reay et al., 2004; Santangelo and Reeve, 2006). Therefore, the final functional evidence is required to prove the occurrence of Rho-dependent termination in archaea as well.

5. Transcriptional Regulation by TFs The way by which bacteria and archaea couple the transcription of genes with the external and internal conditions is the use of regulatory proteins, specifically the TFs (Ptashne and Gaan, 2002; Adhya, 2003; Cases et al., 2003a; Barnard et al., 2004; Miroslavova and Busby, 2006). A TF is a protein (more precisely a protein complex, since it can be a dimer or multimer) that activates or represses the expression of transcription units upon binding to specific DNA-binding sites (cisregulatory elements). Historically, binding sites for transcriptional regulators were called operator sites. Operator sites in their more general definition are sites on the genomic DNA for the binding of repressors and activators. In bacteria, specifically for σ 54 promoters, the term “UAS” used for upstream activator sites also refers to activator sites that function remotely. A related term is enhancers, more commonly used for activators in eukarya. An enhancer has been initially defined as an activator site for EBPs that function from far upstream, in either orientation in relation to the promoter.

5.1. Transcriptional Signal Sensing TFs have been considered for a long time as “two-headed” molecules (Jacob, 1970) with one domain for DNA sequence recognition — formally considered equivalent to the enzymatic site — and the other domain for the binding/sensing of small molecule effectors. In this way TFs could be considered as molecular switches that have an input module for signal sensing and an output module for the transcriptional response (Wall et al., 2004; Seshasayee et al., 2006). In bacteria, most TFs have these “two-headed” molecules contained in one single protein except

196

A. Mart´ınez-Antonio & J. Collado-Vides

in the signal transduction two-components systems where these modules belong to two different proteins; a protein containing the sensor component mostly localized in the periplasm for sensing exogenous signals, and a response regulator localized in the cytoplasm (Aravind et al., 2005; Ulrich et al., 2005; Alm et al., 2006). As a consequence, the mechanism of signal-dependent regulation is different in these two groups of TFs; in the first group the TF is regulated by allosterism through the binding of specific signals whereas in the second group a phosphorylation cascade from the sensor to the response component takes place to modify the TF covalently.

5.2. Exogenous and Endogenous Signal Sensing by TFs All organisms sense and respond to environmental conditions and most of their physiology is remodeled according to changes in the milieu (Martinez-Antonio et al., 2003; Balazsi et al., 2005). Consequently, most cellular modifications occur in response to changes in the exterior. At least two main kinds of exogenous signals could be distinguished; those physicochemical signals that stress or damage the cell (e.g., osmotic pressure, high/low temperature, high/low pH, etc.) and those components introduced into the cell and further metabolized for fuel or use as building blocks. For sensing the first kind of extracellular signal, unicellular organisms mostly use the sensor proteins of the two-component systems; these proteins sense exogenous signals or their effect on the membrane structure (damage) and communicate this condition to the corresponding pair of response regulators in the cytoplasm through a phosphorylation cascade to generate a transcriptional/cellular response to alleviate this stressing condition. In the second group, the cell uses mostly transport systems inserted in the membrane that sense and import exogenous compounds. Both these transcriptional sensing mechanisms are used to monitor the exogenous conditions; the genetic elements involved (sensor/response (TF) components and TF/transport systems) tend to be encoded together in the E. coli chromosome (Martinez-Antonio et al., 2006; Janga et al., 2007b). Imported metals, metabolites and some small diffusible components are further utilized by the cellular metabolism (catabolism and biosynthesis) required for cell functions. During these cellular processes, different metabolites are produced as intermediates or waste products. The concentration of some intermediate metabolites or potentially dangerous waste products merit cellular monitoring. Certainly, there is a set of TFs for sensing/binding these key metabolites (e.g., pyruvate, tryptophan, ATP levels, etc.) and dangerous conditions (e.g., redox potential, DNA damage, acetate accumulation, etc.), which are considered as TFs sensing endogenous conditions. Because the metabolism depends on external conditions or compounds, it is not surprising that the transcriptional regulation of both processes is coordinated (Balaji et al., 2007; Janga et al., 2007a). Thus, the TFs responsible for sensing the internal conditions are frequently regulating those TFs governing the functioning of transport systems in E. coli (Martinez-Antonio

Comparative Mechanisms for Transcription and Regulatory Signals

197

et al., 2006). Thus, TFs for internal sensing are those whose allosteric metabolite or equivalent signal is generated inside the cell. Curiously, internal sensing TFs are mostly global regulators that affect the transcription of a large number of genes, including other TFs of external sensing as well as those of dual sensing whose signal may be either synthesized internally or coming from the outside. Thus, global regulators in bacteria -at least those of the E. coli regulatory network-modulate a large number of genes in response to changes in the internal milieu of the cell. They alter a large number of genes’ functional states directly through affecting their transcription, as well as regulating the expression of alternative TFs and additional sigma (σ) factors (Martinez-Antonio and Collado-Vides, 2003).

5.3. Additional Mechanisms Regulating the TF Activity In addition to the regulation of TF activity by effector signals or metabolites, there are other less known mechanisms for controlling the TF transcriptional activities: (i) Cellular sequestration is exemplified by MalT (a regulator for maltose degradation in E. coli); in the absence of the transportable sugar, some components of the transport system (MalK, MalY) sequester the MalT regulator in the cellular periplasm, preventing the transcription of more transportable components until cell faces deficiency in terms of favorable/degradable sugars in the milieu and maltose is present for transportation and degradation (Schlegel et al., 2002). (ii) A second mechanism is allowed by spatial/temporal availability of TFs as it happens in unequally dividing cells in α–proteobacteria. One well studied model is Caulobacter crescentus where a master response regulator, CtrA, directly controls the initiation of chromosome replication as well as several aspects of polar morphogenesis and cell division. The CtrA activity is temporally and spatially regulated by multiple partially redundant control mechanisms, such as transcription, phosphorylation, and targeted proteolysis (McAdams and Shapiro, 2003; Ryan and Shapiro, 2003). (iii) Another mechanism (related to those described in point i) is by activity, depending on the sub-cellular location of the regulator, as in the case of PutA (Proline utilization A), a multifunctional protein that in solution represses transcription of genes for proline utilization, whereas it is mobilized to the membrane when it is bound to proline to catalyze the oxidation of proline to glutamate (Zhu et al., 2002; Zhu and Becker, 2003). A similar case is that of the carbon source Mlc transcriptional regulator (Seitz et al., 2003). In eukaryotes, where TFs must enter the nuclei to be effective, sub-cellular location is believed to play a more important role (Holstege et al., 1998; Jans and Hassan, 1998). (iv) In yet another interesting mechanism, the cellular concentration of a TF entails a balance between protein expression and degradation as is apparently

198

A. Mart´ınez-Antonio & J. Collado-Vides

the case for MarA and SoxS regulators in E. coli. In fact, in many cases the precise mechanism is unknown that degrades and reduces the concentration of the active TFs once their intended goal is accomplished. A possible mechanism is simple dilution by cell duplication, as has been demonstrated in the case of the LacI regulator (Elf et al., 2007). MarA and SoxS appear to interact with RNAP in solution, in the absence of promoters, “capturing” RNAP for use at certain transcription units. It is worth noting that this mechanism is similar to the one used by alternative σ factors (Browning and Busby, 2004).

6. Mechanisms of Gene Regulation by TFs 6.1. Transcriptional Regulation by Activators in σ 70 Promoters Activation is required when promoter recognition by RNAP is inefficient and an activator compensates for and attracts RNAP to the promoter. In general, activators bind target DNA sites located near the upstream regions on promoters (Gralla and Collado-Vides, 1996; Barnard et al., 2004). The position of these DNA binding regions could range from those located between the −35 and −10 bp elements to allow realignment of these cis-elements and their proper recognition by RNAP holoenzyme (Gralla, 1996; Rhodius and Busby, 1998; Lloyd et al., 2001). They also occur at regions upstream to the -35 bp sites that permit control of the TF with the αCTD of RNAP to position it at the promoter. A related mechanism of activation, by recruitment, involves targeting a region adjacent to the promoter −35 element (the bound activator interacts with domain 4 of σ 70 ) by TF.

6.2. Transcriptional Regulation by Repressors in σ 70 Promoters The role of repressor proteins is to reduce the transcription initiation through the most common and simple mechanism of binding to a DNA region that interferes with the promoter recognition by the RNAP. Another mechanism is, through polymerization, binding of a TF repressor to distant regions of the promoter but inducing DNA looping to make the promoter recognition by the RNAP inefficient, such as with GalR and DeoR TFs. The third mechanism described is the antiactivation in which a repressor binds to an activator affecting its contacting RNAP, particularly within the αCTD region (Rojo, 2001; Browning and Busby, 2004).

6.3. Transcriptional Activation in Bacterial σ 54 Promoters The mechanisms described above apply to σ 70 -dependent promoters, which are the most common promoters in the bacterial cell. The mechanisms available for regulation of σ 54 promoters are rather different (Kustu et al., 1989; Studholme and Dixon, 2003; Rappas et al., 2007), since, as described earlier, the RNAP coupled to a σ 54 factor requires EBPs to form the open-complex promoter to initiate

Comparative Mechanisms for Transcription and Regulatory Signals

199

transcription. As opposed to σ 70 promoters, there is no basal transcription of σ 54 promoters in the absence of activation. These EBPs could bind cooperatively as dimers to DNA sequences between 70 to 150 bp upstream of the σ 54 promoters, but they can be active as far as 3 kb upstream. Oligomerization of the EBP induces a DNA loop, which permits the EBP to contact the RNAP σ 54 -holoenzyme (Fig. 3B). The DNA loop could be randomly and transiently formed, but its intrinsic curvatures or the presence of some nucleoid associated proteins (e.g., IHF) bound on these intermediate regions help to direct DNA looping. Many EBP activators are modular in their structure generally containing three modules or domains: one amino terminal module required for signal sensing, a central domain for ATP hydrolysis and transcription activation, and a carboxy-terminal domain for DNA binding/recognition. As expected, the amino/terminal domain is the least conserved since it adapts to the sensing of diverse signals. Sometimes this amino/terminal domain serves as the receiver module for phosphorylation by the sensor protein in the two-component systems. Either phosphorylation or binding of small molecules induces oligomerization and increases the rate of ATP hydrolysis. This released energy is coupled with the formation of an open-complex promoter possibly through a conformational change in RNAP holoenzyme. The activity of σ 54 promoters requires the activity of EBP. It is notable that EBPs, in some cases most importantly by repression, can function by regulating σ 70 promoters.

7. The Regulation of Gene Transcription by Nucleoid Structure In both archaea and bacteria, the RNAP is present in the cell in limited quantities and most of it is used to channel the synthesis of stable ribosomal RNA, necessary for translation (occupying up to 95% of the total RNA synthesis in conditions of rapid growth). On the other hand, some RNAP is bound non-productively to the DNA in a nonspecific way. Thus, the availability of RNAP used for transcription of the thousands of genes in a prokaryote cell is limited. In addition, the σ factor and its accessory factors are also limited, which generates an intense competition between promoters for the RNAP holoenzyme. These limitations explain why, depending on promoter strength and the TFs availability, it requires a tight regulation and production of many transcripts from some promoters and little or no transcription from others. Here, we describe the main mechanisms that modulate the gene transcription initiation in archaea and bacteria.

7.1. Transcriptional Regulation by Nucleoid-Structure in Bacteria All cells have to replicate a long genome to fit into a small space while maintaining DNA accessibility for replication, transcription, repair and other DNA-related biochemical processes. In eukaryotes the central replication unit of chromatin is the nucleosome, which consists of dimers corresponding to each of the following

200

A. Mart´ınez-Antonio & J. Collado-Vides

histones: H2A, H2B, H3 and H4, wrapping a DNA segment of around ∼200 bp, and one segment of ∼50 bp DNA is wrapped by H1 histone, linking each nucleosome structure (Nemeth and Langst, 2004; Bernstein and Hake, 2006; Robinson and Rhodes, 2006; Woodcock, 2006). Replication of DNA inside the cell helps maintain the highest level of DNA compactness and controls gene expression by inhibiting access of RNAP to promoters when they are not needed for RNAP expression. In order for transcription to occur, nucleosomes need to be restructured by remodeling enzymes that destabilize these structures and expose the dsDNA. In prokaryotes with genomes ranging from ∼0.5 to ∼6 Mbp, DNA must be packaged ∼1000 times to constitute the nucleoid that occupies a very small cellular space (∼1 µm). A bacterial nucleoid contains 50–400 negatively supercoiled DNA loops that are ≥10 kb in size (Carpentier et al., 2005; Hashimoto et al., 2005; Thanbichler et al., 2005; Zimmerman, 2006). These DNA loops might be topologically discrete chromosomal territories independent of each other. The nucleoid is believed to be composed mostly of DNA (60%), ribosomal complexes and nucleoid associated proteins or TFs. Due to the compactness of a folded genome, it is postulated that gene regulation in eukaryotes and bacteria might face two different scenarios: restrictive and nonrestrictive ground state, respectively (Struhl, 1999). In other words, it is suggested that bacterial DNA is always available for transcription, accessible for competitive binding of TFs and RNAP to operators and promoter regions, whereas eukaryotic DNA is densely packaged into repressive chromatin and therefore is practically inaccessible to both TFs and RNAP. Consistent with this view, it is known that bacterial RNAP has been associated with some σ factors that can recognize and trigger transcription by strong promoters, without requiring additional TFs to mediate the process. The eukaryotic RNAPs, however, do not have an intrinsic promoter-binding ability and the initial role of a TF in eukaryotes has been postulated to recruit additional nucleosome-remodeling proteins (e.g., RSC, SWI/SNF) to unwind the DNA and to make the promoters accessible to the RNAP complex. The σ 54 promoters in bacteria behave similar in this sense to one in eukaryotes in that they require TFs to initiate transcription, since RNAP is unable to transcribe even weakly without the presence of a TF. However, in the case of the σ 70 , that dominate transcription in bacteria, the TFs are used to strengthen the recognition of weak promoters, whereas in eukarya, the TFs are necessary to recruit the RNAP complex for transcription (Ptashne and Gann, 1997; Ptashne and Gaan, 2002). These different scenarios attributing to the levels of compactness of folded genomes correlate with different mechanisms for transcription regulation. Bacterial genomes can be as much subject to activation as to repression, whereas the role of TFs in eukaryotes is mostly activation.

7.2. Transcriptional Regulation by Nucleoid-Structure in Archaea In the previous section, transcriptional regulation in eukaryotes and bacteria has been reviewed with respect to nucleoid-structure but it is interesting to find out the

Comparative Mechanisms for Transcription and Regulatory Signals

201

nature of the same mechanism in archaea. RNAP in archaea, even in a simplified way, bears a resemblance to eukarya RNAP whereas the genomic package and TF regulators are more similar to those of bacteria. All the species of archaea studied up to date contain abundant quantities of small DNA-binding proteins (SuI7d, Alba, histones, HTa, MC1), whose binding seems to be sequence-independent, and in both chrenarchaea and euryarchaea these have been localized in vivo in the nucleoid (Herrmann and Soppa, 2002; Long and Faguy, 2004; Majernik et al., 2005). Nonetheless, there is no evidence for a universal mechanism for archaeal chromatin structure (Hayat and Mancarella, 1995; Poplawski and Bernander, 1997; Hoppert and Mayer, 1999; Malandrin et al., 1999; Bernander, 2000; Majernik et al., 2005). For euryarchaeal genomes, it has been suggested that the histone, a small protein, is conserved only in its folding domain but not the accessory domains, for acetylation or methylation, like their eukaryal counterparts (Reddy and Suryanarayana, 1989; Grayling et al., 1996). In addition, bioinformatics analyses indicate that the DNAsequence that facilitates DNA-binding and wrapping by histones is statistically over-represented in this archaeal genome but there is no definitive evidence for the constitution of chromatin structures. Furthermore, there are no conclusive evidences regarding whether or not these chromosomal regions (for predicting histone-binding in euryarchaeal genome) are related to zones of transcription. Thus, transcription regulation in archaea seems to be more related to bacteria mostly depending on TF activity following the nonrestrictive model. However, the role of these small DNA-binding proteins in the transcriptional regulation is still not clear.

7.3. Nucleoid-Associated Proteins in Bacteria In bacteria, the so-called Nucleoid-Associated Proteins (NAPs) contribute to the replication of DNA as well as to transcriptional regulation, including the DNAbridging protein (H-NS) and the DNA-bending proteins (IHF, HU and FIS), (Kepes, 2004; Dame, 2005; Luijsterburg et al., 2006). H-NS (Histone-Like Nucleoid Structuring Protein) is a 15.4 kDa protein conserved among Gram-negative bacteria. The active form of H-NS is believed to be a dimer or larger oligomers. This protein binds DNA without clear sequence specificity but with some bias to curved DNA or to more flexible than normal DNA. The binding of this protein to DNA results in bridges between adjacent DNA duplexes that provide the structural basis for transcriptional repression (Navarre et al., 2006). In exponentially growing cells (when it is maximally expressed in E. coli), this protein reaches the concentration of one dimer per 1400 bp of DNA. IHF (Integration Host Factor) functions as an accessory factor in a wide variety of processes such as replication, site-specific recombination and transcription. It is a heterodimer of two subunits (with 25% sequence homology between them): α subunit of 11 kDa and β subunit of 9 kDa in size. This heterodimeric protein binds to specific DNA sequences with nanomolar affinity. Expression of IHF is maximal during stationary growth (1 IHF heterodimer/335 bp of DNA), and is predicted to

202

A. Mart´ınez-Antonio & J. Collado-Vides

have ∼1000 binding sites on the E. coli genome. IHF binding occurs frequently upstream of the σ 54 promoters in Klebsiella and other bacteria, and is essential for the bending required to bring the remotely bound TF or enhancer-like regulator physically close to the polymerase bound at the σ 54 promoter. That is, it provides a mechanism of specificity to diminish cross-talk or undesired activation of other σ 54 promoters nearby (Gralla and Collado-Vides, 1996). The Histone-like protein from E. coli strain U93 (HU) is present in most of bacteria as α and β homodimers (each 9.5 kDa, with 70% homology between them). HU binds non-specifically to DNA, but has a higher affinity for the supercoiled and distorted DNA. The DNA-bending induced by HU is suggested to be less rigid than that induced by IHF and is considered as a flexible hinge. Interestingly, HU activity can be substituted by the eukaryotic High Mobility Group Box 1 protein (HMGB1), an extranuclear cytokine with chromosome remodeling properties (Perez-Martin and De Lorenzo, 1997). HU protein reaches its maximal production during the exponential phase (1 HU/550 bp). The bacterial FIS protein (Factor for Inversion Stimulation) is a 22 kDa homodimeric complex that binds DNA in a sequence-specific manner. Strong binding sites for FIS are located upstream of a number of stable RNA operons where it acts as a transcriptional activator. FIS is the most abundant NAP at the beginning of exponential growth (1 FIS/450 bp) and is completely absent during the stationary phase of growth. Some nucleoid-associated proteins are expressed during all phases of bacterial growth (Grainger et al., 2006; Luijsterburg et al., 2006). All these proteins favor DNA packing and can, at the same time, affect transcriptional regulation. In contrast to eukaryotic genomes, the DNA-binding sites for the bacterial NAPs are biased toward non-coding parts of the genome. These proteins, also recognized as regulators of transcription, are not modulated by allosterism through the direct binding/sensing of environmental or metabolic signals as the majority of the classic TFs. It is, however, postulated that these NAPs sense the state of the chromosome and act as global regulators by permitting or restricting the access to binding sites of RNAP or additional TFs. It is interesting to note that, in this context, a large number of CRP sites have been found by ChIP-chip experiments (Grainger et al., 2005). Because of their genomic positions, only a fraction of them corresponds to the CRP as a transcriptional regulator, whereas other fractions within the coding region may have different roles. The large number of CRP sites suggests that these and other global regulators multimerize on contiguous DNA binding sites and could function by specifically modifying some chromosomal regions depending on the presence and absence of environmental signals (as cAMP for CRP). Generalizing this idea a step further, it is certainly interesting to note that most of the proposed global regulators of E. coli (Martinez-Antonio and Collado-Vides, 2003) multimerize and show some DNA-structuring properties by binding the DNA similarly to the nucleoid-associated proteins. Earlier, all TFs had been considered as

Comparative Mechanisms for Transcription and Regulatory Signals

203

transcription regulators, but the function of CRP has opened new avenues or, more precisely, introduced an additional means to globally affect transcription indirectly other than its direct role in modulating the transcription initiation. Furthermore, transcription of NAPs is regulated by other TFs sensing environmental or endogenous signals, which led to the conclusion that the activity of NAP is indirectly coupled to the cellular or milieu conditions. In this way, one can envision alternative states or configurations of the bacterial chromosomal structure, that, depending on the growing conditions, determine the global pattern of gene transcription; e.g., exponential versus stationary phase (Travers and Muskhelishvili, 2005).

8. Post-transcriptional Control Mechanisms for Regulating Gene Expression 8.1. Riboswitches Riboswitches (RNA-switches) are found in the 5′ un-translated leader regions (UTR) present in some mRNA molecules. They directly bind to a small target molecule which can affect a gene’s expression. Thus, mRNA that contains a riboswitch is directly involved in regulating its own activity, depending on the presence or absence of its target molecule (Winkler and Breaker, 2005; Batey, 2006; Nudler, 2006). This small-molecule sensing region on mRNA is named aptamer, and it undergoes structural changes when bound to their specific small molecule. These structural changes affect the expression of a gene in the following ways: 1. by the formation of transcription termination hairpins; 2. by sequestering the ribosome-binding site (Shine-Dalgarno), thereby blocking translation; and 3. by self-cleavaging (i.e. the riboswitch contains a ribozyme that cleaves itself in the presence of sufficient concentrations of its metabolite). The small metabolites sensed by these mRNA structures include vitamins, cofactors, amino acids, sugars and nucleotides. Their presence has been experimentally proven in bacteria and in some cases in archaea as well as eukarya.

8.2. Attenuators An attenuator is an RNA sequence that regulates the expression of certain genes by terminating transcription. A transcriptional attenuator usually depends on the formation of mutually exclusive secondary structures (Matsumoto et al., 1986; Switzer et al., 1999). It requires a “rho-independent” (or intrinsic) terminator, which, when formed, causes the RNA polymerase to prematurely stop transcribing. An anti-terminator is a structure that impedes the formation of the terminator, and

204

A. Mart´ınez-Antonio & J. Collado-Vides

an anti-anti-terminator is a structure that impedes the formation of the antiterminator. In this fashion, the most stable structure in a given condition will form and govern the outcome of the expression of the gene(s) immediately downstream. Different structures may be formed under different conditions such as when a protein, the ribosome or a small molecule stabilizes one of the structures.

8.3. Non-coding RNAs A non-coding RNA (ncRNA) is any RNA molecule that is not translated into a protein (see Chapter 2 for more details). ncRNAs have been found to have roles in a great variety of biological processes, including transcriptional regulation, chromosome replication, RNA processing and modification, messenger RNA stability and translation, and even protein degradation and translocation (Gottesman, 2005; Vogel and Papenfort, 2006). Recent studies indicate that ncRNAs are far more abundant and important than initially imagined. Some of the known mechanisms for ncRNAs activity include the following: (i) Directing base-pairing with target RNA or DNA molecules is central to the function of some ncRNAs in repressing translation by forming base pairs with the Shine-Dalgarno sequence and occluding ribosome binding. (ii) Some ncRNAs mimic the structure of other nucleic acids as bacterial RNA polymerase may recognize the 6S RNA as an open promoter, and bacterial ribosomes recognize tRNA as both a tRNA and an mRNA. (iii) Some ncRNAs function as an integral part of a larger RNA-protein complex, such as the signal recognition particle whose structure has been partially determined (Gottesman, 2005; Winkler, 2005; Vogel and Papenfort, 2006). In addition, the conserved RNA-binding protein, Hfq, functions as a pleiotropic regulator that modulates the stabilization or translation of an increasing number of mRNAs (Valentin-Hansen et al., 2004; McNealy et al., 2005; Sittka et al., 2006). Regulon DB, the database on transcriptional regulation in E. coli, now contains the set of small RNAs and has been curated to include information on their corresponding target sites and regulated genes.

8.4. mRNA Half-Life The life of an mRNA molecule in prokaryotes begins with transcription and ultimately ends in degradation. In E. coli, around 80% of mRNAs have half-lives between 3 and 8 min (Bernstein et al., 2002; Selinger et al., 2003). The stability of mRNA in prokaryotes depends on multiple cis-elements and trans-factors. The trans-acting proteins are mainly nucleases and the cis-acting elements that protect

Comparative Mechanisms for Transcription and Regulatory Signals

205

mRNA from degradation are stable stem-loops at the 5′ end. However, it seems that transcription is the dominant factor in determining the mRNA steady-state level in E. coli and that the variation in their half-life may have an alternative biological role, perhaps to facilitate transient changes in mRNA abundance in response to specific environmental perturbations or the progress of bacteria through the celldivision cycle. We have now a hint for archaea, with the global measures of half-life mRNA in Sulfolobus and Halobacterium archaea; their average mRNA half-life is 5 and 10 min, respectively, similar to their bacterial counterparts (Andersson et al., 2006; Hundt et al., 2007).

9. Regulatory Network Integration 9.1. Global and Local TFs The nature of biological networks presents a scale-free distribution with a few elements highly connected while most of them sparsely connected within the network (Barabasi and Albert, 1999; Hartwell et al., 1999). In the transcriptional network, this distribution is marked by the presence of a few TFs defined as global TFs presenting the following properties: (i) they have few paralogs; (ii) they regulate many genes; (iii) they regulate several genes encoding for other TFs; (iv) they cooperate with numerous TFs and together regulate other genes; (v) they directly affect the expression of a variety of promoters that use different σ factors; and (vi) they regulate genes from different functional classes (Martinez-Antonio and Collado-Vides, 2003). On the other hand, local TFs mostly respond to defined environmental signals, constituting small regulons, and normally fall to the lowest level in the transcriptional TFs hierarchy. Global TFs regulate and co-regulate with local TFs coordinating the cellular transcriptional response with general and specific transcriptional conditions (Covert et al., 2004; Barrett et al., 2005; MartinezAntonio et al., 2006; Yu and Gerstein, 2006; Janga et al., 2007b).

9.2. Noisy Gene Transcription The availability of new methodologies has made possible to follow gene expression patterns even at the single-cell level (Swain et al., 2002; Becskei et al., 2005; Golding et al., 2005; Hooshangi et al., 2005; Rosenfeld et al., 2005; Cai et al., 2006). Previous studies have shown that gene expression occurs as pulses or bursts of transcription, and that the level of response of a gene is highly variable when comparing individual cells in a population. A heterogeneous population exhibits heterogeneous transcription mechanisms in a cell population. As a consequence, in a population, there are more than one cellular configuration even in the same growing conditions identified as phenotypes. These numerous phenotypes contribute to the prokaryotes plasticity required for environmental adaptation and population robustness.

206

A. Mart´ınez-Antonio & J. Collado-Vides

9.3. Combinatorial Regulation of TFs Though regulation by multiple TFs is more frequent in eukaryotes than in prokaryotes, this is also the dominant form of gene regulation in prokaryotes where multiple TFs work in collaboration (Adhya, 2003; Martinez-Antonio and ColladoVides, 2003). In this way, multiple signaling pathways and cellular conditions can converge on the regulation of transcription units whose protein products can respond to these conditions. For example, the sodA operon of E. coli is regulated by 6 different TFs sensing internal and external signals. In most of genes, transcription co-regulation is coupled with the activity of internal and external sensing TFs or with nucleoid-associated proteins. The following groups of genes have been defined based on their co-regulation and co-expression. 9.3.1. Regulon A regulon is the set of genes subject to regulation by only one regulator (Maas and McFall, 1964). The initial definition was derived after studies of the arginine biosynthetic genes, which were found to be scattered (or non-contiguously located) in the chromosome of E. coli in contrast to operons. To better describe this type of group of co-regulated genes, we call this a simple regulon (as opposed to a complex regulon). A complex regulon is a group of genes subject to regulation by two or more regulators, where all genes are subject to regulation by exactly the same TFs. A strict complex regulon otherwise is a set of genes where the effect of each regulator (activator or repressor) is the same for all the regulated genes (GutierrezRios et al., 2003). For more detailed discussion on regulons, we refer the reader to Chapter 11. 9.3.2. Sigmulon and Stimulon A sigmulon is a term used to refer to the group of genes transcribed by a common σ factor (i.e. σ 70 , σ 54 , etc.). It implies that the encoding TUs have common promoters to be recognized by the same σ factor but not necessarily to be transcribed at the same time. A related term is stimulon that refers to the collection of genes (or operons, regulons) undergoing regulation by the same stimulus and is generally used for prokaryotic systems, for example quorum sensing, heat shock, etc. (Cases et al., 2003b; Schumann, 2003).

9.4. Transcriptional Regulatory Network A typical prokaryotic cell might be regarded as a small bag of thousands of cellular components each devoted to different tasks but interacting with each other (Hartwell et al., 1999; Barabasi and Oltvai, 2004). The cellular interactions are commonly known as a network where the nodes represent the cellular components and the edges their functional interactions. Focusing on transcriptional regulation,

Comparative Mechanisms for Transcription and Regulatory Signals

207

we can draw the transcribed products of TUs as the nodes and the regulatory activities of TFs on the transcription of those genes as the edges. In this manner we can represent the regulatory interactions between TFs and their target genes; because some of those genes self-regulate (auto-regulation) or regulate other TFs, we can have a hierarchical transcriptional regulatory network (Martinez-Antonio and Collado-Vides, 2003; Resendis-Antonio et al., 2005; Yu and Gerstein, 2006). The availability of many individually described regulatory interactions for TUs and highthroughput experimental data analysis (micro-arrays and ChIP-chip experiments) mainly in E. coli and other organisms provided a wealth of data that permit analyses of organisms from a biological point of view; an approach that combines mathematical modeling and prediction of molecular interactions to be verified experimentally in an integrated way. In these reconstructed regulatory networks, it is possible to define which TFs work together, jointly defining regulatory modules because they constitute groups of TFs processing similar signals, and generate coordinated transcriptional responses; their biological redundancy many contribute to the system’s robustness (Thieffry and Romero, 1999; Ihmels et al., 2002; Segal et al., 2003). It is also possible to define small topological units in a network that are overrepresented relative to a randomly constituted network, and that contain at least three or four genetic elements named network motifs (Milo et al., 2002; Shen-Orr et al., 2002). Understanding how these topological structures evolve and function in the genetic networks helps in the better understanding of biological systems. In addition to genetic components in the network we can add the effector metabolites, signaling cascades, epigenetic regulations, etc., which make the network analysis complicated but are indispensable to achieve a better comprehension of the functions of biological systems.

10. Summary In this chapter, we have summarized the diversity of mechanisms with a range of combinatorial components that increase the flexibility both of what constitutes a specific promoter, for instance in the σ 70 architecture, as well as of what constitutes a particular regulatory region, with its combined set of regulating TFs. These combinations provide room for highly selected specific systems in terms of their integration with the system and the kinetics of their responses. The work of evolution also provides a unique rationale for understanding gene regulation, be it in bacterial, archaeal or eukaryotic cells, where regulation serves as a molecular switch executed by specific molecules that provide alternative conformations –active and inactive- which are themselves connected through “sensing” pathways to the internal and or external changing conditions. We have also described how genes are organized for transcription in prokaryotic organisms and the main machinery for processing cis-regulatory signals and transregulatory factors is needed for the proper regulation of genes in response to new or changing environmental conditions. In archaea, the basal machinery for

208

A. Mart´ınez-Antonio & J. Collado-Vides

transcription is similar to the RNAP II of eukarya, while the use of TFs for modulating the gene transcription is similar to bacteria. Obviously, our knowledge about transcriptional regulation in archaea is rather limited as compared to bacteria due to the difficulty to grow them in the laboratory. Both archaea and bacteria represent simple successful cellular systems adaptable to many environmental conditions and understanding their functions is fundamental to the understanding of our changing environment throughout the history of life on the earth.

11. Further Reading Browning, DF, Busby, SJ (2004) The regulation of bacterial transcription initiation. Nat Rev Microbiol 2(1):57–65. Collado-Vides, J, Magasanik B, Gralla JD (1991) Control site location and transcriptional regulation in Escherichia coli. Microbiol Rev 55(3):371–394. Gruber TM, Gross CA (2003) Multiple sigma subunits and the partitioning of bacterial transcription space. Annu Rev Microbiol 57:441–466. Ptashne M, Gann A (1997) Transcriptional activation by recruitment. Nature 386(6625):569–577. Ptashne M, Gaan A (2002) Genes and Signals: Cold Spring Harbor Laboratory Press. Struhl K (1999) Fundamentally different logic of gene regulation in eukaryotes and prokaryotes. Cell 98(1):1–4. Luijsterburg MS, Noom MC, Wuite GJ, Dame RT (2006) The architectural role of nucleoid-associated proteins in the organization of bacterial chromatin: a molecular perspective. J Struct Biol 156(2):262–272.

Acknowledgments The authors acknowledge two anonymous referees as well as the editors, whose suggestions and comments helped to improve this chapter. J.C.-V. acknowledges NIH grant R01 GM71962-03.

CHAPTER 9 COMPUTATIONAL TECHNIQUES FOR ORTHOLOGOUS GENE PREDICTION IN PROKARYOTES

MARIA POPTSOVA

All happy families are alike; every unhappy family is unhappy in its own way. —Leo Tolstoy All happy families are more or less dissimilar; all unhappy ones are more or less alike. —Vladimir Nabokov

1. Introduction Terminological debate over homology, as applied to the field of molecular evolution, continued for several decades, since 1970’s and until the beginning of the millennium (Fitch, 2000; Jensen, 2001; Koonin, 2001; Petsko, 2001; Theissen, 2002). Today the terms homologs, orthologs and paralogs are finally recognized as logical and intrinsic to the field of comparative genomics. Indeed, the correct usage of the terms greatly facilitates the description and analysis of the evolutionary relations between genes (Koonin, 2005). This terminology traces back to the field of evolutionary biology, and the term “homologs” initially applied to similar morphological or physiological structures that originated from a common ancestor (for example, the forelimbs of humans, bats, and deer). Homologs were opposed to analogs, which indicated apparent functional similarity, but had nothing to do with a common origin (for example, the wings of birds and of insects) (Henning, 1966). When protein sequences became available, the notion of homology and analogy was reintroduced to molecular evolution in the classic paper of Walter Fitch (Fitch, 1970). Fitch made a further distinction between two subcategories of homologs according to the ways of their coming into existence. He distinguished between orthologs (from “ortho-” — straight, vertical), those which originated via vertical descent, or speciation, and paralogs (from “para-” — alongside of) as brought about by duplication in a genome. The terminology was not accepted immediately and, at the beginning, led to misunderstanding and misuse (Koonin, 2001; Petsko, 2001). Fifteen terminological problems were summarized by Fitch himself in his review paper “Homology: A personal view on some of the problems” thirty years after his initial publication (Fitch, 2000). 209

210

M. Poptsova

Today, with hundreds of complete genomes available from different species, the need for terminology that would describe evolutionary relations between different genes from different species is no longer a concern. Fortunately, discreteness of possible events that could happen with an individual gene during its evolution allows for a precise classification of these events, and hence for classification of genes that underwent those. In this respect, the terms ortholog and paralog introduced by Fitch proved to be of special value for comparative genomics. The terminology was further refined by introducing xenologs (Gray et al., 1983), synologs (Gogarten, 1994) and by specifying orthologous and paralogous subcategories such as in- and outparalogs (Sonnhammer et al., 2002), pseudoorthologs and pseudoparalogs (Koonin, 2005). An important outcome of the orthology-paralogy paradigm is the theoretical protein function prediction (Baker et al., 2001; Whisstock et al., 2003; Friedberg, 2006). Orthologs are believed to perform the same function, while paralogs usually develop a new one. The statement seems to hold true for the majority of experimentally tested cases, but there are always exceptions. Nevertheless, a delineation between orthologs and paralogs is the basis for building the annotation of a new sequenced genome. In this chapter I discuss the existing methods for orthologous gene prediction in general, and those for prokaryotes in particular. I describe approaches that were implemented in different gene family databases. I also review the fully automated methods for orthologous gene selection. Finally, I shall undertake to describe BranchClust algorithm, a fully automated phylogenetic method to select orthologous gene families for any number of different taxa.

2. Basics of Orthologs, Inparalogs and Outparalogs, and Xenologs By definition, homologs are the genes that share common evolutionary origin. This statement can be reformulated as follows: For any set of homologous genes from any set of different species (i.e. taxa), there existed one common ancestral gene from which the entire considered set of homologous genes originated. Note that the common ancestral gene did not necessarily exist in the common ancestor of the considered species, as it follows from the case of a gene that originated in one species (some time after species diverged) and then was horizontally transferred to another. Under this definition evolutionary scenarios for a set of homologous genes can be traced to their emergence from one ancestral gene (see Fig. 1). Assume that we have a set of homologous genes {E1 , A1 , B1 , C1 , D1 , A2 , A2′ , B2 , C2 , D2 } assembled from five different species. Capital letters A, B, C, D, and E signify species, and 1, 2 are indices that signify genes. Organismal tree for taxa A, B, C, D, and E is given in Fig. 1B. A tree reconstructed from homologous genes {E1 , A1 , B1 , C1 , D1 , A2 , A2′ , B2 , C2 , D2 } is presented in Fig. 1C. Figure 1A depicts the history of the species (tubes drawn in thick) together with the histories of the individual genes inside the lineages (drawn in thin inside the tubes). The unfolded history of the genes inside the history of the species, as it is depicted in Fig. 1A, corresponds to homologous gene tree shown in Fig. 1C.

Computational Techniques for Orthologous Gene Prediction in Prokaryotes

211

Fig. 1. (A) Organismal tree containing individual gene trees, (B) organismal tree, (C) homologous gene tree. Nodes corresponding to speciation, duplication, and horizontal gene transfer (HGT) events are marked by asterisks, pound signs, and dollar signs, respectively.

A phylogenetic tree representing the history of the homologous genes, as depicted in Fig. 1C, can be composed of three elementary events: speciation, duplication and horizontal (sometimes termed lateral) transfer. In Fig. 1 elementary evolutionary events correspond to the nodes in a phylogenetic tree and are designated by signs (see the caption of Fig. 1). The first elementary event is speciation (asterisk sign), when two homologous genes start to evolve separately in different species. The second event is duplication (pound sign), when the second copy of the gene is created in the same genome and two genes start to evolve separately, but in one and the same species. And the third event is horizontal transfer (dollar sign), when a homologous gene is acquired from different species. Depending on the types of events that happened with homologous genes, their relations to each other are expressed in terms of orthologs, paralogs and xenologs. Orthologs are pairs of genes which evolved according to the first scenario, or the speciation event. Paralogs are pairs of genes that emerged as a result of the second type of event, or duplication. Finally, xenologs are pairs of genes in which at least one member emerged in the species via the third scenario, or horizontal transfer. In the example of Fig. 1 tracking of evolutionary scenarios begins with one ancestral gene, gene 1. Gene 1 underwent a speciation event and gave birth to two species: to taxon E and to the common ancestor of taxa A, B, C and D. Later on,

212

M. Poptsova

gene 1 was duplicated in the common ancestor of taxa A, B, C and D, and a new gene 2 emerged as a result of the duplication event. After this instance of duplication (note that this occurred before the speciation of taxa A, B, C and D) evolution of genes 1 and 2 gave rise first to the species D, and, later, to the species A, B and C. All of the four taxa (A, B, C and D) inherited two ancestral genes 1 and 2. Gene 2 was duplicated once again, but only in taxon A resulting in two similar genes in the same species: A2 and A2′ . Gene 1 from taxa D, or gene D1 , requires special attention. In the homologous gene tree it is shown to be a close relative of gene E1 (Fig. 1C), but in the organismal tree taxon D groups with a branch holding the species A, B and C (Fig. 1B). In this hypothetical example, gene 1 came to taxon D from taxon E at some later point in the evolution, even after the speciation event for gene 1 resulting in A1 , B1 and C1 genes. After the elementary (evolutionary) events for a given set of homologous genes are determined, the assignment of orthologous, paralogous and xenologous relationship is straightforward. Note that relationships are established between pairs of genes: a single gene cannot be just an ortholog, a paralog or a xenolog, it is always an ortholog (paralog or xenolog) to some other gene(s). To determine the relationship inside a pair of homologous genes one should trace the history back to the nearest common node in a phylogenetic tree. If the nearest common node is a speciation event and there is no HGT event along the history of these genes, then the genes are orthologs. If the nearest common node is a duplication event then the genes are paralogs. If an HGT event happened to any of the considered genes, these genes are xenologs. To illustrate how these rules work let us consider an example depicted in Fig. 1C. Genes 1 and 2 that originated as a result of duplication event are paralogs regardless of the species under comparison. Thus, gene A1 (or B1 , or C1 , but not the horizontally acquired gene D1 ) is a paralog to any of the genes A2 , A2′ , B2 , C2 and D2 . Vice versa, gene A2 (or A2′ , or B2 , or C2 , or D2 ) is a paralog to any of the genes A1 , B1 and C1. The nearest common node of any of those pairs of genes is a duplication event. On the other hand, any pair of genes from the set {A1 , B1 , C1, E1 } are orthololgs, because they originated due to speciation event. One can easily verify that the same is true for any pair from the set {A2 , B2 , C2 , D2 , E1 } and {A2′ , B2 , C2 , D2 , E1 } (see Table 1). Usually it is not pairs of orthologs which are of interest, but a set of orthologs — as only orthologs are good for reconstructing species phylogeny, while mixing paralogs can lead to a wrong species tree. A set of orthologous genes can be defined as a set of genes, where each pairwise relationship is orthologous. In our example these sets are {A1 , B1 , C1, E1 }, {A2 , B2 , C2 , D2 , E1 } and {A2′ , B2 , C2 , D2 , E1 }. It is evident that the set {A1 , B2 , C2 , D2 , E1 } is not a set of orthologs because gene A1 is a paralog to the genes B2 , C2 , D2 . The term inparalog was introduced to specify genes that originated via a duplication event that happened after speciation, as it is the case with gene A2 and A2′ . Thus A2 is inparalog to A2′ and gene A2′ is inparalog A2. Outparalogs are the genes that arose via duplication before the speciation event. In our example the

Computational Techniques for Orthologous Gene Prediction in Prokaryotes Table 1.

213

Orthologs, paralogs and xenologs according to the phylogenetic tree depicted in Fig. 1.

Terminology

Gene sets

Orthologs — genes arise via speciation event (nearest common node for any pairs of gene is a speciation)

{A1 , B1 },{A1 , C1 },{A1 , E1 },{B1 , C1 },{B1 , E1 },{C1 , E1 } {A2 , B2 },{A2 , C2 },{A2 , D2 },{A2 , E1 } {A2′ , B2 },{A2′ , C2 },{A2′ , D2 },{A2′ , E1 } {B2 , C2 },{B2 ,D2 },{B2 , E1 },{C2 , D2 },{C2 , E1 },{D2 , E1 } Resulting sets of orthologs: {A1 , B1 , C1 , E1 } {A2 , B2 , C2 , D2 , E1 } {A2′ , B2 , C2 , D2 ,E1 }

Paralogs — genes arise via duplication event (nearest common node for any pairs of gene is a duplication) Inparalogs — genes arise via duplication event after the speciation event Outparalogs — genes arise via duplication event before the speciation event

{A1, A2 },{A1, A2′ },{A1 ,B2 },{A1 ,C2 },{A1 ,D2 } {B1, A2 },{B1, A2′ },{B1 ,B2 },{B1 ,C2 },{B1 ,D2 } {C1, A2 },{C1, A2′ },{C1 ,B2 },{C1 ,C2 },{C1 ,D2 } {A2 , A2′ }

Xenologs — genes where one or both genes arise via horizontal gene transfer (nearest common node for any pairs of gene is either a speciation or duplication but one of the gene underwent HGT event)

{D1 , A1 },{D1 , B1 },{D1 , C1 },{D1 , E1 } {D1, A2 },{D1, A2′ },{D1 ,B2 },{D1 ,C2 },{D1 ,D2 }

{A2 , A2′ }

{A1, A2 },{A1, A2′ },{A1 ,B2 },{A1 ,C2 },{A1 ,D2 } {B1, A2 },{B1, A2′ },{B1 ,B2 },{B1 ,C2 },{B1 ,D2 } {C1, A2 },{C1, A2′ },{C1 ,B2 },{C1 ,C2 },{C1 ,D2 }

speciation event for genes A1 , B1 , C1 happened after the duplication of gene 1 and 2, so genes A1 , B1 , and C1 are outparalogs of A2 , A2′ , B2 , C2 and D2 . Lastly, gene D1 is a xenolog to any of homologous genes as it underwent an HGT event. The summary of the terminological ASSIGNMENTS is given in Table 1. For 10 homologous genes there are 10!/2!∗8! = 45 possible pairs. In our example these 45 possible pairs are divided into three non-intersecting sets of relationships: 20 pairs of orthologs, 16 pairs of paralogs and 9 pairs of xenologs. However, whereas orthology and xenology are always mutually exclusive relationships, paralogy and xenology are not. Genes can be paralogs and xenologs at the same time. Imagine that an HGT event happened to a species which originated after a duplication event. For example, gene D1 was horizontally transferred from species B, and not from E. In this case xenologous relation of gene D1 to other homologs will remain the same as in Table 1. In addition, gene D1 will be a paralog to any gene from group 2 — {A2 , A2′ , B2 , C2 , D2 } — as their nearest common node is a duplication event. Finally, synologs are the homologs that occur in a single organism by the fusion of two independent lines of descent. Bacterial genes that appear in a eukaryotic cell as a result of endosymbioses are synologs to the corresponding homologs in the nuclear eukaryotic genome. For example, genes encoding mitochondrial ATPsynthase, the chloroplast ATP-synthase, and the vacuolar type ATPase in eukaryotic species are synologs with respect to each other, since mitochondria and plastids are

M. Poptsova

214

Fig. 2.

Examples of (A) pseudoorthologs and (B) pseudoparalogs.

thought to have evolved from free living bacteria (Margulis, 1995). The distinction between xeno- and synologs is exactly this: in case of xenologs, one of the xenologs is recognized as “foreign,” whereas in case of synologs, one has a reticulate organismal tree, thus neither one of the homologs is foreign, they reflect the two lineages that merged. Two additional terms, pseudoorthologs and pseudoparalolgs, also deserve attention. Pseudo- stands for the wrong assignment of either orthologs or paralogs. Pseudoorthologs can happen if different paralogs were lost in different taxa, and the remaining genes are considered as orthologs when they are actually paralogs (Fig. 2A). Pseudoparalogs are genes that seem to be paralogs but, in reality, one of the seeming paralogs was acquired via HGT or as a result of the fusion of the lineages (Fig. 2B). Correct understanding of the introduced terminology is important for developing algorithms that serve the task of automated selection of orthologous genes and separation of orthologs from paralogs. It is almost impossible to make a reliable distinction between paralogs and orthologs based on sequence similarity only. One needs to conduct a phylogenetic study to make the decision. Such a study is not easy to automate, so almost all methods for ortholog detection use semi-automated approaches when similarity searches are done by machines and phylogenetic studies are manually performed by the curators. The delineation between paralogs and orthologs can be crucial for correct assignment of protein prediction. Ninety nine percent of all annotated proteins are theoretically inferred from annotation based on the principle of orthology and paralogy. In the following section we briefly review the approaches implemented in different gene family databases. 3. Current Methods for Ortholog Selection 3.1. Sequence Similarity as Basis for Ortholog Selection Methods Evolution of genes in the genomes, if considered as progressing in a series of elementary events, started from a single common ancestral gene that, by way of

Computational Techniques for Orthologous Gene Prediction in Prokaryotes

215

mutation, sired the entire variety of homologous genes we observe today in different species. This notion of evolution bears the name of divergent evolution whereby genes, once identical, diverged from each other by accumulating mutations. As a result we observe a number of gene sequences that are not completely identical, but to “a certain degree” similar to each other, so that we can recognize them as related. Statistical theory of sequence similarity is beyond the scope of this chapter, and the reader is referred to the original articles (Karlin et al., 1990, 1993) or to related books (Korf et al., 2003; Baxevanis et al., 2005). The “degree of similarity” that allows us to make a conclusion to the effect that the sequences are related is based not only on percent identity of two sequences, but also on a statistical significance of the matching fragments. Statistics developed by Karlin and Altschul were later employed in the program BLAST (Altschul et al., 1990), which is now the most popular program for homology searches. Most of the known methods used for detection of homologous genes are based on sequence similarity and genomespecific best hits. Once two sequences are significantly similar to each other we consider them as being homologous. When more than one complete bacterial genome sequence became available, the following questions arose immediately: How many common genes are shared between different species (see Chapter 4 for a detailed discussion on the core genes)? Can common genes be hierarchically organized into gene families? The posed questions required methods for ortholog detection for any given set of different taxa and criteria for assembling genes into gene families. As we will see, there is no such single method as there is no such single criterion. Attempts to develop such methods and find such criterion led to creation of many gene family databases. Each database was composed according to its own set of rules. We will briefly review the approaches employed in some of them. 3.2. Existing Databases of Protein Families A number of databases for protein families have been developed in the past decade or so, and are publicly available on the Internet, which includes the following. PIR (http://pir.georgetown.edu/pirwww/index.shtml), or the Protein Identification Resource, was created by Margaret Dayhoff’s group (Barker et al., 1998). The original classification was based on sequence similarity only (Dayhoff, 1976, 1979). The genes that show more than 50% sequence identity were combined into the same families. Later on PIR extended its superfamily concept and developed the PIRSF system (Wu et al., 2004). Superfamilies in PIRSF are divided in two major classes: homeomorphic and domain superfamilies. In each homeomorphic superfamily sequences must share similarity over the entire length and have the same domain architecture, meaning that the number, order and types of core domains must be the same. Domain superfamilies combine families linked by common domain regions that do not extend over entire protein length. No distinction is made between paralogs and orthologs.

216

M. Poptsova

COG (http://www.ncbi.nlm.nih.gov/COG/), or cluster of orthologous groups database (Tatusov et al., 1997, 2001, 2003) is built upon BLAST hits. Minimum cluster in COG is a triangle with sides corresponding to the best Blast hit relationship. First, triangles forming one-side circular best BLAST hits are constructed, and then triangles with common sides are merged to form a cluster. The COG clusters consist of an undifferentiated mixture of orthologs and paralogs and are limited to a certain set of taxa. Imparanoid (http://inparanoid.sbc.su.se/) (O’Brien et al., 2005) contains pairwise orthologs which were assembled by the reciprocal best BLAST hit method. Inparalogs are added to the orthologous pairs by applying the Inparanoid clustering method based on sequence similarities scores (Remm et al., 2001). Only two taxa are considered at a time. SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/), or Structural Classification of Proteins database (Murzin et al., 1995), divides proteins into families based on their three-dimensional structures. Gene families generally comprise proteins whose sequences are 30% or more identical; but in some cases, based on the structure and function, less identical sequences can be included into the same families (such as globins that could share only 15% identity). Superfamilies are formed from sequences with low sequence identity, whose structure and function suggest common evolutionary origin. No distinction is made between paralogs and orthologs. Pfam (http://www.sanger.ac.uk/Software/Pfam/) (Sonnhammer et al., 1998) is a protein family database based on multiple sequence alignments of known families. Initial BLAST analysis is complemented by HMM-profile, and, as a result, each family is represented by four files: a seed multiple alignment and HMM-profile, and the resulting multiple alignment and HMM-profile. Later Pfam families were grouped into clans (Finn et al., 2006) based on similar structures, related functions, significant matching of the same sequence to HMMs from different families and profile–profile comparisons. “The main distinction between Pfam and most other protein family databases is that for all of Pfam, both the family definition and the search method span the entire domains, including not only conserved motifs but also less-conserved regions, insertions and deletions” (Finn et al., 2006). No distinction is made between paralogs and orthologs. HOGENOM (http://pbil.univ-lyon1.fr/databases/hogenom.html), or homologous gene family database, includes families from bacteria, archaea and eukaryota. Genes are composed into families according to the principle of Blast similarity searches. BLAST output is further filtered to remove homologous segment pairs not compatible with their global alignment. Two sequences in a pair are included in the same family if the remaining homologous segment pairs cover at least 80% of the proteins lengths, their similarity is greater or equal to 50% and both sequences are complete. Partial sequences (longer than 100 AA or at least 50% of the length of the complete proteins) are also included in the classification. No distinction is made between paralogs and orthologs.

Computational Techniques for Orthologous Gene Prediction in Prokaryotes

217

CATH (http://www.cathdb.info/latest/index.html) is a database of protein domain structures (Orengo et al., 1997). It classifies structures according to their (C)lass, (A)rchitecture, (T)opology or fold and (H)omologous family. Class is the simplest level that describes the secondary structure composition of each domain. Architecture takes into account the shape revealed by the orientations of the secondary structure units, such as barrels and sandwiches. Topology level considers sequential connectivity, such that members of the same architecture might have quite different topologies. The proteins whose structures belong to the same T-level and that show high similarities are put into the same homologous superfamily. CDD (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml), or Conserved Domain Database, is based on domain alignments collected from Pfam, SMART and COG. Each domain in CDD is presented by position-specific score matrix (PSSM) calculated from domain multiple alignments. The search of conserved domains for a query protein sequence is implemented as a sequence search against database of PSSM models using reverse-position-specific BLAST (RPS-BLAST). Many domain families are linked to three-dimensional structures. All proteins in Entrez as well as web-based BLAST have links to the entries in CDD. Domain classification into families is manually curated and is based on classification used in the databases that were the source for the multiple alignment — Pfam, SMART and COG.

4. Gene Family and Superfamily Classification It would have been much easier if we had only one criterion that would allow us to combine homologous genes into families. For example, this criterion could be formulated as follows: “all genes that show more than 50% sequence similarity belong to one family”. That was actually the starting point for the classification proposed by Margaret Dayhoff’s group and implemented in PIR database. It seems to be a true statement for many cases, but exceptions are abundant. Gene sequences with low sequence similarity are shown to have a similar 3D structures, while eye lens protein and active lactate dehydrogenase, and enolase in other tissues are examples of proteins that are almost identical in sequence but perform totally unrelated functions (Wistow, 1995). At present there is no agreement on the definition of the terms “gene family” and “superfamily”. Usually the term gene family implies homology, i.e. descent from a common ancestral gene, but this is a very sparse definition, because there is no indication of how far back in time a common origin should be traced. Each of the databases uses different criteria for homology assignment. Usually sequence similarity, though defined differently, is complemented by alignment profiles. Complex cases are always resolved manually by a curator. Table 2 gives a summary for family-superfamily classification employed by the above-described databases.

M. Poptsova

218

Table 2. Summary for family-superfamily classification employed in different gene family databases.

PIR

Family

Superfamily

> 50% sequence identity

1

Homeomorphic superfamilies — sequences must share similarity over entire length and have the same domain architecture (this is actually what most researchers would call a family)

2

COG

SCOP

Cluster is analog of a family. Best Blast hits organized into merged triangles. Generally, sequence identity of 30% and greater. Other evidences of similar functions and structures with lower sequence identities.

Pfam

Blast searches, complemented by HMM-profile.

HOGENOM

BLAST similarity searches with the requirement of HSPs cover at least 80% of the proteins length, their similarity is greater or equal to 50%. Domains within each H-level are subclustered into sequence families using multi-linkage clustering combining sequence identity overlap of 80% and 35%, 50%, 95% and 100% sequence identity. Multiple alignments from families from Pfam, SMART and COG converted into PSSM.

CATH

CDD

Domain superfamily — families linked by a single domain Clusters connected by PSI-BLAST searches. “Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable, are placed together in superfamilies.” (Murzin et al., 1995) “We use up to four independent pieces of evidence to help assess whether families are related: related structure, related function, significant matching of the same sequence to HMMs from different families and profile–profile comparisons.”(Finn et al., 2006) None

Homologous superfamily, or H-level. H-level groups together protein domains according to high sequence identity or structure comparison using SSAP.

Classification is inherited from Pfam, SMART and COG.

4.1. Attempt to Unify Different Family-Superfamily Classifications With the ever-increasing amount of available protein sequences, the need for unified classification of gene families is apparent. One of the efforts in this direction is a document entitled “A Proposal For The PIRSF (PIR SuperFamily) Classification System” (http://pir.georgetown.edu/pirwww/about/doc/PIRSF.pdf). In the original PIR classification a homeomorphic superfamily was similar to what all researchers would call a family. Indeed, the requirement of similarity over the entire

Computational Techniques for Orthologous Gene Prediction in Prokaryotes

219

length and the same domain architecture implies a very close relationship between genes. This issue was emphasized in the Proposal and an improved version of PIRSF changed the terminology from homeomorphic superfamily to homeomorphic family. In the present form the proposed classification seems to be much better structured and well-defined than any other classification scheme. If we consider proteins as consisting of smaller units such as domains, then the evolution of proteins should be considered as evolution of domains. If one were to compare proteins from homeomorphic families one would see that the conserved regions correspond to domain regions while inter-domain sequences are subject to many mutations. This property is also captured by HMM profiles and seed alignments in the Pfam database. Another acknowledged fact derived from the practice of assembling multiple gene families is that the proteins can be linked together by a single domain. The numerous BLAST hits that one could obtain for a gene are often due to the one common well-conserved domain. In the PIRSF classification system genes linked together at least by a single domain are grouped in a so-called “domain superfamilies” which are then classified into families according to their similarity over the entire length. The important property of PIRSF database is that it is cross-referenced to many other classification schemes. Domain-based retrieval can identify common Pfam domains, compare classification with CATH homology levels and SCOP superfamily levels. Many proteins (approximately 65% in eukaryotes and 40% in prokaryotes according to recent re-estimation) are multi-domain proteins consisting of more than one domain (Ekman et al., 2005). Domain shuffling and rearrangements potentially produce a lot of possible protein variations. The conserved nature of domain combinations was revealed in (Apic et al., 2001, 2003; Bornberg-Bauer et al., 2005). Since protein domains can be treated as conserved evolutionary units, the classification of gene families based on classification of domains seems more logical than that based on percent similarity or BLAST hits only. Now all protein queries submitted to NCBI BLAST are searched for the presence of conserved domains. In light of the above discussion, it would be relevant to inquire about comparative advantages of protein family databases and protein family classification systems. There is no definite answer to this question. Depending on the type of research and the individual tasks, some approaches can be preferred over others. Assembly of protein families is still a work in progress, but despite a large pool of non-overlapping gene sequences a unified system of gene family classification will eventually develop. Meanwhile an individual researcher may face a problem whereby none of the existing protein family databases resolves the questions he/she poses. Thus, while complete genome sequences for the species of interest may be available, the sets of orthologs for the given species may not be present in any of the existing databases. In the next section I describe what automated methods are used today for selecting sets of orthologs without accessing gene family databases.

220

M. Poptsova

5. Automated Methods for Selecting Families of Orthologs Despite the existence of various gene family databases one would like to be able to select all possible sets of orthologous genes from a particular number of target genomes. The databases may not contain the species of interest, or the extraction of all orthologous sets for this particular number of taxa is not supported, or for any other reason the researcher might need to select gene families of the target species in the lab. There are several methods that are often used for this purpose.

5.1. Reciprocal Best BLAST Hit (RBH) Method Since the release of the first complete bacterial genome sequence (of Haemophilus influenzae) in 1995 a widely used in-house method to identify orthologs from a set of species was the reciprocal best BLAST hit method (e.g. Montague et al. (2000) Zhaxybayeva et al. (2002)). The method requires strong conservative relationships among the orthologs so that if a gene from species 1 has a gene from species 2 as a best hit when performing a BLAST search with genome 1 against genome 2, then the gene 2 must also have gene 1 as the best hit when genome 2 is searched against genome 1. For a set of species the reciprocal BLAST hit method requires the presence of all pairwise reciprocal connections between all species as depicted in Fig. 3A. The reciprocal BLAST hit method is very stringent and succeeds in selection of conserved orthologs with low false positive rates (Zhaxybayeva et al., 2003), but it often fails to identify orthologs in the presence of paralogs. Figure 3B illustrates how reciprocity is broken in the presence of paralogous gene 2’s closely related to gene 2. Genes 2 and 2′ could be inparalogs that resulted from a recent gene duplication. In this example, gene 3 has gene 2′ instead of gene 2 as the best BLAST hit, preventing both paralogs from being appropriately recognized as orthologs.

Fig. 3. The reciprocal best BLAST hit (RBH) method. Each node represents one gene from different genomes (numbers correspond to genomes), arrows signify best Blast hits.

Computational Techniques for Orthologous Gene Prediction in Prokaryotes

ATP-A (catalytic subunit) Escherichia coli

ATP-B (non-catalytic subunit)

ATP-A Escherichia coli

gi 16131600

ATP-B gi 16131602

ATP-A gi 16131600

ATP-B

Methanosarcina mazei

gi 16131602

ATP-F ATP-A

ATP-A

gi 15897484

gi 21226882

ATP-B

ATP-B

gi 15897485

gi 21226881

Sulfolobus solfataricus

ATP-A ATP-B gi 16080736

ATP-A

gi 16129888

Methanosarcina mazei ATP-A gi 21226882

gi 15897484

ATP-B

ATP-B

gi 15897485

gi 21226881

Sulfolobus solfataricus

gi 16989734

221

ATP-A gi 16989734

ATP-B

Bacillus subtilis

gi 16080736

ATP-F

Bacillus subtilis

gi 16078687

A Fig. 4.

B Best BLAST hit relationships between ATP synthases: (A) subunit A, (B) subunit B.

Figure 4 gives an example where reciprocity fails in a case of four species (two archaea and two bacteria) for a conserved anciently duplicated protein, ATP synthase ATP binding subunits. In the case of the catalytic subunits (ATP-A, Fig. 4A), reciprocity is broken when ATP-A from Escherichia coli and Bacillus subtilis has the more conserved subunits B (ATP-B), from Sulfolobus solfataricus as the best BLAST hit. In the case of the ATP-B family (Fig. 4B), the situation is further complicated by the presence of a third paralog frequently found in bacterial species, a paralog that is involved in the assembly of the bacterial flagella (Vogler et al., 1991) (here denoted as ATP-F), which is as the best hit for ATP-B from the archaeon Methanosarcina mazei. As a result, neither ATP-A nor ATP-B are selected as gene families when applying a strict reciprocal best BLAST hit method. In many bacteria, additional ATP-A paralogs exist that make the recognition of orthologs even more difficult: a Rho transcription termination factor involved in unwinding the RNA transcript from the encoding DNA, and an ATPase that is part of type III secretion systems that is similar to ATP-F. In contrast to the reciprocal best BLAST hit approach, a phylogenetic tree, reconstructed for all the genes collected from both diagrams of Fig. 4, places ATP-A, ATP-B and ATP-F on separate branches forming three distinct clusters representing the three gene families (see Fig. 5). Due to the complexity, the implementation of RBH method sometimes is replaced by triangular (as in COG) or similar circular Best Blast hit methods. The disadvantage of the so called circular methods is the inclusion of paralogs, whereas strict RBH approach frequently excludes orthologs in case in-paralogs exist in some of the genome. The main problem of RBH method is that it considers only local alignments, or part of the sequences that were well conserved and thus detected by BLAST as significantly similar. Phylogenetic reconstruction takes into account global

M. Poptsova

222

Family of ATP-A Sulfolobus solfataricus

ATP-A Methanosarcina mazei ATP-A

Bacillus subtilis

ATP-A ATP-A Escherichia coli

Bacillus subtilis ATP-F

ATP-B Escherichia coli

Escherichia coli

ATP-F

ATP-B Bacillus subtilis

ATP-B

Family of ATP-F

Sulfolobus solfataricus

ATP-B Methanosarcina mazei

Family of ATP-B Fig. 5.

Phylogenetic tree reconstructed for a superfamily of ATP synthase subunits.

alignment, and put sequence together according to the full length sequence data. The idea that in a mixture of orthologs and paralogs, more closely related genes are placed on one branch emerging from one node on a tree was implemented in a phylogenetic clustering algorithm BranchClust (Poptsova et al., 2007) (more discussion can be found later in this chapter).

5.2. Reciprocal Smallest Distance (RSD) Method To overcome problems inherent to the reciprocal best BLAST hit approach, the Reciprocal Smallest Distance (RSD) method was proposed by Wall et al. (2003). The method relies on global sequence alignment and maximum likelihood estimation of evolutionary distances to detect orthologs between two genomes. It starts with BLAST as the first step. But rather than confining itself to the selection of the best hit, the method permits to take all of the significant hits below some cutoff value. Then all of the selected sequences are globally aligned with ClustalW (Thompson et al., 1994), and for each pair the evolutionary distance is calculated by PAML (Yang, 1997). Evolutionary distance is defined as maximum likelihood estimate of the number of amino acid substitutions separating the two protein sequences given an empirical amino acid substitution matrix. Of all the selected sequences only the sequence yielding the shortest distance with the original query is retained. This sequence is then used for a reciprocal BLAST search against the first genome. If the selected protein sequence produces the shortest distance with the original query sequence, both are selected as a putative pair of orthologs. RSD method manages to find more putative orthologs than RBH (Wall et al., 2003), because it is less likely to be misled by the presence of a close paralog if compared with RBH. The advantage of RSD over RBH is that it uses global rather than local sequence alignments. The generated evolutionary distances are maximum

Computational Techniques for Orthologous Gene Prediction in Prokaryotes

223

likelihood estimates of the number of amino acid substitutions separating any two protein sequences given an empirical amino acid substitution matrix. RSD method is implemented in the RoundUp (http://rodeo.med.harvard.edu/tools/roundup) (Deluca et al., 2006), a web-based tool for selecting putative orthologs from any number of species. A user can obtain a set of orthologs for selected genomes together with their phylogenetic profiles (Pellegrini et al., 1999) based on pre-computed pairs of sequences selected by RSD method.

6. Phylogenomic-based Approaches 6.1. Tree Reconciliation Method Tree based methods for selecting orthologs originated with the problem of reconciliation of gene trees with species trees. One of the first problems on tree reconciliation was the task to reconcile haemoglobin gene history with that of mammalian species (Goodman et al., 1979). In case of mammalian species with a “non-questionable” species tree, the history of a particular gene may differ from that of the species only due to gene duplication, gene loss and different expression of distinct sister genes in different species. The reconciliation task is a parsimonious task of finding minimum events (gene duplications, gene loss, gene expression) that would reconcile two incongruent trees. Figure 6 depicts a hypothetical species tree, a gene tree and the resulting reconciled tree. To explain the different positions of species B and C in the gene tree and the species tree, one duplication and three loss events are inferred from the reconciled tree. The theory of reconciled trees and tree mapping algorithms were developed in (Page, 1994; Mirkin et al., 1995; Eulenstein, 1997; Page et al., 1997; Zmasek et al., 2001; Dufayard et al., 2005). The modification of Eulenstein’s linear algorithm for tree mapping was implemented in the program called GeneTree (Page, 1998). The program calculates reconciled tree and depicts it with a series of inferred gene duplication and gene loss events. If we consider, for example, a worm, a fish, a bird and a mammal, then the species tree will not raise any controversies. In many cases, especially in the case of prokaryotes, or even for closely related taxa of plants and animals, the species tree is unknown. If the species tree selected for reconciliation task is wrong, then all the inferences of gene duplication and gene loss will also prove wrong. The problem of reliability of species tree and its usage in tree mapping algorithms was brought up in (Storm et al., 2002). These authors proposed a method that uses a set of bootstrap trees instead of one species tree. The frequency of orthology assignment in bootstrap tree can be used as a support value for possible orthology. The main problem that restricts the applicability of the tree reconciliation methods to prokaryotic genomes is the requirement for a known species tree. Another serious limitation is that the incongruence between a species tree and

224

M. Poptsova

Fig. 6. Tree reconciliation method. A, B, C, D are different taxa in the species tree, and at the same time genes from corresponding species.

a gene tree is explained only by means of gene duplication and losses, whereas for prokaryotes it often results from HGT events.

6.2. Phylogenetic Clustering Method: BranchClust A phylogenetic approach to assembling orthologous sets that would not require a known species tree was proposed in BranchClust (http://www.bioinformatics. org/BranchClust) (Poptsova et al., 2007). BranchClust is a clustering algorithm that parses trees to delineate families of orthologs within a superfamily containing several paralogous gene families. The underlying idea is that closely related genes are placed on one branch emerging from one node on a tree (see Fig. 5), so the task of detecting families for different taxa is simply a task to detect branches containing groups of genes from all, or almost all, species. For the detailed description of the algorithm the reader is reffered to the original paper. Here I will briefly outline the major steps and assumptions.

Computational Techniques for Orthologous Gene Prediction in Prokaryotes

225

The method starts with selection of the so-called superfamilies, i.e., sets of genes containing mixtures of orthologs and paralogs. The collection of superfamilies is implemented by performing BLAST of all n genomes against the database composed of the same n genomes with inclusion of all significant hits for each gene from every genome in the superfamilies. The sequences from the obtained superfamilies are aligned and a phylogenetic tree is reconstructed by any of the preferred methods of tree reconstruction (currently by default ClustalW 1.83 is used for sequence alignment with default parameters and for phylogenetic reconstruction using correction for multiple substitutions). Then superfamily trees are parsed to split them into sub-trees containing families of orthologs. The algorithm starts the selection from the most distant leaf in a tree, and then descends along the tree node by node, adding branches and calculating the total number of different species in the sub-trees. The process continues until the node either becomes complete or is shown to contain a side branch that serves as a “stopper”. The branch that serves as a “stopper” is defined by the number of different taxa covered and can be adjusted via parameter MANY. When a tree containing several clusters (i.e., families of orthologs) is submitted to the BranchClust algorithm, it is arbitrarily rooted: it can be rooted inside any cluster or anywhere in between (see example below). To avoid artifacts caused by placing the initial root, the selection is repeated for the tree rooted at the opposite end. We report as the final clustering the one that minimizes the number of paralogs. The process of selection with tree re-rooting is illustrated in Fig. 7. Figure 7A shows a hypothetical unrooted tree for a set of 5 taxa A, B, C, D and E. The parameter MANY, or the number of species required for a branch to be a “stopper” is set to 4 (i.e., the branch containing 4 different taxa will serve as a “stopper”). The algorithm runs twice with two different roots, which are chosen as the two nodes most distant from each other. The process of root selection for the two independent runs is shown on Figs. 7B–D. Figures 7C and 7E–H show how BranchClust works for the tree rooted at root 1. Then the algorithm is applied to the tree rooted at root 2 (Fig. 7D), and the results are compared by calculating the number of paralogs from two different runs. The clustering that contains the least number of paralogs is selected. Using two trees rooted at opposing ends helps to solve a problem that arises in case of two incomplete clusters. This problem and how it is addressed by the implemented approach are illustrated by the clustering of the penicillin binding proteins superfamily for a set of 13 gamma proteobacteria (Fig. 8A). The superfamily containing the penicillin-binding proteins consists of 25 members that form two distinct clusters in the tree: one is a branch with 15 leaves and 13 different taxa, forming a complete cluster; the other cluster is incomplete, containing only 12 members from 12 different species. The results of applying the BranchClust algorithm in this case depend on the starting point, or the root of the tree. If we start selection inside Cluster 1, we will select the complete Cluster 1, remove it from the tree and the remaining tree will be the incomplete Cluster 2.

226

M. Poptsova

Fig. 7. An example of the BranchClust selection steps for a superfamily tree for 5 different taxa with 3 clusters.

Computational Techniques for Orthologous Gene Prediction in Prokaryotes

227

However, if we start selection inside Cluster 2, we will skip the node containing Cluster 2 and continue selection to form a complete branch. This will result in the following clustering: 23:13, 5:5, meaning that one branch contains 23 leaves with 13 different taxa and the second with 5 leaves and 5 different taxa. The number of paralogs is given by the difference between the number of leaves on a branch and number of different taxa. In the latter case this number would be 23−13+5−5 = 10, and in the first run we have the difference of 15 − 13 + 13 − 12 = 3. We select that run which yields the minimum number of paralogs. Once a cluster is isolated, a family containing one representative from each taxon is selected with identification of inparalogs as duplicated genes inside that cluster. For example, in the case of penicillin-binding protein superfamily (Figs. 8A and 8B), cluster 1 contains two inparalogs, one of Salmonella typhimurium, gi 16765177, and the other is of Pseudomonas aeruginosa, gi 15597468. The selection starts from the most distant leaf on a tree, and the gene copy which is closest to the top of the branch is reported as part of the family, while other copies are reported as inparalogs. We call genes “out-of-cluster paralogs”, if they are located inside a superfamily, but not on the branch containing a selected cluster. Note that for a given cluster all other genes from the same superfamily are “outparalogs”. We do not include all of these outparalogs in our clustering reports, because this would just list all genes in the superfamily not included in that cluster. The concept of “out-of-cluster paralogs” is illustrated in Fig. 9, which depicts the superfamily of DNA-binding proteins and integration host factors for 13 gamma proteobacteria. The second

Cluster 1 15:13 27904705 16273058

32490961 16120877 22127506 26246017 16763512 16765177 15837394 21241544 21230194

15602001 15642404

16272007 15603789 15599614 26246616 16764017 15597468 22126081 16122817 16765252 32490921 15837913 15640966 21241430 15599198 21232896

Cluster 2 13:12 (A) Fig. 8 (A) Superfamily of penicillin-binding proteins for 13 gamma proteobacteria. (B) BranchClust output.

M. Poptsova

228

------------ CLUSTER 1 ---------------------- FAMILY ----------->gi|27904705| peptidoglycan synthetase FtsI [Buchnera aphidicola str. Bp (Baizongia pistaciae)] >gi|26246017| Peptidoglycan synthetase ftsI precursor [Escherichia coli CFT073] >gi|16273058| penicillin-binding protein 3 [Haemophilus influenzae Rd KW20] >gi|15602001| FtsI [Pasteurella multocida subsp. multocida str. Pm70] >gi|15599614| penicillin-binding protein 3 [Pseudomonas aeruginosa PAO1] >gi|16763512| division specific transpeptidase [Salmonella typhimurium LT2] >gi|15642404| penicillin-binding protein 3 [Vibrio cholerae O1 biovar eltor str. N16961] >gi|32490961| hypothetical protein WGLp212 [Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis] >gi|21230194| penicillin-binding protein 3 [Xanthomonas campestris pv. campestris str. ATCC 33913] >gi|21241544| penicillin-binding protein 3 [Xanthomonas axonopodis pv. citri str. 306] >gi|15837394| penicillin binding protein 3 [Xylella fastidiosa 9a5c] >gi|16120877| penicillin-binding protein 3 [Yersinia pestis CO92] >gi|22127506| peptidoglycan synthetase [Yersinia pestis KIM] COMPLETE: 13 >>>>> IN-PARALOGS ---------->gi|16765177| putative penicillin-binding protein 3 [Salmonella typhimurium LT2] >gi|15597468| penicillin-binding protein 3A [Pseudomonas aeruginosa PAO1] ------------ CLUSTER 2 ---------------------- FAMILY ----------->gi|26246616| Penicillin-binding protein 2 [Escherichia coli CFT073] >gi|16272007| penicillin-binding protein 2 [Haemophilus influenzae Rd KW20] >gi|15603789| Pbp2 [Pasteurella multocida subsp. multocida str. Pm70] >gi|15599198| penicillin-binding protein 2 [Pseudomonas aeruginosa PAO1] >gi|16764017| cell elongation-specific transpeptidase [Salmonella typhimurium LT2] >gi|15640966| penicillin-binding protein 2 [Vibrio cholerae O1 biovar eltor str. N16961] >gi|32490921| hypothetical protein WGLp172 [Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis] >gi|21232896| penicillin-binding protein 2 [Xanthomonas campestris pv. campestris str. ATCC 33913] >gi|21241430| penicillin-binding protein 2 [Xanthomonas axonopodis pv. citri str. 306] >gi|15837913| penicillin binding protein 2 [Xylella fastidiosa 9a5c] >gi|16122817| penicillin-binding protein 2 [Yersinia pestis CO92] >gi|22125081| peptidoglycan synthetase, penicillin-binding protein 2 [Yersinia pestis KIM] INCOMPLETE: 12 >>>>> IN-PARALOGS ---------->gi|16765252| putative penicillin-binding protein [Salmonella typhimurium LT2]

(B) Fig. 8

(Continued )

copy of gene from Pseudomonas aeruginosa, gi 15600541, is reported as an outof-cluster paralog (Fig. 9) because Pseudomonas contains one copy inside each cluster. The BranchClust algorithm was initially tested on four different sets of genomes: 2 bacteria and archaea, 13 gamma proteobacteria, 14 archaea and 30 bacteria and

Computational Techniques for Orthologous Gene Prediction in Prokaryotes

229

15599135

Cluster 2

Cluster 4 9:9

10:10 21230442 21241841 15837792 161123316 22124937 15640302 27904545 16763832 26226451 30995375 15603597 15597001 32491258 15641921 22124414 16123868 26250771 16767424 15600541

16273225 16764342 26246937 15602493 22126657 16121673 15641235 15641916 22125804 16122647 21231892 15598357 16764690 26247964 21243319 21243032 15597934 16273140 15837345 21231632 15602665 15839028

Cluster 1 11:11

Cluster 3 12:11

Fig. 9. Superfamily of DNA-binding proteins and integration host factors for 13 gamma proteobacteria.

Table 3. Comparison of the best BLAST hit method and the BranchClust algorithm. Number of taxa — A: Archaea B: Bacteria 2A2B 13B 14A 14B 16A

Number of selected families Reciprocal best BLAST hit 80 236 125 12

BranchClust 414 (all complete) 2066 (369 complete, 1690 with n ≥ 8) 1431 (300 complete, 1131 with n ≥ 8) 195 (80 complete, 195 with n ≥ 24)

archaea together. Table 3 compares the number of families of orthologs selected by the reciprocal best BLAST hit method and by the BranchClust algorithm. The homologs of ATPase/ATPsynthase catalytic subunits provide a good test case to explore the limits of the algorithms to assemble families of orthologs. This superfamily includes ancient paralogs and recent gene duplications, and among the homologs that are part of the type three secretion system are genes frequently horizontally transferred found in pathogenicity islands (Winstanley et al., 2001; Dobrindt et al., 2004). Examples of clustering the ATP synthases’ superfamily for 13 gamma proteobacteria, 30 and 317 bacteria and archaea are given in Fig. 10. In all cases BranchClust recognizes complete clusters for ATP-A and ATP-B, as well as clusters for Rho-termination factor and ATP-F and type III secretion

230

M. Poptsova

Fig. 10. ATP synthase superfamilies for A — 13 gamma proteobacteria, 30 bacteria and archaea and 317 bacteria and archaea.

system ATPases. More examples of the BranchClust analysis can be found on the BranchClust website. As was discussed earlier in this chapter, there is no agreement on the definitions of the terms gene family and superfamily. In the BranchClust algorithm all recognized homologs (para- and orthologs) were assembled under the label superfamily. The term “gene family” in BranchClust denotes a collection of orthologs, where each species contributes one or in case of inparalogs several genes to the family. Superfamilies are composed of families related to each other via significant BLAST hits, i.e., superfamilies correspond to single-linkage clusters or domain superfamilies in PIRSF classification system. As a consequence, some superfamilies will contain families of orthologs that are joined via a single fusion protein. The implemented phylogenetic reconstruction of such a superfamily places the families of orthologs in distinct branches of the superfamily tree. The BranchClust algorithm is not restricted to the proposed method of assembling superfamilies. Rather, BranchClust allows one to analyze superfamilies assembled under any other selection criteria; e.g., the pre-computed families from PIRSF, Pfam, COG, or HOBCAGEN could be submitted for further BranchClust clustering. 7. Summary The majority of the existing gene family databases contain undifferentiated mixture of orthologs and paralogs which require further separation to perform a

Computational Techniques for Orthologous Gene Prediction in Prokaryotes

231

comprehensive analysis of orthologs in the requested prokaryotic species. The other problem that restricts the usability of these databases is that complete prokaryotic genomes are now being sequenced much faster than the databases can update their contents. Besides, the criteria for homology assignment and gene familysuperfamily classification vary from database to database, and no agreement has been achieved so far, though an attempt in this direction was made in PIRSF proposal for classification system. Researchers in the field of comparative genomics of prokaryotes are in need for tools that would allow them to select all orthologous gene sets from any set of prokaryotic species. Automated methods for ortholog selection are still undergoing development though the existing algorithms can be effectively used for certain purposes with a clear understanding of the restrictions and applicability of each method. The RBH method selects very conservative sets of orthologs, but frequently fails in assembling families containing paralogs, in particular, inparalogs. RSD method performs much better and the developed Rodeo tool allows for easy extraction of orthologs for a requested set of species. The orthologous sets selected by RSD method contain the RBH selected orthologs as a subset, but RSD method neither distinguishes nor keeps track of inparalogs. The phylogenetic methods based on reconciliation of a species tree with a gene tree are not applicable to prokaryotes in most cases as a species tree is unknown, and incongruence in tree topology could be caused by HGT rather than by sequential gene duplication-loss events. Phylogenetic clustering method such as BranchClust overcomes the requirement of a known species tree by analyzing unrooted phylogenetic trees reconstructed from the mixture of orthologs and paralogs. It effectively selects complete and incomplete clusters of putatively orthologous genes, including inparalogs arising through lineage specific gene amplification. Capable of distinguishing incomplete clusters, it makes it easy to track lineage specific gene loss events. The problem of BranchClust method is that if the clustering parameter MANY is too big, two sets of orthologs will be fused, while if it is too small the cluster could be split into smaller groups. The optimal value can be found by multiple test runs, but individual superfamilies always fall out as exceptions and require manual curation. Growing data on 3D protein structures together with 3D structure domain databases are likely to lead to a new breakthrough in our understanding of homology, which, in turn, will demand improved automated methods for ortholog prediction.

8. Further Reading On Orthology-Paralogy Paradigm: Gogarten, J. P. (1994). Which is the most conserved group of proteins? Homologyorthology, paralogy, xenology, and the fusion of independent lineages. J Mol Evol 39(5):541–543.

232

M. Poptsova

Fitch, W. M. (2000). Homology a personal view on some of the problems. Trends Genet 16(5):227–231. Koonin, E. V. (2005). Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet 39:309–338. On Tree-Mapping Algorithms: Eulenstein, O. (1997). A linear time algorithm for tree mapping. Arbeitspapiere der GMD 1046. Zmasek, C. M. and S. R. Eddy (2001). A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 17(9):821–828. On Phylogenetic Clustering Algorithm: Poptsova, M. S. and J. P. Gogarten (2007). BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics 8:120. On Protein Function Prediction : Whisstock, J. C. and A. M. Lesk (2003). Prediction of protein function from protein sequence and structure. Q Rev Biophys 36(3):307–340. Friedberg, I. (2006). Automated protein function prediction–the genomic challenge. Brief Bioinform 7(3):225–242. On Protein Domains as Evolutionary Units: Bornberg-Bauer, E., F. Beaussart, S. K. Kummerfeld, S. A. Teichmann and J. Weiner, 3rd (2005). The evolution of domain arrangements in proteins and interaction networks. Cell Mol Life Sci 62(4):435–445. Apic, G., W. Huber and S. A. Teichmann (2003). Multi-domain protein families and domain pairs: comparison with known structures and a random model of domain recombination. J Struct Funct Genomics 4(2–3):67–78.

CHAPTER 10 COMPUTATIONAL ELUCIDATION OF OPERONS AND UBER-OPERONS

PHUONGAN DAM, FENGLOU MAO, DONGSHENG CHE, PING WAN, THAO TRAN, GUOJUN LI and YING XU

1. Introduction The operon model was first proposed by Monod and Jacob (Jacob et al., 1960; Jacob et al., 1961) in 1960 to describe the transcriptional regulation of genes involved in the metabolism of lactose in E. coli. Jacob and Monod made key experimental and theoretical discoveries that suggested how a set of genes may be coordinately transcribed in response to the presence of lactose, providing a new model for transcriptional regulation in prokaryotes (Lewis, 2005). For this work, they were awarded the Nobel Prize in Physiology in 1965. Further studies showed that the lac operon consists of three genes lacZ, lacY and lacA that are co-transcribed in a single polycistronic RNA when E. coli is grown in the presence of lactose. The RNA is called polycistronic because it contains multiple genes, in contrast to the term monocistronic that refers to a RNA containing one gene. Subsequently, the operonic organization of genes working in the same biological process was found to be a common theme in both bacteria and archaea (Leigh, 2000; Xie et al., 2003). For example, operon structures have been observed for bacterial genes encoding enzymes involved in amino acid biosynthesis, ribosomal proteins and RNA molecules, RNA polymerase subunits, or membrane transporter subunits. In archaea, operons containing genes encoding for the ribosomal RNA molecules, enzymes involved in trytophan biosynthesis or the acetyl-CoA decarboxylase/synthase multi-enzyme complex have been observed (Grahame et al., 2005). Furthermore, polycistronic transcripts in eukaryotes were also observed for the first time in 1988 in Trypanosomes (Agabian, 1990), and later in C. elegans (Spieth et al., 1993), flatworms (Davis et al., 1997) and primitive chordates (Ganot et al., 2004). Attempts in genome-wide operon prediction, however, did not begin until the mid to late 1990’s after a number of prokaryotic genomes had been sequenced. The earliest prediction methods for operons relied on known promoter and terminator information (Yada et al., 1999). Salgado et al. (2000) were the first to use intergenic distance in operon prediction, which has proved to be generally applicable for 233

234

P. Dam et al.

prokaryotic genomes (Moreno-Hagelsieb et al., 2002). The tendency for genes within the same operon to have shorter intergenic distances, compared to the inter-operonic distances, has made it a useful feature for operon prediction. Furthermore, conservation of gene pairs from the same operon across multiple genomes represents another important feature for operon prediction (Ermolaeva et al., 2001). Subsequent papers explored various other features for operon prediction, including codon usage (Bockhorst et al., 2003; Price et al., 2005), short DNA motifs (Dam et al., 2007), functional similarities among genes in the same operons (Chen et al., 2004b; Dam et al., 2007; Tran et al., 2007; Wang et al., 2004; Zheng et al., 2002), and similar text-based functional annotations (Westover et al., 2005). The details of feature selection and methods used for operon prediction will be discussed later. Uber-operon, a related concept, was first coined by Lathe et al. (2000). Essentially an uber-operon is a group of operons that are functionally related and their union is conserved across multiple genomes. One interesting observation about uber-operons is that genes of the same uber-operons tend to have lower functional relatedness than genes from the same operon, while they tend to have higher functional relatedness compared to genes from the same regulons (Che et al., 2006), suggesting that uber-operons could represent an important layer of genomic structures (lower and higher refer to the scoring values computed by the authors). Lathe et al. (2000) were the first to develop a computational procedure to identify uber-operons, using two genomes, assuming that orthologous gene relationships across the two genomes are given. Several other groups have since proposed various approaches to predicting uber-operons (Campillos et al., 2006; Che et al., 2006; Figeac et al., 2004; Janga et al., 2005). The methods primarily differ in how they derive orthologous gene relationships. The details of these algorithms will be discussed later in this chapter. It is generally believed that cellular machineries in prokaryotic cells are organized into multiple layers. At the bottom layer are operons (Jacob et al., 1960; Jacob et al., 1961), each of which contains a set of genes co-transcribed into one RNA. Each uber-operon (Lathe et al., 2000) contains a set of functionally related operons whose union is conserved across multiple genomes while a regulon (Kremling et al., 2000; Wagner, 2000) is composed of a group of transcriptionally co-regulated operons, whose prediction is discussed in Chapter 11. A group of regulons is then grouped into modulons (Kremling et al., 2000; Wagner, 2000), regulated transcriptionally by global regulators, typically in response to changes in general physiological states (Kremling et al., 2000; Wagner, 2000). Further, multiple operons, regulons and modulons that respond to a common environmental stimulus are organized into stimulons (Kremling et al., 2000; Wagner, 2000). Readers are referred to other sources for more detailed discussions on these topics.

Computational Elucidation of Operons and Uber-operons

235

2. Characteristics of Operons and Uber-operons 2.1. Structures of Operons The concept of operon describes a set of genes transcribed together (Jacob et al., 1960; Jacob et al., 1961). Typically, an operon has a promoter, an operator and a terminator (Fig. 1). All genes of the same operon share a common promoter that binds to RNA polymerases and a common regulatory region called the operator. In some cases, an operon may have multiple promoters. For example, the E. coli operon for galactose utilization (gal) contains a glucose-dependent and a glucoseindependent promoter (Aki et al., 1997; Hua et al., 1972; Kuhnke et al., 1986). The E. coli tryptophan (trp) (Fig. 2) and isoleucine-valine (ilv) operons have internal promoters leading to the expression of some but not all of the genes in the operons under certain conditions (Calhoun et al., 1985; Johnson et al., 1983; Johnson et al., 1984; Subrahmanyam et al., 1980; Yanofsky et al., 1981). When an operon has multiple promoters, a subset of its genes can be present in several different transcription units, though none of the existing operon prediction programs are able to deal with this issue, and only limited attempts have been made to address this problem (Dam et al., 2007). Operators contain DNA sequences that specifically interact with particular proteins whose binding can either interfere or enhance the activity of RNA polymerases. In addition, some operators act as the sites of regulation by attenuation either with or without the involvement of a regulatory protein. Further details of the mechanism for attenuation and termination of transcription are discussed in the following subsection.

2.2. Transcriptional Regulation of Genes in an Operon Transcription regulation of genes in the same operon is achieved through regulating the activity of the promoter or through the formation of RNA secondary structure P O

E

D

C

B

T

Fig. 1. Schematic drawing of an operon containing a promoter (P), an operator (O), genes (B, C, D, E) and a terminator (T).

P1

P2 L

E

D

C

B

A

RNA 1 RNA 2 Fig. 2. The trp operon involved in tryptophan biosynthesis has two promoters (P1 and P2 ). Transcription starting from the P2 promoter in vivo is approximately 15% of the primary P1 promoter (Horowitz et al., 1982).

236

P. Dam et al.

within the operator region. Such regulation can have three possible outcomes, namely repression, induction or attenuation of transcription. We briefly review the mechanism of transcriptional regulation in bacteria. For more detailed discussion on the subject, we refer the reader to Chapter 8 and other reviews (Busby et al., 1994; Cenatiempo, 1986; Henkin, 1996; Huffman et al., 2002; Lindahl et al., 1992; Magasanik, 1989; McClure, 1985; Perez-Martin et al., 1994; Platt, 1986). Transcription of the lac operon is a good model for studying the mechanism of transcriptional regulation through repression (Borukhov et al., 2005; Lewis, 2005; Wilson et al., 2007). A regulatory protein called the lactose repressor is encoded by the lacI gene that lies near the lac operon, and is constitutively expressed in E. coli. When lactose is not present, the lactose repressor binds tightly to the lac operator that is located just downstream of the lac promoter, near the beginning of the lacZ gene. The binding of lacI protein to the operator interferes with the binding of RNA polymerase (RNAP) to the promoter, resulting in low production of RNA. In the presence of lactose, a lactose metabolite named allolactose binds to the lacI repressor and causes a change in the repressor’s shape. As a result, the repressor is unable to bind to the operator, allowing RNAP to effectively bind with the promoter, thereby enabling a high level of expression of RNA. The catabolite activator protein (CAP) is a good model for studying the mechanism of transcriptional regulation through induction. E. coli can use CAP to activate the transcription of genes that allow the cell to utilize alternative carbon sources (Busby et al., 1999; Hernday et al., 2004; Lawson et al., 2004). When the glucose level decreases, the cyclic AMP level increases, that binds to CAP protein. The binding between the cyclic AMP and the CAP protein enables the CAP protein to bind to the promoter of the operons regulated by CAP, and thus helps in recruiting RNAP to the promoter to transcribe the genes in these operons. Transcriptional attenuation is another mechanism for regulating transcription in bacteria. An example is the transcriptional regulation of the enzymes involved in the tryptophan biosynthesis encoded by the trp operon. In this case, gene expression is regulated by the level of tryptophan in the cell through the formation of alternative RNA secondary structures. In Bacillus subtilis, two alternative RNA secondary structures regulate transcription termination in an untranslated leader region, upstream of the structural genes of the trp operon (Babitzke, 1997; Babitzke, 2004; Gollnick et al., 2005). A trans-acting regulatory protein called TRAP (trp RNA-binding Attenuation Protein) controls which RNA structure forms. In the presence of tryptophan, TRAP is activated to bind to a specific target in the leader RNA. This binding results in formation of the transcription terminator, thereby halting the transcription of the operon. In the absence of tryptophan, TRAP does not bind to RNA and the alternative anti-terminator structure forms, allowing transcription of the operon. Transcriptional termination in bacteria is generally accomplished through two mechanisms, namely Rho-dependent and Rho-independent termination (Banerjee et al., 2006; Gollnick et al., 2002; Henkin et al., 2002; Nudler et al., 2002;

Computational Elucidation of Operons and Uber-operons

237

Richardson, 2003). The Rho-independent termination signal is a sequence of 30– 40 bp in length that is rich in CG residues followed by a stretch of U. This sequence forms a stem-loop structure that interacts with the RNAP to terminate transcription. The Rho-dependent mechanism is involved in ∼50% of the E. coli transcriptional terminations. Previous studies showed that there are two essential components, including the upstream Rho loading site and the downstream termination site. The Rho complex containing six identical subunits begins by binding to the RNA transcript at the Rho loading site that is 70–80 bp long and rich in C. Then the Rho hexamer moves along the RNA in the 3′ direction. When the Rho hexamer catches up with the RNAP, its ATPase activity helps to unwind the DNA-RNA complex and induces the release of the RNAP from the DNA. Transcription of an operon generally generates an RNA transcript that contains all the genes of the operon, downstream of the promoter. The RNA transcript then is translated to distinct polypeptides by the ribosomes.

2.3. Functional Relatedness Among Genes in the Same Operon In general, there is some functional relationship among the genes within an operon. Experimental data suggest that genes of the tryptophan (trp) operon encode enzymes that are responsible for the biosynthesis of tryptophan from chorismate (Crawford et al., 1980; Imamoto, 1973; Yanofsky, 1971). Similarly, genes in the lac operon encode a membrane-bound transport protein that moves lactose into the cell, and two enzymes that are involved in the degradation of lactose (Lewis, 2005). However, genes of some operons do not seem to have an obvious functional relationship. For example, an E. coli operon contains three genes, a ribosomal protein S21 (rpsU), a DNA primase (dnaG), and the sigma subunit of RNA polymerase (rpoD). All three genes are involved in starting up the synthesis of macromolecules but they are not involved in the same pathway. While operons like this exist, in general, genes in the same operon are functionally related.

2.4. Uber-operons as Conserved Groups of Operons While some operons may not be conserved across genomes, researchers have noticed that unions of some operons could be conserved across these genomes. Bork’s group (Lathe et al., 2000) was the first to study this phenomenon and called such groups of operons “uber-operons”. A precise definition of uber-operon is given by Che et al. (2006) as follows. An uber-operon, U , is a group of operons in a genome whose component operons are transcriptionally or functionally related, and U is conserved across multiple genomes measured as follows: the orthologous genes of U ’s genes in each of these reference genomes form a group of operons, which are (approximately) made of these orthologous genes only (i.e. these operons do not contain other genes nor miss genes). Here ‘transcriptionally related’ refers to operons that are transcriptionally co-regulated; and ‘functionally related’ refers to operons that are

238

P. Dam et al.

made of genes from the same pathway or with highly similar Gene Ontology (GO) annotations. Che et al. (2006) also studied various properties of the predicted uberoperons, and found that the average size of an uber-operon is about 3.5 times the average size of an operon (in terms of the number of genes). They also found that in E. coli K12, the predicted uber-operons cover ∼50% of the genes in the genome. Though their origins are still unclear, Che et al. (2006) suggested that uber-operons could be the foot-prints of operon evolution from larger operons in more primitive organisms to smaller operons in more “advanced” organisms (also see Sec. 7). Figure 3 shows the flagella uber-operon in four different genomes. In B. burgdorferi, genes in the flagella uber-operon are grouped into 8 operons and a single gene transcription unit. Although the order and composition of genes in each B. burgdorferi operon are not well conserved in T. maritima, B. subtilis and E. coli, the overall uber-operon composed of these operons are well conserved across the four genomes. An interesting observation made by Che et al. (2006) is that genes predicted to be in the same uber-operons tend to have lower functional relatedness compared to genes in the same operons while having higher functional relatedness compared to genes in the same regulons. This seems to suggest that uber-operons, regardless of how they have evolved, have plenty of information possibly useful for functional prediction. For example, they could contain useful information for the elucidation of biological pathways in a similar fashion to how operons and regulons have been used for prediction of pathways (Mao et al., 2006). There are other concepts related to uber-operons. A gene cluster, a concept proposed by Kolesov et al. (2001) and other researchers, is defined as a group

Fig. 3. The uber-operon containing flagellar genes. Genes involved in flagella formation and function from four different genomes are shown. In B. burgdorferi, the genes are arranged into 8 operons and a single gene transcript. Although the order and composition of genes in each B. burgdorferi operon are not well conserved in T. maritima, B. subtilis and E. coli, the overall uber-operon composed of these operons are fairly conserved across four genomes.

Computational Elucidation of Operons and Uber-operons

239

of genes with multiple pair-wise interactions among genes in the same cluster. The interactions range from physical protein-protein interactions to functional interactions such as co-evolution, co-regulation, etc. By comparing uber-operons and regulons, we noticed that uber-operons are mostly identified through computational approaches, which are not easy to verify by experiments. In contrast, regulons can be readily identified by both experimental and computational approaches. Furthermore, physical interactions or co-regulation between gene pairs in a gene cluster can be experimentally verified while co-evolution relationships can only be computed using computational approaches. 3. Experimental Determination of Operons The presence of multiple genes in the same RNA transcript can be experimentally detected using several techniques including northern blot (Alwine et al., 1977), RT-PCR (Reverse Transcription Polymerase Chain Reaction) (Ra et al., 1996; Tummuru et al., 1995) and gene expression arrays. 3.1. Northern Blot The Northern blot procedure (Alwine et al., 1977) for operon identification can be summarized as follows. The first step involves isolation of RNA transcripts from the cell; then the RNA transcripts are separated by sizes and transferred onto a membrane. To confirm that a gene is transcribed into RNA, a single-stranded DNA sequence of the gene in question is used as a probe, which is labeled for easy detection. Through hybridization, the membrane is exposed to the probe. If the membrane contains the RNA transcripts of the gene in question, the probe will bind to the RNA transcripts, and the interaction can be detected. Positive binding of probes from multiple genes to the same RNA transcript suggest that they are co-transcribed. 3.2. RT-PCR RT-PCR (or Reverse Transcription PCR) (Wylie et al., 1996), which is based on the Polymerase Chain Reaction (PCR), has been used to amplify the RNA transcripts of interest. First, the RNA transcripts are collected from a cell, and the complementary DNA is made from the RNA transcripts by an enzyme named reverse transcriptase. The DNA template can then be amplified by the normal PCR reaction, which requires the use of a pair of primers that flank the DNA fragment. If a pair of primers flanking a DNA region contains two or more genes successfully amplified, the genes in question are deemed co-transcribed. Otherwise, if the pair of primers cannot be amplified, the genes in question are not co-transcribed. RT-PCR is less labor-intensive because the same collected RNA sample can be used with different primers to confirm the presence of multiple operons at the same time. However, the procedure is costly due to the use of multiple primer pairs and other reagents.

240

P. Dam et al.

3.3. Gene Expression Arrays Recent advances have enabled experimentalists to collect genome-wide gene expression data under designed conditions using the DNA microarray technique. Detailed discussion of the technique can be found in numerous reviews (Joyce et al., 2007; Lee et al., 2007; Mocellin et al., 2007; Petersen et al., 2007; Wilkes et al., 2007). The data gathered from such experiments are the RNA expression profiles for a large number of genes encoded in a genome. One could possibly infer co-expression relationships among genes based on the similarity of their time-course expression data though such information can only be used as a necessary but not sufficient condition for checking for genes in the same operons. Such co-expression data, though not sufficient for predicting operons, have been widely used as one type of supporting evidence in operon prediction programs (Sabatti et al., 2002). 4. Computational Methods for Operon Prediction While experimental techniques such as RT-PCR or Northern blot are effective for identification of genes in the same operon, they are generally too expensive or too labor-intensive for genome-scale applications. For example, as of now, only ∼285 multi-gene operons in E. coli (less than 50% of the estimated number of operons), have been experimentally validated (Salgado et al., 2006) since the first paper describing the E. coli lac operon was published over 40 years ago (Jacob et al., 1961). Out of the 500+ prokaryotic and archaeal genomes with complete sequences by 2007, only E. coli K12 and B. subtilis 168 have relatively large numbers of operons experimentally evaluated, while A. tumefaciens str. C58, P. aeruginosa and L. lactis each has over ten experimentally verified operons (Okuda et al., 2006). Clearly, there is a large gap between the number of prokaryotes with complete sequenced genomes and the number of prokaryotes with (even limited) experimentally validated operons, and this gap is widening very rapidly due to the fast rate of genome sequencing. This gap has hindered the full utilization and application of the available genome sequence data. To bridge this gap, numerous computational methods have been developed for prediction of operons. Most of these prediction programs rely on the idea of supervised learning. That is to train an operon predictor based on detected distinguishing features of known operons and non-operons, and apply the trained predictor to new genome sequences for operon prediction. Such a protocol generally consists of three key steps: (a) feature selection, (b) predictor training, and (c) predictor evaluation. 4.1. Feature Selection Except for a few exceptions (Yada et al., 1999), operon prediction programs are generally designed to classify adjacent gene pairs on the same genomic DNA strand into two classes: (i) operonic gene pairs and (ii) boundary gene pairs (between consecutive operons). Various features have been examined to distinguish between

Computational Elucidation of Operons and Uber-operons

241

such gene pairs. These features include (a) intergenic distance, (b) similarity between the phylogenetic profiles of the involved genes, (c) conservation of the gene pair (or more generally gene neighborhood) across multiple genomes, (d) functional assignments of the genes, (e) known information about genes working in the same pathways, protein complexes or with physical interactions, and (f) correlations among the genes’ expression patterns based on microarray data. Among these features, the intergenic distance versus inter-operonic distance is by far the most useful feature for prediction of operons, as first discovered by Salgado et al. (2000). They found that the distance between adjacent genes within an operon in general is shorter than the distance between two adjacent genes of the same genome but in different operons. While other features have been used in operon prediction, their discerning power seems to be more situation-dependent, i.e., less universally applicable than intergenic distance information. For example, while genes in the same operons should be transcriptionally co-expressed in general, co-expression of two adjacent genes in a genome does not necessarily mean that these genes are in the same operon. Figure 4 shows the distribution of the Pearson correlation coefficient calculated from expression data of adjacent gene pairs in E. coli K12, which indicates that sometimes adjacent gene pairs not in the same transcript can also be co-expressed, reflected by the high Pearson correlation coefficient. 4.2. Prediction Methods The core part of most operon prediction methods is to predict whether an adjacent gene pair on the same genomic strand is in the same operon (forming an operonic pair) or in adjacent operons (forming a boundary pair). While different techniques have been employed to predict operonic versus boundary pairs, the majority of the methods attempt to classify these two classes of gene pairs using various features and their combinations as outlined above. Random chosen pairs Known boundary pairs Known operonic pairs

60.0

Percentage

50.0 40.0 30.0 20.0 10.0 0.0 0

0.2

0.4

0.6

0.8

1

Pearson coefficient correlation Fig. 4. Distribution of the Pearson correlation coefficient of E. coli adjacent gene pairs calculated from the gene expression data (Dam et al., 2007).

242

P. Dam et al.

A wide range of classification methods have been used in operon prediction programs, which include (a) hidden Markov model-based method (Yada et al., 1999), (b) decision tree based method (Che et al., 2007; Dam et al., 2007), (c) simple statistical methods (Ermolaeva et al., 2001), (d) Bayesian based methods (Bockhorst et al., 2003; de Hoon et al., 2004; Sabatti et al., 2002; Westover et al., 2005), (e) graph-theoretic approaches (Chen et al., 2004a; Edwards et al., 2005; Zheng et al., 2002), (f) neural networks (Chen et al., 2004b; Tran et al., 2007), (g) support vector machine (Zhang et al., 2006), and (h) others (Jacob et al., 2005; MorenoHagelsieb et al., 2002; Romero et al., 2004). OPERON, developed by Ermolaeva et al. (2001), is a statistical method for estimating the likelihood of an adjacent gene pair to be in the same operon. It counts the frequency of each adjacent gene pair on the same genomic strand across multiple genomes and identifies gene pairs that appear together more often across multiple genomes than expected using a probabilistic model. Ermolaeva et al. predicted operons based on identifying such conserved gene pairs. Using this idea, OPERON derived 7,699 conserved gene pairs across 34 bacterial genomes. Their test result indicates that 98% of such gene pairs are actually in the same operons. While the algorithm has high prediction specificity, it suffers from relatively low prediction sensitivity as on their test set, OPERON could only detect ∼50% of the known operons. Numerous computational methods, such as OFS (Westover et al., 2005) and VIMSS (Price et al., 2005), have included this feature of conserved gene pairs in their operon prediction. JPOP (Chen et al., 2004b) and the meta-learner approach (Tran et al., 2007) both use a neural network-based method to incorporate different features into operon prediction programs. In JPOP, the log-likelihood of intergenic distance, phylogenetic similarity, and COG-based functional similarity between adjacent gene pairs were used as inputs to a neural network-based classifier. The inputs used in (Tran et al., 2007) consists of the prediction scores for operonic gene pairs from three prediction programs (hence named a meta-learner approach), namely JPOP, OFS, and VIMSS along with a GO similarity score and a pathway-based distance among candidate gene pairs (Tran et al., 2007). In both programs, the neural networks were trained to maximally separate true operonic gene pairs from boundary gene pairs. The JPOP neural network-based predictor can reach ∼83.8% of prediction accuracy [Eq. (4)] in E. coli. Interestingly, the meta-learner (Tran et al., 2007) approach reaches 86.5% of prediction accuracy on the same data set, suggesting an improvement of the prediction accuracy over the accuracies of its three input programs. Combinations of various features used in other prediction programs are also worth mentioning. Westover et al. (2005) used a na¨ıve Bayesian approach (OFS) with features including the intergenic distance and common functional annotation. Pierce et al. (2005) applied a Bayesian approach (VIMSS) and a combination of intergenic distance, codon usage, comparative genomic information and functional similarity information (i.e., COG). Recently, Bergman (2007) developed a hidden

Computational Elucidation of Operons and Uber-operons

243

Markov model that incorporates phylogenetic information with intergenic distances. Jacob et al. (2005) developed a fuzzy genetic algorithm-based approach with four features: intergenic distance, participation in the same metabolic pathway, phylogenetic similarities and COG-based functional annotation. Another machine learning approach has been developed with high prediction accuracy in both E. coli. and B. subtilis 168 (Dam et al., 2007).

4.3. Prediction Evaluation In general, the performance of a predictor can be assessed based on operons that have been experimentally validated. Furthermore, if a predictor is trained in a genome and applied to another genome, it is critical to evaluate how well the program generalizes to other genomes. The prediction performance of a program can be assessed in term of prediction sensitivity (ST ), prediction specificity (PT ), prediction accuracy (A), and prediction error rate (E), defined as below. Sensitivity: ST =

TP WO

(1)

Specificity: PT =

TP TP + FP

(2)

TP + TN WO + T UB

(3)

Accuracy:

A=

or A= Error rate:

1 (ST + PT ) 2

E = 1 − A,

(4) (5)

where TP (true positive) is the number of operonic pairs being predicted correctly, TN (true negative) is the number of boundary pairs being predicted correctly, FP (false positive) is the number of boundary pairs predicted to be operonic pairs, and FN (false negative) is the number of operonic pairs predicted to be boundary pairs, where operon (WO) and transcription unit boundary (TUB) are the total numbers of operonic pairs and boundary pairs under consideration, respectively. In general, the higher the sensitivity and specificity are, the better the prediction is, although it is often true that sensitivity and specificity are inversely related, as shown in Fig. 5. In this figure, the ROC (Receiver Operating Characteristic) curve (Tran et al., 2007) indicates that the increase in sensitivity is accompanied by a decrease in specificity. Obviously, the larger the area under the ROC curve, the better the overall performance a prediction program has. It should be noted that to apply these assessment measures, we need to have not only a set of known operons but also a set of known non-operons. While the operon set is often easy to get based on the experimentally validated operons, to get

244

P. Dam et al.

Fig. 5. The relationship between the specificity and sensitivity of an operon prediction program using a neural network framework (Tran et al., 2007).

a non-operon data set sometimes is difficult due to the fact that not all non-operon gene pairs have been experimentally checked rigorously, and some believed-to-be non-operons may not necessarily be real non-operons. The most widely used negative (non-operon) sets typically comprise the first gene or the last gene of a known operon and the adjacent gene upstream or downstream of it, provided that the gene pair is transcribed in the same direction. Other types of negative data sets have been used (Craven et al., 2000; Price et al., 2005). Caution should be taken when comparing the performance statistics of different prediction programs because the negative data sets could be defined differently. One critical assessment of any prediction program is its capability to generalize, or be able to predict operons in a genome on which the predictor was not trained. While many methods were shown to be effective when training and testing in the same genome such as E. coli K12 or B. subtilis 168, they often do not generalize well if the training and testing sets are from different genomes (Dam et al., 2007; Romero et al., 2004). The problem could be that some features are organism specific (Dam et al., 2007), resulting in substantial performance reduction when tested on a genome different from its training genome (e.g., E. coli versus B. subtilis). In an attempt to address this problem, Edward et al. (2005) developed an universally applicable prediction program based on conserved genomic context information. While promising, the method suffers from the problem of low prediction sensitivity (49.1%) (Edwards et al., 2005). In another approach to address the same problem, Dam et al. (2007) used two different classification methods for operon prediction,

Computational Elucidation of Operons and Uber-operons

245

depending on whether known operons from the target genome are available to train the classifiers. Using a non-linear decision tree-based method and half of the known operons from either B. subtilis or E. coli for training, their trained predictors achieve 91% and 95% prediction accuracy on E. coli and B. subtilis, respectively, when tested on the other half of each genome. However, this predictor does not perform well if it is trained on E. coli data and then tested on the B. subtilis data. Therefore, the authors proposed to use a linear logistic function-based (LFB) classification method, with a smaller set of features. As a result, when training the LFB predictor on known operons from E. coli, the predictor can achieve 85% accuracy when tested on B. subtilis data. The lesson learned is that when operon data is available for a target genome, more sophisticated learning techniques can be used; and one can expect to achieve higher accuracies when the classifiers are tested on the target genome. However when no such data is available and training has to be done on data from other genomes, less sophisticated learning techniques are preferred and should be used, which should make the classifier more generalizable.

5. Applications of Operon Prediction Because operons are basic functional units in the hierachical structure of a prokaryotic genome that includes operons, regulons, modulons and stimulons, accurate prediction of operons is essential to the prediction of any of these higher level genomic structures. Besides, results of operon prediction have been used to improve prediction of pathways and orthologous genes.

5.1. Prediction of Pathways It is generally true (with a few exceptions) that a metabolic pathway can be decomposed into a collection of operons (Mao et al., 2006). Hence accurate prediction of operons forms a basis for accurate prediction of component genes of a metabolic pathway. When attempting to elucidate a biological pathway (see Chapter 12), operons can provide information complementary to the information derived from other data sources such as protein-protein interactions or functional annotation of genes. For example, if gene A and gene B are predicted be in the same operon, it is likely that they work in the same biological process (Mao et al., 2006).

5.2. Prediction of Orthologous Genes Prediction of orthologous genes across genomes represents a highly important as well as highly challenging problem. Numerous techniques have been developed, including the traditional bi-directional best hit approach (Mushegian et al., 1996) and its variations (Wall et al., 2003), the COG approach (Tatusov et al., 1997) and the phylogenetic approaches as discussed in detail in Chapter 9. Operons

246

P. Dam et al.

provide an important piece of information for orthologous gene prediction. Previous observations have shown that a pair of homologous genes across two genomes are more likely to be orthologous if the two corresponding operons have a second pair of homologous genes across the two genomes; for example, five tryptophan biosynthesis genes trpA, trpB, trpC, trpD and trpE are observed to be in the same operon in many bacteria genomes (Xie et al., 2003). Based on such an observation, Wu et al. (2005) has developed a classification scheme for othologous genes based on predicted operons.

6. Evolution of Operons One important and interesting application of operon prediction is to study operon evolution. With accurate prediction abilities for operons, one can start applying them to all the sequenced prokaryotic genomes, compare the predicted operons across these organisms with different characteristics and living in different conditions, and start asking important questions such as “what rules govern the evolution of operons?” or “what are the major differences, if any, between bacterial operons and archaeal operons?”

6.1. Origins of Gene Clusters Models for explaining why genes are grouped together into gene clusters, such as operons, fall into five classes: (1) the Natal model, (2) the Fisher model, (3) the modularity model, (4) the co-regulation model, and (5) the selfish operon model. The Natal model suggests that gene clusters came into being through gene duplication and divergence evolution (Horowitz, 1965; Lewis, 1951). The Fisher model proposes that if certain genes work well together, the linkage among these genes would increase, and hence selection would favor gene clusters to reduce deleterious recombinations (Bodmer et al., 1962; Fisher, 1930). The co-regulation model proposed that genes are clustered together because co-regulation through a common promoter is beneficial (Jacob et al., 1960; Pardee et al., 1959), though this model has problems in explaining some of the known facts. For example, many genes scattered in a genome can be co-regulated without being clustered together. Also, many (conserved) gene clusters may contain genes that are not cotranscribed. In addition, while genes working in a common metabolic pathway are often grouped into operons, (seemingly) unrelated enzymes also appear in common operons (Lawrence et al., 2000; Lawrence et al., 2003). The selfish operon model maintains that gene clusters may arise and propagate owing to horizontal gene transfers (Lawrence et al., 1996), and that genes are clustered into groups to increase their likelihood of being distributed among organisms, whether their functions are essential to those organisms or not. In theory, genes organized into clusters can propagate together through both vertical inheritance and horizontal transfer while unclustered genes can be inherited together

Computational Elucidation of Operons and Uber-operons

247

only through vertical transmission. The selfish operon model suggests how a new operon could be formed after a horizontal transfer event. The initial operon could contain all genes being transferred together in the same event to the new host, and then the intervening genes not having a relevant function could get deleted. Therefore, this model provides an intermediate step to operon formation, without requiring rare beneficial rearrangements. Under this model, operons are formed through repeated gain and loss of genes, and gene clusters are initially beneficial to the genes themselves, not necessarily to their host organisms. While the model can explain why many operons have been acquired by horizontal gene transfers, it also suggests that essential genes should not be in operons since these genes cannot undergo the cycles of gene loss and gain in the model. However, contradictory to this model, it has been observed that essential genes are often conserved without paralogs across multiple species, and rarely being transferred horizontally across species (Pal et al., 2004). These observations suggest that the organizations of genes into operons could be due to multiple underlying mechanisms, which are not yet fully understood.

6.2. Characteristics of Operon Evolution 6.2.1. Few Bacterial Operons Remain Intact Across Multiple Genomes Despite the benefits conferred by transcriptional co-regulation, the gene composition of a bacterial operon is not stable across multiple genomes (Itoh et al., 1999), suggesting that operons are not stable during evolution. As shown in Fig. 6, only a small number of operons in E. coli are found to be conserved across multiple prokaryotic genomes. Furthermore, the most stable gene clusters reportedly are those whose protein products interact physically (Dandekar et al., 1998). For example, operons containing ribosomal proteins such as S10-spc-a are well conserved across all prokaryotic genomes (Siefert et al., 1997; Watanabe et al., 1997). Several other operons, namely the atp, groE, nusA-infB, and pheST operons, are also well conserved within eubacterial genomes. 6.2.2. Gene Order Is Not Conserved During Bacterial Evolution Not only the gene composition in an operon is not conserved as being discussed in Sec. 6.2.1, but the order of the genes in a bacterial operon is also poorly conserved during evolution (Mushegian et al., 1996). Comprehensive analyses of available genomes have shown that positions of genes in an operon are shuffled frequently during evolution, although positions of genes in a small number of operons including the ribosomal protein operons are well conserved (Siefert et al., 1997; Watanabe et al., 1997). Together, these analyses suggest that (a) shuffling of gene positions is virtually neutral in long-term evolution, (b) there is no absolute requirement for juxtaposition of any pairs of genes in a bacterial genome, and (c) there seems to be a strong positive selection for clustering of genes that encode physically interacting

248

P. Dam et al.

Fig. 6. Cluster analysis of operon conservedness. The horizontal and vertical axes represent 312 genomes and 283 operons, respectively. The colors from red to black correspond to the operon structural conservedness from high to low. A green spot represents a genome that does not have that particular operon.

proteins. The lack of conservation of gene order is in sharp contrast with the generally high-level sequence similarity between orthologs, i.e. genes in different species that evolved from a common ancestral gene by speciation, suggesting that conservation of sequences is more critical than conservation of the gene positions in a genome during evolution. Furthermore, variation in the positions of the functionally linked genes suggests that the regulation mechanisms of gene expression of functionally linked genes in distantly related bacteria could be dramatically different. 6.2.3. Adaptive Evolution of Bacterial Operons Why do many bacterial operons not remain intact over a long period of time? A recent study (Wan et al., 2007) of gene order and composition of 283 operons in E. coli K12 across 312 bacterial genomes suggests that only a few operons, called house-keeping operons, remain intact across all genomes, and they are related to the biosynthesis of ribosomal proteins (Fig. 6). Furthermore, when the similarity of operons is used as a feature to cluster genomes, the results suggest that genomes clustered into the same group usually have the same or similar lifestyles, although they may not be the closest relatives in the phylogenetic tree, suggesting that operons can be shared between genomes that are not closely related.

Computational Elucidation of Operons and Uber-operons

249

7. Computational Prediction of Uber-operons Unlike operons, currently there are no experimental data for the validation of predicted uber-operons. While uber-operons are believed to be footprints of operon evolution, it is not clear what functional roles the uber-operons may have in current cells, making it difficult to design experiments to validate the predicted uberoperons. Still, computational predictions did provide strong evidence of the actual existence of uber-operons and did suggest the functional relatedness of genes in the same uber-operons (Che et al., 2006). While it is clear that further investigation is needed to link the predicted uber-operons to possible cellular functional roles, previous studies have suggested that this information can be used for studies of metabolic pathways and regulons (Lathe et al., 2000). Hence we briefly go through a few methods for prediction of uber-operons. Lathe et al. (Lathe et al., 2000) developed the first algorithm for predicting uber-operons through detecting conserved unions of operons across multiple bacterial genomes. They assume that orthologous gene relationships across the underlying genomes are provided using some other programs. Their algorithm employs a heuristic strategy to find groups of operons in a genome, whose union is approximately conserved across multiple genomes. Kolesov et al. (2001) developed a method, called SNAP (SimilarityNeighborhood APproach), to find gene clusters. In their method, they defined two genes to be S-related if their sequence similarity is higher than a threshold, and they are N-related if they are sequential on the same genomic strand, and their intergenic distance is less than a threshold. A SN-graph is defined to have all genes represented as vertices and gene pairs that are either S-related or N-related as edges. It was shown that a cycle with alternating S-related and N-related edges forms a gene cluster, and an algorithm was developed to find such SN-cycles (Kolesov et al., 2001). Martin et al. (2004) defined an uber-operon as a maximal set of operons across two genomes that share common homologous genes. The problem was formulated as to find connected components in a graph G with its vertex set containing genes that each has at least one homolog in the other genome and its edge set containing gene pairs in the same operon. The authors developed a method called ‘Hierarchical Union of Genes from Operons’ (HUGO) to derive uber-operons, using the above definition. Recently, Che et al. (2006) developed a novel approach to predict uber-operons through identifying groups of functionally or transcriptionally related operons, whose gene sets are conserved across a target and multiple reference genomes. The method consists of two prediction steps. In the first step, multiple versions of uberoperons in the target genome are predicted based on multiple individual reference genomes. The basic idea of this step is to iteratively connect operon components in a bipartite graph by using a maximum cardinality bipartite matching based approach. Those connected operon components are considered to be a putative uber-operon. In the second step, all putative uber-operons based on multiple reference genomes are

P. Dam et al.

250

unified into a final version of uber-operons by using the Markov cluster algorithm (MCL) (Dongen, 2000). Future studies in this area of (a) generating experimentally verified uberoperon datasets, (b) improving the accuracy of prediction, and (c) analyzing the conservation of uber-operons across genomes will be critical for improving our understanding of operon evolution, and the hierarchical organization of the bacterial genomes in general. 8. Resources and Databases on the Internet A large number of databases and computer programs have been developed for prediction and analyses of operons and uber-operons, which provide a set of highly useful resources for comparative genome studies and pathway prediction. 8.1. Internet Resources for Known Operons Resource, ref. and link

Organisms

Description

The Operon Database (ODB) (Okuda et al., 2006) (Okuda, 2004)

203+

ODB contains known operons from prokaryotes and eukaryotes curated from the literature. Putative operons are also identified based on orthologous gene prediction. This database also enables the user to predict operons in 194 organisms using several features.

RegulonDB (Salgado et al., 2006), (Salgado, 2006)

E. coli

RegulonDB is a comprehensive database consisting of data from transcription regulation for E. coli K12, including operons, their terminators and promoters, and relevant and regulatory pathways. The data is well curated from experimental data.

Database of Transcriptional Regulation in Bacillus subtilis (DBTBS) (Makita et al., 2004) (Makita, 2004)

B. subtilis

DBTBS is a useful database for experimentally confirmed regulatory networks in B. subtilis. The current version contains data for binding factors and gene functional classification. Predictions of operons, terminators, and regulons are also provided for B. subtilis.

Computational Elucidation of Operons and Uber-operons

251

8.2. Internet Resources for Operon Prediction Resource, ref. and link

Organisms evaluated

Joint Prediction of Operons (JPOP) (Chen et al., 2004a; Chen et al., 2004b) (Chen et al., 2004b)

E. coli, S. sp. WH8102

Decision-tree and logistic function based classifier for operon prediction (Dam et al., 2007) (Dam, 2007)

E. coli, B. subtilis

Neural network operon prediction (Tran et al., 2007) (Tran, 2007)

E. coli, B. subtilis, P. furiosus

UNIPOP (Li et al., 2007) (Che, 2007)

E. coli, B. subtilis

# Pred. genomes

Description

1

JPOP applies intergenic distances, COG functional annotations, and phylogenetic profiles to perform operon prediction. The executable code and the operon prediction results for Synechococcus sp. WH8102 are available for download.

256

The method uses phylogenetic profiles, ratio of gene lengths, intergenic distances, motif frequency, conservation score, and gene ontology similarity score to predict operons. The source code and training files are available.

3

365

The method uses a meta-learner to combine the operon predictions from JPOP, OFS, and VIMSS along with Gene Ontology similarity and KEGG pathway score to predict operons. The MATLAB source code is available. The predictor applies a graph theory framework to predict operons without a training set. The source code and operon predictions for 365 prokaryotic organisms are available. (Continued)

P. Dam et al.

252

Resource, ref. and link

Organisms evaluated

E. coli K12 operon prediction (Bockhorst et al., 2003; Craven et al., 2000) (Bockhorst, 2003)

E. coli

Metabolic biochemical pathways for operon prediction (Zheng et al., 2002) (Zheng, 2002)

E. coli, B. subtilis

Operon Finding Software (OFS (Westover et al., 2005) (Westover, 2005)

E. coli, B. thetaiotaomicron

OPERON hosted by the Institute for Genomic Research (TIGR) (Ermolaeva et al., 2001) (Ermolaeva, 2001) Virtual Institute for Microbial Stress and Survival (VIMSS) (Price et al., 2005) (Price, 2005)

# Pred. genomes

Description

1

The predicted transcription units for E. coli K12 based on sequence information and microarray data are available for download.

40

The method uses graph representations of biochemical pathways from KEGG to predict operons. The predicted operons from 40 microbial organisms are available.

2

OFS makes operon prediction using intergenic distance, functional annotation of genes, and conserved gene order information. The perl scripts for the OFS program are available.

E. coli

228

OPERON analyzes gene pair conservation across multiple genomes in a probabilistic framework. The results of the predicted gene pairs in 228 organisms are available.

E. coli, B. subtilis, H. NRC-1, H. pylori, C. trachomatis, S. PCC 6803

416

VIMSS uses intergenic distance, COG-based functional similarity, and similarity in synonymous codon usage to make operon predictions. The source code and pre-computed predictions are available for download. (Continued)

Computational Elucidation of Operons and Uber-operons

Resource, ref. and link

Organisms evaluated

FGENESB Suite of Bacterial Operon and Gene Finding Programs (Anonymous A)

E. coli, H. sp. NRC-1, M. jannaschii, B. anthracis

Distinctive signatures of operon junctions across Prokaryotes (Janga et al., 2006) (Janga, 2006)

E. coli, B. subtilis

# Pred. genomes

253

Description

4

The FGENESB server provides a suite of tools for the annotation of genes and operons. Operon prediction is initially based on intergenic distance and iteratively refined using conservation of gene pairs and predicted promoters and terminators. Full functionality of the program is not available at the website and needs to be executed locally. Pre-computed predictions for several organisms are available.

330

The method uses tri-nucleotide signatures conserved across genomes to predict operons. The pre-computed predictions are available for download.

8.3. Internet Resources for Uber-operon Prediction Resource, ref. and link

Organisms evaluated

# Pred. genomes

Description

Nebulon (Janga et al., 2005) (Janga, 2005)

E. coli, B. subtilis, S. meliloti

197

The Nebulon web server enables a user to view the functional relationships among genes based on their organization into predicted operons across 197 prokaryotic genomes. A graphical display illustrates the interaction information of the selected genes. (Continued)

P. Dam et al.

254

Resource, ref. and link

Organisms evaluated

# Pred. genomes

Description

Hierarchical Union of Genes from Operons (HUGO) (Figeac et al., 2004) (Figeac, 2004)

E. coli, B. subtilis, Y. pestis, C. acetobutylicum

0

HUGO detects a maximal set of operons across two genomes that share homologous genes. The binaries and source code for HUGO are available for download.

Uber-operon Database (Che et al., 2006) (Che, 2006)

E. coli

87

The method predicts uber-operons by identifying groups of functionally or transcriptionally related operons conserved across multiple genomes. Pre-computed uber-operon predictions in 87 genomes are available.

9. Challenges Ahead A major challenge in operon and uber-operon prediction is the lack of experimentally validated data. To date, only two large operon datasets from E. coli and B. subtilis with experimental validation are available while no validated data are available for uber-operons. This raises a serious issue when attempting to derive accurate operon information for genomes other than those two organisms and even for their closely related organisms (similar in uber-operons), including how to assess (a) prediction performance in other genomes, and (b) the generalizability of the features used in operon prediction programs. Improving the accuracy of the existing operon prediction programs from the state of the art at ∼90% accuracy to something closer to 100% remains a challenge. New insights about operon structures (i.e., gene composition and order), evolution, and transcription regulation might be needed before we can make the next major leap forward. Potential breakthroughs may come when several known operon sets besides E. coli and B. subtilis become available for comparison. A related challenge arises from our desire to develop more universally applicable operon prediction tools that only require a small number of input parameters while yielding consistent performance in term of prediction accuracy across a large number of genomes. Another key challenging problem in operon prediction is to predict all possible transcripts for each operon under different conditions. As we discussed earlier in

Computational Elucidation of Operons and Uber-operons

255

this chapter, an operon could have multiple transcripts due to the presence of multiple promoters and terminators. Improvement in predicting the promoters and terminators and the incorporation of such information into operon prediction may lead to a higher accuracy in predicting overlapping transcription units. It represents a highly interesting problem to study how operons have evolved from simple organisms to complex organisms as we know some of the relatively simple eukaryotes, such as yeast and even worm, have operons (Agabian, 1990; Davis et al., 1997; Spieth et al., 1993), as well as to understand how operon structures ultimately faded away in the more complex eukaryotes. Evidence suggests that eukaryotes diverged from archaea after bacteria did, or eukaryotes and archaea are more closely related than bacteria and archaea. However, archaea and bacteria carry numerous operons containing the same genes in the same order, while only a small number of eukaryotes have operons, suggesting that the progenitor of the eukaryotes had these operons, but the operons must have been lost relatively early in eukaryotic evolution. While such studies have focused on a small number of conserved operons such as the ribosomal protein operons, there have been very few studies using a large number of operons, in part due to the low accuracy in predicting operons across genomes. Therefore, the improvement in the performance of operon prediction programs will be critical for generating reliable data needed for such large scale studies of operons across all three superkingdoms. Although it has been suggested, with strong evidence, that genes in the same uber-operons in general tend to work in the same pathways, there have been cases where genes in the same uber-operons are not involved in the same pathways. Clearly, there is a need for better understanding of the biological meaning of uber-operons. To date, only a few papers have been published on uber-operon prediction. Further research is needed not only to improve prediction but also to investigate the functional roles of uber-operons in cellular processes. In addition, while experimental data may become available for validation of some uber-operons, we expect that most of the validation of the predicted uber-operons will come from computational approaches in the near future. Hence more powerful computational validation techniques for uber-operons will need to be developed.

10. Summary The operon model was first proposed by Monod and Jacob in 1960 to describe the transcriptional regulation of genes involved in the metabolism of lactose in E. coli. Subsequently, operons have been found to be a common theme in bacteria and archaea. Furthermore, polycistronic transcripts in eukaryotes were observed in trypanosomes, nematodes such as C. elegans, flatworms and primitive chordates. Although there are experimental methods to find co-transcribed genes in any genome, the procedure is often labor-intensive and/or expensive as discussed in this Chapter. Among the 500+ completely sequenced prokaryotic and archaeal genomes, only two genomes, E. coli K12 and B. subtilis 168, have large numbers

256

P. Dam et al.

of experimentally validated operons. To bridge this gap, various computational techniques have been developed to predict operons. Most of these methods consist of three main steps: (i) feature selection for distinguishing operonic from boundary gene pairs, (ii) training a classifier to separate operonic from boundary gene pairs based on selected features, and (iii) evaluation of the prediction results. In general, the performance of a predictor can be assessed based on testing on known operons. The sensitivity, specificity, accuracy, and ROC curve have been used to evaluate the performance of an operon predictor. One important application of operon prediction is to study the evolution of operons. Analyses of predicted operons suggest that (a) only handful bacterial operons remain intact during the long course of evolution, (b) gene order within operons is generally not conserved throughout bacterial evolution, and (c) operon structure is probably a reflection of adaptive advantage based on the selfish operon model. Since Lathe et al. (2000) first coined the term “uber-operons” to describe sets of operons that are conserved across multiple genomes, there have been a few attempts to predict and analyze uber-operons. Preliminary studies suggest that uber-operons probably represent the footprints of operon evolution, and contain highly useful functional information that is relevant to the elucidation of biological pathways. We expect that efforts to understand uber-operons and apply them to functional studies of biological processes will continue to increase over the next few years.

11. Further Reading Operon Prediction: Salgado, H., Moreno-Hagelsieb, G., Smith, T.F. and Collado-Vides, J. (2000) Operons in Escherichia coli: genomic analyses and predictions. Proc Natl Acad Sci USA, 97, 6652–6657. Ermolaeva, M.D., White, O. and Salzberg, S.L. (2001) Prediction of operons in microbial genomes. Nucleic Acids Res, 29, 1216–1221. Okuda, S., Katayama, T., Kawashima, S., Goto, S. and Kanehisa, M. (2006) ODB: a database of operons accumulating known operons across multiple genomes. Nucleic Acids Res, 34, D358–362. Uber-operon Prediction: Lathe, W.C., 3rd, Snel, B. and Bork, P. (2000) Gene context conservation of a higher order than operons. Trends Biochem Sci, 25, 474–479. Janga, S.C., Collado-Vides, J. and Moreno-Hagelsieb, G. (2005) Nebulon: a system for the inference of functional relationships of gene products from the rearrangement of predicted operons. Nucleic Acids Res, 33, 2521–2530. Che, D., Li, G., Mao, F., Wu, H. and Xu, Y. (2006) Detecting uber-operons in prokaryotic genomes. Nucleic Acids Res, 34, 2418–2427.

Computational Elucidation of Operons and Uber-operons

257

Operon Evolution: Blumenthal, T. (2004) Operons in eukaryotes. Brief Funct Genomic Proteomic, 3, 199–211. Lawrence, J.G. (2003) Gene organization: Selection, selfishness, and serendipity. Annual Review of Microbiology, 57, 419–440. Woese, C.R. (1987) Bacterial evolution. Microbiol Rev., 51, 221–271. Price, M.N., Huang, K.H., Arkin, A.P. and Alm, E.J. (2005) Operon formation is driven by co-regulation and not by horizontal gene transfer. Genome Res, 15, 809–819.

Acknowledgments This work was supported in part by NSF IIS-0407204, NSF DBI-0542119, NSF DBI0354771 and NSF CCF-0621700 and a “Distinguished Scholar” grant from Georgia Cancer Coalition. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsoring institutions.

This page intentionally left blank

CHAPTER 11 PREDICTION OF REGULONS THROUGH COMPARATIVE GENOME ANALYSES

ZHENGCHANG SU, GUOJUN LI and YING XU

1. Introduction Genes in a prokaryotic cell are not expressed at a constant level, rather different genes may have different expression levels and the same gene may be expressed differently at different stages in the cell’s life cycle and/or under different conditions. In other words, a gene’s expression level is determined by the combination of the cell’s physiological requirements and its intra- and extra-cellular environments. As discussed in Chapter 10, in prokaryotes, several adjacent genes on the same strand of DNA could form an operon and be transcribed as a polycistronic mRNA. At the molecular level, one important determinant for a gene’s expression level is the transcription initiation process (Beckett, 2001). In bacteria, gene transcription initiation is controlled by the σ-factor of the RNA polymerase (RNAP) together with other specific transcription factors (TFs) that each binds to distinct regions of DNA located in the upstream region of an operon (Busby and Ebright, 1994). The region to which the σ-factor of the RNA polymerase (RNAP) binds is called a promoter, and the region recognized by a TF is called a cis-regulatory element or a binding site. Under certain conditions, the transcription of some genes can be initiated by the binding of a σ-factor to its promoter alone, while the transcription initiation of some other genes requires not only the binding of a σ-factor, but also the binding of at least one TF to its cis-regulatory element and the interactions between the σ-factor and the TF(s). Transcription termination is governed by either specific sequences located downstream of the operon or by a ρ-factor (Henkin, 1996). Genes in an operon in general share the same transcription initiation and termination control machineries. Typically, a genome encodes far fewer TFs than the number of operons in the genome, thus each TF could regulate multiple operons. The collection of the operons that are regulated by a TF is called the regulon of the TF. Some operons are regulated by more than one TF. Thus an operon can belong to different regulons. Binding of a TF to its cis-regulatory elements for an operon may either up- or down-regulate the transcription of genes in the operon, depending on the structure and the location of the involved cisregulatory binding sites. A TF that only up-regulates gene expression is called 259

260

Z. Su et al.

an activator, one that only down-regulates gene expression is called a repressor, and one that can either up- or down-regulate gene expression is called a dual regulator. A TF is also called a response regulator in prokaryotes, since it is usually activated through phosphorylation by a sensor kinase that can “sense” a specific change in the intracellular or extracellular environment, which together constitute a so-called two-component signal transduction system (Stock et al., 2000). Thus, environmental changes sensed by a specific two-component system will lead to changes in gene expression of the corresponding regulon (Balazsi and Oltvai, 2005). Regulons are often considered the basic functional unit of gene transcriptional regulation. Elucidation of regulons encoded in a genome represents one of the most fundamental problems in the determination of gene transcriptional regulatory networks in a prokaryotic cell. Traditionally, experimental characterization of the members of a regulon starts with identifying the operons and genes that are differentially expressed when the corresponding TF is activated or inhibited, followed by characterization of the cis-regulatory elements in the promoter regions of the differentially expressed genes (Jiang et al., 1995; Neidhardt et al., 2002). Taking advantage of the large number of sequenced genomes, a high-throughput technique termed chromatin immunoprecipitation coupled with DNA microarray hybridization (or ChIP-chip) has recently been developed and widely applied to identify the binding sites of a TF at a genome scale (Ren et al., 2000; Wade et al., 2007). However, it is currently difficult, if not impossible, to experimentally characterize all the regulons encoded in a genome using any of these experimental methods due to the amount of work involved and the high costs that they may incur. Computational elucidation of regulons has become feasible because of the development of a number of computational techniques for modeling and identification of DNA binding sites over the past two and a half decades (Stormo, 2000; Stormo and Hartzell, 1989) and the development of several computational methods that have been used to predict the members of specific regulons. The earlier methods for predicting new cis regulatory binding sites started by compiling known binding sites of interest, and then employed the sequence profiles of the known binding sites to search for additional sites in the genome of interest (Stormo et al., 1982a; Stormo et al., 1982b). Recently, Gelfand and colleagues (Gelfand, 1999; Mironov et al., 1999) introduced the phylogenetic footprinting technique (Tagle et al., 1988) for predicting specific regulons. This method and its variations have been adapted for prediction of regulons conserved across closely related bacterial and archaeal genomes (Bulyk et al., 2004; Gerdes et al., 2003; Laikova et al., 2001; Makarova et al., 2001; McGuire et al., 2000; Panina et al., 2001; Rodionov et al., 2001; Tan et al., 2001; Thomas et al., 2003; Vitreschak et al., 2002). Structure-based algorithms have also been developed to predict new members of a TF whose tertiary structure is known (Becker et al., 2006; Endres et al., 2004; Havranek et al., 2004; Kaplan et al., 2005; Kono and Sarai, 1999; Liu et al., 2005; Morozov et al., 2005; Robertson and

Prediction of Regulons Through Comparative Genome Analyses

261

Varani, 2007; Sarai et al., 2005). For example, Liu and colleagues (2005) developed a knowledge-based potential function to model the interactions between protein residues in the DNA binding domain of a TF and its interacting nucleotides in the binding sites. Using this potential function, they were able to predict all known E. coli binding sites of CRP (cAMP receptor protein) in the top 15% of all candidate sequences (Liu et al., 2005). Though promising, these methods have only limited applications since the accurate structures of most TFs are not available yet. The first genome-wide regulon prediction was carried by van Nimwegen and coworkers (van Nimwegen et al., 2002) using Monte Carlo sampling of the putative binding sites to partition thousands of short conserved DNA sequences into clusters identified using the phylogenetic footprinting method. In another study, Qin et al. (2003) used a Gibbs sampling-based Bayesian motif clustering algorithm to cluster hundreds of predicted putative cis-regulatory binding sites identified through genome-wide phylogenetic footprinting analyses. They predicted a collection of genes that are associated with a cluster of binding sites to form a regulon (Qin et al., 2003). More recently, Alkema et al. (2004) proposed another phylogenetic footprinting-based algorithm to tackle the genome-wide regulon prediction problem. In this chapter, we first introduce the structural features of regulons and a mathematical formulation of the regulon prediction problem. We then introduce and discuss an algorithm for predicting the regulon of a specific TF about which some facts are known, followed by a discussion on an algorithm for genome-wide prediction of regulons.

2. Structure of Prokaryotic Regulons Our current understanding of regulon structures in prokaryotes mainly comes from the studies of regulons in model bacteria, in particular, E. coli, whose genome encodes ∼4,237 genes, ∼310 of which are characterized or predicted as TFs (Madan Babu and Teichmann, 2003; Perez-Rueda and Collado-Vides, 2000). A database called RegulonDB has been compiled to store the known genes/operons regulated by a TF in E. coli as well as their corresponding cis-regulatory binding sites (Salgado et al., 2001). As of now, the database contains 154 TFs, 1,463 cis-regulatory binding sites, and 2,110 regulatory interactions. Most TFs in RegulonDB regulate a few genes, while three TFs (IHF, FNR and CRP) regulate more than 100 genes (Madan Babu and Teichmann, 2003). The average number of genes per (known) regulon in E. coli is 7. Most operons in E. coli are regulated by a single TF, while some operons are regulated by more than two 2 TFs (Madan Babu and Teichmann, 2003), which is in contrast to the situation in eukaryotes where genes are all regulated by multiple TFs (Levine and Tjian, 2003). The largest number of TFs that coregulate an operon, known as of now in E. coli, is 7 (Madan Babu and Teichmann, 2003). Most of the known TFs in E. coli can work with other TFs to co-regulate the transcription of operons, and only a few TFs work alone (Madan Babu and Teichmann, 2003). Furthermore, some TFs can work with more than one type of

Pi

p

Pi

Pi

PitA

PItB

Pi

Pi

p

ATP p

PhoB

PhnGHIJKL n (C-P lyase Complex) RCH4 + Pi

Pi

ADP ATP

PstB

ATP ADP + Pi

Pi

PstU

PhoB

PstA

PhnE P

PstC

PstS

PhnC ADP + Pi

Sensor kinase PhoR

Pn

Plasma membrane PhnD

PhoR

Z. Su et al.

262

Response regulator PhoB regulon

phoBR phoA

phoE

phoH

iciA

pstSCABU

Chemical conversion or translocation

phnCDEFGHIJKL

pitA

pitB

Regulation

Fig. 1. A schematic of the PhoB regulon and the phosphorus assimilation pathway in E. coli. The operons enclosed in the dotted box show the organization of the known members of the PhoB regulon. The positions and orientations of the operons are arbitrarily arranged, and thus do not reflect their actual locations and orientations on the chromosome.

σ-factors, while others are known to work with only one type of σ-factor (Madan Babu and Teichmann, 2003). Some TFs can form a homo-oligomer, and thus bind to direct or inverted repeats of similar binding sites, possibly separated by short spacers in the genomic sequence. The binding of an oligomer TF to a string of tandem binding sites is believed to be one of the ways for the TF to accomplish more flexible control of transcription (Blanco et al., 2002). Figure 1 shows the phosphorus assimilation pathway and the corresponding PhoB regulon in E. coli to illustrate the typical structure of a regulon and the pathway it encodes in bacteria. The regulator PhoB is a dual regulator. Upon being phosphorylated by the sensor kinase PhoR, it can either activate or repress the transcription of the relevant genes. PhoB positively regulates its own transcription. In addition, PhoB can form a homo-dimer, tri-mer or tetramer, and thus can bind to two, three or four direct repeats of the consensus sequence CTGTCACA. The PhoB regulon constitutes the core of the phosphorus assimilation pathway in E. coli. When inorganic phosphate (Pi ) is abundantly available in the environment, PhoB stays inactive, and Pi is taken up by the constitutively expressed low affinity transporters PitA and PitB. When Pi becomes limiting in the environment, PhoB is activated by PhoR, the expression of the high affinity Pi transporter of ABC type, the PstSCAB complex as well as the phosphonate transporter PhnCED and C-P lyase will be activated to scavenge traces of Pi in the environment, and to switch to the alternative phosphorus source, the organic phosphonates.

Prediction of Regulons Through Comparative Genome Analyses

263

3. The Problem of Regulon Prediction As mentioned above, experimental characterization of the members of a regulon and the corresponding cis-regulatory binding sites in a genome is often laborious and expensive, so consequently, even for the most extensively studied bacterium E. coli, we only know a small fraction of its entire collection of regulons (Salgado et al., 2001), not to mention other less-studied organisms. The rapidly increasing pool of finished bacterial genome sequences poses an urgent need for much improved capabilities in computational prediction of cis-regulatory elements and their associated TFs, since the experimental characterization of regulons and their binding sites clearly could not keep up with the rate of world-wide genome sequencing of prokaryotes. We expect that computation will play a key and much expanded role in elucidation of regulon structures in prokaryotic genomes in the years to come. Computationally, the regulon prediction problem can be formulated as follows: given a genome sequence and the annotation of its genes and operon structures, find all the members of each regulon and its corresponding cis-regulatory binding sites for each TF encoded in the genome. It should be noted that for the vast majority of target genomes for regulon elucidation, the annotation of their genes and operons (see Chapters 2 and 10) will mainly come from computational predictions, which will probably continue to be the case for the foreseeable future. We expect that predictions of operons or even genes will not be 100% correct, which will in turn pose challenges for accurate prediction of regulons. For the convenience of discussion throughout this chapter, we call a single gene transcription unit also an operon, specifically a singleton operon. Operationally, since operons in a regulon share similar binding sites, the regulon prediction problem boils down to finding all maximal groups of operons such that operons in each group share at least one set of similar cis-regulatory elements, each group is deemed to be a regulon.

4. Representation of cis-Regulatory Elements The binding of a TF to its cis-regulatory elements on the double-stranded DNA is determined by the unique 3D structure of the DNA binding domain of the TF as well as the 3D structure of a cis-regulatory element (McKay and Steitz, 1981). These structures are determined by their amino-acid and DNA sequences, respectively. However, the cis-regulatory elements of a TF in a genome could be degenerate, though they are more conserved compared to the flanking non-functional sequences. The cis-regulatory elements of the same TF generally have the same length ranging from 6–25 base pairs. When all the cis-regulatory elements of a TF are aligned, some positions are more conserved than the others (Stormo, 2000), often corresponding to the key binding interactions between the TF and a cis-regulatory element. When modeling and characterizing the set of known cis-regulatory elements of a TF, they are often represented using a 4 × l matrix called their sequence profile, computed

Z. Su et al.

264

based on the (gapless) multiple sequence alignment of these cis-regulatory elements (l is the length of the cis-regulatory elements). Each column in this profile represents the relative frequencies of the four letters, {A, C, G, T}, in this aligned position (Stormo, 2000). We call the collection of all the binding sites of a TF a motif. A position weight matrix (PWM) can be also computed for a motif using the following formula: wi,j = log2

pi,j , qi

(1)

where pi,j is the relative frequency of base i at column j in the multiple sequence alignment, and qi is the frequency of base i in the genome. The conservation of each column j in the profile can be measured by the relative information (entropy) content defined as I(j) =

T 

pi,j log2

i=A

pi,j . qi

(2)

The significance of a motif represented by the profile M can be defined as the summation of I(j) over all the columns in the multiple sequence alignment, I(M ) =

l  T 

j=1 i=A

pi,j log2

pi,j , qi

(3)

where l is the length of the motif.

5. Prediction of Regulons As mentioned earlier, the cis-regulatory elements of a TF could potentially be predicted from the 3D structure of the TF, and a few attempts have been tried with some level of success (Becker et al., 2006; Endres et al., 2004; Havranek et al., 2004; Kaplan et al., 2005; Kono and Sarai, 1999; Liu et al., 2005; Morozov et al., 2005; Robertson and Varani, 2007; Sarai et al., 2005). However, a number of issues have limited the applications of such a strategy, including that such methods generally require high resolution (e.g., better than 2A) structure of a TF-DNA complex in their bound form, which may not be available for most of the TFs. Hence, the current methods for predicting cis-regulatory motifs and regulons are mainly based on sequence information only. It should be noted that when only sequence information is used, it is very difficult, if not impossible, to predict a cis-regulatory binding site accurately in a single genomic sequence, since any segment of the genomic sequence could potentially be a binding site. Virtually all the existing sequencebased algorithms work by finding over-represented sequence segments among a set of input sequences, each of which is likely to contain a cis-regulatory site and which collectively contain possibly conserved regulatory motifs, based on experimental data or prior biological knowledge (GuhaThakurta, 2006; Siggia, 2005).

Prediction of Regulons Through Comparative Genome Analyses

265

5.1. Phylogenetic Footprinting Identification of a set of genes that are potentially regulated by the same TF could be achieved through a clustering analysis of genes based on their microarray gene-expression profiles collected under different culture conditions, some of which are known to alter the activity of the TF though this method has only limited applications due to the reality that microarray gene expression data are in general not readily and sufficiently available for most of the newly sequenced prokaryotic genomes. The phylogenetic footprinting method has proven to be a powerful technique for identifying cis-regulatory elements for a TF in a prokaryotic genome, given that one or more genes in the genome or their orthologs in closely related genomes are known to be regulated by the TF or its orthologs (Blanchette et al., 2002; Manson and Church, 2000; McCue et al., 2002; McGuire et al., 2000; Tan et al., 2001; Tompa, 2001). The premise for the phylogenetic footprinting method to work is that the DNA binding domains of orthologous TFs in closely related organisms are generally well conserved, and hence their respective cis-regulatory binding sites should also, to a great extent, be conserved. When applying a phylogenetic footprinting procedure, genes known to be regulated by a TF in the target genome along with their orthologs in a group of closely related genomes (reference genomes) are identified, their upstream regulatory regions are extracted and pooled, and then over-represented sequence segments are identified in this pool of sequences using a motif finding algorithm. It should be noted that even with this powerful technique, precise identification of cis-regulatory elements is still notoriously difficult, due to the short lengths and the degenerative nature of the cis-regulatory elements (Hu et al., 2005; Tompa et al., 2005). In practice, two versions of the regulon prediction problem with increasing difficulty have been addressed using phylogenetic footprinting-based algorithms: (1) prediction of all the cis-regulatory elements for a known TF, for which a few target genes or operons are known (Manson and Church, 2000; McCue et al., 2002; McGuire et al., 2000; Tan et al., 2001), and (2) de novo prediction of all the cisregulatory elements for each of the TFs encoded in a genome (Alkema et al., 2004; van Nimwegen et al., 2002). We call the first problem a special regulon prediction problem, which utilizes the prior knowledge about a regulon to be predicted, and the second problem the general regulon prediction problem, which attempts to predict all the groups of operons that share similar cis-regulatory binding sites for the same TF.

5.2. Regulon Prediction for a Known TF Through Comparative Genome Analyses This problem is raised since in practice, biologists often have some prior knowledge about the TF, including (1) the way by which it is activated, (2) the biological process(es) in which it plays a role, (3) some operons that it regulates, and (4) some

266

Z. Su et al.

of its cis-regulatory elements. In such a case, biologists are interested in knowing all the members of the regulon and the related cis-regulatory elements of the TF in the genome of interest. The computational approach to this problem can be divided into two steps: (1) identification of a subset of the cis-regulatory motif from a set of regulatory regions of operons that are known to be regulated by the TF; and (2) scanning the regulatory regions of the target genome using the sequence profile of the known binding sites to find additional ones, and thus additional members of the regulon.

5.3. Identification of cis-Regulatory Motif Through Phylogenetic Footprinting Analyses For the convenience of discussion, we also call the collection of similar cis-regulatory elements for a TF and its orthologs in closely related genomes as a motif. A subset of such a motif can be identified using the following phylogenetic footprinting analyses, which have been demonstrated to achieve a rather high prediction accuracy compared to the other existing methods (Su et al., 2006). 5.3.1. Choice of Reference Genomes for Phylogenetic Footprinting Reference genomes used for phylogenetic footprinting can be chosen based on phylogenetic tree analyses of the DNA binding domains of the orthologous TFs among all the sequenced prokaryotic genomes (Su et al., 2006). We have noted that evolutionarily too distant genomes will not help to identify cis-regulatory elements, since the binding sites might have diverged too much, and therefore their level of sequence conservation might be below the level of detection using existing motif finding methods. Hence the reference genomes should include only genomes that are closely related to the target genome. However, too closely related genomes, such as different strains of the same species, should not be considered since their sequences might not have diverged sufficiently for the regulatory motifs to stand out from the background genomic sequences. Typically one should choose genomes from the same genus of the target genome but not from the same species as reference genomes, or operationally, choose genomes (not from the same species) whose orthologous TFs share at least 80% identical sequence with the TF of interest in the target genome. In terms of the size of the reference genome pool, our own experience has been that the more reference genomes included in the analysis, the better representation of the subset of cis-regulatory motif one can expect to produce. In general, at least five reference sequences are needed to achieve a statistically significant result. 5.3.2. Prediction of cis-Regulatory Elements and Possible σ-Factor Binding Sites A motif finding program such as MEME (Bailey and Gribskov, 1998), BioProspector (Liu et al., 2001) or CUBIC (Olman et al., 2003) can be used to identify potential

Prediction of Regulons Through Comparative Genome Analyses

267

cis-regulatory binding sites of the TF or its orthologs from the pool of the collected promoter sequences as discussed above. If some cis-regulatory binding sites are already known for the TF, they can be used to validate the prediction result (Su et al., 2006). Otherwise if no such information is available, then multiple motif finding tools should be used to confirm or reject a motif prediction result, based on some voting scheme of the predicted results by the multiple motif-finding programs (Hu et al., 2005; Tompa et al., 2005). Furthermore, if it is known that the TF can form an oligomer, and can bind to tandem repeats of binding sites, then multiple similar binding sites should be sought after (Su et al., 2007). In another case, if the TF is known to work in concert with other TFs to co-regulate the transcription of genes, then the cis-regulatory binding sites of the latter TFs should also be identified (Bulyk et al., 2004). It should be also noted that in addition to the binding of a TF to its cisregulatory elements, the initiation of transcription in bacteria requires the binding of a σ-factor of the RNAP to a specific sequence in the promoter region as well (see Chapter 8 for details). Therefore, besides the cis-regulatory elements of the TF, one should also attempt to identify the possible σ-factor binding sites. Often, a σ-factor binding site is located in a fixed distance from the cis-regulatory binding sites (Su et al., 2006). Simultaneous identification of a cis-regulatory element(s) and a σfactor binding site in the regulatory region of a gene/operon can greatly increase the reliability of the predicted binding sites as was previously demonstrated (Bulyk et al., 2004; Su et al., 2006).

5.3.3. Prediction of Additional Members of a Regulon by Scanning the Target Genome Once some of the cis-regulatory binding sites of a TF and its orthologs are identified by a phylogenetic footprinting procedure as described above, the sequence profile of these sites can be used to search for additional cis-regulatory binding sites through scanning the whole regulatory regions of the target genome and finding the “matched” sequences. For each predicted cis-regulatory binding site, its immediate downstream operon will be predicted as a member of the regulon. The following provides an algorithm for the scanning process for additional cisregulatory elements, though other algorithms exist (Gelfand et al., 2000; Manson and Church, 2000; McGuire et al., 2000; Tan et al., 2001), but they either lack a cohesive way to incorporate multiple source information or have relatively high false positive prediction. Given the profile of a cis-regulatory motif M with length l, the score for a sequence v of length l matching M can be defined as

sM (v) =

l  j=1

Ij ln

pj,vj , qvj

(4)

Z. Su et al.

268



Ij =  a=



pj,b ln

b∈{A,C,G,T }

1 n+1 ln(n + 1) − ln(n + 4) − n+4 n+4



pj,b  qb



a,

ln qb −

b∈{A,C,G,T }

(5)

n ln min qb , n + 4 b∈{A,C,G,T }

(6) where vj is the base at position j of v, pj,b the relative frequency of base b at position j in M , qb the relative frequency of base b occurring in the background, and n is the number of sequences in M . A pseudo-count 1 is added to the frequency of each base at each position in the profile when computing pj,b .The coefficient a is used to normalize the relative information content so that Ij will be in the region [0,1]. When a sequence t from a genome is scanned against the profile M , the substring (with length l) of t that maximizes the scoring function of Eq. (4) will be returned as a primarily predicted binding site; the score of the predicted binding site found from t when scanned using M is defined as sM (t) = max h⊂t

l 

Ij ln

j=1

pj,hj , qhj

(7)

where h is any substring of t with length l, and hj the base at position j of h. Thus, at most one possible binding site in each sequence t is considered, which could easily be generalized to deal with t with multiple binding sites. Due to the short lengths and degenerative nature of cis regulatory motifs, such scanning of the entire regulatory regions of a target genome may result in many false positive predictions (GuhaThakurta, 2006). One possible way to reduce the false predictions is to scan the genome with a companion binding motif (Bulyk et al., 2004; Su et al., 2007) or a σ-factor binding motif, which are co-located (Su et al., 2006) since the co-occurrence of multiple (predicted) binding sites in a sequence by chance should be substantially lower than the occurrence of a single (predicted) binding site by chance. When multiple profiles M1 , . . . , Mz are used for scanning the genome, the score of co-occurrence of multiple binding sites in a sequence t can be defined as sM1 ,...,Mz (t) =

z 

sMj (t).

(8)

j=1

When the sequence profiles of a cis-regulatory motif and a σ-factor binding motif were used together for new motif scanning, the false positive rate could be reduced in general by about twofold compared to that when using the cis-regulatory binding motif alone, as demonstrated previously (Su et al., 2006). To reduce the false positive rate further, one can consider the possibility of co-occurrence of similar multiple binding sites in the regulatory regions of orthologous genes in some closely

Prediction of Regulons Through Comparative Genome Analyses

269

related genomes (Su et al., 2006; Tan et al., 2001). The finding of similar multiple binding sites in the regulatory regions of the orthologous genes should further reduce the false predictions. Let t be the upstream sequence of an operon U (g1 · · · gn ) in the target genome T . If gi has orthologs in mi closely related genomes G1 , . . . , Gmi , let ok (gi ) be the upstream sequence of the ortholog of gi in genome Gk . Then the score of cooccurrence of the multiple binding sites in t can be redefined as mi z   lj − di,j,k sMj (ok (gi )), 1 s) to represent the cumulative probabilities that I and C contain sequences bearing putative binding sites with scores s(v) > s and s(w) > s, respectively. To avoid possible biased sampling of C, multiple (e.g., 1000) CU(g1 ,...,gn ) ’s should be generated for each operon U (g1 , . . . , gn ) in the target genome, and p(SC > s) can be computed from these pooled sequences. The following log-odds ratio (LOR) function can then be used to estimate the confidence of the predictions in a genome: LOR(s) = ln

p(SI > s) . p(SC > s)

(10)

The higher a LOR(s), the higher probability it will be to find binding sites with a score greater than s in the intergenic regions than in coding regions, and thus the

270

Z. Su et al.

more likely such sites found in the intergenic regions are true binding sites. Since p(SC > s) is the probability of type I error for testing the null hypothesis that IU does not contain a binding site when SI is greater than a cutoff s, it can be used to estimate the false positive rate of the predictions, i.e., the p-value. A cutoff score corresponding to a certain p-value can be used for binding site predictions in a genome. 5.3.4. An Application: Prediction of NtcA Regulon in Cyanobacteria Nitrogen control in cyanobacteria is mediated by NtcA, a transcriptional regulator that belongs to the CRP (cAMP receptor protein) family (Reitzer, 2003). NtcA is known to bind to a palindromic motif GTAN8 TAC (Herrero et al., 2001). In addition to this motif, the promoter regions of known NtcA-activated genes also contain a –10, E. coli σ 70 -like box in the form of TAN3 T, located ∼22 bp downstream (Herrero et al., 2001). Since the DNA binding domain of the available NtcA sequences from 9 sequenced cyanobacteria (Gloeobacter violaceus PCC 7421, Nostoc sp. PCC 7120, Prochlorococcus marinus CCMP1375, Prochlorococcus marinus MED4, Prochlorococcus marinus MIT9313, Synechococcus elongatus PCC 6301, Synechococcus sp. WH8102, Synechocystis sp. PCC 6803 and Thermosynechococcus elongates BF-1) is highly conserved, it is likely that NtcA will recognize similar DNA sequences in different cyanobacteria, thus all these 9 sequenced genomes were used in the phylogenetic footprinting analyses. We pooled the upstream regions of the orthologs in each of the nine cyanobacterial genomes of 11 genes, which are known to be regulated by NtcA in at least one cyanobacterium. These genes include ammonia permease amt, nitrogen global regulator ntcA, glutamine synthetase glnA, signal transduction protein PII glnB, urea transporter subunit A urtA, nitrite reductase nirA, heterocyst differentiation protein hetC, heterocyst specific ABC-transporter devB, group 2 σ 70 factor rpoD-V, nitrate assimilation transcriptional activator ntcB and isocitrate dehydrogenase icd. Fifty-one putative NtcA binding sites were identified from a total of 65 pooled upstream sequences. The −31 bp downstream regions of these predicted NtcA binding sites were further searched for 6 bp, −10, E. coli σ 70 -like boxes; the identified sites can then be used to create the sequence profile for the −10 σ-factor binding motif. The logo representations of the profiles of the NtcA binding sites and −10 like boxes are shown in Figs. 2A and 2B, respectively. These two sequence profiles were used to scan the target genome with the rest genomes as the reference genomes [see scoring function (9)]. As shown in Fig. 3, the LOR’s for all nine genomes increase monotonically beyond a score around 12, indicating that this scoring function can well differentiate between the putative binding sites found in regulatory regions and those found in coding regions. Since most of the high-scoring sites, if not all, found in the coding regions are supposed to occur by accident, the much higher probability of the occurrence of the high-scoring sites found in the regulatory regions must have been under the selection pressure,

Prediction of Regulons Through Comparative Genome Analyses

2.1

Sequence logos of the binding motifs of NtcA (A) and σ-factor (B).

CCMP1375

1.5 0.9 0.3

Probability or LOR

-0.3 0 1.5 1.2 0.9 0.6 0.3 0 -0.3 0

Probability or LOR

Probability or LOR

Fig. 2.

1.5 1.2 0.9 0.6 0.3 0 -0.3 0

2

4

6

8 10 12 14 16

MIT9313

2

4

6

8 10 12 14 16

PCC6803

2

4

6

271

8 10 12 14 16

Score

1.5 1.2 0.9 0.6 0.3 0 -0.3 0 1.8 1.5 1.2 0.9 0.6 0.3 0 -0.3 0

1.8 1.5 1.2 0.9 0.6 0.3 0 -0.3 0

PCC7421

2.1

MED4

1.5 0.9 0.3 2

4

6

8 10 12 14 16

PCC7120

-0.3 0

2

4

6

8 10 12 14 16

Cu PCC6301 Iu LOR training

2.7 2.1 1.5 0.9 0.3

2

4

6

8 10 12 14 16

Thermosynechoccus

-0.3

0

2

4

6

8 10 12 14 16

WH8102

1.2 0.9 0.6 0.3 0

2

4

6

8 10 12 14 16 Score

-0.3 0

2

4

6

8 10 12 14 16

Score

Fig. 3. The probabilities (p(S > s)) of the scores of 52 promoters in the training set from nine cynaobacteria (cyan), and those of the scores of putative promoters found in the IU ’s (pink) and IC ’s (blue) and their LOR (red) when both the profiles of NtcA binding sites and −10 σ70 -like boxes are used for scanning; the presence of similar sites in the regulatory regions of orthologs in other genomes are also considered. The dotted vertical line in each panel shows the largest score cutoff for the genome in order to include all the binding sites from that genome in the training set.

and thus they are biologically meaningful. In this case, they are more likely to be of regulatory functions. In order to predict new members of the NtcA regulons in the nine sequenced cyanobacterial genomes, one can choose for each genome a cutoff of the combined score s(t) as defined in Eq. (9) so that a predefined statistical

272

Z. Su et al.

significance level can be achieved. In general, the lower a p-value is used, the higher confidence one has with the predictions. Specifically, we consider predictions with a p-value < 0.01 as highly statistically significant, and those with a p-value < 0.05 as statistically significant. Table 1 summarizes the prediction results in WH8102. As can be seen from these results, genes bearing a high scoring putative NtcA promoter (e.g., p-value < 0.05) always have at least one ortholog in other genomes. Putative novel NtcA promoters are found for many genes involved in nitrogen assimilation. Intriguingly, high scoring NtcA promoters are also found for many genes involved in the various stages of the photosynthetic process, suggesting that these genes serve as the regulatory points to orchestrate nitrogen assimilation and photosynthesis in a cyanobacterial cell (Su et al., 2006). 5.4. De novo Prediction of Regulons in a Genome Through Comparative Genomics Analysis The approach described in the above section can only predict members of a regulon for which a few genes are already known in the target genome or in at least one of its closely related genomes. In general, such information about the cis-regulatory binding sites may not necessarily be available. Moreover, biologists may be more interested in finding out more about the unknown cis-regulatory systems encoded in a genome. This leads to what we call the genome-wide de novo regulon prediction problem. Phylogenetic footprinting-based approaches have been proposed to address this problem as well (Alkema et al., 2004; van Nimwegen et al., 2002). We now outline the main idea of a modified algorithm for the problem based on the work of Alkema et al. (2004). The algorithm starts by identifying orthologs for each gene in a target genome across multiple reference genomes. A motif finding algorithm is then applied to find putative binding sites from the pooled upstream sequences of each gene in the target genome as well as its orthologs in the reference genomes, assuming that orthologous genes in closely related genomes share similar cis-regulatory binding sites. The putative motifs of low quality are discarded and similar ones are merged. The resulting non-redundant motifs are used to scan the entire target genome for additional cis-regulatory binding site using the algorithm described in Sec. 5.3.3. The details of the algorithm follow. 5.4.1. Prediction of Orthologous Groups (OGs) in Closely Related Genomes To predict all possible cis-regulatory elements in a target genome, T , a set of closely related reference genomes are selected, denoted as R, which can be identified using the small ribosome RNA sequences based a phylogenetic tree analysis (Alkema et al., 2004). To find orthologs for as many genes in T , one should try to use as many closely related (but not too closely as discussed in Sec. 5.3.1) genomes as possible. Then, orthologs for each gene in T across all the genomes in R are predicted using an orthologous gene prediction algorithm such as the bi-directional best hits method

Table 1. Rank

4 5 6 7 8 9 10 11

12 13 14 15

16 17

Transcription unit

Name

NtcA site

Downstream of NtcA binding site and −10 like box1

NtcA sits position

Score

synw2442 synw1073 synw2485 synw2486 synw2487 synw0165 synw1105 synw0153 synw0154 synw0347 synw1434 synw1435 synw2477 synw0253 synw1412 synw1413 synw1414 synw1415 synw1416 synw1417 synw1418 synw2475 synw2476 synw1507 synw1508 synw0152 synw1422 synw1423 synw1424 synw1425 synw1426 synw0462 synw2171

urtA ginA cynA cynB cynC

GTTccggttgaTAC GTGcgcgttgaTAC GTAtcacctgaTAC

CAAAGCGGTGGGGGGCCCTTTTTTACCTTCC AAAACAGGGCATAACGGCTCCTTACGGTCGT AACATCCGCGTTCGCTTTCCAACTATAAATA

−52 −60 −51

14.1879 13.7574 13.692

— — — — — nirA amt 1 −hypA2 hypB

GTAgtttaggaAAC GTGttagttaaTAC GTAgtcgccgcTAC GTAgctaatttTAC GTAacaacaccTAC GTAattccatcAAC GTTcagtcggaTAC GTAgctgatcaCAC

ATATGGGTTAAAGATTTTTCGTTATCAAGAG ACAAGCATGTACTAGACTGCGCTAGTTTAAT ATCTGGTGGGGTGGGCAGACCGTCCTCCACC TCTAGCCTTGTTTTCTATATAGCTAGCACTA AGCCTGAACCAGTCCACTCGGTAACACTATG AGAACAACTTTTGAGTACGAACTAGAAAAGG ACCATCCGGCGTGACCAGCAGCTCTGCACTC CGCGCGTGCCACCGGTGCCGCAGACAGTGGA

−39 −64 −62 −173 −143 −349 −37 −69

13.1547 13.1128 13.1126 13.0991 12.7911 12.7111 12.7063 12.6782

— cobA— mc —

GTTgatggaatTAC GTAataaagacTGC GTAgcggcgacTAC GTCgtatttcgTAC

GATTCCGCTCGTATTGCCGTCTGTAACTCTT GGAATTAATATTTCGGCAATACTTATACCTT CGATATCGGCGCTCCTGACGGGGCTGGCGGG ATTTTTTGTGGGCCGACCGGAGCCAGTCTTT

−199 −111 −527 −52

12.5851 12.5376 12.4768 12.4149

— ginB —

GTTacaggggcTAC GTCatggatacTAC

CCACACCGCCACCATTCACGTCATGCTTAAT CCTTGCCCACCTCTGTACACTTTCGGGTAGC

−51 −56

12.2099 12.1677



Prediction of Regulons Through Comparative Genome Analyses

1 2 3

Predicted NtcA and σ-factor binding sites in WH8102.

273

Z. Su et al.

274

(Overbeek et al., 1999) or the ones presented in Chapter 9. Each gene in T and its orthologs in the genomes of R form an orthologous group (OG). For each OG o, the whole upstream inter-operonic region of each gene in o is extracted according to the predicted operons in the genomes. If a gene in T has fewer than four orthologs in R, it should not be considered in order to insure that the predicted cis-regulatory motifs will be statistically significant. 5.4.2. Prediction of Conversed Motifs from the Pooled Sequences of Each OG Although different binding motifs may have different lengths, we consider only a fixed motif length l (e.g. l = 16) to simplify our discussion. The best putative motif of length l is found from the pooled sequences for each OG using a motif finding algorithm such as MEME (Bailey and Gribskov, 1998), BioProspector (Liu et al., 2001) or CUBIC (Olman et al., 2003). For the set of pooled inter-operonic sequences of each OG, the same number of coding sequences with the same length is randomly selected from the reference genomes, and the best scoring motif is identified using the same motif finding program. Motifs found in the upstream regions of OGs are filtered according to the distribution of the scores of motifs found from the randomly selected coding sequences, those with low scores (below some predefined cutoff) are discarded. 5.4.3. Prediction of Regulons by Clustering Conserved Motifs Some of the identified putative motifs from the upstream regions of different OGs may be similar to each other and thus are presumably bound by the same TF. These similar motifs should be identified and merged. For this purpose, we define the dissimilarity between two motifs Mi and Mj by d(Mi , Mj ) =

nj ni  l − l′ + H(su , tv ) 1  , ni nj u=1 v=1 l

(11)

where su and sv are the putative binding sites in Mi and Mj , respectively, ni and nj the numbers of putative binding sites in Mi and Mj , respectively, l′ the length of non-gapped aligned portions of su and sv computed by the Needleman–Wunsch algorithm (Needleman and Wunsch, 1970), and H(su , tv ) is the Hamming distance between the aligned portions of su and sv . Similar motifs can then be clustered using the UPGMA algorithm (Durbin et al., 1998), and similar sequence profiles can be merged into a larger one. The resulting sequence profiles are then used to scan the whole genome for additional binding sites, and thus additional members of the regulons using the algorithm described in Sec. 5.2. 5.4.4. An Application: Prediction of Regulons in Staphylococcus Aureus Alkema and coworkers (Alkema et al., 2004) applied their algorithm to S. aureus to predict all possible regulons of the genome, and they used 11 reference genomes

Prediction of Regulons Through Comparative Genome Analyses

275

in their prediction. For the 2,594 annotated genes in S. aureus, they found 1,818 of them have orthologs in at least four reference genomes, and thus form a valid OG for cis-regulatory binding site identification. They identified 1,430 over-represented sequence patterns from the promoter regions of these OGs, using motif finding algorithms. After removing the low scoring patterns and clustering the similar patterns using the UPGMA algorithm (Durbin et al., 1998), 125 patterns were kept. The profiles of these 125 motifs were used to scan the intergenic regions of the S. aureus genome for additional sites.

6. Discussion Regulons are the basic functional units for transcriptional regulatory networks in prokaryotes. Complete elucidation of regulon structures in a genome is a prerequisite for the elucidation of the entire gene regulatory networks in the genome. Due to the limitations of structure-based approaches, sequence-based algorithms have been and will continue to be the major tools for the prediction of cis-regulatory binding sites and regulons in the foreseeable future. Although significant progress has been made in the last few years in the computational prediction of regulons using comparative genomics approaches, many problems remain unsolved. First, all the current motif finding programs have only a very low prediction specificity (30 ∼ 50%) and sensitivity (30 ∼ 50%) as summarized in two recent survey articles (Hu et al., 2005; Tompa et al., 2005). Very often, the most significant scores of a prediction algorithm may not necessarily correspond to real motifs (Hu et al., 2005); thus more effective scoring schemes are urgently needed to capture the essence of true cis-regulatory binding sites. In the mean time, one should try to use multiple complementary algorithms for predicting motifs, and multiple hits from each algorithm should be considered. Second, the coverage and accuracy of genome-wide de novo prediction of cis-regulatory binding sites are still low. For example, the algorithm of Nimwegen and coworkers (van Nimwegen et al., 2002) using Monte Carlo sampling only predicted ∼100 regulons in E. coli genome that is predicted to encode ∼310 TFs (Madan Babu and Teichmann, 2003; Perez-Rueda and Collado-Vides, 2000), thus the same number of regulons, suggesting that the algorithm only predicted ∼33% (van Nimwegen et al., 2002) of the regulons, though this number may be improved to some extent when using more reference genomes. The prediction accuracy of the algorithm by Alkema and coworkers (Alkema et al., 2004) was not high either. When tested on the E. coli data set, it predicted 55% of known binding sites while having a false positive rate of ∼13% (Alkema et al., 2004). Novel algorithms are clearly needed with substantially improved accuracy and coverage. Third, the results of cisregulatory binding site predictions based on phylogenetic footprinting also depend on the accuracy of gene and operon prediction methods used, the improvement of these methods, in particular operon predictions, will help improve the accuracy of cis-regulatory binding sites predictions. Lastly, computational assignment of predicted cis-regulatory binding sites to the corresponding TF remains a virtually

276

Table 2. Program Motif Finding Tools MEME AlignACE CONSENSUS BioProspector CUBIC Gibbs Motif Sampler MotifSampler

Match

MBscan

MSCAN

Description

URL

An expectation maximization algorithm An Gibbs sampling based algorithm An greedy algorithm A Gibbs sampling strategy followed by Monte Carlo simulation A minimal spanning tree based algorithm The original Gibbs sampling strategy Gibbs sampling strategy with a higher order background model

http://meme.sdsc.edu/meme/intro.html http://atlas med.harvard.edu/ http://bifrost.wustl.edu/consensus/ http://robotics.stanford.edu/-xsliu/BioProspector/

Use the p-value to evaluate the score of a sequence for its matching to a motif profile Use the matrix similarity score and the core similarity core to evaluate how a sequence matches a motif profile Evaluate the match of segments of sequence to a set of motif profile by consider the appearance of similar site the of orthologous intergenic regions in closely related genomes Evaluate the combined statistical significance for a set for sequences to match a sets of binding motif profile

http://csbl.bmb.uga edu/downloads/#cubic http://bayesweb.wadsworth.org/gibbs/gibbs.htm http://homes.esat.kuleuven.be/∼lhijs/Work/MotHSampler.hlml

http://meme.sdsc.edu/meme/mast-intro.html http://www.gene-regulation.com/pub/programs.htmWmatch

http://www cs uncc edu/∼zcsu/tools

http://www.nada.kth.se/-ojvind/?page=mscan

Z. Su et al.

Motif Matching Tools Mast

Common tools for motif finding and operon prediction.

Program Display Motif Logo Weblogo

(Continued )

Description

URL

Web servers to generate graphic sequence logo representation of motifs

http://weblogo.berkeley.edu/

enoLOGOS Operon Prediction Tools JPOP

Operon Finder

http://biodev.hgen.pitt.edu/cgi-bin/enologos/enologos.cgi Train a neural network using intergenic distance, phylogenetic profile, COG annotation and conserved gene order across multiple reference genomes to predict operons. Integrate several source of information including intergenic distance, conserved gene order across multiple reference genomes and common gene annotation to predict operons

http://csbl.bmb.uga.edu/downloads/#jpop

http://www.cse.wustl.edu/-jbuhler/research/operons/

Prediction of Regulons Through Comparative Genome Analyses

Table 2.

277

278

Table 3.

Databases for bacterial transcriptional regulation.

Description

URL

GeneBank KEGG SMD DBD ODB TRACTOT DB PRODORIC

All known nucleolids and protein sequences Pathway annotation for all sequenced genomes Microarray data set and analysis tools Predicted TFs for sequenced prokaryotes All known operons in more than 50 genomes Predicted regulons in sequeced gamma-proteobacterial genomes A comprehensive database about gene transcription regulation in prokaryotes, with biggest available list of transcription factor binding sites in prokaryotes. Promoters, operons, regulons, TFs, TFBSs in Bacillus subtilis Promoters, operons, regulons, TFs, TFBSs in E. coli Information regarding complete and ongoing genome projects around the world Experimentaly verified all kinds information about the. transcriptional regulation, transporters, and metabolic pathways in E. coli K-12 MG1655

http://www.ncbi.nlm.nih.gov/Genbank/index.htm! http://www.genome.jp/kegg/ http://genome-www5.stanford.edu/index.shtml http://dbd.mrc-lmb.cam.ac.uk/DBD/index.cgi?Home http://odb.kuicr.kyoto-u.ac.jp/ http://www.bioinfo.cu/Tractor DB/ http://prodoric.tu-bs.de/

DBTBS RegulonDB GOLD EcoCyc

http://dbtbs.hgc.jp/ http://regulondb.ccg.unam.mx/ http://www.genomesonline.org/ http://www.ecocyc.org/

Z. Su et al.

Database

Prediction of Regulons Through Comparative Genome Analyses

279

untouched and very challenging problem, though two papers have reported some efforts towards this direction (Birnbaum et al., 2001; Tan et al., 2005). Final solutions to these problems might rely on the development of new technologies for detecting protein–DNA interactions at a genome scale as well as the development of more accurate computational algorithms for such purposes.

7. Related Resources on the Internet In Tables 2 and 3 we list the relevant computational tools and databases, respectively, on the Internet for regulon prediction.

8. Further Reading Gelfand MS: Recognition of regulatory sites by genomic comparison. Res Microbiol 1999, 150:755–771. McCue LA, Thompson W, Carmack CS, Lawrence CE: Factors influencing the identification of transcription factor binding sites by cross-species comparison. Genome Res 2002, 12(10):1523–1532. GuhaThakurta D: Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res 2006, 34(12):3585–3598. van Nimwegen E, Zavolan M, Rajewsky N, Siggia ED: Probabilistic clustering of sequences: inferring new bacterial regulons by comparative genomics. Proc Natl Acad Sci USA 2002, 99(11):7323–7328. Alkema WB, Lenhard B, Wasserman WW: Regulog analysis: detection of conserved regulatory networks across bacteria: application to Staphylococcus aureus. Genome Res 2004, 14(7):1362–1373. Su Z, Olman V, Mao F, Xu Y: Comparative genomics analysis of NtcA regulons in cyanobacteria: regulation of nitrogen assimilation and its coupling to photosynthesis. Nucleic Acid Res 2006, 33(16):5156–5171.

Acknowledgments This research was supported in part by the US Department of Energy’s Genomes to Life program under project, “Carbon Sequestration in Synechococcus sp.: From Molecular Machines to Hierarchical Modeling”, National Science Foundation (NSF/DBI-0354771, NSF/ ITR-IIS-0407204, NSF/DBI-0542119, NSF/CCF0621700) and by a start-up fund from the University of North Carolina at Charlotte to Z.C.S.

This page intentionally left blank

CHAPTER 12 PREDICTION OF BIOLOGICAL PATHWAYS THROUGH DATA MINING AND INFORMATION FUSION

FENGLOU MAO, PHUONGAN DAM, HONGWEI WU, I-CHUN CHOU, EBERHARD VOIT and YING XU

1. Introduction A key to understanding how living organisms work is detailed insight into how genes (their protein products) carry out their functions in the global context of cellular machineries. Biochemical (or biological) pathways provide an effective framework for carrying out studies on natural or designed molecular and cellular functions of gene products. Biological pathways can be generally categorized into three classes based on their functionalities: (a) metabolic pathways, (b) regulatory pathways, and (c) signaling pathways. These pathways, together with pathways responsible for transportation across membranes, comprise the main working machinery of a living cell.

1.1. Metabolic Pathways The earliest studies of metabolism can be traced back to about 400 years ago when Santorio published his book Ars de statica medicina (Eknoyan, 1999) describing how he weighed himself after eating, fasting, sleeping, working, drinking and excreting. In the 19th century, studies by Pasteur on fermentation of sugar to alcohol by yeast (Manchester, 1995) as well as the studies by W¨ ohler on the chemical synthesis of urea (Kinne-Saffran and Kinne, 1999) demonstrated that the organic compounds and chemical reactions found in cells of a living organism were not different in principle from those observed elsewhere. The discovery of enzymes by Buchner in the early 20th century marks the beginning of the research discipline that we now call biochemistry (Herring et al., 2005); while the discoveries of the urea cycle, the citric acid cycle, and the glyoxylate cycle by Krebs and Kornberg (Kornberg, 2000; Kornberg and Krebs, 1957; Krebs and Johnson, 1937) laid the scientific foundation for the study of metabolism. Chemical chain reactions involved in an organism’s metabolism are generally referred to as metabolic pathways, each of which refers to a sequence of biochemical reactions catalyzed by enzymes to keep a cell’s homeostasis. The collection of all 281

282

F. Mao et al.

metabolic pathways, possibly interacting, in a cell forms the cell’s metabolic network. At the chemical level, many of the metabolic processes are the same or similar across different organisms although the detailed implementation in terms of which genes encode which enzymes may be different. Radioactive tracers have been used for identification of pathways from the precursors to the final products via radioactivelylabeled intermediates and products (Rennie, 1999) whereas enzymatic activities can be examined through analyses of cell extracts, enzyme assays, gene mutation, and culture enrichment and deprivation. With the rapid emergence of a wide range of omics techniques, such as genomics, transcriptomics, proteomics, interactomics, metabolomics and phenomics, for rapid generation of large quantities of experimental data, it is now possible to reconstruct metabolic pathways in a systematic manner (Gianchandani et al., 2006). Computation is playing a major role in such endeavors. Prediction and modeling of metabolic pathways can be done at either the stoichiometric or kinetic level. Stoichiometric models describe a metabolic network using a set of stoichiometric equations, each of which represents a biochemical reaction, and mass balance constraints on the metabolites at the steady state of the system that are typically used to determine the metabolic fluxes (Stephanopoulos, 1999). Whereas, when detailed information is available about the kinetics of the cellular processes, a kinetic model can be used to describe their dynamic properties (Gombert and Nielsen, 2000). Such pathway models allow analyses of general properties of these pathways. For example, analyses on the changes of metabolic fluxes in response to different genetic or environmental perturbations can be used to elucidate metabolic flux control and provide a theoretical basis for metabolic engineering (Stephanopoulos, 1999).

1.2. Regulatory Pathways A metabolic process can be regulated at various components of the underlying pathway, which may control the timing of activation or deactivation of specific genes as well as the amount of functional products of genes and their functional states. Such regulations can in principle take place at the transcriptional, translational or post-translational level; in prokaryotes, the most important regulation takes place at the transcriptional level (Voet, 2005). The (transcription) regulatory machinery in a prokaryotic cell is made of a network of transcription regulators and genes that they regulate. The research field of gene transcription and regulation was ignited by the discovery of RNA polymerase in the late 1950s when Weiss (Weiss, 1959) and Stevens (Stevens, 1960) described RNA polymerases for both eukaryotes and prokaryotes, respectively. In 1961, Jacob and Monod proposed a general model and established many of the key principles for bacterial gene regulation, among which the most important one is the existence of trans-acting factors that control gene transcription by binding to cis-acting DNA motifs near the genes (Jacob

Prediction of Biological Pathways Through Data Mining and Information Fusion

283

and Monod, 1961). Since 1978, a model of gene transcription, regulated through interactions between trans-acting regulators and cis-acting DNA motifs, has gradually emerged (Brent and Ptashne, 1985; Dynan and Tjian, 1983; Engelke et al., 1980; McKnight and Kingsbury, 1982; Payvar et al., 1981; Tjian, 1978). Unlike in the study of metabolic pathways, there had not been simple techniques for effectively “tracing” the pathways of a transcription regulatory network until recently, when the microarray techniques started to be widely used for measuring gene expressions at a large scale (Schena et al., 1995). This might have partially explained why the number of well-characterized regulatory pathways is substantially lower than the number of well-characterized metabolic pathways, as stored in the most comprehensive pathway database KEGG (Kanehisa, 2002). The situation started to change since the rapid emergence of the microarray gene expression techniques in the mid 1990’s. A number of regulatory networks, for instance, the global transcription regulation network of E. coli K12, have been published through integrated studies of large-scale transcription profile analyses under controlled mutagenesis studies and computational genome analyses (Gardner et al., 2003; Ishii et al., 2007; Magasanik, 2000). Various computational techniques have been developed for prediction, analysis and simulation of regulatory networks, ranging from static to dynamic models, and from deterministic to stochastic models (de Jong, 2002), using techniques like ordinary differential equations (Alfieri et al., 2007; Bennett et al., 2007), Boolean networks (Kauffman, 1969), Petri nets (Chen and Hofestadt, 2003; Hardy and Robillard, 2004), Bayesian networks (Hartemink et al., 2001; Ong et al., 2002; Perrin et al., 2003) and others. These computational techniques allow scientists to develop predictive network models, and to study different network properties such as identification of critical points or bottleneck paths and bifurcations (de Jong, 2002).

1.3. Signaling Pathways The third class of biological pathways comprises signaling pathways, which are cellular processes that communicate external or internal stimuli (signals) to the relevant response systems via a sequence of biochemical reactions or physical interactions. While eukaryotic organisms use more complex signaling cascades for information communication, prokaryotes often rely on a simple mechanism for signal transduction, i.e., two-component systems (Alex and Simon, 1994; Bekker et al., 2006). A typical two-component system consists of a sensor kinase and a response regulator, with the former activating the latter through phosphorylation in response to a specific environmental or internal signal. Ninfa and Magasanik (1986) first discovered the two-component signaling system for nitrogen regulation in E. coli. Since then, two-component systems have been found in many response systems in both bacteria and archaea (Bourret et al., 1991; Parkinson and Kofoid, 1992; Stock et al., 1990). Because of their simplicity and their direct interactions with the

284

F. Mao et al.

regulatory machinery in prokaryotes, signaling systems are often modeled together with the regulatory pathways in computational studies of prokaryotic regulatory systems. 2. Experimental Data for Pathway Studies The ever increasing ability to sequence whole genomes of living organisms has fundamentally changed biological science. The rapidly expanding pool of sequenced genomes (over 500 at the time of publication) has provided an exceptionally rich source of information for deciphering the secrets of life. Through direct comparisons of these genomes, one can derive enormous amounts of information about the biology of living organisms. In addition, the advent of different high-throughput omics techniques in the past decade has made it practically feasible to infer cellular machineries at a systems level, particularly for prokaryotes. 2.1. Genome Sequence Data The first complete genome was sequenced by the Fiers group for a viral RNA (bacteriophage MS2) (Fiers et al., 1976) in 1976. In 1977, Sanger and colleagues completed the first DNA-genome sequencing project on Phage Φ-X174, consisting of 5,368 base pairs (bps) (Sanger et al., 1977). The first bacterial genome was sequenced in 1995 when TIGR completed the sequencing project on H. influenzae (Fleischmann et al., 1995). Since then, 515 prokaryotic and 26 eukaryotic organisms had their whole genomes sequenced, and over 1,000 are currently in the pipeline to be fully sequenced. By simply comparing these genomes in terms of sequence similarity, gene organizations and conserved DNA motifs, a great amount of information such as protein-encoding regions (see Chapter 2), operons (see Chapter 10), regulons (see Chapter 11), transposable elements (see Chapter 5), horizontally transferred genes (see Chapter 6), and protein–protein interactions (Marcotte et al., 1999) can be derived. Since genes working in the same prokaryotic metabolic pathways are typically organized in a few operons or regulons, comparative genome analyses can provide the information about the component genes of a target pathway. 2.2. Transcriptomic Data The transcriptome is the set of all messenger RNA (mRNA) molecules produced at a given time in a cell or a population of cells. Transcriptomic data, measuring the relative abundance of these mRNAs, reflects the activity levels of the genes at a given time and under designed conditions. By comparing transcriptomic data collected under different conditions, one could possibly derive information about functions of individual genes and their functional associations. Such data have been widely used in studies of transcription machineries and regulations (Jenner and Young, 2005). Various types of microarrays have been used for collecting such

Prediction of Biological Pathways Through Data Mining and Information Fusion

285

data, including full-length cDNA arrays, oligo arrays, Affimatrix arrays, and tiling arrays. Readers may refer to other reviews (Conway and Schoolnik, 2003; Dharmadi and Gonzalez, 2004) for more detailed discussions of these techniques. At this point, microarray data have been used most extensively in the reconstruction of transcriptional regulation networks (see Chapter 11 for more details). 2.3. Proteomic Data Proteomic data provide information about what proteins are present in the cell, their quantities and their functional states (often adjusted through post-translational modification (PTM)), and interactions with other molecules. Various experimental techniques are available for collecting large-scale proteomic data, including 2-D gel electrophoresis, high performance liquid chromatography column (HPLC) and mass spectrometry. The state of the art is that no more than 20% of the mass spectral peaks collected from simple bacterial cells can be interpreted (VerBerkmoes et al., 2006; Zhang et al., 2006), indicating that many of the proteins present may not be identifiable any time soon. There are a number of reasons for the difficulties in interpreting the proteomic data. The foremost is the intrinsic complexity of the proteome compared to the genome and transcriptome of a cell, due to the dynamic nature of the functional states of individual proteins. 2.4. Measuring Metabolites Metabolites are the intermediate or final products of metabolism, and usually refer to small molecules in a cell. Metabolome refers to the whole collection of metabolites found in a cell. Popular techniques for metabolite data collection include HPLC, mass spectrometry (e.g., Brauer et al., 2006), and in vivo nuclear magnetic resonance (e.g., Neves et al., 1999). The data collected using such measurement techniques represent an instantaneous snapshot of the physiology of a cell, and provide essential information about a cell, such as the identities and quantities of many of the metabolites present. These data prove to be critical to the reconstruction of metabolic pathways in a systematic manner (see Chapter 13). 2.5. Molecular Interaction Data Two types of molecular interaction data are most useful for the elucidation of biological pathways, namely (a) protein-protein interaction data, and (b) proteinDNA interaction data, which form the basic units in a pathway. 2.5.1. Protein–Protein Interactions Traditionally, experimental methods for the identification of protein-protein interactions were low-throughput, typically focusing on a few specific proteins. Recent technological advances have enabled several high-throughput experimental

F. Mao et al.

286

techniques such as the yeast two-hybrid system, phage displays, affinity purification of protein complexes and protein chips (Gavin et al., 2002; Ho et al., 2002; Ito et al., 2001; Tong et al., 2002; Tucker et al., 2001; Uetz et al., 2000; Uetz and Hughes, 2000; Zhu et al., 2001; Cahill and Nordhoff, 2003). The yeast two-hybrid system (Uetz and Hughes, 2000) is among the most widely used techniques. It gained its popularity after being used to study protein interactions at a genome scale using a matrix of arrays. In such experiments, multiple pairs of interactions are screened simultaneously using colony arrays, with each colony expressing a pair of proteins. The information derivable from these techniques includes protein domain-domain interactions, protein-peptide (or ligand) interactions, and proteinsmall molecule interactions. These methods have been used to produce massive amounts of interaction data, which are being used in many computational studies of protein-protein interactions (Ito et al., 2001; Tucker et al., 2001; Uetz et al., 2000). Besides providing the protein-protein interaction data, protein arrays have been used to provide quantitative measurements of proteins in cells (Cahill and Nordhoff, 2003; Cutler, 2003). 2.5.2. Protein-DNA Interactions ChIP-chip (Horak and Snyder, 2002) is an experimental technique that combines chromatin immunoprecipitation (ChIP) and microarray technology (chip) to provide in vivo information about protein-DNA interactions. It has been widely used for the identification of interactions between transcriptional factors and their DNA binding sites, a key piece of information for the elucidation of transcription regulation networks. The basic idea of a ChIP-chip experiment is that protein-bound DNAs in a cell are sheared into short fragments. These sheared protein-DNA complexes are then isolated through immunoprecipitation. After reversal of the cross-links in the isolated protein-DNA complexes, the remaining DNA fragments are purified and

Protein binding site

ChIP enriched DNA segments Chromosome

log2(intensity)

A

B

probe position

Fig. 1. Generic results of a ChIP-chip experiment. (A) Illustrates the relationship between the binding site and the ChIP enriched DNA segments. The dark blue segment represents the binding site, and the green segments are the upstream and downstream regions of the binding site, covered by the ChIP enriched segments. (B) The y-axis roughly shows the relative intensity measurement for a series of sequential probes in a tiling array, centered on the binding site; the x-axis shows the probe position on chromosome.

Prediction of Biological Pathways Through Data Mining and Information Fusion

287

hybridized to genomic DNA fragments selected from some targeted regions in the genomic DNA and pre-fixed on a microarray chip. The genomic DNA regions that interact with the protein (e.g., transcription factor) can then be identified through the identification of spots on the microarray chip with high florescent intensities, just like on a typical gene expression microarray. One challenging issue in using the ChIP-chip data is that the data do not directly provide the exact binding locations of a target protein on the genomic sequence; rather it shows a bell-curve distribution (see Fig. 1) of each such location, which is in general more or less centered on each binding site, due to the randomness in shearing the protein-binding DNA at different sites (Kim and Ren, 2006).

3. Information Derivation through Genome Analysis It is well understood that all information about pathways in a cell is encoded in its genome. Nonetheless, effectively recovering such information is a daunting task. Using comparative genome analysis techniques, in conjunction with the interpretation of high-throughput omic data (Sec. 4), such information is much more readily derivable for prokaryotes than for their eukaryotic counterparts. Through comparative analyses of prokaryotic genomes, one can derive the following types of information and possibly much more.

3.1. Functional Assignment of Genes Functional assignment to genes in a genome represents the first step towards systematic derivation of biological pathways. We review a few basic techniques for carrying out such a task. We refer the reader to Chapter 9 for more in-depth discussions on orthology-based functional assignment. The function of a gene (or, more specifically, its protein product) can be characterized at both the molecular and cellular levels. The popular Gene Ontology (GO) (Ashburner et al., 2000) uses a set of controlled vocabularies for assigning a gene its molecular function as well as its role(s) in cellular processes, called biological processes. There are computational techniques for predicting the molecular functions of a gene. One basic approach is based on homology or orthology information derivable through sequence comparisons, using tools like BLAST (Altschul et al., 1990) or COG (Clusters of Orthologous Groups) (Tatusov et al., 1997). Remote homology can also be identified through sequence profile comparison, sequencestructure comparison, and structure-structure comparison (Wan and Xu, 2005). Another approach relies on search against functional motif databases like PROSITE (Hulo et al., 2006) or BLOCKS (Pietrokovski et al., 1996). Cellular functions of a gene can be predicted through identifying its association with other genes with known cellular functions. There are a few computational techniques for accomplishing this, including the Rosetta stone approach (Enright et al., 1999; Marcotte and Marcotte, 2002; Marcotte et al., 1999) and phylogenetic profile analysis (Pellegrini et al., 1999b; Wu et al., 2003).

288

F. Mao et al.

Identified molecular functions can be used for assigning a gene to a specific role in a (generic) metabolic pathway, possibly derived through biochemical studies (which have resulted in generic pathways as, shown for instance, in KEGG (Kanehisa, 2002)), and the predicted cellular functions can be used to group genes into the same biological processes (pathways).

3.2. Association Prediction of Genes One of the key application areas of comparative genome analyses is in derivation of functional associations among genes. We summarize a few applications below. 3.2.1. Operon and Regulon Prediction A unique feature of prokaryotic genomes is that genes working in the same pathways are often encoded in the same operons or related operons (e.g., regulons). Hence by accurately identifying operons and regulons, one can derive genes working in the same pathways. Such information, when used in conjunction with molecular function prediction of genes, provides a powerful tool for identifying component genes of a target pathway. Various computational techniques have been developed for predictions of operons and regulons. We refer the reader to Chapters 10 and 11 for details. 3.2.2. Gene Fusion and Phylogenetic Profile Analysis Eisenberg and colleagues pioneered a number of computational techniques for predicting protein-protein interactions, based on co-occurrence and co-evolutionary relationships among proteins (Pazos and Valencia, 2001; Pellegrini et al., 1999a) as well as identified gene fusion or fission events (Enright et al., 1999; Marcotte et al., 1999; Sali, 1999). For example, gene fusion-based methods (Enright et al., 1999; Marcotte et al., 1999; Sali, 1999) predict that if a pair of proteins in genome A is fused into a single gene in genome B, then the pair of proteins interacts with each other in genome A (Marcotte et al., 1999). In practice, such a fusion event could be identified through a BLAST search (Enright and Ouzounis, 2001; Huynen et al., 2003; Mellor et al., 2002) or its variations, such as the reciprocal best hit method using BLAST (Suhre and Claverie, 2004). Based on the hypothesis that co-evolved genes are functionally related (Gaasterland and Ragan, 1998; Huynen and Bork, 1998; Pellegrini et al., 1999a), the phylogenetic profile method infers functional relatedness of proteins. The method encodes each gene in a genome using a binary string, called a phylogenetic profile, based on a set of reference genomes. Specifically, if a gene has a (detectable) homologue in the ith reference genome, the phylogenetic profile of the gene has a “1” in its ith bit; otherwise it is assigned a “0”. Then, the authors of the technique argued that two genes having highly similar phylogenetic profiles (i.e., they co-evolved) are generally functionally related, for instance, with them working

Prediction of Biological Pathways Through Data Mining and Information Fusion

289

in the same pathway. Various distance measures between phylogenetic profiles have been proposed, including Hamming’s distance (Dam et al., 2007; Pellegrini et al., 1999a; Su et al., 2003), Shannon’s entropy distance (Chen et al., 2004), Pearson’s correlation coefficient (Glazko and Mushegian, 2004), and mutual information content (Huynen et al., 2003). Others (Date and Marcotte, 2003; Pazos and Valencia, 2001) have later improved this technique by using more sophisticated distance measures between phylogenetic profiles. 3.2.3. Homology-based Protein Interaction Prediction Several databases have been created for protein-protein interactions, including DIP (Xenarios et al., 2002), BIND (Bader et al., 2001), MIPS (Pagel et al., 2004) and STRING (von Mering et al., 2005). One can predict protein-protein interactions through homology searches against these databases (Shoemaker and Panchenko, 2007a, 2007b; Shoemaker et al., 2006), based on the observation that many interactions are conserved across different species (called interologs) (Shoemaker and Panchenko, 2007a). For example, Matthews et al. reported that when the yeast protein-protein interaction network was used to predict protein-protein interactions in C. elegans, up to 31% of the protein-protein interactions in C. elegans could be identified (Matthews et al., 2001).

3.3. Prediction of Functional Modules While operon and regulon predictions provide effective tools for identifying component genes of a target pathway, previous studies have shown that such predictions alone may not be adequate to identify all component genes as some of the relevant operons may not have easily detectable relationships that can tie these operons together (Su et al., 2006). More general techniques are needed for the identification of possible functional relationships among the predicted operons and regulons. One approach is through prediction of functional modules (Wu et al., 2005). The basic idea is to predict first a functional linkage map among all genes encoded in a genome where a functional linkage between a pair of genes is loosely defined as the two genes working in the same pathway, network, or forming a physical complex. An interesting observation made in (Wu et al., 2005) is that such a relationship can be predicted based on the genes’ (a) co-occurrences in the same genomic contexts across multiple genomes (Snel et al., 2002), (b) co-evolutionary relationships (Marcotte et al., 1999), both of which are derivable through comparative genome analyses, and (c) their functional relatedness based on their annotated functions. Using such information, one can predict a functional linkage map among all genes in a genome. Figure 2 shows a portion of a genome-scale functional linkage map predicted for E. coli K12. An interesting feature of this linkage map is that some genes form densely intra-linked clusters (sub-graphs) while others have much sparser linkages. Studies

290

F. Mao et al.

Fig. 2. Examples of functional modules. On the left is a functional linkage map for a group of E. coli genes; on the right is a collection of identified functional modules in the linkage map that are known to consist of genes only working in the same E. coli pathways. See text for details.

on E. coli K12 show that most of these densely linked clusters each correspond to genes working in the same pathways (Wu et al., 2005). Various types of information and their combinations have been used for constructing similar functional linkage maps. For example, using conserved cooccurrence relationships of genes in the same operons across multiple genomes, Snel et al. predicted an interaction network containing 3,033 orthologous gene groups from 38 prokaryotic genomes, and then identified sets of genes possibly involved in the same biological processes (Snel et al., 2002). Von Mering et al. (2003) used conserved gene neighborhood information, gene fusion events and gene clusters consisting of genes with similar phylogenetic profiles to predict the modules in their functional interaction network. In terms of identifying densely intra-linked clusters from an interaction network, numerous methods have been developed. The method developed by Yeger-Lotem et al. (2004) searches for re-occurring sub-networks, called network motifs, in interaction networks, and has identified numerous such motifs. Sen et al. (2006) applied an eigenvalue/eigenvector decomposition technique to their interaction network for cluster identification. Similar ideas, based on the spectral analysis of the connectivity matrix of the interaction network, have also been explored in (Bu et al., 2003; Ihmels et al., 2004).

4. Mining Omics Data New omics techniques, such as microarray chips, protein arrays, ChIP-chip and phenotype microarrays, have been emerging at an explosive rate in the past decade, and there seems to be no obvious reason to believe that this will not continue in the foreseeable future. These techniques have allowed or will continue to allow systematic elucidation and analyses of biological pathways. Here we highlight some analysis techniques for mining transcriptomic and interactomic data, two of the most useful data types for the prediction of pathway networks, while we refer the reader

Prediction of Biological Pathways Through Data Mining and Information Fusion

291

to review articles such as (Bar-Joseph, 2004; Wu and Dewey, 2006) for applications of other omics data.

4.1. Analyses of Microarray Gene Expression Data Microarray experiments can provide snapshots of the functional states of a cell through measuring its gene expression levels and patterns. Information derivable from the microarray data includes co-expression or even co-regulation of genes, genes possibly involved in a particular biological process (e.g., assimilation of phosphorus), and possible regulatory relationships between transcription factors and other genes. Particularly useful are time-series gene expression data as they provide information about the dynamic behavior of genes under designed experimental conditions over a period of time (Cho et al., 1998; Oliva et al., 2005). Various computational techniques have been developed for extracting such information. 4.1.1. Identification of Differentially Expressed Genes One basic application of microarray data is to identify genes that are affected by changes of a particular environmental condition or by a genetic perturbation. This is done through the identification of genes with substantial changes in their expression levels or expression patterns over time collected under two conditions (e.g., nitrogen depletion versus nitrogen treatment). Genes with substantial changes in expression levels might possibly be involved in a particular biological process, e.g., the response system to nitrogen. Such inference often needs to be done in conjunction with functional analyses of genes and possibly additional microarray experiments and associated data interpretation. For example, it may require first identifying genes involved in the general stress response systems before proceeding to elucidate genes that are directly involved in the nitrogen response system. It may require a few rounds of iterations before proposing a set of genes that are directly involved in a particular biological process (a pathway or network). 4.1.2. Identification of Co-regulatory Genes through Data Clustering One of the most widely used applications of the time-course microarray data is to derive possible co-regulation relationships among genes. This has been achieved through clustering genes with similar expression patterns. Eisen and colleagues developed the first clustering algorithm for predicting co-expressed genes (Eisen et al., 1998). Since then, the development of clustering algorithms for microarray data has been one of the most active research areas in bioinformatics (Quackenbush, 2001). Predictions of co-expressed genes can be used for inference of possible co-regulatory relationships among the involved genes. A popular approach is to examine the promoter regions of the co-expressed genes in order to identify if they may have conserved sequence motifs, using tools such as MEME (Bailey and Elkan, 1994), BioProspector (Liu et al., 2001) or CUBIC (Olman et al., 2003a, 2003b).

292

F. Mao et al.

Genes with similar expression patterns and conserved DNA motifs in their promoter regions are deemed as possible co-regulated genes. Using such analyses, a number of predicted regulatory networks have been found to be highly correlated with available experimental data (Harbison et al., 2004). We refer the reader to Chapter 11 for more detailed discussion on this subject area. Besides inferring regulatory networks, the gene clusters can also be mapped to the well studied (generic) pathways in pathway databases such as KEGG or BioCyc to infer the corresponding pathways in the target genome. 4.1.3. Transcription Regulator Versus Genes They Regulate Birnbaum et al. (2001) have developed a technique for linking a transcription factor (TF) and predicted cis-acting regulatory motifs by correlating the expression profile of the TF and the profiles of the candidate target genes. This method predicts potential binding motifs in the genomic sequence for a TF through finding motifs with “composite” expression patterns that are highly correlated with the expression pattern of the TF. The composite expression pattern over a time series, t = t1 , . . . , tk , of a motif B is defined as follows. The composite expression value at t = ti is the sum of the expression values of all the genes that have B as a cis-acting regulatory motif at t = ti . The intuition is that when expressed, all copies of a TF are distributed to all genes it regulates, and the expression level of each such gene should correlate well with the number of the TF molecules used to regulate the gene. This approach is based on the assumption that there is a good correlation between the RNA expression level of a transcription factor and its protein concentration, although this correlation could be relatively weak in some cases (Anderson and Seilhamer, 1997; Chen et al., 2002; Lichtinghagen et al., 2002). 4.2. Interpretation of Protein–Protein Interaction Data As mentioned in Sec. 2.5, protein-protein interaction information could be derived through analyses of experimental data like yeast two-hybrid data. The information derivable from these data includes protein-protein or protein domain-domain interactions, protein-peptide (or ligand) interactions, and protein-domain/small molecule interactions. Several computational techniques have been developed for extraction of such information (Asthana et al., 2004; Deane et al., 2002, 2003; Goldberg and Roth, 2003). 4.2.1. Prediction of Protein–Protein Interaction Clusters A key issue in interpreting protein interaction data from, say, yeast two-hybrid systems is that the data are generally noisy (Birnbaum et al., 2001). Systematic analyses of yeast two-hybrid data from different experiments have shown that overlaps among protein interactions inferred from different experiments have been generally low (Deane et al., 2002), indicating the possibility of high false positive prediction rates. Hence the prediction techniques mostly focus on low-resolution predictions rather than attempting to interpret individual interaction relationships

Prediction of Biological Pathways Through Data Mining and Information Fusion

293

among proteins. For example, Asthana et al. (2004) developed a probabilistic model for inferring protein complexes through the identification of groups of proteins with highly dense interactions among themselves in interaction networks that were derived based on yeast two hybrid data (Gavin et al., 2002; Ho et al., 2002; Ito et al., 2001; Uetz et al., 2000). Similar ideas have been used in the work of Deng et al. (2003) and of Bader and Hogue (2003). 4.2.2. Prediction of Domain–Domain Interactions More detailed information about which specific domains of interacting proteins interact with each other could be derived through more detailed analyses of an interaction network. A pair of interacting proteins can be represented as a collection of interacting pairs between each protein’s domains. For example, when a pair of proteins is predicted to interact and the two proteins each have two domains {a, b} and {c, d}, respectively, the goal is to figure out which of the four possible domainpairs, {(a, c), (a, d), (b, c), (b, d)}, interact. One way to make the prediction is based on the relative frequencies of the domain pairs in the overall protein interaction network. For example, each pair of domains can have an (interaction) score based on its observed frequency in the above expanded domain interaction network versus its expected frequency in such a network based on the frequency distribution of each protein domain (Park et al., 2003). Deng et al. (2002) proposed a global optimization approach to simultaneously estimate the probabilities of all domaindomain interactions. 4.2.3. Feature-based Prediction of Protein–Protein Interactions Based on known protein-protein interaction data, a number of classification techniques have been developed for predicting novel protein-protein interactions (Albert and Albert, 2004; Bader et al., 2004; Ben-Hur and Noble, 2005; Bradford and Westhead, 2005; Chen and Liu, 2005; Dohkan et al., 2006; Fariselli et al., 2002; Gilchrist et al., 2004; Jansen et al., 2003; Lee et al., 2004; Yamanishi et al., 2004; Zhang et al., 2004). The general idea is to identify a set of useful features relevant to protein interactions and train a classifier to distinguish interactions and noninteractions between proteins, based on the values of the selected features. Features used in current classification algorithms include hydrophobicity, charge and surface tension, co-expression, coessentiality, co-localization in the cellular compartment, and protein phylogenetic profiles of the involved proteins. Random Forest Decision (RFD) and Support Vector Machines (SVM) (Bock and Gough, 2001, 2003; Dohkan et al., 2006; Martin et al., 2005; Yamanishi et al., 2004) are two examples for implementing such classification algorithms. 4.2.4. Analysis of Protein–Protein Interaction Networks Analyses of predicted protein (or domain) interaction networks could reveal interesting new insights about genes and biological processes. For example, Uetz and Hughes (2000) predicted the first genome-scaled protein interaction network

294

F. Mao et al.

containing 1,004 yeast proteins with 957 interactions, based on yeast two-hybrid data. The authors found dense clusters of interacting proteins are generally involved in the same cellular functions, suggesting that this approach can be used to gain insights into the functions of un-annotated proteins. Similar work has been done on another large protein-protein interaction dataset by Ito et al. (2001), which contains 4,549 interactions among 3,278 yeast proteins. Their analyses suggest that it is possible to (i) assign functions to un-annotated proteins based on the functions of their network neighbors (Brun et al., 2003; Rain et al., 2001), (ii) elucidate the functions of duplicated genes (Baudot et al., 2004; Wagner, 2001), (iii) predict protein complexes (Bader and Hogue, 2003; King et al., 2004; Przulj et al., 2004; Spirin and Mirny, 2003), and (iv) construct biological pathways (Dam et al., 2007; Su et al., 2003, 2006).

4.3. ChIP-chip Data for Protein–DNA Interaction Whole-genome ChIP-chip studies allow one to determine the entire spectrum of in vivo DNA binding sites for a DNA-binding protein. While it represents a challenging problem to interpret ChIP-chip data for mammalian systems due to the complexity of the genomes (such as the prevalence of repetitive sequences in the genomes), it is relatively easier to interpret the data for prokaryotic genomes. Numerous computational methods have been developed for predicting the approximate regions of the DNA binding sites of a specific protein across the whole genome, including TileMap (Ji and Wong, 2005), Chipper (Gibbons et al., 2005), TiMAT (http://bdtnp.lbl.gov/TiMAT), MAT (Johnson et al., 2006) and Hidden Markov Model based methods (Du et al., 2006; Li et al., 2005). Based on the predicted approximate regions for the binding sites, one can further pinpoint the exact binding locations using a motif finding program like MEME (Bailey and Elkan, 1994), BioProspector (Liu et al., 2001) or CUBIC (Olman et al., 2003a,b). Using this technique, numerous successful applications have been carried out to identify transcription regulatory networks for microbial organisms. A whole cell transcription regulatory network for yeast was identified in 2004 (Harbison et al., 2004). This technique has also been used to identify the spatial patterns of transcriptional activities (Jeong et al., 2004) and to map RNA polymerase binding sites in E. coli (Herring et al., 2005).

5. Pathway Prediction through Pathway Mapping Many metabolic pathways are essential for the survival of an organism, and they exist in very similar form in many different species (Kanehisa, 2002). These pathways employ the same or similar network structures as well as the same or similar set of enzymes. For example, the glycolysis/gluconeogenesis pathway, which is associated with the utilization of glucose in a cell, exists in virtually all organisms with sequenced genomes, ranging from E. coli to H. sapiens, and this pathway can

Prediction of Biological Pathways Through Data Mining and Information Fusion

Fig. 3.

295

Generic pathway model for glycolysis and gluconeogenesis in KEGG.

be presented by a generic pathway model as shown in Fig. 3. The difference in the detailed pathway structures across organisms is that each may use a portion of this generic pathway but all contain a common core part. Based on this observation, a set of pathway prediction programs has been developed. These programs map a pathway onto the target genome based on known or partially known models of

F. Mao et al.

296

similar pathways in other organisms. We refer to members in each such class of similar pathways as homologous pathways. It should be noted that, though these homologous pathways have similar network structures and similar sets of enzymes, genes that encode each of these enzymes may not necessarily be homologues or even remote homologues. Below we summarize a few popular computational methods that have been developed for mapping specified target genomes based on known pathways.

5.1. Speciation of KEGG Generic Pathways by KAAS Based on Sequence Information The KEGG pathway database currently contains 255 generic pathways (Kanehisa, 2002), accumulated through many years of studies on metabolism by the whole research community, collected and manually curated by Kanehisa’s group at Kyoto University. Each generic pathway is represented as a set of biochemical reactions and the associated enzymes that catalyze the reactions. KEGG also provides a suite of computational tools in support of its pathway mapping. For instance, one of the tools is for mapping KEGG generic pathways to a specific genome, which we call speciation of a generic pathway. KEGG does this through another database called KEGG Orthologs (KO), which classifies genes into orthologous groups, each given a K-number (Kanehisa, 2002). KEGG associates one or multiple K-numbers with each reaction of a generic pathway, specifying the enzymes that are legitimate for that action. To map such a pathway to a target genome, KEGG assigns genes with the same enzymatic roles in the generic pathway with the same K-numbers. The subsystem of KEGG that accomplishes this mapping task is KAAS (Moriya et al., 2007) whose overall workflow is illustrated in Fig. 4.

5.2. Pathway Construction by PathoLogic Based on Gene Annotation PathoLogic is a computational module in the Pathway Tools package (Karp et al., 2002), which can map well-characterized metabolic pathways in MetaCyc (Caspi et al., 2006) to a target genome. Unlike KEGG pathway mapping, PathoLogic maps a (template) pathway known for a particular organism, instead of a generic pathway BLAST to KEGG gene database homologs of target gene

target gene sequence

ranked KOs

ranking KOs for pathway assignment

Fig. 4.

ortholog candidates

assignment to KO groups KO groups

Diagram of the overall procedure of KAAS.

Prediction of Biological Pathways Through Data Mining and Information Fusion

297

in KEGG, to a target genome, by identifying candidate genes in the target genome for each enzyme in the template pathway. It uses a dictionary to select candidate genes, from the target genome, for each gene in the template pathway if there is any. The dictionary consists of information extracted from the MetaCyc database, the ENZYME database, an enzyme name table developed by Pangea Systems, and a user-provided name table that represents the user’s additional knowledge of the system. PathoLogic does not use sequence information but rather annotated gene functions. It can be used to predict pathways for a specific genome by using a set of well-characterized pathways from one organism (say E. coli) to a target genome.

5.3. Pathway Mapping by PMAP Based on Multiple Data Resources Among the existing pathway mapping programs, PMAP (Mao et al., 2006) is unique in the sense that it explicitly relies on genomic structure information such as operons and regulons in addition to functional similarity information when mapping a template pathway to a target genome. The underlying assumption is that genes in the same operons tend to work in the same pathways. This assumption is supported by a recent simulation study (Mao et al., 2006). The result of the study suggests that the probability for two genes to be in the same operon when working in the same pathway is at least two orders of magnitude higher than the probability for two arbitrary genes to be from the same operon. Based on this observation, PMAP was designed to map a pathway in the target genome based on a template pathway of some organisms, for instance, E. coli. The following criteria are considered when mapping a template pathway: 1. The overall sequence similarity or functional similarity (when homology could not be detected) between the genes in a template pathway and their mapped genes in the target genome is as high as possible, 2. The number of operons covered by the mapped genes is as small as possible, and 3. The number of regulons covered by the involved operons is as small as possible; it is measured by the shared conserved cis-acting regulatory motifs in the promoter regions of the involved operons. An integer programming-based algorithm (Mao et al., 2006) was employed to model and solve this optimization problem. Different than the other pathway mapping programs, PMAP is not restricted to mapping metabolic pathways; it treats all types of pathway mapping problems in the same manner.

5.4. Pathway Construction Using SEED SEED (Overbeek et al., 2005) is a computational environment for reconstructing pathway models (called sub-systems) manually by a group of domain experts. The developers of SEED think it is better to have the domain experts who

298

F. Mao et al.

understand a specific sub-system to do annotation at the specific sub-system than to have generic experts who sequence the genome also annotate the genome. SEED represents the effort of both developing the computational environment for annotation and bringing domain experts to do annotation by using the computational environment. The computational environment provides both a set of comparative genomics tools for easy calculation and analysis and a database for comprehensive management of the annotations. As of May 2007, 659 pathways (sub-systems) have been constructed through the SEED platform, covering 504 organisms (http://theseed.uchicago.edu/FIG/index.cgi). This database of reconstructed pathways has expanded the existing pool of (experimentally validated) pathway models, which have been mostly compiled from the published literature. SEED represents a new community-wide effort in generating new knowledge about pathways, extending the pool of pathway models from which tools like PMAP can be used to map to various target genomes. It is generally difficult at this point to assess the accuracy of any pathway mapping programs scientifically, since the overall available information about “pathways” is incomplete and sporadic. Hence we have not seen any comprehensive study to compare prediction accuracy of pathway prediction programs. However, there is some limited information that can be utilized to assess pathway mapping performance. KAAS and PMAP are fully automatic methods, while PathoLogic needs gene annotation information for the target genome, and SEED needs domain experts to do annotation manually. When mapping three well studied pathways from B.subtilis to E.coli, PMAP does better than the traditional methods such as Reciprocal Best Blast Hit and COG (Mao et al., 2006). KAAS claims its prediction accuracy varies from 85.5% to 98.0% for different organisms when compared with KEGG GENES database (Moriya et al., 2007). Of 98 pathways mapped by PathoLogic from MetaCyc to H. pylori, 40 are claimed to be accurate based on manual examination (Paley and Karp, 2002). SEED also claims that it has accurately predicted several sub-systems such as the Histidine Degradation Subsystem.

6. Pathway Inference through Information Fusion We have shown how microarray gene expression data, in conjunction with functional prediction of genes, can be used to identify an initial list of genes that are possibly involved in a target pathway. This list can be enhanced and refined through the identification of genes whose protein products may have physical interactions or functional associations with genes (their protein products) already in the candidate gene pool, through prediction of operons, regulons, protein-protein and proteinDNA interactions, and functional associations. Typically there is still a large gap between this list of candidate genes along with their predicted interactions (in general not complete) and the final pathway model, due to the complexity of biological pathways and the limitations of individual measurement techniques such

Prediction of Biological Pathways Through Data Mining and Information Fusion

299

as microarrays. This gap could be reduced through mapping some of the wellcharacterized pathways from other organisms or from the generic KEGG pathways to the list of candidate genes to add additional candidate genes as well as their connectivity information. As of October 2007, the KEGG database contains 351 reference pathways, suggesting that a substantial fraction of candidate genes in a genome can be assigned to these reference pathways using pathway mapping tools such as PMAP. After mapping, a pair of genes mapped to the same pathway can be considered connected, providing a functional connection map. In this section, we continue to discuss additional methods/algorithms used to integrate multiple sources of information to identify missing components, to add new components and to combine multiple partial models into a comprehensive model.

6.1. Identification of Missing Reactions and Gene Functional Assignment One challenge in deriving a “complete” pathway model lies in the reality that none of the methods outlined in this chapter promises to find all component genes of a target pathway. It represents a common problem in pathway prediction that there are “holes” in the predicted pathways where some enzymes or other functional roles could not be filled by any genes identifiable using the methods outlined above. Palsson’ group (Reed et al., 2006) recently proposed a systematic approach to predicting missing reactions in a metabolic pathway network through reconciling the predicted models and the available growth-related experimental data. The approach consists of four steps: (a) computational analyses of initial pathway models to identify discrepancies between predicted models and growth phenotypes, (b) identification of enzymatic and transport reactions missing from the current metabolic pathway models, (c) identification of candidates for missing components, and (d) experimental validation. In their study, E. coli growth data were compared to the results of a metabolic pathway reconstruction, using flux balance analysis. The comparison led to the identification of numerous discrepancies. To address these discrepancies, a new algorithm was proposed to calculate the minimum number of reactions, selected from a pre-collected pool of reactions that need to be added to the model to allow the observed growth curves. Then candidate genes were identified based on a literature search, sequence homology search, context-based homology search and microarray data analyses (Reed et al., 2006). Various other attempts have been made to identify “missing” genes using more localized approaches. For example, PathoLogic (Green and Karp, 2004) applies the “guilt by association” rule to identify candidate genes for a “hole” in a pathway model, using information of predicted operons and conserved genomic neighborhoods. In another approach, Yamanishi et al. suggested that the metabolic networks can be inferred through integration of multiple genomic data and chemical information (Yamanishi et al., 2005). In their approach, chemical compatibility is used as a constraint in refining predicted metabolic pathways. The EC (Enzyme

300

F. Mao et al.

Commission) numbers were used to compute the chemical compatibility between two enzymes. A global enzyme network consisting of 1,120 enzymes in yeast was obtained in this study (Yamanishi et al., 2005). The validity of the approach has been documented through a number of case studies.

6.2. Adding Complementary Pieces to Pathway Models Sometimes additional sub-systems could be added to a predicted pathway models when additional data become available. Numerous attempts have been made towards accomplishing this. Using a Bayesian statistics model, Lee et al. (2004) proposed a scoring scheme for protein-protein interaction prediction through unifying data from multiple sources, including microarray data, protein interaction data, gene-fusion analysis, phylogenetic profile analysis and co-citation in literature (Lee et al., 2004). Each of these data sources is benchmarked individually and integrated into one statistical model, through parameter optimization on a large training dataset, to give one numerical score to each predicted interaction relationship, all of which are on the same scale and hence could be directly compared with each other. As a result, this approach discovered new interactions including genes in the chromatin modification and ribosome biogenesis (Lee et al., 2004). Another method based on a stochastic algorithm was proposed by Tu et al. (2006) for integrating genotype information, gene expression, protein-protein interaction, protein phosphorylation, and protein-DNA interaction information to predict the causality relationships among genes. The focus is on finding a functional pathway that links between a transcription factor or an upstream gene that regulates the activity of a transcription factor and downstream genes regulated by this causal gene. The basic idea is that given a gene network, a “walk” starts from this causal gene and follows the edges of the network. The decision of which edge to take is based on the gene expression profiles. Genes being visited frequently are more likely to be causal genes, and the most traveled paths between the two are regarded as the underlying regulatory pathways.

6.3. Integration of Multiple Partial Pathway Models When using different data sources, different portions of a pathway model could be inferred with its component genes and their interaction relationships. One approach to integrate multiple partial pathway models was recently developed by Su et al. (2006; 2003). In their study to build response networks to different nutrients, e.g. phosphorus in cyanobacteria, the authors first mapped corresponding known pathways from E. coli, B. subtilis and S. typhymurium to their target cyanobacterial genome to build the initial target pathway models. By mapping the (partial) phosphorus assimilation pathways from each of three organisms, they have

Prediction of Biological Pathways Through Data Mining and Information Fusion

301

three pathway models (Su et al., 2003), which include consistent, complementary as well as conflicting parts. Consistent and complementary predictions are then simply merged to generate a larger model while conflicting predictions are resolved through applications of operon and regulon information in the target genome as well as through application of the majority rule (Su et al., 2003, 2006). Using this simple qualitative procedure, they have predicted a number of pathway models, including phosphorus assimilation pathways, nitrogen assimilation pathways and carbon assimilation and fixation pathways. Clearly such qualitative approach could be made more rigorous and more general by using quantitative methods to be presented in Sec. 7. Various other attempts for integrating predicted partial pathways from different data sources have been carried out, including combining metabolic networks with transcriptional data to infer the transcriptional regulation machinery of the metabolic networks (Patil and Nielsen, 2005) and the prediction of transcriptional networks based on gene expression and protein-DNA interaction (Lee et al., 2002). 6.4. Examples of Network Topology Prediction Using techniques and data resources discussed above, numerous (novel) metabolic and regulatory networks have been predicted. Though these pathway/network models are far from perfect, they are expected to have some predictive power in providing experimentalists with testable hypotheses that can then be validated experimentally and used for guiding further experimental designs. In addition, further refinements using more general information to help filling gaps in the models can be done using mathematical optimization techniques (Sec. 7). For example, if the model describes a metabolic network, it should be mass-balanced. If a model describes a transcriptional network, it should be consistent with available microarray gene expression data. 6.4.1. Prediction of the Nitrogen Assimilation in Synechococcus sp. WH8102 Su et al. developed a computational protocol for inference of metabolic and regulatory networks in bacteria (Dam et al., 2004; Su et al., 2003, 2005, 2006). Their protocol consists of three main steps: (a) construction of a template pathway based on searches of the literature and genomic databases in organisms with substantial experimental data as well as the current understanding about the target pathways; (b) mapping the template pathways to the target genome using PMAP; (c) refinement of the mapped pathway models through model integration and through recruiting additional genes into the integrated model based on protein-protein interaction prediction, co-expressed gene information, operon and regulon information, and predicted functional modules; and (d) experimental validation. Using this protocol, they constructed a number of pathway models in Synechococcus sp. WH8102, including nitrogen assimilation. Their predicted pathway for nitrogen assimilation revealed some interesting biology of

302

F. Mao et al.

cyanobacteria, such as that nitrogen assimilation affects the expression of many genes involved in photosynthesis, suggesting a tight coordination between the nitrogen assimilation and photosynthesis processes (Su et al., 2006). 6.4.2. Prediction of the E. coli Pathways Many large datasets have been created for E. coli K12, including genomic, transcriptomic, proteomic and protein-protein interaction data. Using these data, Covert et al. (2001) developed an algorithm for constructing genome-scale regulatory and metabolic networks in E. coli. Their prediction result is the first genome-scale model of an integrated transcriptional regulatory and metabolic network. Among the 1,010 E. coli genes covered in the model, 906 genes are involved in metabolism and 104 are regulatory genes that regulate 479 genes in the reconstructed metabolic network. The model could possibly be used to (a) predict the outcomes of growth phenotype and gene expression experiments, and (b) identify knowledge gaps, and add previously unknown components and interactions in the regulatory and metabolic networks (Covert et al., 2004). The success of this approach illustrates the power of computational prediction and modeling in systematic integration of multiple types of experimental data to gain systems level understanding about an organism. 7. Estimation of Parameters for Metabolic Models We have, in the previous sections, demonstrated how genomic and proteomic data can be mined to establish component genes and patterns of connectivity in metabolic pathway networks of a prokaryotic organism of interest. This section complements the earlier sections in two ways by discussing two closely related topics, namely (a) the inference of metabolic networks and their regulation from metabolic data, and (b) the determination of numerical parameter values that quantify the connections within the network model. These connections may be in the form of material fluxes or regulatory signals. Both topics are difficult, requiring mathematical innovation and computational efficiency. The technical challenges become much more approachable, if one uses “canonical” models, whose structure is fixed, and whose individuality comes from its parameter values. Canonical models are described in greater detail in Chapter 13, but it is useful to mention some of their salient features, because this will streamline the following discussion. The simplest canonical model in the context of metabolic pathways is a stoichiometric model, which describes the collection of all fluxes in a metabolic system in such a way that they are balanced at each metabolite pool. Thus, all material flowing into a pool exactly equals all material flowing out of this pool. Under the assumption that the metabolic system operates at a steady state and that the flux rates are constant, the mathematical representation consists of a linear system that is relatively easily analyzed with methods of linear algebra and operations research. More complex canonical models, which account for regulation in

Prediction of Biological Pathways Through Data Mining and Information Fusion

303

addition to connectivity, are the foundation of Biochemical Systems Theory (BST) (Savageau, 1969a, 1969b). BST models are nonlinear, which creates much greater mathematical challenges than in the case of stoichiometric steady-state systems, but they capture the essence of true metabolic pathways much more accurately. One key feature of all canonical models is that their parameters can be mapped almost uniquely onto the structure of a pathway model under investigation. Therefore, if one is able to determine the parameter values of a canonical model, one immediately gains insight into the structure (and regulation) of the pathway. This conclusion shifts the heavy burden of identifying the structure of a pathway onto the estimation of parameter values and, indeed, this estimation is the bottleneck of very many model analyses. Figure 5 illustrates the generic features of a metabolic pathway and its control. Reading from bottom to top, the metabolites are coded as variables Mi , and the arrows connecting them represent reactions that are typically catalyzed by enzymes Ej (see Chapter 13 for details). As proteins, the enzymes are subject to the expression of appropriate genes Gk , along with their transcription, translation and posttranslational modifications. Enzymes may furthermore be degraded through the action of proteases or diluted by growth, or their activities may be adjusted through posttranslational modifications. As is also indicated in Fig. 5, metabolites can regulate each other directly at the metabolic level (Rm ). This regulation, for instance in the form of product inhibition, is much faster than regulatory mechanisms based on gene expression. Not shown in the figure is that gene products as well as metabolites may affect the expression of the genes, thereby creating a feedback regulatory mechanism that is much slower than purely biochemical regulation. It is plausible to infer from inspection of Fig. 5 that there is a certain degree of redundancy in these types of systems. For instance, if we know the values of all metabolites at all time points of an experiment, we might be able to deduce the reaction rates. Similarly, if we know all reaction rates at all times, we can infer the metabolite concentrations. In fact, this is the manner in which most standard model analyses work: The parameters of the reaction steps numerically determine the right-hand sides of a system of differential equations, and numerically integrating these equations produces the dynamics of all metabolites. If we take the possibilities

Fig. 5. Schematic representation of a metabolic pathway and its control. Metabolites (Mi ) are the nodes of the pathway. Reaction steps (solid arrows) describe conversions between them, and regulatory signals (Rm ) in turn affect the reaction rates. Enzymes (Ej ) that catalyze the reactions are subject to gene (Gk ) expression and degradation processes.

304

F. Mao et al.

of inference one step further, we can imagine that accurate measurements of gene expression may give us quantitative insights into the availability of enzymes, which in turn lets us deduce metabolite concentrations. One must be aware, though, that this latter chain of causes and effects is somewhat uncertain, because of differing half lives of mRNAs and enzymes as well as the assumption that posttranslational modifications can be ignored. Nonetheless, gene expression is often used as a coarse substitute for measuring metabolic activities, which is generally considered as reasonable for prokaryotic systems (see Sec. 4.1.3). The redundancy in the metabolic scheme of Fig. 5 has an important consequence. Namely, if we have data available on one level (for instance at the genome level, obtained with methods described before), we might be able to use a mathematical model to infer what is happening on the other levels, such as the metabolic pathway level. Depending on what is known and what is mathematically deduced, this process of inference can take on different forms and becomes a matter of computer simulation, bottom-up or top-down parameter estimation, or reverse network engineering. The case of computer simulation was already mentioned in the previous paragraph and is discussed in Chapter 13 in greater detail.

7.1. Forward or Bottom-Up Modeling Until recently, most metabolic pathway models were developed from “local” kinetic information, which was combined with experimental measurements on biochemical or physiological responses. Specifically, biologists around the world worked on characterizing a particular enzyme (Ej in Fig. 5) in the traditional, reductionist manner. They purified the enzyme, studied its characteristics, determined optimal temperature and pH ranges, and quantified cofactors, modulators, and secondary substrates. Isolated from these laboratory experimenters, modelers later converted this information into the mathematical description of a “rate law” describing the activity of, say, E1 , as well as possible regulatory effectors, such as R1 . This conversion constituted the first and most important step of traditional parameter estimation in metabolic pathway analyses. Finally, once enough information had been collected on many or all Ej and Rm , the modeler attempted to merge all this information into an integrative mathematical model. “Solving” this model (i.e., integrating the differential equations) with a set of initial conditions produced outcomes in the form of metabolites Mi . While theoretically straightforward, there are several disadvantages to this approach. The main issue is that more often than not, the “integrated result” is not consistent with biological observations. One reason is that quite a bit of kinetic information is needed and that this information is often obtained from different organisms, different species, and collected under different experimental conditions, where it is unclear to what degree the measured features of the enzyme are compatible. This process of construction and refinement is also very labor intensive and requires a combination of biological and computational expertise that

Prediction of Biological Pathways Through Data Mining and Information Fusion

305

is still rare and will require changes in higher education in biology (Goel et al., 2006; Teusink et al., 2000; Voit, 2004).

7.2. Inverse or Top-Down Modeling High-throughput omic techniques of biology have begun to offer additional, distinctly different options for modeling metabolic systems. Of particular interest are new in vivo data that are generated with techniques such as the ones outlined in Sec. 2 plus flow cytometry and nuclear magnetic resonance (NMR). In contrast to the “local” data of traditional experimentation, the “global” data are obtained under the same experimental condition, within the same species, and sometimes in vivo. By their very nature they account for all processes within the organisms that could possibly have an effect on the variables of a system under investigation and thus describe the system in its integrity. The global data contain enormous information about the functional connectivity and regulation of the biological networks they describe. However, this information is mostly implicit and requires strategies of appropriate analysis and interpretation that must combine biological with mathematical and computational techniques. Because the ultimate reward is high, many groups around the world have begun to develop mathematical methods of reverse network engineering that attempt to identify the structural and regulatory features of pathway systems from these global data that describe the entire state of a system (for a recent review see Srividhya et al. (2007)). The top-down approaches will not replace the bottom-up approach (Sec. 7.1), but instead the two should both be executed, because their results complement each other in distinct ways (e.g., Bruggeman and Westerhoff, 2007; Voit, 2004). As stated here, the inverse task refers to the metabolic level. However, it is clear that this task can (and should) be combined with inverse tasks at the proteomic and the genome levels, as described in earlier sections.

7.3. Reverse Network Engineering The potential advantages of inverse approaches are significant, but there are also numerous significant challenges, some of which are readily anticipated, while others are surprising and puzzling (Voit et al., 2006, 2005). The technical issues of inverse modeling and reverse network engineering depend very much on the model structure that is supposed to capture the system of interest. Three classes of models and inverse approaches have been developed. The first is of a statistical nature and uses Bayesian ideas for assessing the probability that the dynamics of one metabolite directly depends on the dynamics of another metabolite. Specifically, a Bayesian network (Pearl, 1988) is a graph model that represents a set of nodes (variables) and their conditional probabilistic dependencies upon each other. Its analysis permits the explicit detection of causal associations among nodes in the system, as long as there are no structural or

306

F. Mao et al.

functional cycles, for instance in the form of material recycling or feedback signals. While the latter exclusions are clearly restrictive, Sachs and coworkers (Sachs et al., 2005) successfully used Bayesian network methods to investigate the structure of a protein-signaling network from single cell flow-cytometry data. The computational methods confirmed and elucidated most of the previously reported causalities and revealed new relationships between the involved signaling proteins. Bayesian network methods have not been applied extensively to metabolic pathway systems, but more often to genomic networks, where the task was to reconstruct networks of expression traits, or networks comprised of both expression and disease traits (Zhu et al., 2007). The second class of models amenable to reverse network engineering consists of stoichiometric systems. These are based upon the law of conservation of mass and focus, in particular, on the distribution of fluxes and transport steps that connect the “pools” in the network. As described in more detail in Chapter 13, stoichiometric systems are typically studied at the network’s nominal steady state, and it is assumed that all flux rates can be considered constant. Because of these restrictions, the differential equations describing the pathway reduce to a system of linear algebraic equations. This reduction is of great advantage, because well-established matrix techniques can be applied to their analysis, including the estimation of internal flux rates from some measured influxes and effluxes. Flux balance analysis (FBA) is an extension of the stoichiometric approach that imposes mathematical constraints on the possible distribution of fluxes. These reflect that flux rates and ratios at branch points in reality are bounded in capacity by physical, biochemical, and physiological limitations (see, e.g., Beard et al., 2002). Because FBA is constructed to retain the linearity of stoichiometric descriptions, it provides an effective tool for analyzing metabolic networks in a quantitative manner (for reviews see Bonarius et al. (1997); Edwards et al. (1999); and Palsson (2006)). A main goal of FBA is the manipulation and optimization of flux distributions, for instance, with the goal of maximizing a microbe’s yield with respect to a biological metabolite of interest, while minimizing nutrient utilization. FBA has been successful in assessing the theoretical capabilities and operative modes of metabolic systems in the absence of kinetic information (Bono et al., 1998; Palsson, 2006; Selkov et al., 1997). The main drawback of both the stoichiometric and FBA approaches is that they do not and cannot take nonlinearities into account, which would destroy the linearity of the describing equations. This insistence on linearity is at odds with natural systems, which are intrinsically nonlinear, thereby rendering linear approaches in many cases inadequate. A distinct alternative to the linear approaches are nonlinear modeling frameworks. As discussed in detail in Chapter 13, it is often beneficial to search for such models within the restricted domain of “canonical” nonlinear models, which are designed within a rigorously defined mathematical structure. The most prevalent of these canonical nonlinear models in the context of metabolic modeling are S-systems and Generalized Mass Action systems that are the core of Biochemical Systems Theory (BST). These models are constructed by approximating fluxes

Prediction of Biological Pathways Through Data Mining and Information Fusion

307

with products of power-law functions, which at once are supported by rigorous approximation theory, have very convenient mathematical features, and allow for a surprising flexibility in what they can represent. One aspect of these models, which is crucial for structure identification, that is, for the characterization of the connectivity and regulation of metabolic pathway systems (or other biological systems), is the fact that structural and functional features are mapped one-toone onto the parameters of these systems. Thus, it is immediately clear what the meaning of any given parameter in a BST model is. An important consequence is that structure identification becomes a matter of parameter estimation. The latter task is still very complex, but much simpler than a general structure identification task. Specifically, it is theoretically (and to some degree practically) possible to design a symbolic BST model that contains all realistically possible structural and functional connections between system components. If subsequent parameter estimation from actual data reveals that a particular parameter is zero, it is immediately known that the corresponding feature does not exist in the particular biological system under investigation. The inference of the structure and regulation of biological systems from time series data, using BST models, faces various challenges that need to be overcome, and which are currently being addressed by a number of research groups around the world. On the biological side, data typically contain noise and are seldom complete. In many cases, potentially important system components are not measured, or the uncertainty in data and experimental conditions is high. On the computational side, the estimation process itself is very complicated since the describing models consist of nonlinear differential equations. As a consequence, the optimization of parameter values is much more difficult than in linear systems, and numerical algorithms often run into problems of lacking convergence or convergence to local minima, especially when the systems grow larger. Other computational challenges include the distinction between direct and indirect effects, the characterization of intermediate steps and time delays, and the internal heterogeneity and stochasticity of biological systems, which is seldom addressed in nonlinear models. In spite of these challenges, inverse methods for biological systems are badly needed, because the data describing these systems at a high level are in some sense the most comprehensive reflections of what cells and organisms really do. Computational solutions to biological reverse engineering problems usually require a combination of techniques, which typically include: methods for reducing the complexity of the inference task; methods for solving the differential equations; and efficient algorithmic means for determining optimal estimates.

7.4. Reducing the Complexity of the Inference Task All methods of parameter estimation and structure identification eventually run into problems caused by “combinatorial explosion,” which simply means that the computational techniques will eventually be overwhelmed by the rapidly increasing

308

F. Mao et al.

number of possible interactions between variables in large systems. Fortunately, there is a counteracting and very beneficial feature of biology: namely a real biological system is very rarely fully connected. For instance, most metabolites can only be converted into a rather limited number of other metabolites (Jeong et al., 2000). To take advantage of this fact of nature, it must therefore be our goal to precede any estimation attempt with a concerted effort to limit objectively the number of candidate (structural and functional) connections within a system, thereby a priori reducing the parameter space that must be searched. For instance, the lack of any direct interaction between two sets of variables allows the immediate elimination of all corresponding kinetic orders in a BST model (see Chapter 13) before the estimation begins. In this context, Veflingstad et al. (2004) suggested that information on the connectivity of a pathway may be gleaned through customized methods of piecewise linearization (see also Torralba et al. (2003)). Marino and Voit (2006) proposed an estimation method based on reconstructing equations from the simplest possible model to increasingly more involved equations. Specifically for linear parts of pathways, Lall and Voit (2005) applied a technique of “term peeling” to models in BST to convert the nonlinear parameter estimation task into a series of linear regression tasks.

7.5. Solving Differential Equations To appreciate the need for speeding up the evaluation of differential equations, consider a direct attempt to estimate the parameters of a five-variable system of ordinary equations from noise-free time series data (Kikuchi et al., 2003). This group used a cluster of 1,040 CPUs, which ran for ∼10 hours for each iteration of the estimation program. Needing seven iterations, the entire estimation time thus was roughly 70,000 PC-hours. An obvious strategy for handling this situation has been the use of massive computer power, subdividing the task into smaller subtasks and parallelization or decomposing the large network inference problem into subproblems (e.g., Kikuchi et al., 2001; Kimura et al., 2005; Maki et al., 2001, 2002). Analyzing this dire situation, we clocked optimization tasks in detail and determined that parameter searches involving differential equations are very time consuming and easily spend 90% of the used CPU time on integrating the equations, while relatively little time is used to compute gradients toward the optimal estimates (Voit and Almeida, 2004). In fact, if the underlying model is stiff, the computation time may increase to almost 100%, and even if the model is not stiff, the likelihood is high that some trial solutions during the algorithmic process could make it stiff (Voit and Almeida, 2004). With this insight, we returned to an old idea of estimating slopes from observed time series data and substituted them for the derivatives in the differential equations (Voit, 2000; Voit and Savageau, 1982a, 1982b). This substitution entirely eliminates the need to integrate differential equations, because the estimation is now executed on systems of algebraic equations. Furthermore, the

Prediction of Biological Pathways Through Data Mining and Information Fusion

309

equations become uncoupled so that they can be assessed in parallel or one at a time. The slope-estimation-decoupling strategy requires good estimates for the slopes, which are not always easy to obtain. If the data are more or less noise-free, simple linear interpolation, splines (de Boor, 1978; de Boor et al., 1993; Green and Silverman, 1994), B-splines (Seatzu, 2000), or even the simple three-point method (Burden and Faires, 1993) are effective. If the data are noisy, it is useful to smooth them, because the noise tends to be magnified in the slopes. Established smoothing methods again include splines, as well as different types of filters, such as the Whittaker filter (see Eilers (2003) for a review) and artificial neural networks (Almeida, 2002; Almeida and Voit, 2003). An alternative approach avoiding numerical integration is a modified collocation approximation combined with hybrid differential evolution (HDE), which Tsai and Wang (2005) proposed for determining the global solution of an estimation task. Again, applying this type of “uncoupling” strategy in combination with other estimating methods reduced the computation time dramatically.

7.6. Efficient Algorithms for Determining Optimal Estimates As for most optimization tasks, the prominent methods for parameter estimation from time series data are either gradient based or not. Several articles have been published in the recent literature describing these computational methods for the inverse problem using BST models, but no method so far has risen to the top as the clear general winner in terms of efficiency, robustness and reliability. Evolutionary search techniques, such as genetic algorithms, genetic programming or differential evolution as well as many of their variants, have been applied because of their global optimization ability (Cho et al., 2006; Kikuchi et al., 2001, 2003; Kim et al., 2006; Kimura et al., 2003, 2004, 2005; Koza et al., 2001; Maki et al., 2001, 2002; Moles et al., 2003; Noman and Iba, 2005, 2006; Park et al., 1997; Sakamoto and Iba, 2001; Spieth et al., 2005, 2006; Sugimoto et al., 2005; Tsai and Wang, 2005; Voit and Almeida, 2004; Zhang et al., 1996). Other methods have included simulated annealing (Gonzalez et al., 2007), interval analysis (Tucker et al., 2006; Tucker and Moulton, 2006), gradient-based nonlinear optimization methods (Mendes and Kell, 1998), radial basis function network (RBFN) (Matsubara et al., 2006; Rank, 2003), branch-and-reduce methods (Polisetty et al., 2006), network component analysis (NCA) or general NCA (gNCA) (Liao et al., 2003; Tran et al., 2005), singular value decomposition (SVD) with robust regression (D’Haeseleer et al., 1999; Yeung et al., 2002) and a number of hybrid methods (Tsai and Wang, 2005). Most of the methods correspond to nonlinear methods which are not straightforward and lead to very challenging issues of lacking convergence or convergence to local minima, as soon as a model becomes moderately large. Some of the methods have been reviewed by Crampin and coworkers (Crampin et al., 2004).

310

F. Mao et al.

Two new intriguing methods are the determination of Newton flows and alternating regression. In the former case, Tucker and collaborators conjectured that most (or maybe even all) good parameter solutions of a BST model lie on onedimensional manifolds within the high-dimensional parameter space and proposed methods for computing this manifold (Tucker et al., 2006; Tucker and Moulton, 2006). Optimization along this curve then became comparatively easy. Chou et al. (2006) proposed a method of alternating regression (AR), which is specific to S-systems within BST and utilizes the fact that power-law functions are linear in logarithmic space. The AR method dissects the complex nonlinear estimation task into iterations of simple tasks of linear regression. Some puzzling issues of convergence with this method still need to be resolved, but the authors showed very good performance in a variety of systems. 8. Major Resources of Relevant Tools on the Internet The following is a list of computational tools relevant to prediction of pathways and networks, all available on the Internet. InterPreTS (Interaction Prediction through Tertiary Structure): Interaction prediction through 3D structure analysis http://www.russell.embl.de/interprets/ PIP (Potential Interactions of Proteins): Comparative genomics method for protein-protein interaction prediction http://bmm.cancerresearchuk.org/∼pip/ DIP (Database of Interacting Proteins): A database that documents experimentally determined protein–protein interactions http://dip.doe-mbi.ucla.edu/ FusionDB: A database for analysis of prokaryotic gene fusion events http://www.igs.cnrs-mrs.fr/FusionDB/ BioGRID: A database of genetic and physical interactions http://biodata.mshri.on.ca/grid/servlet/Index/ IntAct: EMBL-EBI Protein Interaction http://www.ebi.ac.uk/intact/ SPiD (Subtilis Protein Interaction Database): Bacillus subtilis Protein Interaction Database http://genome.jouy.inra.fr/cgi-bin/spid/index.cgi/ DPIDB: DNA-Protein Interaction Database http://www.dpidb.genebee.msu.ru/ TRANSFAC: A commercial database for transcription factors and their genomic binding sites for eukaryotes http://www.biobase-international.com/

Prediction of Biological Pathways Through Data Mining and Information Fusion

311

MAT (Model-based Analysis of Tiling-array): A model-based ChIP-chip data analysis software package http://chip.dfci.harvard.edu/∼wli/MAT/ TiMAT (Tilling Microarray Analysis Tools): An open-source, Java based set of scripts used for processing chip-chip tiling array data at Berkeley Lawrence Lab http://bdtnp.lbl.gov/TiMAT/ TileMap: ChIP-chip data analysis software based on a hierarchical empirical Bayesian model http://biogibbs.stanford.edu/∼jihk/TileMap/index.htm/ BDTNP (Berkeley Drosophila Transcription Network Project): ChIPCHIP Meta Database https://bdtnp.lbl.gov/Chipper/index.jsp/ Protein-DNA Recognition Database: A system to help researchers understand the mechanism of nucleic acid recognition by proteins http://gibk26.bse.kyutech.ac.jp/jouhou/3dinsight/recognition.html/ KEGG (Kyoto Encyclopedia of Genes and Genomes): A comprehensive pathway database developed at Kyoto University of Japan that also includes information about genes and orthologs http://www.genome.jp/kegg/ KAAS (KEGG Automatic Annotation Server): Prediction software developed at KEGG, which can be used to infer pathways directly; it also has an online server http://www.genome.jp/kegg/kaas/ EcoCyc, MetaCyc, BioCyc, Pathway Tools, Pathologic: A collection of databases and software developed by SRI International http://www.biocyc.org/ Reactome: A curated knowledgebase of biological pathways http://www.reactome.org/ PathFinder: Reconstruction and dynamic visualization of biochemical pathways http://bibiserv.techfak.uni-bielefeld.de/pathfinder/ PMAP (Pathway MAPping): A software package that predicts pathways through mapping known pathways from other organisms in conjunction with application of genomic context information http://csbl.bmb.uga.edu/pmap2/ SEED: Gene annotation environment developed by the Interpretation of Genomes http://theseed.uchicago.edu/FIG/index.cgi/

Fellowship for

312

F. Mao et al.

Pathway Miner: A software package that can extract gene association networks from molecular pathways for predicting the biological significance of gene expression microarray data http://www.biorag.org/pathway.html/ Path-A (Pathway Analyst): A pathway prediction software package using machine learning techniques http://path-a.cs.ualberta.ca/ MapMan: A user-driven tool that displays large datasets (e.g. gene expression data from Arabidopsis Affymetrix arrays) onto diagrams of metabolic pathways or other processes http://gabi.rzpd.de/projects/MapMan/ Pathway Studio: A commercial software package that can construct pathways through literature mining http://www.ariadnegenomics.com/products/pathway-studio/ Ingenuity Pathways Analysis (IPA): A commercial software package that can help to search scientific literature, build dynamic pathway model, and analyze experiment data http://ingenuity.com/ VitaPad (Visualization Tools for the Analysis of Pathway Diagrams): A software package for pathway visualization and analysis developed at Yale http://bioinformatics.med.yale.edu/group/ BioPAX (Biological Pathways Exchange): A collaborative effort to create a data exchange format for biological pathway data; currently BioCyc and KEGG are available as BioPAX format http://www.biopax.org/

9. Concluding Remarks The availability of a large number of sequenced genomes and different types of omics data, in conjunction with the latest development of network modeling techniques, has made it possible for the first time to infer biological pathways and networks in a systematic manner as well as simulating them in an accurate manner. Nonetheless, new, efficient techniques, both computational and experimental, along with new modeling frameworks are needed. We are clearly in an exciting period of history in searching for new and more effective strategies and techniques in understanding what information we could possibly derive from the explosively increasing omics data about the cellular structures and processes, and how to convert such understanding into creating cellular models with predictive power, using advanced computational modeling techniques. To get there, we see a few major tasks ahead.

Prediction of Biological Pathways Through Data Mining and Information Fusion

313

9.1. Gaps between omics Data and Accurate Pathway Models Although numerous techniques have been developed to generate large quantities of omics data such as microarray chips, ChIP-chip, yeast two-hybrid, and computational techniques have been developed to mine omics data and to model the pathways that are most consistent with these and other data, there are clearly gaps between where we are now and where we want to be in terms of accurate prediction of biological pathways. We expect that reducing the gaps requires improved understanding about how different classes of pathways and networks are encoded in prokaryotic genomes as well as improved abilities and strategies to better utilize the available omics techniques for pathway inference in a more systematic manner. 9.2. Model-Driven Experimental Design With our increased ability for predicting pathways, we should seriously start thinking about and investigating (pathway) model-driven investigation protocols for the elucidation of complex biological systems. By comparing prediction results from pathway models with predictive power and the collected experimental data, hypotheses could be developed for refinement of the models and designing future experimental plans. It is foreseeable that an integrated investigation cycle of alternating computation and experiment could be employed in the near future. Such a cycle might look like this: pathway prediction based on public high-throughput omics data and computational modeling → pathway simulation with predictions → hypotheses generation → new experimental design and data collection → hypothesis verification and model refinement. This approach has the potential to substantially improve the efficiency of biological investigation. 9.3. Training a New Generation of Systems Biologists The next generation of systems biologists should not only know computational mining and modeling techniques for pathway generation and analysis in silico as well as hypothesis generation and experimental design but should also know experimental techniques that are central to molecular biology and enzyme kinetics. New courses along these directions clearly need to be developed and taught at both the graduate and undergraduate levels. 10. Further Reading The special issue of Nature Reviews Molecular Cell Biology (vol. 7, no. 3, March, 2006) contains a series of review papers on cellular systems, ranging from subcellular systems such as signaling networks, multi-protein complexes and organelles, to cells, tissues, and even entire organism, describing different approaches that can be used to model them.

314

F. Mao et al.

E.O. Voit’s book Computational Analysis of Biochemical Systems. A Practical Guide for Biochemists and Molecular Biologists (Cambridge University Press, 2000) provides an easy introduction into the modeling and analysis of pathway systems. U. Alon’s book An Introduction to Systems Biology: Design Principles of Biological Circuits (Chapman and Hall/CRC, 2006) takes a unique perspective on systems biology, and investigates biological networks through unveiling their basic principles underlying at different levels. B.O. Palsson’s book Systems Biology: Properties of Reconstructed Networks (Cambridge University Press, 2006) describes how to model networks, determine their properties, and relate these to phenotypic functions.

Acknowledgments This work was supported in part by the US Department of Energy’s Genomes to Life Program under project ‘Carbon Sequestration in Synechococcus sp.: From Molecular Machines to Hierarchical Modeling’ (http://www.genomes-to-life.org). The work isalso supported, in part, by the National Science Foundation (#NSF/DBI0354771, #NSF/ITR-IIS-0407204, DBI-0542119, #NSF/MCB-0517135), the Georgia Research Alliance, and a DistinguishedScholar grant from the Georgia Cancer Coalition. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsoring institutions.

CHAPTER 13 MICROBIAL PATHWAY MODELS

SIREN R. VEFLINGSTAD, PHUONGAN DAM, YING XU and EBERHARD O. VOIT

1. Introduction When we think about microbial biology, genomics immediately comes to mind. Modern techniques allow us to assess the status of all genes in a microbe at once; we have dozens of methods to manipulate gene expression, and just as many methods to insert a gene into a micro-organism or to knock it out. While the importance of studying genes and genomes is beyond any doubt, we should not forget that it is metabolism that yields much of the utility of microbes. Whether it is the fermentation of sugars to alcohol, the production of biofuels, or the clean-up of some undesired substances in the environment, the microbial genes provide a frame of what might be possible, but it is the metabolic reactions that actually do the job. Without metabolism, no microbe would survive, even under the most conducive conditions. Without metabolism there would be no signal transduction converting an external signal into a genomic, and ultimately physiological, response. It is therefore evident that a good understanding of microbial functioning is intimately related to how metabolism works. We tend to think about metabolism as dissected into pathways. The pathways are usually represented as linear chains of enzyme catalyzed reactions, with interspersed branch points. In other cases, metabolites may be organized in the structure of a cycle where the initial substrate is regenerated in a number of steps, while the cell gains advantage, for instance, in the form of energy production. In reality, life is complicated. Very many metabolites have several possible fates, and often different production routes, which immediately suggests that there must be many more branches than simple pathway representations lead us to believe. For instance, ATP and water are involved in several hundred reactions. Complicating this “network of pathways” is the fact that even the simplest organisms have a large repertoire of tools allowing them to regulate the flux through one branch or the other at about every branch point. Further complicating this regulatory potential, the control mechanisms easily adapt to changes in the state of the organism or its environment, thereby making the flux patterns both state and time dependent. Thus, the network of pathways can be depicted as a huge graph, for instance as on the Boehringer Map or in KEGG (Kanehisa, 2002; Kanehisa et al., 2004), but if we 315

316

S. R. Veflingstad et al.

want to understand how much material is flowing through this network at a given point in time, we are faced with the very complex task of analyzing a dynamical system, where the formerly static graph becomes a time dependent structure, for instance in the form of a system of nonlinear differential equations. We describe in this chapter some of the concepts that are presently applied to metabolic pathway systems, discuss some of the challenges and partial solutions, and keep examples as simple as possible to convey a particular point, knowing full well that realistic pathway systems are many times larger and more complex. Before we go into details, it is useful to recall where the field is coming from. Mathematical pathway analysis, and in fact, most modeling in biology, has two major roots that go back to the beginning of the twentieth century and are still of direct importance to microbial pathways. The first root consists of mathematical descriptions of isolated (bio)-chemical processes and the second of small ecological systems. The early analyses of (bio)-chemical processes were based on concepts from thermodynamics and elemental chemical kinetics. The former describes whether a reaction is energetically feasible, while the focus of the latter is on temporal features of the process. Of course, the two are not independent, and the latter can be derived as the consequence of the former. Specifically, considerations from statistical thermodynamics led to mathematical descriptions of elementary chemical reactions, originally in a discrete probabilistic formulation. In this line of thought, one computes the probability of a reaction taking place within a short time interval. This probabilistic representation is mathematically rather inconvenient, but allows approximation in the form of the so-called chemical master equation. The master equation, in turn, leads to a streamlined continuous and deterministic representation, in which the rate of a reaction is expressed as an ordinary differential equation, and thus as a kinetic rate law. These rate law representations were then used to derive descriptions of the dynamics of enzyme catalyzed reactions and ligand binding processes that are still in use today. The best known among them is undoubtedly the Henri-Michaelis-Menten rate law (HMMRL; Henri, 1903; Michaelis and Menten, 1913) that describes how an enzyme aids the conversion of a substrate into a product. Notably, HMMRL addresses reactions that by themselves, in the absence of an enzyme catalyst, do not take place and assesses the speed with which such reactions occur. The main achievements underlying HMMRL were (a) the correct postulation that enzyme and substrate form an intermediate complex, which in turn breaks apart into enzyme and product, and (b) the mathematical formulation of the process, including a clever approximation, known as the quasisteady-state approximation, which ultimately yielded a simple function describing enzyme kinetics. The proposed mechanism is shown in Fig. 1 and the typical formulation of HMMRL is

vp =

Vmax S , KM + S

(1)

Microbial Pathway Models

Fig. 1.

317

Schematic representation of a simple enzyme catalyzed reaction.

where vp is the rate with which product P is generated, S is the substrate concentration, and Vmax and KM are parameters. Further details and discussions of extensions and limitations can be found in many biochemistry or metabolic engineering books (e.g., Torres et al., 2002). Note that HMMRL expresses the speed of the reaction, or rate of change, which in a more general systems analysis would be written as a differential equation describing the increase in product, for instance as dP = vp . dt

(2)

This formulation is the basis for most pathway analyses, in which vp is replaced with more complicated functions. In particular, many reactions require two substrates and are furthermore affected by modulators that increase or decrease the reaction speed. For instance, many reactions can be partially or fully inhibited by other metabolites and sometimes even by the final product of its own reaction chain, a mechanism that slows down the production of something that is available in sufficient quantities already. Needless to say, accounting for inhibitory and other modulating effects requires rate functions that depend on several variables and are mathematically more complicated. A good description can be found in Schultz (1994). When HMMRL was developed, all computations were of course done by hand. This situation changed in the 1960s when computers slowly became accessible to a wider group of researchers and not just to a very few specialized computer scientists. Seizing the opportunity, Garfinkel (1968) and others proposed to program computers for analyzing complicated biochemical pathways and implicitly thought they had the problem nailed. They expected that it was only a matter of time before thousands of HMMRLs could be evaluated simultaneously with these fascinating computing machines. While it is true that we can now simulate essentially any size of system consisting of HMMRLs and their extensions, the problem of biochemical pathway analysis is far from being solved. The reasons are several-fold, but two stand out. First, in spite of incredible advances in computer software and hardware, some analyses are still too complex to be solved. Leading the list of such problems is parameter estimation, that is, the determination of parameters like Vmax and KM from biological data. Sure, for one or a few reactions, this is easy to do, but it is currently simply not possible to estimate several hundred of these parameters simultaneously from experimental data with reliability. Secondly, not all problems

318

S. R. Veflingstad et al.

are purely technical. A significant other type of problem is the interpretation of results. Heinrich and Rapoport (1974) acutely described the issue: Firstly, from the computer output it appears difficult to differentiate between important and unimportant effects, enzymes, metabolites, etc. Secondly, it is difficult to see how some effects are brought about. Thirdly, such a computation is often impracticable for experimentalists. Fourthly, many ad hoc assumptions are even now necessary for the mechanisms of several of the constituent enzymes of a chain. The strong dependence of the model of an enzymatic chain on the detailed mechanisms of single enzymes is unfavorable. Analytical approaches, on the other hand, circumvent some of these difficulties, as they are usually connected with a reduction to the essential parameters. They may show clearly the reasons for regulatory effects and they are more practicable for experimentalists. As a solution to some of these issues of interpretation and analysis, as well as to some of the technical problems, several groups (e.g., Savageau, 1969a, 1969b; Kacser and Burns, 1973; Heinrich and Rapoport, 1974) introduced new mathematical approaches that were based on general approximation theory and came to be known as “canonical” models (e.g., Voit, 1991). We will discuss some of them later in this chapter. The second root of mathematical pathway analysis comes from a long tradition in ecology and, in particular, the seminal work of Lotka (1924). Interestingly, even describing animal (predator) species consuming other (prey) species, Lotka referred to their dynamics as “kinetics,” a term that nowadays is reserved for chemical and biochemical reactions. Lotka (1924), Volterra (1926) and many others after them, such as Ludwig von Bertalanffy (1968) and Robert May (1976), developed methodological frameworks for biological systems that have become cornerstones of any comprehensive pathway analysis.

2. Why Do We Need Models to Understand Pathway Systems? Our minds are trained to think in chains of causes and effects, and we have little problem following such chains and making accurate predictions. However, as soon as regulation enters the picture, if there are branches, or if components have opposing effects, with one inhibiting a process and the other activating it, our unaided mind is prone to fail. The main reason is that the quantitative features of each process or effect drive the response, and that even slight changes in one of the contributing factors might be sufficient to turn a particular outcome on its head. In other words, numerical characteristics have to be integrated in a dynamic manner, and we are simply overwhelmed with such tasks, unless we can rely on the use of computational models.

Microbial Pathway Models

319

As an illustration, consider the simple pathway, where the metabolite X1 is derived from some external source X0 and converted to a product Y in a series of intermediate steps. If the input is constitutive, the pathway reaches a steady state in which the flows in and out of the different metabolite pools are perfectly balanced. As an experiment, assume that we suddenly get a burst of the input signal (for example through an injection). How will the system react? Biochemical intuition tells us that we will get a burst in the concentration of X1 , followed by a burst in the subsequent intermediates and the product Y as well. Eventually all the components will return to their steady state values. While these responses are quite intuitive, we will have problems predicting the finer details of the behavior, e.g., the shape and timing of the bursts or the time it takes to return to the steady state, without the help of a mathematical model. Now let us modify the system by adding inhibition exerted by the end product of the pathway (Fig. 2). In other words, Y inhibits the formation of X1 . How will this system react if we alter the input? Again we may predict that a burst would propagate through the system, but what happens to the concentration of X1 as the concentration of Y peaks? That is not as easy to predict from the interaction diagram alone. Suppose the strength of inhibition is p = −0.75. As the figure shows, a change in input leads to oscillations that ultimately disappear. For stronger inhibition (p = −1) the fluctuations decrease more slowly in amplitude, and for even stronger inhibition (p = −1.25) their amplitudes increase, and it is easy to foresee

Fig. 2. Linear pathway with end-product inhibition. Depending on the strength of inhibition, the dynamics of the system changes significantly.

320

S. R. Veflingstad et al.

that soon one of the variables will vanish altogether. It is impossible to predict with intuition where this change to growing oscillations will happen. These examples illustrate that even small metabolic networks may be too complex to allow intuitive predictions. A main reason is that the relationships between the metabolites are nonlinear. Considering that realistic biochemical systems are characterized by much larger numbers of components that are coupled through multiple reactions and regulatory interactions, there is no doubt that mathematical models and tools for their analysis are crucial for understanding their behaviors. A mathematical model can serve several purposes. It can be used to execute what-if scenarios (exploratory purpose), or to predict the behavior of the system under altered conditions (predictive purpose). A model may also enhance our understanding of the underlying mechanisms of the system behavior and thus serve an explanatory purpose. It allows us to screen for sensitive components, which may be prone to failure or, in the case of metabolic systems, may be promising drug targets. Finally, a good model helps us to develop focused manipulations and optimization strategies that change the pathway in a desired fashion, for instance, in the context of increasing yield of organic compounds in metabolic engineering. These and other uses are elaborated in Sec. 10.

3. The Modeling Process Almost independent of the subject area and the ultimate choice of a particular modeling framework, the process of modeling consists of several phases that are distinct with respect to input requirements and techniques. The main phases are briefly outlined here and will be elaborated in the subsequent sections. I. Problem definition. The main goal of this phase is to establish specific purposes of the model and to explore whether the model construction is realistic given the available data and information (Sec. 5). II. Collecting information on network structure. If a pathway system consists of n metabolites, it could theoretically contain roughly n2 fluxes of material. This number could even be higher, because some biochemical conversions can be catalyzed by several enzymes or isozymes. In addition, many constituents of the system could exert regulatory effects, which are distinctly different from material fluxes. Given these large numbers, concerted effort should be expended to learn as much as possible about the structure and regulation of a given pathway system. In many cases, some of this information can be mined from genomes (Sec. 6). III. Model design. Having decided what the model should accomplish, the next step is to choose the appropriate modeling framework and to implement it with appropriate parameter values (Sec. 7).

Microbial Pathway Models

321

IV. Model diagnostics. Before we can rely on the model, we must analyze it with respect to its responses and its consistency with the current biological knowledge (Sec. 9). V. Model application. At this stage it is time to put the model to the test in terms of making predictions, generating hypotheses and possibly designing additional biological experiments (Sec. 10). The process looks quite simple, but it is not as straightforward as it seems. Modeling is essentially cyclic in nature. During the evaluation of the model and interpretation of the results, one often finds that the model is not optimal in its given form. Discrepancies between model and observations, or lack of robustness, are indicators of potential problems with the model structure or the parameter values, and analyses of these discrepancies may suggest changes in the model design. The redesigned model is then subjected to the same analyses as the former model, and should hopefully exhibit improved performance. 4. Case Study: Sugar Metabolism in Lactococcus lactis The modeling process outlined in the previous section will be illustrated throughout this chapter using the initial steps of carbohydrate metabolism in the lactic acid bacterium Lactococcus lactis. Lactic acid bacteria (LAB) have a long tradition in industrial fermentations, in which they are used as starters in the manufacture of fermented foods and beverages, such as buttermilk, cheese and yogurt, sausages, bread, pickles and olives, and wine. L. lactis is widely recognized as the model organism for the study of lactic acid bacteria, and a large set of efficient tools for the genetic manipulation of this bacterium have been developed (de Vos, 1999). Moreover, the complete genome sequences of three L. lactis strains are available for comparative genomic studies (Bolotin et al., 2001). Proteomic data are also available in several illustrative works (Hartke et al., 1996; Budin-Verneuil et al., 2005). Finally, modern methods of in vivo nuclear magnetic resonance (NMR) allow the generation of very accurate metabolomic data of L. lactis in the form of simultaneous concentration measurements of several metabolites, and these experiments can be executed in rapid sequence, thus leading to metabolic time courses (e.g., Neves et al., 1999) that allow unique means of analysis (e.g., Voit et al., 2006a). Sugar metabolism in L. lactis has been the topic of intense research for several decades and all key enzymes of the glycolytic pathway and many others have been well characterized. Nevertheless, there are still open questions regarding the regulation of these pathways. Complicating research in the area is the fact that glycolysis does not occur in isolation but is connected directly or indirectly to multiple branches of metabolism and, for instance, ebbs and flows with the redox (e.g., NADH/NAD+ ) state and with the availability of energy related cofactors (e.g., ATP/ADP/Pi ), which themselves are subject to regulation at several levels. Thus, beyond the apparent simplicity of energy metabolism in L. lactis lies a complex

322

S. R. Veflingstad et al.

regulatory design that renders the system robust and versatile, and in doing so poses a challenging puzzle of biological discovery. Sugar metabolism in L. lactis has been the object of several modeling studies. The presentation here is primarily based on three recent studies (Hoefnagel et al., 2002; Voit et al., 2006a; Voit et al., 2006b) and references therein.

5. Defining the Problem The first step towards making a mathematical model of a system should always be to define its purpose. We need to determine very specifically which questions we want the model to answer. Are we modeling the system in order to determine the mechanism of a particular feature or do we want to choose parameters in order to optimize a specific property of the system? All good modeling processes begin with a good set of questions. Two issues must be kept in mind when we derive a good set of questions: time and organizational scales, and data availability. Why? Consider the time and organizational scales first. Suppose our goal is to study trees at a number of different levels. To model the growth of a forest or plantation, the scale of organization is typically the individual tree. The corresponding time scale could be years or even decades. Contrast this with photosynthesis in a leaf or needle. Now the organizational level is biochemical or physiological, and the corresponding time scale is that of seconds or minutes. It is easy to imagine time and organizational scales between or outside these two scenarios. A common modeling mistake is the attempt to account for too many different scales within the same model. It is simply not feasible (or useful) to study the details of light conductance and the long-term growth of forests in the same model. Having issued this warning, the real strength of mathematical models lies in their ability to explain phenomena at one level through the integration of processes at a lower level. Thus, a good model may span two or maybe three levels of biological organization and, for instance, describe the formation of a leaf in terms of genetic and biochemical processes. In defining the purpose of the model we thus need to be quite specific; “I want to study the growth of trees” is not crisp enough. The second critical issue during model design and the determination of purpose is the availability of data. If only yearly biomass measurements of the tree stand are available, one should not even attempt to model the biochemistry of tree growth. Conversely, if the motivation of the model rests on light conductance data, good as they may be, they would hardly support a model of the growth of the tree stand over many years. Case study . The questions that have been driving the investigations of the control and regulation of sugar metabolism typically fall into two categories. First, which mechanisms govern the metabolic shift from homolactic fermentation (lactate production) to mixed acid fermentation (ethanol, acetate and formate)? And second, what controls or regulates carbohydrate uptake and glycolysis?

Microbial Pathway Models

323

6. Developing Hypotheses Regarding Network Structure Having defined the purpose(s) of the model as specifically as we can, we need to identify the components and the interactions of the system that are to be included in the model. This information can be extracted from various types of experimental data (see Chapter 12 or, for a brief summary, Sec. 8). It is often useful to represent this information in a graphical map (for an example, see Fig. 3). Although seemingly straightforward, this is nonetheless a crucial step of the modeling process. If important variables are missing from the model design or if interactions between variables are misrepresented, later analyses and our interpretations of the results will be affected although this might not be obvious at this stage. It is therefore important to put considerable time and effort into this first step in the modeling process. A seemingly good strategy could be “the more the merrier”: let’s include everything possible in the model. However, this generosity will soon backfire, as the addition of every variable comes with a number of parameters that need to be

Fig. 3. Simplified graphic representation of the initial steps of carbohydrate metabolism in L. lactis. Black arrows denote flow of material and grey arrows are regulatory signals. The minus sign indicates that the regulation is inhibitory, while the plus sign indicates an activating effect. The fluxes are numbered consecutively (1–9) to simplify the mapping of the map to the mathematical model (see Figs. 4 and 6). Abbreviations: PEP – phosphoenolpyruvate, G6P – glucose-6-phosphate, FBP – fructose 1,6-biphosphate, 3-PGA – 3-phosphoglycerate.

324

S. R. Veflingstad et al.

estimated from biological data and also with added analytical and computational costs. Case study . Figure 3 shows a graphical representation of glycolysis and lactate production in L. lactis. The main metabolites are shown and will form the basis of the model. Other metabolites are ignored even though we know that they exist. An example is fructose 6-phosphate (F6P), which is an intermediate between glucose 6-phosphate (G6P) and fructose 1,6-bisphosphate (FBP). The reason for omission in this case is the very fast equilibration between F6P and G6P, which allows us to consider G6P as the sole representative of both pools. We notice arrows of two different types. The black arrows show flow of material, while the gray arrows indicate signals. It is important to distinguish the two, as one carries material, but the other one does not. The map also contains a question mark: the regulatory role of G6P on the uptake of glucose has so far not been observed experimentally in L. lactis, but it has been shown to be important in yeast (Galazzo et al., 1990). It is safe to include it in the pathway and later on to test with the model if this inhibition may play an essential role also in L. lactis. While this pathway has been studied in some detail in L. lactis, other relevant pathways of carbohydrate metabolism could be considered as natural expansions. Four important expansions to this model could be: (1) the fate of pyruvate; (2) the fate of G6P toward uridine 5-diphosphate glucose (UDPG) and the pentose shunt; (3) the use of galactose as substrate under conditions of glucose depletion; and (4) the production of the secondary metabolite mannitol.

7. Choosing a Modeling Approach Having a clear understanding of the system we want to model and what the model is supposed to accomplish, we are now ready for the next step: determining the structure of the model. The aim is to develop a model that is a valid description of the system, while at the same time convenient for analysis and manipulation. The main challenge is that there are no strict, generally accepted guidelines for mapping reality onto a mathematical model. Initially, four fundamental properties should be considered. Should the model be: • • • •

dynamic or static? deterministic or stochastic? continuous or discrete? spatially heterogeneous or homogeneous?

Answers to these questions are not associated with different degrees of sophistication, usefulness, or quality of the model. We are simply required to make decisions, again based on the data and the purpose of the model. For instance, if

Microbial Pathway Models

325

the data exhibit a dependency on time, we will probably prefer a dynamical model that captures the time trends over a static model, such as a regression model, in which the dependences of one variable on others are explored. In the real world, about everything occurs in three spatial dimensions. Nonetheless, if the spatial aspects are not of particular importance, it makes life much easier if we ignore them. To use the physiological example of diabetes, it may or may not be important to study the three-dimensional distribution of glucose and insulin throughout the body. As another dichotomy, biological datasets always contain noise, and their behaviors contain a certain level of unpredictability. Does that mean we have to use a stochastic model? Not necessarily. If we are primarily interested in “average” model responses, a deterministic model will do just fine. On the other hand, if we need to explore the worst case, best case, and likelihood of other possible outcomes from a system with a lot of uncertainty and noise, we may be best advised to choose a stochastic model. If we assume a dynamic, homogeneous, continuous and deterministic model, we arrive at a set of ordinary differential equations. These equations have had numerous applications in metabolic pathway modeling. In general terms, an ordinary differential equation for the temporal change in metabolite can be formulated as X˙ i = Vi+ − Vi− = Vi+ (X1 , X2 , . . . , Xn ) − Vi− (X1 , X2 , . . . , Xn ),

i = 1, . . . , n,

(3) ˙ where Xi denotes the derivative of Xi with respect to time (i.e., dXi /dt) and n is the number of metabolites in the system (as an illustration, revisit Fig. 2). The key to deriving a valid model for a metabolic network is to specify appropriate functions for Vi+ and Vi− which for a given metabolite summarily represent the production and degradation rates, respectively, and could consist of several terms each. There are various alternatives, all carrying with them a set of underlying assumptions and approximations. Some of the modeling approaches are briefly described in the following sections.

7.1. Stoichiometric Models At the lowest level of detail we can focus only on the topology of the biochemical system, stating which metabolite may be converted into which other metabolite. The result is a simple “wiring diagram,” which we can translate into a balance equation that is formulated as a set of linear ordinary differential equations. This set expresses the dynamics of the metabolite concentrations, collected in a vector S, in terms of a stoichiometric matrix N and a vector v of reaction rates, which in principle may be constant, time dependent, and/or dependent on other metabolites (see, for instance, Eq. (7)). In typical notation, such a balance equation reads dS = N · v. dt

(4)

326

S. R. Veflingstad et al.

Fig. 4. Stoichiometric matrix of the pathway in Fig. 3, with metabolites in rows and fluxes vi in columns.

The stoichiometry matrix N (Fig. 4) consists of one row for each metabolite and one column for each reaction. If a reaction generates a metabolite, the corresponding element is +1, and if the reaction uses a metabolite as substrate, the matrix element is −1. If a metabolite and a reaction are unrelated, the corresponding matrix element is zero. The matrix elements also indicate stoichiometric relationships if these are not one-to-one. For instance, if two substrate molecules are used to form one product molecule, the loss in substrate is coded as −2. Generally, the element Nij represents the stoichiometric coefficient of the ith chemical species in the jth reaction. The main advantage of the stoichiometric model is its mathematical representation in terms of a matrix equation, which has the consequence that there are uncounted theorems and analytical methods that support its analysis. One application where the stoichiometric model has proven especially useful is in determining the feasible or optimal flux distributions throughout a microbial metabolic network under the assumption that the vector v consists of constant flux rates. If the system operates at a steady state (i.e., N · v = 0), flux balance analysis seeks to find the distribution that is optimal relative to some criterion, such as maximal growth rate (for an introduction, see Palsson (2006)). Flux balance analysis has been shown to provide meaningful predictions in the metabolic network of Escherichia coli (Edwards et al., 2000) and other microbial systems. The approach has also been generalized to take into account various physico-chemical as well as thermodynamical constraints (Beard et al., 2004; Palsson, 2006). The disadvantage of the steady-state stoichiometric model with constant rates is that it focuses almost exclusively on the connectivity structure of the system.

Microbial Pathway Models

327

Kinetic information (for instance, regarding inhibition signals and other nonlinear dynamic interactions) is not, and cannot be, taken into account. Case study . The stoichiometric matrix of the glycolytic pathway in L. lactis can be derived directly from the graphical representation in Fig. 3 without the need for any detailed kinetic information. The resulting matrix is given in Fig. 4. Note that only the arrows corresponding to flow of mass are represented, whereas the signal arrows remain unaccounted for in this representation.

7.2. Kinetic Models Incorporation of the kinetic properties of the processes and the total concentrations of species present in the system yields a kinetic (dynamic) model of the system. The best-known example in this category is HMMRL, which we discussed before. HMMRL in its original form is compact and easy to use, analyze, and interpret. Also, various generalizations (such as different types of inhibition) are easily implemented in such a way that the result retains its simple algebraic character. However, if several substrates or reactions are involved, and if several modulators affect the pathway, the comprehensive rate law quickly becomes unwieldy (Schulz, 1994). As we discussed before, some of the resulting issues are of a technical nature, while others cloud our insights in what the model results really mean. In addition, the extended rate laws often render it difficult to obtain sufficient experimental data to estimate all parameters. Case study . Hoefnagel et al. (2002) proposed a kinetic model for carbohydrate metabolism in L. lactis. The model focuses on the distribution of carbon at the pyruvate branch point, so that only some reactions are overlapping with the reactions in Fig. 3. As an example, let us consider the conversion of pyruvate to lactate by lactase dehydrogenase. The kinetic representation of this reaction is given by     LAC×NAD 1 × PYR × NADH − Vmax Km,GLC ×K K m,NADH eq     , v8 = PYR LAC NAD NADH 1 + Km,PYR + Km,LAC × 1 + Km,NADH + Km,NAD

(5)

where Vmax , the different Km and Keq are parameters. In order to arrive at this expression, the authors assumed a reversible Henri-Michaelis-Menten equation with non-competing substrate-product couples (Hoefnagel et al., 2002).

7.3. Canonical Models The discussion in the previous sections has shown that, on one hand, HenriMichaelis-Menten models may become complicated even for moderately large biochemical systems and suggests that their implementation requires a very

S. R. Veflingstad et al.

328

considerable amount of information. On the other hand, the stoichiometric models are often simply too coarse, because they do not account for regulation. The question thus is whether we can find a good compromise, giving us a simpler mathematical form which can readily be scaled up to large systems, while at the same time retaining some of the advantages of the stoichiometric approach, such as analytical tractability. The envisioned approach will certainly be an approximation because it is impossible to account for every process in mechanistic detail. Because linear functions are not feasible for our purposes, we must look for nonlinear approximations. Below we discuss two alternatives. 7.3.1. Power-Law Approximation A canonical approach that has proven very useful is the use of power-law approximations. Mathematically speaking these are equivalent to linear approximations in a logarithmic coordinate system. In other words, one approximates log(rate) as a function of log(metabolites). The power-law approximation is the foundation of a modeling framework called Biochemical Systems Theory (BST) (Savageau, 1969a; Savageau, 1969b; Savageau, 1970; Voit, 2000). Two formulations within BST have been applied successfully to metabolic networks: S-systems and Generalized Mass Action (GMA) systems. In the S-system formalism, the rate of change in each pool (variable; metabolite) is represented as the difference between the influx into the pool and the efflux out of the pool [cf. Eq. (3)]. Each term is approximated by a product of power-law functions, so that the generic form is n n   dXi gij h = αi X j − βi Xj ij , dt j=1 j=1

i = 1, . . . , n,

(6)

where n is the number of state variables, which may dynamically change over time. The exponents gij and hij are called kinetic orders and describe the quantitative effect of Xj on the production or degradation of Xi , respectively. They may have any real values, and the magnitude reflects the strength of the effect. A kinetic order of zero implies that the corresponding variable Xj does not have an effect on Xi . If the kinetic order is positive, the effect is activating or augmenting, and if it is negative, the effect is inhibiting. The multipliers αi and βi are rate constants that quantify the turnover rate of the production or degradation, respectively, and their values must be greater than or equal to zero. Each term consists only of those variables, together with their kinetic orders, that have a direct effect on this term. When we develop a model within the S-system format, all processes affecting the production of a variable are aggregated into one process, and all processes affecting the degradation of the variable are similarly grouped into another process (Fig. 5). These overall processes are then approximated by power law terms. The result is that the right-hand side of the S-system equations [Eq. (6)] contains exactly one

Microbial Pathway Models

329

Fig. 5. Canonical modeling, whether with S-systems, GMA models, or in the lin-log format, begins with translating a graphical map into differential equations. The graphical map in this example shows a simple branched pathway with two regulatory interactions. As an illustration, consider the differential equation for X1 . Common to all three approximations is that we only include the variables that directly affect a given process, along with the corresponding parameters. The example also illustrates that the only difference between S-system and GMA occurs at branch points; in the former case, the two reactions depleting X1 are aggregated into a single process, which is then approximated by a product of power-law terms, while the two reactions are approximated individually in the GMA and lin-log formats.

difference of products of power-law functions. The alternative representation of a Generalized Mass Action (GMA) system is obtained when we do not aggregate all influxes and all effluxes into one term each. Instead, in the GMA format each process entering or leaving a variable or pool is represented individually with a product of power-law functions such that vi = γi

n 

f

Xj ij ,

(7)

j=1

where vi is the flux through reaction i. By collecting all influxes and effluxes associated with a variable Xi we obtain a differential equation model, where each right-hand side is a sum of power-law functions of the type n n n    dXi f f f Xj ijk , Xj ij 2 ± · · · ± γik Xj ij 1 ± γi2 = γi1 dt j=1 j=1 j=1

i = 1, . . . , n,

(8)

(see Fig. 5), where the number of terms equals the number of fluxes entering and leaving this variable or pool. As in the S-system form, the rate constants are positive or zero and the kinetic orders may have any real values. Compared to S-systems, GMA systems are often closer to biochemical intuition, because each process is explicitly represented and easily defined. However, this form

S. R. Veflingstad et al.

330

also has drawbacks. Most importantly, the GMA form does not permit the algebraic calculation of steady states, which the S-system form does. This property is very beneficial if, for example, the model is used for optimization (see Sec. 10). It should be noted that differences between the two formulations only exist at branch points, where either two or more independent processes converge or diverge. All other steps are identical, even if they are highly modulated by activators or inhibitors. For more details on the mutual advantages and disadvantages of GMA and S-system representations, see for example Voit (2000). The translation of a map into S-system or GMA model is a straightforward process (Fig. 5) and can in principle be accomplished automatically (Goel et al., 2006). It begins with a listing of all variables and all processes. Then, one determines which variables directly affect each of the processes in the system. These, and only these, variables enter the power-law representation of each process in the system. A kinetic order is assigned to each variable, and a rate constant is assigned to each term. As for S-systems, we know that rate constants are always positive or zero. Theory also confirms that kinetic orders describing an increasing effect are positive, while kinetic orders describing an inhibitory effect are negative. The magnitude of a kinetic order reflects the strength of its effect. 7.3.2. Lin-Log Approximation An alternative canonical form was introduced by Hatzimanikatis and Bailey (1996) (for a more recent review, see Visser et al., 2002; Heijnen et al., 2005). This form is based on taking the logarithm of each metabolite concentration and enzyme activity in relationship to a corresponding reference value. The resulting “lin-log model” constitutes an extension of Metabolic Control Analysis (MCA), a theoretical framework for analyzing control and regulation in metabolic networks close to their steady state (Kacser et al., 1973; Heinrich et al., 1974). Specifically, each relative rate (rate divided by reference flux) is written as    n  ei  X νi j , = 0 1+ ε0ij log Ji0 ei Xj0 j=1

(9)

where J is the flux through the reaction, e is the enzyme activity, ε is the elasticity, which is an important element of MCA, and the superscript 0 denotes the value at the reference state. Each elasticity is equivalent to the corresponding kinetic order in BST and thus is a measure of how much a rate will change given a change in a metabolite contributing to the process. The reliance on a reference state can create challenges when the lin-log approach is applied to real systems without a real steady state or where any of the concentration values are (close to) zero (cf. Wang et al., 2007).

Microbial Pathway Models

331

In simplified notation without reference states, the lin-log rate is an approximation of the type   n  (10) aij log Xj  , vi = ei ai0 + j=1

which resembles the GMA format, because each reaction rate (vi ) is approximated individually (Fig. 5). However, the mathematical format is different, and it is unclear whether this form has the same mathematical flexibility as models in BST (cf. Savageau, 1995). As in the case of GMA, the right-hand side of the differential equation for a metabolite Xk is obtained by collecting all influxes and effluxes associated with it, resulting in a sum of logarithmic terms (see Fig. 5). The constants aij can be positive or negative, depending on whether the effect of the corresponding variable is activating of inhibiting. The translation of a map into equations follows the same steps as outlined for the S-system and GMA models (Fig. 5). In comparison to the power-law approximations in BST, the lin-log model has the advantage of the GMA format in that the sum of terms is close to biochemical intuition. Simultaneously, it has the benefit of the S-system format by allowing algebraic calculations of its steady states. The two main problems with the lin-log model are that its structure is essentially linear, which precludes certain nonlinear behaviors (cf. Savageau, 1995) and that low concentration values can cause the corresponding rates to become negative (cf. Wang et al., 2007; del Rosario et al., 2008). In assessing these advantages and drawbacks, it should be kept in mind that both the BST and lin-log approaches are local approximations and are guaranteed to perform well as long as the variables stay within a reasonable range. In the case of power-law approximations, the representations become more inaccurate for very high substrate concentrations, while the lin-log approximation results in greater errors for substrate values close to zero. Case study . With the graphical representation and the guidelines outlined above it is fairly straightforward to set up a canonical model of the pathway in Fig. 3. As an example, let us consider the differential equation for pyruvate. There are two fluxes, v1 and v6 that increase the pool of pyruvate as well as two fluxes, v8 and v9 , depleting the pool. In GMA format the equation for pyruvate will therefore be given by P Y˙ R = v1 + v6 − v8 − v9 ,

(11)

where the individual fluxes are defined as v1 v6 v8 v9

= γ1 GLCf1,GLC G6Pf1,G6P PEPf1,PEP f = γ6 FBPf6,FBP PEPf1,PEP Pi 6,Pi = γ8 FBPf8,FBP PYRf8,PYR NADf8,NAD = γ9 PYRf9,PYR .

(12)

S. R. Veflingstad et al.

332

The corresponding S-system model for pyruvate is P Y˙ R = VP+Y R (v1 , v6 ) − VP−Y R (v8 , v9 ), where the aggregated fluxes are given by g

+ VPYR = αPYR GLCgPYR,GLC G6PgPYR,G6P FBPgPYR,FBP PEPgPYR,PEP Pi PYR,Pi − VPYR = βPYR FBPhPYR,FBP PYRhPYR,PYR NADhPYR,NAD .

(13)

In other words, the two influxes are combined into one term and the same is done for the two effluxes. For the lin-log approach, the differential equation for pyruvate would again be given by Eq. (11) but the fluxes would be defined differently. As an example, consider v8 , which would have the format v8 = e8 (a80 + a8,FBP log(FBP) + a8,PYR log(PYR) + a8,NAD log(NAD)).

(14)

The remaining fluxes are derived in a similar manner. The full GMA-model is shown in Fig. 6 (Voit et al., 2006a; Voit et al., 2006b). In principle, the derivation of equations for all metabolites is as straightforward as for pyruvate. However, some metabolites require additional attention, namely the ubiquitous metabolites, such as ATP, inorganic phosphate (Pi ), NAD and NADH. As is clear from the model in Fig. 6, these are not included in the model as dependent variables (i.e., they do not have a differential equation describing their dynamics). The reason is that these metabolites are associated with many reactions throughout

X1 =

1

X 1f 11 X 4f 14 GLC

X 2 = γ 2 X 1f 21 ATP X 3 = γ 3 X 2f 32 Pi

f 3 ,P i

f 2 , ATP

f 1 , GLC

− γ 2 X 1f 21 ATP

− γ 3 X 2f 32 Pi

NAD

f 3 , NAD

X 5 = γ 1 X 1f 11 X 4f 14 GLC

f 1 ,GLC

f 6 , Pi

f 3 , NAD

f 1 , GLC

− γ 5 X 4f 54

− γ 7 X 4f 74

+ γ 6 X 2f 62 X 4f 64 Pi

− γ 8 X 2f 82 X 5f 85 NAD X 5 = γ 8 X 2f 82 X 5f 85 NAD

NAD

+ γ 5 X 4f 54 − γ 4 X 3f 43

X 4 = γ 4 X 3f 43 − γ 1 X 1f11 X 4f14 GLC − γ 6 X 2f 62 X 4f 64 Pi

f 3 ,P i

f 2 , ATP

f 8 , NAD

f 6 , Pi

− γ 9 X 5f 95

f 8 , NAD

Fig. 6. Symbolic power-law model (GMA type) of the pathway in Fig. 3. Variables are as follows: X1 – G6P, X2 – FBP, X3 – 3-PGA, X4 – PEP, X5 – Pyruvate, X6 – Lactate. Fluxes are numbered consecutively, with the numbers in the graph corresponding to the subscripts of the associated rate constants (γ i ). Details of the derivation of the model can be found in Voit et al. (2006a; 2006b).

Microbial Pathway Models

333

the entire metabolic network of L. lactis and thus, their dynamics is partially controlled by processes outside the current system, making their inclusion in the model problematic. A possible strategy, which we use here, is to consider these metabolites as input functions that are not explicitly modeled but directly taken from experimental time series for the respective metabolites. For other strategies, see for example Voit (2000). The concentration of external glucose is also considered as an input function as the observed glucose uptake follows a sigmoidal function that is incompatible with the structure of the pathway (Voit et al., 2006a). The power-law model has been implemented successfully (Voit et al., 2006a; Voit et al., 2006b), but as already shown, other approaches such as lin-log or Michaelis-Menten type models could also be explored (e.g., see Hoefnagel et al., 2002). Furthermore, a combination of these models could be best suited as it has been shown that “pure” Michaelis-Menten models are sometimes difficult to parameterize (Curto et al., 1998). The lin-log model seems less suited for this system, as some variables approach 0 (del Rosario et al., 2008). A possible expansion of the model is to connect the dynamic model with information from existing stoichiometric and mass action models, as it is at present unrealistic to establish a fully kinetic model for the entire metabolic network in L. lactis. Specifically, inputs and outputs of the dynamic model can be replaced with constant fluxes out of, or into, a flux distribution model. Such a model results from the stoichiometry of the system and from the assumption of (unregulated) mass action kinetics and requires considerably less input information, which sometimes is more easily obtained. The initial, fundamental assumption for merging the kinetic models with these simpler models is that, given a specific stimulus, significant metabolic changes occur relatively close by on the pathway map, while much of the remainder of metabolism may stay rather close to a normal operating point. If so, it would justify the essentially linear representation that underlies these models for less affected pathways. 7.3.3. Summary: Canonical Models Selecting a canonical modeling approach over some ad hoc model has several advantages. Canonical models always have the same, relatively simple mathematical format, where it is crisply prescribed how changes in all variables over time are to be represented. In addition to allowing for computer-aided model design, it is easy to modify the models in terms of adding or removing components or interactions. The requirement for detailed kinetic information is also less than for traditional kinetic rate laws, and the interpretation of a canonical model and its features is immediate. The restriction to canonical models can however limit the questions the model may answer. As the mechanistic details are not taken into account, we will not be able to use the model to determine the effect of different reaction mechanisms, for example. It might be possible to design a canonical model at a level of finer detail, but at the level we discussed, mechanistic details may be lost. In other words, the choice of framework depends on the purpose of the model, as was stated before.

334

S. R. Veflingstad et al.

7.4. Determining Parameter Values Independent of the choice of modeling structure, the result of the model design outlined in this section is a symbolic model. It is symbolic because the equations contain symbols for most or all parameters and we have not yet determined their numerical values. In a deterministic model, we usually need sizes or quantities of variables, rates of processes, and characteristic features of components like the Michaelis constant of an enzyme or the kinetic orders in a power-law approximation. Although some diagnostic tools do not necessarily require a numerical model, a complete assessment of the validity and usefulness of the model can seldom be obtained without numerical values. As discussed in Chapter 12 (and briefly summarized below), parameter values may be obtained in numerous ways, none of which is foolproof or simple. In the remaining of this chapter, we discuss parameter estimation at a conceptual level, but ultimately assume that all the parameter values have been identified.

8. Information from Experimental Data As explained in Secs. 2 and 7, a numeric model, i.e., a model with parameter values, is needed to help us understand possible behaviors of a metabolic network and to predict the effects of perturbations. No matter which methods for parameter estimation are used, the required information must be derived from experimental data. The information may directly lead to specific numerical values for some of the parameters or necessitate additional methods of extraction. Because the topic of “information mining” has been discussed in detail in Chapter 12, we here simply summarize the various types of information that can be derived from genomic, transcriptomic, proteomic and metabolomic data. Genomic data resulting from comparative analyses of multiple genomes can provide a rich amount of information regarding gene functions at both molecular and cellular levels, co-transcription relationships, co-occurrence relationships, coevolution relationships of genes, protein-protein and protein-DNA interactions and functional associations among proteins (Enright et al., 1999; Marcotte et al., 1999; Overbeek et al., 1999; Pellegrini et al., 1999; Huynen et al., 2000; Marcotte et al., 2002; Wu et al., 2003). This information has proven to be essential for the construction of initial network models that ultimately link together all genes that are involved with the same cellular functions and needed to perform their specific designed tasks (Snel et al., 2002; von Mering et al., 2003; Yeger-Lotem et al., 2004; Wu et al., 2005). Furthermore, this information has been used in combination with metabolic network analyses to discover missing components in the network (Reed et al., 2006). Genome-wide gene expression data collected through microarray experiments have revolutionized the way through which biological systems can be probed. Gene expression data extracted from such experiments give a global view about many

Microbial Pathway Models

335

of the biological processes happening in the cell under designed conditions instead of focusing on a small set of genes. Through analyses of collected microarray gene expression data under well-designed conditions, one can often quickly obtain a rough idea about which genes are possibly involved in a target pathway. For example, if our goal is the identification of genes that are likely to be involved in nitrogen metabolism in E. coli, we can collect gene expression data under two conditions: E. coli cells cultured with or without nitrogen sources. Genes displaying substantial differences in their expression patterns under these two conditions may provide an initial list of target genes for key processes associated with nitrogen metabolism. Time-course gene expression data are particularly useful for guiding the design of the topology of a pathway network (Lin et al., 2003) and for the parameterization of the network. One useful observation that has been made previously is that enzyme concentrations in prokaryotic cells can be roughly approximated by the gene expression levels (Almeida et al., 2003; Kitayama et al., 2006; Srividhya et al., 2007), which is particularly useful for modeling changes of enzyme concentrations in a target pathway or network. Another important application of microarray data is that they allow us to obtain information about the transcription regulation network (Ogura et al., 2002; Sayyed-Ahmad et al., 2007). For example, as discussed in Chapter 12, the integration of microarray data and metabolic networks has been used to derive new insights about transcriptional regulation of metabolic subnetworks that are not readily discovered by traditional methods (Patil et al., 2005). Proteomic data represent another highly useful source of information for network studies, including network reconstruction, analysis and simulation, although proteomic data are not as readily useable as microarray gene expression data, since this field is still in its infancy. Nonetheless, proteomic data provide useful information concerning various aspects of protein molecules, including (a) the presence of proteins in the cell under the designed experimental conditions, (b) the level of protein expressions, (c) post-translational modifications of proteins, which often reflect the functional states of proteins, and (d) protein-protein interactions and associations that are looser and possibly mediated through other biomolecules. While protein expression levels and modifications at a large scale may provide essential information for network simulation studies, protein-protein interactions are highly useful for reconstructing the topology of a pathway network. An example of the use of proteomic data in network modeling is the development of computational models of mitochondria, the proteome of which is available for many species, including human, rat, mouse, fruit fly, yeast, Neurospora crassa, rice, Arabidopsis thaliana, pea, and soybean (McDonald et al., 2003; Verma et al., 2003; Lindholm et al., 2004; Douette et al., 2006; van der Laan et al., 2006; Vo et al., 2007) using both top-down and bottom-up approaches. For example, in a bottom-up approach, a network of 189 biochemical reactions and 230 metabolites in mitochondria has been reconstructed, using genomic and proteomic data in addition to previous knowledge from the literature (Ozawa et al., 2003). Furthermore, every metabolite and reaction in the network was associated to one of the three

336

S. R. Veflingstad et al.

cellular components, including mitochondrial, cytosolic or extracellular. In one of the top-down approaches, 591 mitochondrial proteins were identified using samples from various tissues, and hierarchically clustered according to gene expression data (Mootha et al., 2003). Obviously, manually curated metabolic networks in mitochondria and various functional modules identified from the top-down approaches provided an additional valuable asset for further computational and experimental studies. Finally, metabolomic data can provide a wealth of information concerning the status of the metabolites in a cell under designed conditions. The availability of large quantities of metabolites measured under the same condition can provide a global view of all cellular processes. Metabolomic data have recently been used for metabolic flux analysis (MFA; (Fisher et al., 2005; Bajad et al., 2006; McNally et al., 2006; Saito et al., 2006; Maharjan et al., 2007)) as well as fully dynamic metabolic models (Voit et al., 2006a; Voit et al., 2006b). Case study . The availability of three L. lactis strains and 26 Streptococcus strains (www.ncbi.nih.gov) in the same family facilitates comparative genomic studies, which allow accurate predictions of operons (Chapter 10), cis-acting regulatory motifs for transcription factors (Chapter 11), functional associations of proteins based on identified co-occurrence and co-evolution relationships of genes and on phylogenetic profile analysis (Chapters 11 and 12), and protein-protein interactions (Chapter 12). Furthermore, all related metabolic pathways in KEGG (Kanehisa, 2002; Kanehisa et al., 2004) can be mapped to L. lactis, using pathway mapping tools discussed in Chapter 12. If there are gaps in the pathway models after the mapping process, various approaches can be used to find candidate genes to fill the gaps (see Chapter 12 Sec. 6). Furthermore, the results of comparative genomic data can be integrated with microarray data and information resulting from literature searches using the approach described in Chapter 12 Sec. 6, which can suggest a functional network of involved genes. Analyses of densely intra-connected subnetworks can subsequently result in the identification of functional modules. This information can be used to propose candidates for filling the gaps in pathway models. Finally optimization techniques described in Chapter 12 Sec. 7 can be used to refine the pathway models further and even to parameterize dynamic models. The end results of a comparative genomic study thus are: a functional network that links the possibly involved genes linked together through protein-protein, and protein-protein associations; and a transcriptional regulatory network consisting of transcription factors and the regulatory relationships between transcription factors and genes they regulate. Once the topology has been derived, literature information and databases like Brenda can be used to obtain estimates of specific kinetic parameters, such as KM and KI values. In addition, a detailed analysis of time series data on metabolites can help to validate the model and to refine its details (Voit et al., 2006a; Voit et al., 2006b).

Microbial Pathway Models

337

9. Methods of Model Diagnostics Before we start using the model, we should test and validate it in order to assess whether the model is a reasonable representation of the system under study and is likely to allow reliable predictions regarding situations not yet tested experimentally. This model diagnostics can be a lengthy process, especially if we find errors or flaws that may require substantial changes to the model. The diagnostic tools are partly biological and partly mathematical. On the biological side, we need to confirm and possibly refine the internal structure and consistency of the model. For instance, we should ask: Are all pathways active in the organism under study? Does this particular organism possess secondary pathways that are not included in the model but could become important? Are the moieties of interest preserved? Are conserved pools really conserved? Are the underlying assumptions still reasonable and consistent with each other? Are we confident that the regulatory structure (feedback signals, modulation) is reasonable? Much of the mathematical diagnostics is rather straightforward, and even though it is the most technical phase of the modeling process, it is not necessarily difficult, because it is usually accomplished according to rather strict guidelines using off-the shelf computer software. Typical tools of mathematical model diagnostics are described below.

9.1. Steady-State, Stability, and Sensitivity Analysis Most biochemical systems will after some time attain a steady state, which is a condition of the system that is characterized by the fact that none of the variables changes in value. Material is still flowing through the system, but for each variable, the influx equals the efflux. Mathematically, this observation is equivalent with the requirement that all differential equations (describing changes over time) are equal to zero. For a model in S-system or lin-log form, the steady state can actually be computed algebraically, with modest effort. In principle, this computation can even be executed with paper and pencil and sometimes leads to insights that are much more general than what computer simulations can yield (e.g., Savageau, 1976). While such manual analyses are rare, the real practical advantage of explicit algebraic solutions lies in the fact that their known existence greatly widens the repertoire of computational methods. In particular, steady-state solutions do not have to be computed with iterative search algorithms, but can be computed efficiently with methods of computational matrix algebra. Two applications of such computational tools are stability and sensitivity analysis. If the system can tolerate small perturbations from a steady state, we say that the steady state is stable. Most biochemical systems are expected to have a stable steady state, because they are constantly exposed to variations in their internal and external environment and should not be derailed by these natural fluctuations. For instance, the pH within microbes is controlled quite tightly, leading to a stable “normal” state to which the organism returns after small perturbations.

338

S. R. Veflingstad et al.

The simplest mathematical test of stability is only guaranteed for very small variations, but usually holds for larger variations as well. This “local stability” is characterized by so-called eigenvalues, which are complex numbers with real and complex parts that characterize the system at the steady state. It is beyond this review to discuss their mathematical features, but it suffices to state that the most important characteristics of eigenvalues can be computed with standard software and that their interpretation is simple: For local stability, all real parts must be negative. Even one positive real part, out of the whole, potentially very large set, destroys stability. Stability analysis explores the tolerance of the system with respect to small, temporary changes in the state variables. Sensitivity analysis on the other hand, explores the effect of small changes in parameter values. In a biological system, a parameter change of this type could reflect a mutation that alters the activity of an enzyme. For most metabolic models, sensitivity values should be of small magnitude, for instance, between −5 and 5. This is so because a value of, say, 100 would mean that a 1% change in a parameter would lead to a 100% change (doubling) in a component of the steady state. Thus, sensitivity analysis provides an impression for how robust a model is. If all sensitivities are small in magnitude, the model is quite tolerant against structural perturbations. There are exceptions to small sensitivity values. Prominently, signal transduction systems might have high sensitivities, because their specific role is to amplify small signals into robust responses.

9.2. Bifurcation Analysis Local stability and sensitivities both refer to properties of a given steady state. The study of structural stability on the other hand seeks to answer the question: Under which combinations of parameter values will the behavior of the system change drastically? A drastic change may be the transition from a stable steady state to a state of ongoing oscillations, as shown in Fig. 2. Structural stability is analyzed using bifurcation analysis, and the threshold value of the parameter value(s) where such transitions happens is denoted a bifurcation point. Unless one uses S-systems (Lewis, 1991), the determination of bifurcation points is generally difficult and one may be forced to resort to a massive simulation study that evaluates the system for very many different combinations of parameter values and initial settings. A potential indicator for structural instability is a real eigenvalue close to zero. Changes that make this value equal to zero could affect the system quite dramatically, although this is not necessarily so.

9.3. External Consistency Stability and robustness are important properties of a model that characterize its internal structure. Of course, robustness is only a prerequisite and not our real goal. Of overriding importance is whether the model captures the experimental data in a

Microbial Pathway Models

339

satisfactory fashion. Ideally, the model provides a good fit to all sets of experimental data available. In this case, the validation may be accomplished intuitively or statistically. In the former (obviously simpler) approach, one compares the results of simulations with actual data and judges whether the two are sufficiently close. In the latter case, various statistical tests may be applied to assess whether or not the differences between model and data are significant. While a quantitative data fit is obviously desirable, one might sometimes be satisfied with less. Especially if the input data are scarce, one might be pleased with a qualitative agreement. For instance, it may be sufficient that a simulation of increased input to the system results in the same variables going up or down as it is observed in nature. In other cases, one may expect a model to be semi-quantitative: “If input to the system is doubled, variable 3 should increase to between 120% and 140%, variable 4 should decrease to about 50%, and variable 5 should essentially be unchanged.” The standard of comparison depends in large part on the complexity of the phenomenon under investigation and the quality and quantity of data used to implement the model.

9.4. Monte-Carlo Simulations Monte-Carlo simulation is a technique for exploring possible and likely behaviors of systems through massive simulations, and thus, often allows assessment which we cannot obtain with algebraic means. The name derives from the famous gambling casino of the French Riviera, because Monte-Carlo simulations make abundant use of random numbers. Specifically, suppose the model has twenty parameters, which are all uncertain to some degree. Suppose further that we have a reasonable idea about the possible range of each parameter and, maybe, the likelihood of each particular value within this range. For instance, one could surmise that the values of some parameter (height of adults) are roughly normally distributed with some mean and some variance. Then, instead of using simply the mean and evaluating the model once, one draws successively from the normal distribution, every time generating a slightly different model output. As a specific example, consider a system with three dependent variables X1 −X3 and two input variables Y1 and Y2 . The two input values have “normal” values Y1N and Y2N , which may correspond to the best estimates a panel of experts would assign. Feeding these normal values into the model leads to the steady state (X1N , X2N , X3N ). Now we consider that the input variables are subject to variability and that both have some distribution of possible values. For each simulation of the system, an appropriate random value is drawn for each of the input variables, and we repeat the process thousands of times. Each simulation yields three output values, so that all simulations taken together yield three distributions of output values, one each for X1 , X2 and X3 . Typical outcome measures are often steady-state values, but they could in fact be any other features of the system (e.g., steady-state fluxes, peak concentrations, transition times, existence or absence of oscillations).

340

S. R. Veflingstad et al.

Typical questions asked with the Monte-Carlo approach are: What are the most likely outcomes for X1 , X2 and X3 ? What are the extreme outcomes (i.e., best-case or worst-case scenarios)? Are there particular combinations of input variables that lead to the majority of undesired outcomes?

10. How to Use the Model? If the model passes the various diagnostic tests, we are led to assume that it represents a reasonable description of our system, and it is time to use it for its intended purpose. While exploring its responses, we must keep in mind that every model is an approximation, and that new discrepancies between the model and the real system are likely to occur in formerly untested situations.

10.1. Typical Uses In the most routine exploration of the model, we like the model to answer questions of the type: What happens to the system following a transient change in an input signal? Can certain inputs cause the model to fail? How will the system react to a change in one of its parameters, such as a kinetic order representing enzyme affinity? There are infinite numbers of scenarios one could explore. Since many of them would be boring, some sort of guidance is beneficial. Indeed, existing experimental data or hypotheses, not necessarily used in the model design, may help us to decide what may be most important or most interesting to explore. In this case, the explorations may confirm or validate an existing hypothesis, or they can explain the rationale underlying the observed output. In addition, explorations like these might eventually lead to new hypotheses, or predictions of system behavior, which can then be tested mathematically, biologically or both. Caution should be taken when it comes to predictions though, as large alterations to the model may lead into domains in which the approximations are no longer valid and result in undue discrepancies between model and reality. In the explorations considered so far, we often test the effect of only one or two changes at a time. By using Monte-Carlo simulations, we may test the effect of variability in multiple parameter values or input signals simultaneously (see Sec. 9.4). These large-scale simulations enable us to determine the overall possibilities and limitations of the mode, as we may identify overall best-case or worst-case scenarios or the most likely behavior. Mathematical models are not only good tools for exploring dynamic behavior. They may also give us a greater understanding of how the behavior is dictated by the network architecture and, thus, may indicate why the network is constructed in the fashion we encounter it. Uncovering such design principles is an important explanatory role of a model. For example, to increase the concentration of a metabolite, one could increase substrate that is used for its production, increase the production rate, or decrease its degradation. By comparatively analyzing these

Microbial Pathway Models

341

options within the context of the entire model, it is possible to assign advantages and disadvantages of one strategy or the other, given other contextual information. For instance, it might occur that increases in substrate destabilize the system or that decreases in degradation would lead to longer response times in cases of higher product demand. Case study . The result of a typical simulation with the GMA-model of carbohydrate metabolism is shown in black solid lines in Fig. 7. The system is initiated at t = 0 when a specific amount of glucose is provided in the medium. This initial burst of glucose immediately results in a burst of G6P and slightly later, in FBP, before their concentrations decreases as the concentration of glucose in the media also decreases. The levels of 3-PGA and PEP on the other hand, initially decrease before they reach maximum. The reason for this is that PEP is used in the conversion of external glucose into G6P, and 3-PGA is the precursor for PEP. The behavior of pyruvate is similar to G6P and FBP, while the level of lactate increases before it settles at some maximum level. As an example of an exploratory use of the model, we can study what happens when the initial amount of glucose is doubled (grey solid lines in Fig. 7). Clearly, the qualitative behavior (i.e., the shape of the curves) does not change much, but the maximum levels increase. As mentioned in Sec. 5, one of the main goals for developing this model is to gain a better understanding for how glycolysis and carbohydrate uptake are regulated in L. lactis. One of the questions to be answered with the current model is the regulatory role of G6P. Previous studies of glycolysis in yeast have shown

Fig. 7. Comparison of dynamic behaviors of the model in Fig. 6 for different initial values of external glucose. The grey solid lines show the system for an initial concentration that is two times the initial concentration for the black solid lines.

342

S. R. Veflingstad et al.

that G6P inhibits the uptake and utilization of glucose, but this phenomenon has not been experimentally observed in L. lactis. Thus, two alternative models were implemented, but the model fits were essentially non-distinguishable (Voit et al., 2006a). In other words, more analysis is needed in order to state whether G6P inhibits the uptake or not. As a second example, the role of the modulation of pyruvate kinase was analyzed. This enzyme (step 6 in Fig. 3) converts PEP into pyruvate and is activated by FBP and inhibited by Pi . A comparative analysis suggested that the effect of FBP is much more important than the effect of Pi . For a more detailed discussion on other regulatory aspects that have been studied, see Voit et al. (2006a; 2006b).

10.2. Drug Targeting The identification of possible drug targets is a specific example of intense pathway modeling. In this application, the methods are actually quite similar whether the focus is on human metabolism, which is compromised by disease, or on microbial metabolism if the microbe is causing a disease. In both cases a metabolic systems model is developed and diagnosed with respect to stability, robustness, and other criteria, as discussed above. In the case of human metabolism, one or several variables (and/or fluxes) differ in concentration from healthy individuals. The specific task in this case is to discover “drug targets” in the pathway whose modulation has the best chance of ameliorating the pathological deviations. Such points must have a relatively high level of sensitivity, because otherwise very strong alterations would be needed to show any effect. However, sensitivity is not sufficient. In addition to improving the disease variables, modulations of the drug targets must not lead to undue deviations in the “healthy” variables and fluxes (see Chapter 10 in Voit (2000) for a specific example). In the case of microbial metabolism, the ideal target is an enzymatic step in a critically necessary pathway that is not found in the human host. As an example one might think of the enormous success of penicillin, which interferes with the ability of the micro-organism to form a cell wall, a process that is not found in humans. Again the target step should be reasonably sensitive so that effects can be achieved with moderate drug doses. Screening for drug targets may begin with a complete sensitivity analysis, which would eliminate parameters with very low sensitivity or with sensitivity with the wrong sign (which would make the disease worse!). One could continue such an analysis with a Monte-Carlo simulation study, in which the system parameters are varied within the normal or disease bounds to mimic inter-personal variability, and where the target parameters are artificially elevated or lowered.

10.3. Optimization “Optimization of pathway models” may refer to a variety of tasks. It could imply improving the fit to observed data by optimizing the parameter values or refer to

Microbial Pathway Models

343

a streamlined design. In most cases, though, the optimization task describes the intent to increase the yield of some output metabolite or flux as much as possible, within the physiological constraints that allow the micro-organism to thrive. This task is quite typical in metabolic engineering, where we have learned that microbes are often much more efficient producers of organic compounds than a chemistry lab. For instance, many pharmacological compounds have to have a certain steric confirmation to be efficacious and safe. However, the typical construction of a molecule like tryptophan by means of organic chemistry often yields a mixture of L- and D-forms, and it is very costly to separate the two forms, for example, for the production of sleeping pills. Microbes by contrast usually only produce one of the two forms, without contamination by the other. It is therefore in the end cheaper to let microbes produce certain compounds, either because these are needed in high purity or because they are needed in large quantities. An example of the latter is citric acid, which is produced world-wide at a rate of about one million tons per year, most of which is accomplished by a fungus (e.g., Torres and Voit (2002)). Mathematically, the simplest type of optimization of a pathway in the sense of metabolic engineering requires continued operation of a microbial culture at a steady-state level where the desired product is generated at maximal rate. To assure viability of the microbes, many metabolites have to be maintained within certain concentration ranges and the fluxes in the pathway system must not be too high or too low. These conditions, along with the requirement of ongoing steady-state production, can be translated into constrained optimization tasks. These are usually complex, unless the underlying model is conveniently structured. Indeed, if one uses S-systems for these purposes, the constrained optimization task becomes very simple, and many methods are available for its solution. The reason for this simplicity is the linearity of the steady-state equations of S-systems that we discussed above (e.g., Voit, 1992; Torres and Voit, 2002). Even for the very similar GMA systems possible solutions are much more complex.

11. Online Resources for Modeling Pathway modeling is not new, and different types of specific tools are freely available. A wealth of information about the structure of many pathways is provided in KEGG (Kanehisa, 2002; Kanehisa et al., 2004) and MetaCyc (Caspi et al., 2006). Kinetic parameters needed to parameterize metabolic models can be found, for instance, in Brenda (Schomburg et al., 2004). Modeling software specifically developed for pathway analyses within BST includes PLAS (Ferreira, 2000), BSTLab (Schwacke et al., 2003) and BSTBox (Goel et al., 2006). More generic pathway analysis software was recently reviewed in Alves et al. (2006). Very useful for the conversion of traditional models in the format of HMMRL and its extensions are general computer algebra packages like MAPLE (for BST model, see also BSTLab) and Mathematica.

344

S. R. Veflingstad et al.

12. Further Reading Voit EO (2000) Computational Analysis of Biochemical Systems. A Practical Guide for Biochemists and Molecular Biologists. Cambridge University Press, Cambridge, UK. Torres NV, Voit EO (2002) Pathway analysis and optimization in metabolic engineering. Cambridge University Press, Cambridge, UK. Palsson BØ (2006) Systems Biology. Properties of Reconstructed Networks. Cambridge University Press, Cambridge, UK.

Acknowledgments This work was supported in part by a Molecular and Cellular Biosciences Grant from NSF (E.O.V., PI), NSF IIS-0407204, NSF DBI-0542119, NSF DBI-0354771 and NSF CCF-0621700, an endowment from the Georgia Research Alliance (E.O.V.) and a “Distinguished Scholar” grant from Georgia Cancer Coalition (Y. X., PI). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsoring institutions.

CHAPTER 14 METAGENOMICS

KAYO ARIMA and JOHN WOOLEY

1. Overview and Scientific Context 1.1. Introduction This is an extraordinary time for biology; among other transformations, all fields of biology are becoming data rich. Most notably, the volume of genomic data in public repositories for DNA sequence has already passed the milestone of 100 gigabases of nucleotide sequence and still continues to double at nearly a yearly frequency. Metagenomics, a rapidly emerging field, will further accelerate this trend through sequencing entire environmental samples en masse using novel sequencing technologies, which among other information, provide sequence data about the vast numbers of microbes that have never been cultured or grown in the laboratory. What is metagenomics? Why are this field and its associated research technologies rapidly growing? What are the promises and challenges? The answers to these questions are themselves only emerging. To articulate and then address the fundamental questions, a nascent, interdisciplinary and increasingly international community is now engaged in establishing priorities and setting a research agenda in preparation for the increasingly informative genomic studies about microbial communities in situ. Earlier, the microbiology scientific community had recognized the need to characterize populations and communities, not just individual species, and that there are unique attributes to environmental communities (Buckley, 2004). The microbial communities live in all domains of the physical environment and also within eukaryotes, a biologically-constrained environment termed a microbiome (and which typically also would have exchanges with external populations, those communities living in a physical environment not constrained by a specific host). This chapter provides a brief summary of the scope and current knowledge of metagenomics, whereas its breadth, growing depth, and expected impact already indicates that entire books will soon be dedicated to the field. Along with the broad, exceptionally interdisciplinary aspects of metagenomics, including the value for continuing functional experimentation — without sequencing — on environmental samples, the authors strongly concur with and endorse the public reports (cited in the text) about the unique opportunities being presented to microbiology and 345

346

K. Arima & J. Wooley

microbial ecology; correspondingly, we attempt to communicate the extraordinary enthusiasm generated as the field is expanding ever more rapidly. While providing wide coverage of metagenomics, we emphasize sub-domains within metagenomics for which there have not yet been extensive reviews and which offer a wide range of research challenges. More generally, to illustrate the importance for researchers in bioinformatics and computational biology, along with all researchers engaged in any subfield of microbiology, to engage in metagenomics, our perspective for this review is to provide the readers with a general introduction and an overview as to these exceptionally important, emerging opportunities for both basic and applied biology.

1.2. Definitions Metagenomics, as a rapidly expanding, relatively new, interdisciplinary research field, can be defined as the DNA sequencing, and also, either a sequence or a functional analysis, of a sample collected directly from the natural environment without the prior culturing or isolation of individual microbes. The research and the field have also been called community genomics, ecological genomics and environmental population genomics; at this point, metagenomics is the most commonly used term to describe such studies and the resultant data. Microbes exhibit high genomic diversity and the vast majority of microbes have never been cultured (described below) and indeed, have not been directly observed. Metagenomics also describes a suite of new methods and technology that address these problems for microbiology and reflect the recent recognition that obtaining a pure culture of a microbe is not an inevitable or essential step. Instead, important information that would otherwise not be known can be derived by studying such environmental samples. This recognition constitutes a scientific revolution for microbiology and overturns a core philosophy in which only after obtaining a pure culture could effective analysis commence. A report from the American Academy of Microbiology (2002) provides a more detailed definition of metagenomics; namely, research that “entails largescale sequencing of pooled, community genomic material, with either random or targeted approaches, assembly of sequences into unique genomes or genome clusters, determination of variation in community gene and genome content or expression over space and time, and inference of global community activities, function, differentiation, and evolution from community genome data.” In 2007, the National Research Council of the National Academy of Sciences issued a report entitled “The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet” (Handelsman et al., 2007); this report, which has had an exceptional impact within the community and on federal agencies, characterizes metagenomics similarly, noting that “meta” means transcendent in Greek and that in terms of research methods, metagenomics “seeks to understand biology at the aggregate level, transcending the individual organism to focus on the genes in the community

Metagenomics

347

and how genes might influence each other’s activities in serving collective functions.” The report notes as well that computational advances will be needed to provide insight into “the genetic composition and activities of communities so complex that they can only be sampled, never fully characterized,” and furthermore states that metagenomics is “a radically new way of doing microbiology.” The first paper that used the term “metagenome” was published in 1998 (Handelsman et al., 1998; Rondon et al., 2000), and the explosive growth in this new field has largely occurred in this century. However, the recognition that organisms can be studied and identified without cultivation by retrieving and sequencing them directly from nature is much older. Indeed, metagenomic approaches to capture microbial diversity in natural habitats have been employed by many researchers for many years. The terms used to describe the methods include environmental DNA libraries, zoo libraries (Hughes et al., 1997), soil DNA libraries (MacNeil et al., 2001), recombinant environmental libraries (Courtois et al., 2003a), whole genome treasures (Oh et al., 2003), community (environmental) genomes (Tyson et al., 2004), whole genome shotgun sequencing (Venter et al., 2004), random community genomics (Edwards and Rohwer, 2005), and others (Schloss and Handelsman, 2005b).

1.3. The Promises The field of metagenomics allows scientists to sample microbial genomes directly, that is, without growing them in pure (clonal) cultures, and as such, provides the means to answer questions that were difficult to impossible with traditional culture-based genomics. These questions include for a given time and environmental location, asking how many and what types of species are present (ecological and phylogenetic questions), which and how many genes are present and what are the gene’s relationships (genetic and genomic questions), what are they doing (metabolic and functional questions), what resources are used (biogeochemical questions), and what changes are taking place over time or due to environmental influences (the dynamics of community composition). There is, thus, the potential to enable discovery of novel organisms, novel evolutionary branches, novel host-symbiont relationships, and novel symbiontsymbiont relationships. It also offers the potential for discovering new proteins with known functions, new proteins with novel functions, known proteins with unique functions, and novel natural products. This level of diverse novelty, already seen in extant metagenomics sampling suggests great potential from metagenomics for drug discovery and for advancing green industrial processes and exploring the potential for expanding the repertoire of biology in supplying energy sources (bioenergy). The major benefits of metagenomics, toward our achieving a better understanding of biology in general and for applying biology to societal problems, lie in comparative studies, such as those comparing one environmental community to another (e.g., in terms of phylogenetic diversity, protein diversity, strain diversity,

348

K. Arima & J. Wooley

or metabolic activity), comparing different host environments (e.g., how host diversity affects community diversity), and comparing (or tracking) populations and environmental changes over time. For species that have not been grown and might not grow in pure culture, these approaches will be effective to find the key species for natural habitats (defined, for example, with respect to nutrient cycles and ecosystem stability), including those of humans in healthy and diseased states. Overall, among the attractions of metagenomics for advancing our knowledge of basic microbiology and the role of microbes in the environment, excitement arises from the potential to contribute to many other basic and applied fields, including earth science, life science, biomedical science, bioenergy, bioremediation, biotechnology, bioagriculture, biodefense and forensics.

1.4. The Challenges Establishing and growing bacteria in pure culture has been traditionally and exclusively the first step in investigating bacterial processes, since investigators can easily control the culture conditions, manipulate those conditions, and obtain timeseries data. In contrast, a typical environment will contain multiple organisms. In contrast, metagenomic samples have been collected from numerous natural and human-influenced sites comprising multi-species environments, such as coral reefs, sludge farms, lakes, EPA Superfund contaminated sites, animal guts, whale falls, and the air. The sequence data obtained will inevitably reflect the DNA sequences of the major species, but the minor members of communities are less likely to be identified through sequencing directly from the environment. Current sequencing methods are too expensive or too time-consuming to do time-course and/or replicative experiments on the scale required for rigorous characterization, although continued technology advance will address this problem. The existing reference genomes do not cover many habitats, so that it is difficult to understand the full metabolic capacity of microbes, their interactions and metabolic pathways in specific communities. More refinements and other advances in large scale culturing of environmental organisms (generally without clonal purification) under a wide range of conditions will be required in order to do adequate population sampling (to identify minor species and establish quantitative abundance) and to be able to do a type of reverse genetics; namely, to ascertain and potentially to culture an organism from which came a given gene found to be of interest. Novel efforts to culture additional microbes, even in mixed cultures and with poorly defined media such as sterilized sea water, remain important for providing reference data sets and for allowing additional exploration and independent validation of metagenomics data (reviewed in Giovanni et al., 2007; and Giovanni and Stingl, 2007). Similarly, the high complexity of metagenomics sequence data has created a range of new computational challenges, including data management and standardization, and the tools for genome assembly and other bioinformatics tools.

Metagenomics

349

2. Goals of Metagenomics As this new field emerges, the breadth and specific goals of metagenomics are still being defined. Although there are many challenges to overcome, one of the major goals is assembling microbial genomes directly from their natural environmental habitat. Metagenomics has a potential to access 100% of the genetic resources in our ecosystem and thus cover the entire phyla for bacteria and archaea. When possible to establish reference genomes, the increased assembly of complete environmental genomes and the better taxonomic coverage and assessment of population distribution obtained through the use of reference genomes will increase the power of metagenomics analyses in many ways, such as for decoding and characterizing the unknown sequences (the vast majority of sequences), for understanding the function of microbial communities in ecosystems, for discovering exceptional variation and diversity, and for seeking insight into the evolution and origin of life.

3. Motivation 3.1. The Predominate form of Life and Its Population and Community Characteristics Not only do microbes represent the vast majority of life in our world, their activities as populations are essential for the chemical cycles that recycle key elements of life, such as carbon, nitrogen, oxygen and sulfur. All living creatures on the earth rely highly on microbial communities, which make nutrients, metals and vitamins available to their hosts, and break down pollutants (e.g. chemical and oil spills) in the environment. We humans also closely associate with various microbial communities, which degrade toxins and harvest nutrients from foods, excrete a variety of their metabolic products, and interact with our immune system. Indeed, humans contain on the order of ten times more microbial cells than human cells and may have on the order of 100 to 200 times more microbial genes than human genes (Handelsman et al., 2007); the implications for homeostasis, wellness and disease in humans arising from the ecology of microbes living on or within humans led the community as early as 2005 to urge a major, organized scientific project, the human gut microbiome initiative (HGMI), to characterize the populations (microbiomes) and their relationship to human health status (Gordon et al., 2005); in a modified and broader form covering other populations than just gut, this goal is now being implemented by the NIH (Handelsman et al., 2007). Since the invention of microscopes in the late 18th century, scientists have conducted extensive visualization studies on microbes, which had been previously invisible. What is considered traditional microbiology is largely “laboratory knowledge” obtained from individual species in pure culture in artificial (well defined and constructed) growth media, conditions that are very different from any natural environment. In addition, in the pure-culture paradigm, the presence of more than

350

K. Arima & J. Wooley

one species in the same culture medium means contamination, so that a species that requires metabolic products from other species would be impossible to detect, study or catalog. Microbes live in communities in which individual members interact and support each other. Indeed, more than 99% of microorganisms observable in nature typically cannot be isolated or are extremely difficult to grow in laboratory culture (Handelsman et al., 2007). Specifically, the percentage of cultivability of microorganisms in diverse habitats is estimated as 0.25% in sediment, 0.3% in soil, 0.01–0.1% in seawater, 0.25% in freshwater, and 1–1.5% in activated sludge (Amann et al., 1995). Until the advent of the new sequencing technologies that enabled metagenomics to emerge, however, there were no efficient tools for investigating microbial communities and their interactions in the conditions under which they actually live, i.e. their natural habitats.

3.2. The Complexity of Genomes and Biology Efforts to understand populations in their ecological setting and characterize their genomes face a range of difficulties not previously encountered. The approach and technology of genomics have made great contribution in cataloging and understanding culturable microorganisms and higher organisms. The genome sequences of 399 bacteria, 29 archaea and almost 30 eukaryotic microbes are publicly available at the time of this writing. However, isolating the target organisms in pure culture is the essential first step in conventional genomics for sequencing a microbial genome. In other words, researchers have had to assess properties of microbial communities without genomic information from key organisms since they could not be grown in laboratory culture, and thus, our extant genomics research has had limitations in decoding the natural microbial world. Our ecosystem is highly complex; understanding the interplay of organisms in the environment will remain a major challenge for biology for this century. However, even at the level of individual organisms, considerable complexity exists. The genetic diversity even within an individual species is far greater than had been recognized until recently. Recombination constantly reshuffles the alleles of genes in the population to generate new adaptive combinations. Bacteria, viruses and archaea transfer DNA to other microbes, which are distantly related or even in different phyla, in part due to proximity and to the ease of uptake of exogenous DNA. It appears that gene transfer continuously occurs between microorganisms, members of a given environmental community, the virosphere (the comprehensive collection of all viruses on earth) and other organisms through the earth’s environment. Consequently, microbes at a population level have a unique capacity to reconstruct their genomes by incorporating and/or discarding genes (through transformation, i.e. uptake and integration, and/or selection) to adapt to their environment. The ease and extent of lateral (or horizontal) gene transfer (LGT), as described above (and also see Chapter 6), complicates the phylogenetic analysis of microbes.

Metagenomics

351

Furthermore, in microbial population settings, LGT might challenge the concept of individual species or of organisms with precisely defined genetics (Fraser-Liggett, 2005; Xu, 2006). Research on microbial strains already has demonstrated that not all of the genome of a cloned microbial species or individual isolate from lab culture or from a patient is precisely defined or determined in a unique way; that is, not all genes of a given species are present in all its strains. For example, Perna and colleagues discovered that Escherichia coli O157:H7 contained more than 1300 strain-specific genes as compared with E. coli K-12 (Perna et al., 2001). This striking example showed that two members of the same species could differ in gene content by almost 30%. Following numerous observations that individual clonal isolates from an individual microbial species shared most but not all of the genes (found in the given species) led to the concept of a microbial pan-genome; namely, that species can be defined as a core set of genes shared by all isolates (Tettelin et al., 2005; see also overview in Muzzi et al., 2007) (and also see Chapter 4). An environmental genome, as a consequence, is not just a collection of genes from multiple organisms, but represents the metabolic activity of a community, the chemical processes going on in the ecosystem, and the metabolic pathway in which molecules and energy are exchanged between the microbes and the rest of the ecosystem. Metagenomic analysis of the data from a given environment should provide insight into the role played by the many different microbial communities in creating an ecosystem.

4. Exploring Novel Cultivation and Growth Conditions to Extend Metagenomics Without laboratory culturing, the ecological observations on microbial population and the behavior of communities of microbes can in principle be extended through bulk growth methods under different conditions that provide a broader dynamic range for sampling, that is, that create conditions under which the population of minor members of a community sampled increases due to changes in pH, salt, temperature or other environmental and chemical conditions. This could be the case for either functional or sequence analysis. Cultivation remains valuable, given the ease of working with cultures and their potential for providing easily interpreted data to test working hypothesis about communities and their resident microbes. Even the establishment of reproducible cocultures or low complexity cultures would facilitate sequence and functional analyses and help with mining metagenomics data. Previous efforts have succeeded in isolating species that had long been resistant, albeit sometimes with poorly defined media such as bulk sea water (Giovannoni and Stingl, 2007), yet even with remarkable advances in technology and considerable commitment of resources, we will never cultivate the extraordinary number of microbial species in the environment. The many parameters ascertained about the environment and nutritional requirements for a given community can provide

352

K. Arima & J. Wooley

input for efforts at culturing their microbes. Given the value of cultures for testing hypothesis and extending molecular and ecological observations from metagenomics experiments, marine microbiology researchers have emphasized that increased attention should be paid to novel methods including single cell manipulation devices and high throughput screening (Giovannoni et al., 2007).

5. Landmark Advances Toward Metagenomics The concept of metagenomics derives from the remarkable insight of Norman Pace in 1985 that organisms can be phylogenetically identified directly in their environmental niche (Lane et al., 1985), an advance that created a new subfield of microbial ecology (reviewed in Pace et al., 1986) and that has been termed a paradigm shift in microbiology (Handelsman, 2004). To characterize phylogenetic diversity, Pace and colleagues cloned and sequenced 16S ribosomal RNA (rRNA) genes directly from the natural environment without laboratory culturing (Stahl et al., 1985). Pace’s cultivation-independent molecular phylogenetic approach extended the molecular evolution research described in a series of landmark papers by Woese and colleagues in the late 1970s and early 1980s that showed convincingly that 16S rRNA genes were sufficient to determine phylogenetic diversity of bacteria by analyzing the genes isolated from cultivated microorganisms (summarized in (Woese, 1987)). In brief, the 16S rRNA gene is an ideal phylogenetic marker because it is universally conserved, a mosaic of highly conserved and highly variable regions, not laterally transferred (with very few exceptions), and has a known estimated divergence rate. Another breakthrough occurred in 1990 when Giovannoni and colleagues first amplified 16S rRNA genes via PCR and suitable primers to analyze clone libraries (Giovannoni et al., 1990); the first application characterized the phylogenies of natural populations of Sargasso Sea picoplankton and served also to highlight the importance and abundance of microbes in picoplankton. Subsequently, additional PCR-based environmental 16S rRNA gene studies emerged in the early 1990s (Delong, 1992; Fuhrman et al., 1992; Schmidt et al., 1991), but culture-independent sequences only became of notable significance in 1997, when GenBank entries of 16S rRNA gene sequences derived from environmental clones began to exceed those from cultivated Bacteria and Archaea (Rappe and Giovannoni, 2003). Since Pace’s cultivation-independent molecular phylogenetic approach, researchers in microbial ecology have increasingly and widely employed the growing collection of molecular techniques able to capture the novel features of environmental microbiology including the microbial diversity, phenotypic and genomic characteristics, biological and biogeochemical function(s), and species interactions in natural habitats (see extensive reviews on the history, experimental observations and implications in Handelsman, 2004; Riesenfeld et al., 2004b; Azam and Worden, 2004; Schloss and Handelsman, 2005b; Tringe and Rubin, 2005; Allen and Banfield, 2005; DeLong and Karl, 2005; Schmidt, 2006; Whitaker and

Metagenomics

353

Banfield, 2006; Falkowski and Oliver, 2007; Jorgensen and Boetius, 2007; Karl, 2007; Moran and Miller, 2007; Moran, 2007; Hallam et al., 2007; Moran and Ambrust, 2007). However, only in the past few years has whole-genome shotgun sequencing (comprehensive sequencing of small inserts omitting creation of libraries of long inserts: Fleishmann et al., 1995) been extended to environmental samples, termed environmental shotgun sequencing or ESS (see review and commentary in Eisen, 2007, and Edwards and Dinsdale, 2007); ESS (used alone followed by extensive computational assembly and analysis, or in combination with other sequencing, cloning and/or functional approaches) represents another paradigm shift for microbial ecology; its early applications are described in more detail below. Environmental sequencing presents a number of challenges, such as managing a large dynamic range in populations, and notably, sequence length. Recovery of DNA sequences longer than a few thousand base pairs (bp) directly from environmental samples has been very difficult. However, several recent advances in molecular biological techniques, such as those related to constructing libraries in bacterial artificial chromosomes (BACs), have provided better vectors for molecular cloning. Similarly, sequencing the longer inserts possible in fosmids has enabled the establishment of entire biochemical pathways (DeLong et al., 2006). Today, emerging sequencing technology such as MDA, 454 pyrosequencer and Solexa 1G Genome Analyzer, while still being made robust and fully accepted as validated, has accelerated the pace of sequence acquisition and will stretch the capacity of community databases (GenBank), as well as of specialized sequence resources (Edwards and Rohwer, 2005); this rapid explosion of DNA sequence information combined with extensive environmental as well as biological annotation also suggests directions for newly formed community cyber resources (Seshadri et al., 2007).

6. Recent Contributions in High-Impact Research Domains 6.1. High Population Complexity: Soil Metagenomics Soil is obviously one of the most important components on the earth. It is the source of nutrients, pharmaceuticals, gases and energy, which are essential for all living organisms. What is not as widely recognized is that microorganisms are the most abundant organisms in soil and can form the largest component of the soil biomass. One gram of forest soil contains ∼ 4 × 107 prokaryotic cells, whereas one gram of cultivated (agricultural) soils or of typical grasslands contains ∼ 2 × 109 cells (Daniel, 2005). It seems likely that today we can only barely imagine the complexity of signaling and other biochemical and population interactions going on in soil, including not just microbe-microbe but microbe-plant and presumably other microbe-eukaryote interactions. Obtaining more extensive experimental evidence for the interactions coupled with new bioinformatics tools and the construction of theoretical models for the processes should provide a basis for soil microbiology

354

K. Arima & J. Wooley

and metagenomics to contribute to a number of environmental challenges, including bioremediation, carbon sequestration and bioenergy production. Microbial communities play a critical, essential role in degrading natural and human made hazardous substances, such as waste and oil spillage, and in transforming many of the hazards (but obviously not metals) into organic matter. In soil, as in other environments, the biodiversity of the microbial life is extremely complex and the population structure is unknown. Traditional culturebase approaches have failed to answer how microbial communities interact with environmental factors, including the availabilities of water, oxygen and other nutrients, pH and temperature. Therefore, culture-independent methods based on the isolation and analysis of DNA directly from soil are being developed as an alternative tool for exploring soil microbial diversity and providing access to the genetic information of uncultured soil microorganisms. In one of the earliest studies, Rondon et al. constructed a BAC library of genomic DNA directly isolated from the West Madison Agricultural Research Station in Madison, Wisconsin, and discovered that some clones expressed phenotypes including antibacterial lipase, amylase, nuclease, and hemolytic activities (Rondon et al., 2000). Riesenfeld et al. constructed four libraries containing 4.1 gigabases of cloned soil DNA, and identified nine clones expressing resistance to aminoglycoside antibiotics, and one expressing tetracycline resistance from these and two previously reported libraries (Riesenfeld et al., 2004a). The same approach has been used to clone genes from soil microbial communities that code for certain functionalities such as antibiotics and antibiotics resistance enzymes (Courtois et al., 2003b; Gillespie et al., 2002; Riesenfeld et al., 2004a; Wang et al., 2000). The soil-based libraries currently available have been summarized elsewhere (Daniel, 2005). Microbial resistance to antibiotics, and notably, the recently increased microbial resistance to many antibiotics, is a big issue for quality health care. Hospital ecology is a rich source for many microbial populations, microbial mutations and genomics variation and concomitant establishment of resistant pathogenic bacterial strains by plasmid transfer, which has inevitably led to increased risks associated with hospitalization, and given the mobility of former patients, to increased risks in travel and other social settings. Soil-dwelling actinomycetes are antibiotic producers that have resistance elements for self-protection; orthologues to these self-protecting resistance elements in actinomycetes have been found on plasmids. Since most antibiotics now used in clinical settings were discovered in actinomycetes living in soil, soil is recognized as an important and possibly as a primary reservoir of antibiotic-resistance genes (D’Costa et al., 2006). For example, vancomycin resistance can be found in numerous soil bacteria, and the alternative peptidoglycan biosynthetic machinery that confers resistance to this antibiotic has been hypothesized to originate in soil-dwelling antibiotic producers (Marshall et al., 1998). The general presence of antibiotics in the soil also selects for the presence

Metagenomics

355

of specific resistance elements even in bacteria who are not producers for the relevant antibiotic. Inspired by this potential and its implications to health care, D’Costa and colleagues searched soil bacteria for antibiotic-resistance genes, termed the “resistome”, through the use of function-driven metagenomics (D’Costa et al., 2006). Morphologically diverse bacteria were isolated by their selectable phenotype, that is, under conditions where only the clones of interest grew in the presence of a given antibiotic, only the clones of interest grew. A plasmid library of 480 strains was constructed and each clone was screened for expression of resistance activity against 21 antibiotics. This study revealed that the resistance genes encoding a group of acetyltransferases are more closely related to each other than any previously discovered genes in this family and the genes encoding resistance to ß-lactam antibiotics were phylogenetically significantly apart from the known genes in this family. The difference between this project and the others described in this section is that the genes of interest were recognized by screening clonal antibiotic resistance rather than by an analysis of the genes’ sequences. In addition, the study is a successful example of function-driven metagenomics to overcome the challenges in expressing gene functions via currently available expression vectors. Of particular relevance for health care and applied microbiology, an immediate application of the study of the resistome would be to watch for changes in the environmental reservoir of resistance determinants, and thus this would serve as “an early warning system” for future resistance mobilization and the emergence of antibiotic resistance to additional antibiotics or of new resistance mechanisms (D’Costa, 2006).

6.2. Low Population Complexity: The Acid Mine-Drainage (AMD) Project Acidic metal-rich solutions, referred to as acid mine drainage (AMD), that form as a result of mining activities are a major environmental problem at sites around the world. The extremely acidic environment creates a microbial assemblage of very low complexity, one strikingly different from most microbial communities in nature. In order to study the linkages between the geochemistry of AMD systems and their microbial communities, Banfield and her colleagues, in 2002, began sampling the genomes of an entire microbial community without cultivation. A biofilm was collected from six locations (pH ∼ 0.6–1.2, 45–50◦C) in Iron Mountain, California. Since the AMD biofilms represent a very simple community, which consists of 5 dominant species (three bacterial and two archaeal species), only about 100 Mbp of shotgun sequences already covered more than 95% of the dominant species. The extraordinary simplicity of AMD microbial communities made it possible to reconstruct two near-complete genomes and to obtain partial recovery of three others (Tyson et al., 2004). The low species richness allowed to study inter-population diversity and the crucial roles of lateral gene transfer (LGT) to alter adaptive fitness of

356

K. Arima & J. Wooley

population members over short time scales. Similarly, the acidophilic archaeon Ferroplasma acidarmanus, fer1 (env), has been reconstructed from 8-megabases (Mb) of environmental sequence data by comparing the 1.94 Mb genome sequence of previously isolated fer1 (Allen et al., 2007). The comparative population genomic studies of the same species from the same site over 4 years revealed that the population rapidly diverged through frequent genetic recombination and that this most likely enhances the genetic potential of the population relative to individuals within it (Allen et al., 2007; Edwards et al., 2000).

6.3. The Sargasso Sea Survey — An Assay of Marine Metagenomics Venter and colleagues showed the power and potential of shotgun sequencing to characterize the diversity and evolution of microbial populations in situ through analyses of shotgun sequencing performed on random, microbial community samples taken from the Sargasso Sea (Venter et al., 2004). Populations from surface seawater samples were collected via tangential flow and serial impact filters to size fractionate the samples, and then the DNA for shotgun sequencing was obtained from the final or microbial filter. At the large scale of this study and the GOS study described below, random sampling of sequences of the microbial communities reduces biases caused by culturing and selection, but without further high throughput culturing, the sequences from the more prevalent species will largely be observed. Along with opening a new window on environmental microbiology, this work itself contributed 1,045,970 entries to the public sequence database, GenBank, including almost two million sequencing reads and a total of 1.6 Gbp from uncultured organisms in a composite 1.5-m3 of the surface-water community. The 1,001,987 proteins predicted from the sequence data set almost doubled the number of proteins in GenBank. Despite the controversies generated by the nonmarine origin of some sequence reads, the work provided new insight into ocean microbiology, such as providing clues that marine microbes by way of photoreception and photosynthesis may contribute more than previously recognized to ocean productivity. The unexpected number of new genes revealed surprising diversity, as much as had been found in all previous sequencing studies. The data are also being used to explore the genomics of picoeukaryotes, whose small size led them to be captured with microbes; thus, their sequences are among the Sargasso Sea data set (Piganeau and Moreau, 2007). The data is likely to provide a basis for continuing computational analysis; e.g., statistical and other bioinformatic analyses of the Sargasso Sea data are currently being employed to extend our understanding of ocean microbiology (Rodriguez-Brito et al., 2006). Indeed, the large size and the metadata attributes of the sampling data served to introduce the challenges of large scale database management (Tress et al., 2006) and data mining to marine biology and to the study of microbial communities.

Metagenomics

357

6.4. Marine Metagenomic Horizontal Sampling: A Global Ocean Sampling (GOS) Project The Global Ocean Sampling expedition is an ongoing environmental metagenomics project aimed at characterizing environmental and biological attributes of marine microbes by sequencing the total environmental DNA samples per se, i.e., without any purification and culturing (Rusch et al., 2007). Similar to the Sargasso Sea study, in this project, samples were collected mainly from surface water at approximately 320-km intervals, and microbes collected following serial filtration. A total of 41 different samples (including four Sargasso Sea samples) were taken from a wide variety (estuaries, lakes and open ocean) of aquatic habitats collected over 8,000 km from the North Atlantic through the Panama Canal and ending in the South Pacific. The expedition has generated much enthusiasm for both the technology and the research field of metagenomics. At the same time, such an effort strictly samples variation among environments; in other words, the data from the marginal amount of ocean sampled provides guidance and inspiration, and yields insight for future experimental and computational research, but do not represent a survey of ocean microbiology. Based on ribosomal RNA analysis (Sogin et al., 2006), the sampling might have well underestimated the genetic diversity of ocean microbes (Nicholls, 2007). Sequence diversity in marine viruses is also very extensive; 60% to 80% of the open reading frames of cultured marine virus genomes are not represented in GenBank, some virus genomes have no isology to any GenBank sequence, and metagenomic viral analyses provide comparable data on diversity (for a comprehensive discussion on the extensive diversity of the ocean’s viruses, see (Breitbart, 2007)). Between 44,000 and 420,000 clones per sample were constructed and endsequenced to generate mated sequencing reads. The dataset of the initial release, the first round of fully sequenced samples, includes 6.25 Gbp of sequence data (about twice the size of the human genome), totaling ∼5.9 Gbp of nonredundant sequence, from these 41 different locations. The shotgun sequencing generated 7.7 million sequencing reads (6.3 billion bp) of an incredible diversity and heterogeneity. An example of the extent of diversity arises from an examination of the rRNA: the annotation of the GOS dataset revealed 4,125 16S ribosomal small subunit gene fragments, which are clustered into 811 distinct ribotypes. The sequence data indicate that related microbial species are present at very disparate sites, despite the large physical distances, but certain trends in population distribution were observed; for example, Rusch et al. (2007) found microbes related to SAR11 and SAR86 in virtually all samples, although there were some differences between near polar and equatorial samples, such that temperate and tropical microbial populations differ in composition and consistent with the population distributions of larger non-microbial plankton. Similar microbial population distributions have been found with more focused sampling of nine regions of the ocean (Pommier et al., 2007).

358

K. Arima & J. Wooley

Conventional computational assembly of extended contiguous regions of DNA sequences from the experimental fragments failed, even though extensive sequence coverage exists within the GOS data. To reconstitute extensive amounts of either cultured or uncultured microbial genomes, new bioinformatics approaches were developed; these include “fragment recruitment” and “extreme assembly” (Rusch et al., 2007; see also Eisen, 2007; and a less technical overview in Gross, 2007). The diversity of microbial species observed was established in terms of the location for each sample sampling locations and known environmental pressures at that location. The DNA sequence data from the GOS samples also offers insight into the total number of distinct protein sequences and their functional and evolutionary relatives. Yooseph and colleagues (Yooseph et al., 2007) clustered the GOS sequences via amino acid sequence isology using known proteins as reference standards; pair wise comparison of all-versus-all non-redundant sequences was followed by profiling to unite and expand the clusters, and to enable the removal of probable spurious ORFs; an analysis of the GOS sequence data identified nearly 6 million proteins, nearly twice the number predicted from the data in GenBank, and also point to nearly 2,000 clusters of unique protein families with the GOS samples as well as the presence of virtually all known microbial and eukaryotic protein families (Yooseph et al., 2007). Many novel biochemical and biophysical features of the GOS proteins were found. For example, differences in 70% of the protein domains were seen between those in cultured, terrestrial microbes versus marine microbes. Protein families previously thought to exist only in eukaryotes were observed, as well as about 6,000 GOS members pair with ORFans, proteins for which no sequence similarity had been found among previously known proteins. The large collection of data from the GOS, in particular, serves to extend the functional diversity of known protein families, and provides the basis for further functional analysis and clues as to the evolution of a given family. A higher proportion of phage sequences than expected were found in the GOS-only clusters, leading the team to suggest very high sequence diversity for marine viruses (Yooseph et al., 2007). As in earlier studies with marine metagenomics libraries (DeLong et al., 2006), significant numbers of viral-like proteins were observed in the GOS data: virus genetic diversity would provide additional options/flexibility for microbial environmental responses. At the same time, the presence of many viral sequences in the DNA from the bacterial filter might be one contributing factor to the difficulty in assembly. As pointed out by Edwards and Dinsdale, the presence of numerous ubiquitous microbial species suggests that successful assembly would have been expected (Edwards and Dinsdale, 2007). In contrast, the ten to fifteen fold abundance of phage compared to bacteria has two consequences. First, predation and burst size might cause population collapse of successive microbial species as they achieve significant density. Second, extensive genetic variation among ocean microbes is enabled and perhaps driven by phage transduction. Very extensive transduction/gene transfer occurs in the marine world,

Metagenomics

359

with as many as about 100 transduction events per day per liter of environmental seawater (Jiang and Paul, 1998). The biology of phage in the ocean, phage predation, and the very high prevalence (40%) of lysogenized microbes (Virios) in one alreadycharacterized microbial population (Jiang and Paul, 1998) and perhaps, similar complexities for other highly heterogeneous metagenomics environments, is likely to complicate conventional and/or accurate assembly of complete genomes from environmental samples (Breitbart, 2007; Edwards, 2007). The GOS data have also altered our view of the kinases in microbes. Most kinases in microbes were believed to be histidine kinases, structurally quite different from the eukaryotic protein kinases (ePKs), whereas very few had been found to be ePK-like kinases (ELKs), which share the same structural fold and similar enzymatic mechanism with ePKs. From an analysis of the GOS data, Kannan et al. identified over 45,000 proteins with a protein kinase like (PKL) fold, in 20 families, and discovered that these kinases with a PKL fold appear to play the major role in microbes for kinase function, not the histidine kinases (Kannan et al., 2007). The PLK families are very diverse in functional properties as well as in sequence; many of those with known kinase activity appear to be involved in regulation, not metabolism. Much remains to be discovered but researchers have already demonstrated that there is a small set of essential or core residues, ten amino acids in the catalytic domain that gives rise to the level of functional diversity observed. These residues are conserved across the three domains of life and within the highly diverse families, suggesting a functional role. Six of them are already known to be involved with catalysis and the binding of ATP. Overall, the observations on the kinases show that extensive variation in biological function that can exist even in the presence of a common catalytic fold and biochemical catalytic function (Kannan et al., 2007).

6.5. Marine Metagenomic Vertical Sampling: A Time-Dependent Alternative Sampling Strategy Extensive research has been conducted in time and depth sampled microbial community genomics at the HOT/ALOHA project, i.e., the Hawaii Ocean Timeseries (HOT) station ALOHA (22◦ 45′ N, 158◦ W). This is a well-characterized open-ocean site with extensive data interconnecting microbial populations and environmental conditions. In contrast to the large scale horizontal sampling, which inevitably reflect a specific temporal snapshot of the specific environment, a vertical sampling strategy provides for the potential analysis of time dependent phenomena, both during a diurnal cycle and over time, and also allows a deeper characterization of microbial ecology and the microbial impact or structuring of ecosystems (Azam and Malfatti, 2007). An extensive characterization of the genomics of known microbial species at ALOHA has recently been completed at seven ocean depths (DeLong et al., 2006).

360

K. Arima & J. Wooley

To obtain long inserts with complete gene sequences and even sets of genes from biochemical pathways, a fosmid library was created from microbes sampled from the seven depths, which ranged from the surface to 4 km deep in the ocean. Approximately 5,000 clones from each depth were then ‘end-sequenced’ on both termini, which at each depth yielded from about 7 to 11 million bp of microbial genome sequence. Although most of the analyses have not yet been completed, the initial analysis revealed insights into depth-dependent microbial variation in carbon and energy metabolism, gene mobility, and host-viral interactions. Studying ocean microbiology in situ over time and at many depths, provides a perspective on microniches in thinking about microbial ecology, establishes a robust model for looking at global processes, and might serve to be an integrating approach for models, simulations, and the testing of specific hypotheses about the role and impact of microbes on marine ecosystems, thus leading to novel insights and broader theoretical considerations (via abstractions and generalizations) into the regulation of primary production and carbon cycling in the open ocean (Azam and Malfatti, 2007).

6.6. Marine Bacteriophages — Viral Metagenomics and Global Diversity There are ten to fifteen viruses per microbial cell in all ecosystems and on the order of 1030 viruses on the planet, and most notably, in the ocean; for ocean viral populations, the experimental observations and the estimates and their limitation have been recently reviewed (Suttle, 2007). Obviously, most of the world’s viruses are phages — often termed bacteriophages — that infect bacteria, given the host’s population. Phages are thus the most abundant biological entities on the planet. Population estimates suggest that in the ocean there are 1023 virus infections every second (Suttle, 2007), and up to about half of ocean microbes are lyzed per day by phage (Fuhrman, 1999; Weinbauer, 2003). Similarly, phage growth following infection uses additional viral genes for metabolism (Angly et al., 2006; Sullivan et al., 2006), further altering bacterial impact on biogeochemistry. In turn, phage predation plays an important, perhaps critical role, in controlling the composition of planktonic, microbial communities and consequently, in nutrient and energy cycling and other aspects of ocean biogeochemistry. Since most phages have not been studied, let alone isolated, purified and grown clonally, very little is known about their genetic or biological diversity. Rohwer and colleagues conducted the first metagenomic analyses of uncultured viral communities in environments to determine what types of phage are present in a particular environment and subsequently, developed methods for modeling the viral community structure (Angly et al., 2005; Edwards and Rohwer, 2005). The phage sequences are not similar to those of terrestrial phage already cultured and sequenced; this observation suggests a “marine-ness” set of sequence attributes for marine viruses.

Metagenomics

361

The culture-independent approach used in these studies helped to predict very high phage diversity on the local level, but also suggested that global phage diversity might be relatively limited if phage can move readily between environments. However, more recently, this team has constructed models from phage metagenomic datasets to explore the extent of diversity; the models yield extrapolations of hundreds of thousands of marine viral genotypes (Angly et al., 2006). This work also further supported the widespread distribution of most phage species, even though certain phages dominate some local environments, such as cyanophages and a newly discovered clade of single stranded DNA phages in the Sargasso Sea and prophage sequences in the Arctic, and regional richness was found to vary on a north-south latitudinal gradient. There is clearly as exciting a suite of biological properties to explore with ocean phages as with their microbial hosts, including studies on phage metabolism and the use of alternative pathways during infection, phage sequence diversity, the capacity of phage to act as genetic reservoirs for microbial hosts, and the role of phage in ocean ecosystems. The same can, of course, be said for the viruses infecting marine picoeukaryotes and other viruses with non-microbial hosts. Overall, in the context of the discussions in this and previous sections, the level of diversity and the complexity of viral sequences and of their microbial hosts present many opportunities and challenges in metagenomics and contemporary molecular and computational tools. These have been summarized (Breitbart, 2007) as: “Future challenges include the development of genetic tools for tracking all major marine groups (e.g., in situ hybridization sequence-based assays using signature genes), the expansion of “snapshot” metagenomic characterization to evaluate the temporal and spatial dynamics of natural communities and the development of a robust theoretical framework to enhance our ability to model and predict the impacts of viruses on global ecosystem function.”

6.7. Microbiomes: Microbiota within Biological Hosts and the Human Microbiome Project (HMP) We have approximately ten times more microbial cells than human cells and at least on the order of one hundred times more (unique) microbial than human genes (Versalovic and Relman, 2006). The gut microbiota is the largest microbial community in human body and at around 1011 to 1012 bacteria per ml, the densest bacterial population known. The composition of the ten to perhaps even 100 trillion (1014 ) microbes in the human intestinal tract is highly affected by genotype and by the gut environment, including the immune system and the diet. These bacterial symbionts are likely to maintain their presence despite the dynamics of the intestine and the host immune system through their interactions with the mucus gel layer overlying the intestinal epithelium as well as a process for obtaining host tolerance (Sonnenburg et al., 2004). Today, there is very little information with which to characterize the differences and similarities of the communities in

362

K. Arima & J. Wooley

open environments with those of symbionts; future metagenomics research, one presumes, will provide very informative comparisons between physically-constrained microbiota or environmental metagenomic communities and microbiota living in multicellular eukaryotic hosts, or biologically-constrained microbiota. The bacterial population in the gut in adults is predominately members of just two bacterial divisions, the Bacteroidetes (48%) and the Firmicutes (51%) (Eckburg et al., 2005), which constitute greater than 98% by rDNA analyses of the microbes that arise from ten bacterial divisions, including Proteobacteria, Verrucomicrobia, Fusobacteria, Cynobacteria and Spirochaetes (Ley et al., 2006a). Phylogenetic analysis of the microbiota shows shallow, fan like radiations, while the microbiota of the gut are highly diverse, which includes significant inter-individual variability; the majority of bacterial sequences found arise from novel microbes and uncultivated species, which represent over 80% of the species found (Eckburg et al., 2005). Extrapolating from the current human population, humans contain a total gut population or “microbial reservoir” of about 1023 to 1024 cells (Ley et al., 2006a). Despite the large microbial population size, little is known, however, about how selective adaptation has shaped and may still be shaping the gut community or communities (Turnbaugh et al., 2007). Similarly, to date only a few clinical and animal model studies point to the impact of microbial communities, the mammalian microbiota, on the metabolism and nutritional requirements of the host, on the cellular to systemic physiology, on development and aging, and on immunity. Described below are implications from the initial studies combined with the information gleaned from the human genome project and concomitant advances in instrumentation and techniques; these technologies have opened a scientific revolution in biology and microbiology parallel to that on environmental communities of microbes. Baselines for exploring the potential, from surveying the members and numbers in the population, have been provided through rDNA sequencing and through complete genomes of Bacterioides. 13,335 16S rRNA genes (the largest dataset to date from any ecosystem) from gut mucosal biopsy samples from the proximal to the distal gut of healthy subjects, plus their stool sample (Eckberg et al., 2005). 395 bacterial and one archaeal phylotypes were identified from a 99% sequence identity cut off. In addition, the first finished genome sequence, which serves as a reference, was obtained for Bacterioides thetaiotaomicron (Xu et al., 2003), which comprises 12% of all Bacerioidetes and 6% of all bacteria in the 11,831 member human colonic 16sRNA dataset. The genomes of B. vulgates (31% and 15%, respectively) and B. distasonis (0.8% and 0.4%) have also been finished (referenced in Gordon (2005)). Described below are implications from the initial studies on mammalian microbiota combined with the information gleaned from the human genome project and concomitant advances in instrumentation and techniques; these technologies have initiated a scientific revolution in biology and microbiology parallel to that initiated on the environmental communities of microbes.

Metagenomics

363

The microbes associated with humans live in very complex communities, as do other microbes associated with any multicellular eukaryote; the presence of such microbiota has led to the term “microbiome” to describe the microbial communities or metagenomic populations in an environment constrained or delineated by a host (even though these populations are often in contact with other populations in the external environment). A wide range of microbiomes are being characterized, including those associated with plants, termites, tube worms, insects and mammals, and these studies suggest, in general, that the microbiomes of an individual depend on genotype and on environmental experiences. The impact of microbes on wellness and on disease processes has led to a rapid increase in interest specifically in the microbiota associated with mammals and notably, the human microbiome. Few details are known about the microbiome ecology, that is, about mutualism, commensalism, and symbiosis for beneficial (and presumably, obligate) microbes within humans or about the biome’s microbe-microbe and microbe-host interactions in general, and much remains to be determined about the pathology of other microbes or about conditions that can lead to pathological events and disease. The early findings, in a striking fashion, have introduced ecology and evolution as important perspectives in understanding human microbiota even from a molecular perspective (Dethlefsen et al., 2007; Gill et al., 2006; Furrie, 2006; Ley et al., 2006; Turnbaugh et al., 2007; Weng et al., 2006; Xu et al., 2007); and at the same time, the extant knowledge, albeit limited, demonstrates clearly — without yet providing many specifics — that there is no doubt that the various communities of microbes play an important role in health and disease in various anatomical locations, such as the skin, the oral cavity, the female reproductive tract, the respiratory tract and the digestive tract, among others (Bik et al., 2006; Eckburg et al., 2005; Gao et al., 2007; Pei et al., 2004). Initial efforts at genome sequencing centers in the sequencing of populations from these locations in healthy humans will provide a basis for functional studies, for understanding the interplay between host immunity and its microbiota, and an analysis of typical microbial populations in sustaining wellness and in the development of disease, as well as their presence and contributions in more advanced human disease. Microbes start to colonize in the human GI tract immediately after the birth, create potentially beneficial symbiotic host-microbial relationship and are involved in various aspects of human health and physiology. Babies primarily acquire their initial microbiota from the vagina and feces of their mothers (Mandar and Mikelsaar, 1996), and having a shared mother may be a stronger contributing factor than genotype (Zoetendal et al., 2001) based on comparison of monzygotic and dizygotic twins as adults. Similar results have been found for mice, in that mothers who are sibs share microbiota with each other as well as with their offspring, and thus, the vertically inherited microbiome is stable enough over time and generations that kinship relationships are observed in the composition of gut populations (Ley et al., 2005).

364

K. Arima & J. Wooley

Colonization, overall, likely involves significant contributions to the flora from parental, environmental and diet sources. For example, the composition and temporal patterns of establishment of the intestinal microbiota has been found surprisingly to vary from baby to baby (Palmer et al., 2007). At the same time, genetic factors appear to shape the gut populations as well since the patterns in a pair of fraternal twins were found to be strikingly similar to each other and much less so to other babies. The individual profiles, despite their variability, were found to be consistent such that each baby could be identified by a distinctive microbial profile over weeks and months, and retain some individual characteristics at one year in age even though by then they have converged toward the more general and adultlike microbial profile (Allen et al., 2007). These experiments were conducted with a novel, custom-made small subunit ribosomal DNA microarray, and authenticated the microarray using sequencing data. Molecular profiling of the fecal bacterial composition in European children has demonstrated the various factors associated with different lifestyles have a significant impact on the intestinal microbiota and their diversity (Dicksved et al., 2007). Specifically, the children in an anthroposophic lifestyle (which involves, for example, the restricted use of antibiotics, greater consumption of fermented vegetables) had a significantly higher diversity of microbes in their feces than the children living on a farm, where there would be greater consumption of farm milk, contact with animals, and other lifestyle differences. The novelty or distinctness of individual microbiota becomes more notably with age, presumably due to the accumulated influences from diverse lifestyles, diets, and colonization histories, and in general, an individual’s diet, genotype, social group, medical history, as well as advanced age, influence the microbial population distribution (Dethlefsen et al., 2007; Turnbaugh et al., 2007). Gill and colleagues have published a detailed metagenomic survey of the human distal gut microbiome (Gill et al., 2006), which points to a wide range of metabolic capacities that humans utilize, rather than having evolved them within their own genomic capacity. In this study, DNA libraries created from fecal flora of two healthy humans generated 78 million bp of sequence data. A comparative analysis of the human genome and reference microbial genomes revealed that the human gut microbiota is enriched in genes involved in breakdown and fermentation of otherwise indigestible plant derived polysaccharides, methanogenesis, vitamin and essential amino acid synthesis, and xenobiotics metabolism. Thus, among other consequences, the microbial genetic assemblage provides humans with large numbers of genes that effectively increase the human physiological repertoire. The enrichment of the human physiological repertoire has its own impacts that might well have been selected for earlier in human evolution but are not desirable now, although the discovery of the role of microbes points to the potential for intervention. An early study on the physiological role of microbes and their impact on human energy balance is the nature of mid-term changes discovered in the human gut microbiome in exploring the relationships between gut microbial ecology and

Metagenomics

365

body fat in humans (Ley et al., 2006b). Twelve obese people were randomly assigned to either a fat-restricted or a carbohydrate-restricted diet and the composition of their gut microbiota was monitored for one year by sequencing 16S rRNA genes from stool samples. The two dominant groups of bacteria in human gut, Bacteroidetes and Firmicutes, changed their relative abundances in direct correlation with percentage loss of body weight in both diet groups. Moreover, one can expect that further sampling, among populations of diverse peoples, locations and diets, will provide many novel insights into the contributions of microbes to human energy balance and obesity, among other things. Among the trillions of microbes in the gut are methanogenic, hydrogen consuming Archaea, whose role in human wellness and metabolism is not established; one Euryarchaeote species, Methanobrevibacter smithii, can be up to ten percent of the anaerobes in adult intestines, while Methanosphaera stadtmanae and Crenarchaeotes are minor members (Eckberg et al., 2005). The microbial population of the gut extracts energy from polysaccharides (indigestible dietary fiber) that humans cannot digest, and their genomes are enriched in metabolic pathways for glycan degradation (Gill et al., 2006). Hydrogen consumption by Archaea would enhance ATP production by bacterial NADH dehydrogenases and facilitate the energy production via dietary fiber fermentation by the principle bacteria in the gut; at the same time, colonization of adult germ free mice with gut microbiota yields an increase in adiposity without an increase in food intake; such considerations led to analysis of how archaea such as M. smithii could affect the ecological physiology (notably, the food web, metabolism and product excretion) of the gut bacteria and if archaeal distribution impacts host energy balance (Samuel and Gordon, 2006). Having created a much simplified gut microbiota, with Bacterioides thetaiotaomicron with or without M. smithii or a sulfate-reducing bacterium Desulfovibrio piger, the study via MS and whole genome transcription profiling permitted the first analysis of the impact of a methanogen and a saccharolytic bacterium on each other’s transcriptomes and metabolomes. Cocolonization of germ free mice with B. thetaiotaomicron and M. smithii increased their gut population density, and M. smithii enhanced B. thetaiotaomicron’s capacity to degreade polyfructose-containing glycans, and the mouse host’s ability to harvest and store calories. The study also provides insights into the mutualism between the two microbes and their metabolic basis; overall, the data indicate that capacity of M. smithii, unlike Desulfovibrio piger, to regulate the specificity of polysaccharide fermentation and to influence the caloric deposition in fat stores, and overall, the Archaeon impacts host energy balance and prioritizes the bacterial metabolism of polysaccharides typical to human diets (Samuel and Gordon, 2006). While the extent to which the ecology of eukaryotes and macroscopic populations can be directly applied to microbial communities is not established, fundamental principles of ecology and evolution derived from observations on populations living in physical environments are observed in the constraints of the

366

K. Arima & J. Wooley

human microbiome. Functional redundancy within the genomes of the divergent bacterial lineages in the microbiome would ensure against disruption of food webs, per ecology-based prediction. Such predictions include the general or top down selection that would yield a population of distantly related members whose genomes have converged on functionally similar suites of genes (Ley et al., 2006). LGT would be one means by which a shared metabolome and other cellular capacities would be achieved, where as the competition among the microbiota populations would select on individual functionalities, bottoms up evolution, to yield specialized genomics with distinct suites of gene functions, which would define ecological niches (or microbial professions) and would be maintained by limitations on homologous recombination (Ley et al., 2006). Ley and colleagues provided a very detailed review of the ecological and evolutionary forces that determine the microbial diversity of the human intestine, which includes the nature of the patterns observed low levels of deep diversity and the radiation of a few lineages; initial studies on how colonization occurs, the nature of the selection pressures in the gut that impact mutualistic and pathogenic microbes; the selection for emergent functionality and pressure on hosts; functional redundancy that reduces the need for traditional keystone species for an environment; the role of the immune system in microbial selection and evolutionary drivers in the host; and the nature of pathogenic communities (Ley et al., 2006). Their list of key questions to be answered and suggestions for future studies are described as part of the discussion of the newly implemented Human Metabiome Project described below. Gordon and his team, to probe further into the ecological and evolutionary properties of the microbiome, have determined the complete genomic sequences of two Bacteroidetes with highly divergent 16 rRNA phylotypes well represented in distal gut in healthy humans; namely, Bacteroids vulgatus and Bacteroides distasonis (or Parabacteroides distasonis), which diverged from the last common ancestor of other Bacteroides prior to the division’s differentiation (Xu et al., 2007). Comparison with extant Bacteroidetes sequences from gut and non-gut populations provides insights into the niche and habitat adaptations of these two normal gut species. LGT, mobile elements and gene amplification have affected these microbes to vary their cell surface, sense the environment, and harvest the resources in the intestine. These processes have been shown to be the driving forces in the adaptation to the environment of the distal intestine, and in particular, point to the need to consider the evolution of humans (and presumably, all mammals and even all animals) in terms of the evolution of the microbiome and thus, the evolution of the supraorganism (Xu et al., 2007). The challenge that this adds to tree construction and to establishing the best approaches for comparative genomics and other comparative studies in biology are obvious. The shared fate of the human supraorganism — the evolutionary trajectory of humans and their symbiotic bacteria — implies selection for mutualistic interactions essential for human health. (The famous dictum from the molecular biology of the

Metagenomics

367

50s and 60s, “what is true for E. coli is true for an elephant,” derives from that mutualism and the underlying selection pressures.) Disruption or uncoupling of this shared fate would seem likely to result in disease (Dethlefsen et al., 2007). This connection greatly increases the importance of ecological and evolutionary analyses for medicine, should change the ways in which ecology and microbial ecology, in particular, are viewed and utilized in the biomedical and clinical medical communities, and should also serve to accelerate research considerations on humanmicrobe mutualism and its relationships to human wellness and disease. Similarly, research into mutualism and the status of humans as supraorganisms per se suggests that core biological principles, that is, principles from ecological, environmental and evolutionary biology as well as molecular cell biology and genomics, are likely to be far more important for clinical and preventive medicine than recognized today. Gordon and his colleagues (reviewed in Gordon et al., 2005; Turnbaugh et al., 2007) have extensively used the potential of gnotobiotic (“known life”) model organisms such as mice and zebrafish to understand the function of the microbiota, coupled with gut transplants to characterize the role of host habitat selection (that is, the impact of habitat on community structure, function and the reciprocal impact on the host), the impact on metabolism and obesity and other features of hostenergy balance. Gnotobiotic refers to animals who have been grown without any micro-organisms or germ free (GF) animals, or who lack the normal components of mouse or human gut microbiota. Overall, the comparison of colonized and GF mice and the effects of colonization of adult germ-free mice with defined components of mouse or human microbiota show that the gut microbiota facilitate the regulation of energy balance, through both calorie extraction from otherwise indigestible diet components and the control of mammalian genes that promote storage of the extracted energy in adiopocytes. The microbiota also directs numerous biological transformations including detoxification of carcinogens and the metabolism of ingested xenobiotics and endogenously produced lipids, and the synthesis of essential vitamins. The microbiota further modulates immune systems through actions on maturation and activity of both the innate and the adaptive immune system, which includes the induction of tolerance toward microbial diversity and presumably offers a selective advantage by enabling the microbiota to function effectively under environmental stress. Similarly, the microbiota affects the cardiovascular system (Ordovas and Mooser, 2006; Turnbaugh et al., 2006) . Another novel set of explorations in fundamental medical microbiology and human metagenomics involves the study of foods influencing the microbiome (prebiotics) and the impact of exogenous bacteria on health (probiotics), which are becoming more common elements of the human diet. The functional interactions within the gut of an intestinal mutualist, Bacterioides thetaiotaomicron, and the most common probiotic microbe, a Bifodobacteria species (Bifodobacterium longum) in a mouse model for host-symbionts interactions have been characterized in terms of genomics and metabolic studies (Sonnenburg et al., 2006). Global functional

368

K. Arima & J. Wooley

changes were probed using genomic methods for monitoring transcriptional changes and through characterizing the habitat-associated carbohydrates. The gut mutualist degrades a greater diversity of polysaccharides when the probiotic microbe is present; the presence of the probiotic microbe also induces host innate immunity genes. Commensal bacteria alter their own genetic program to respond to the environment. This work suggests that commensal and probiotic bacteria can alter each other’s genetic program in the mammalian gut. One consequence of the observations on the microbiota of humans and the initial recognition of microbes consequences for healthy or normal physiology and for disease, has been a paradigm shift in how multicellular eukaryotes (and notably, humans) are viewed as self-reliant, self-contained and restricted, biological entities, and of the meaning of the number of genes within the genome of an individual species. Continuing whole genome sequencing will allow genes to be identified that have different extents of similarity (and presumed “relatedness”) and in turn will identify genes inherited through vertical transmission (paralogs from prior duplication) events or LGT events (xenologs). Whole genome sequencing will also permit insight into the distribution of gene families among lineages and suggest which are essential for survival in the gut, and into the extent of LGT between distant versus close relatives in this exceptionally highly populated distal intestine and the relationship of these genomic phenomena to the evolution and the functional stability of the “collective metabolome” of the gut microbiota. Similarly, the structure of these gut microbial genomes and their evolution might serve as biomarkers for health or disease susceptibility (reviewed in Gordon et al., 2005; Turnbaugh, et al., 2007). Individual differences in microbial community composition create difficulties for correlating particular members of the intestinal microbiota to human health. At the same time, it is becoming increasingly apparent that host genetics has an impact on the susceptibility of individuals to different intestinal disorders including Crohn’s disease (CD) (Jansson, 2006). Recently, Jansson and co-workers (personal communication) have relied on the use of identical twins to uncouple the impact of host genetics on Crohn’s disease, enabling them to focus on the microbial community composition. Of particular value was a set of monozygotic twins from the Swedish twin registry, including pairs of twins that were discordant for Crohn’s; i.e. one individual had CD and the other was healthy. Therefore, the discordant twins served as each other’s genetically matched control, enabling differences between twin pairs to be closely scrutinized. To date, fecal and biopsy samples have been analyzed using a variety of molecular approaches, including DNA-based molecular fingerprinting techniques and proteomics. Preliminary results indicate that whereas healthy identical twins have very similar microbial communities in their intestines, even when they have lived apart for years or decades, the opposite is true in the case of discordant twins (Janet Jansson, personal communication). Apparently, the disease state results in an altered intestinal microbiota, or the composition of the gut microbiota may be a contributing factor leading to disease incidence. Although

Metagenomics

369

it is difficult at this stage to determine whether changes in the gut microbiota are causative or merely indicative of Crohn’s disease, the results are intriguing and may lead towards useful diagnostic markers at the bacterial or protein level. Currently, one limitation is the access to metagenome data to analyze metaproteome data from the twin set. Therefore, efforts are underway to match metaproteomes and metagenomes from the same individuals. If successful this should result in a Crohn’s metagenome and metaproteome database that would be accessible for others interested in elucidation of the role of specific microorganisms or in the role of proteins in disease development or diagnostics. The findings on the human microbiome clearly indicate that the gut metagenomics, along with other human microbiomes, will become an essential field of medicine. The microbiota of the gut both are effectors and reporters for physiology, and could provide a new, improved means for defining health more quantitatively, and might even serve to warn of needed early intervention (predictive and preventive medicine). At a minimum, research on the metagenomics of human microbes starts to define the role of microbes in wellness and disease, better characterizes the nutritional status in obese or starved people, predicts the bioavailability of orally administrated drugs, and helps predict the susceptibilities of individuals or populations to particular diseases. The range of amazing, unexpected research findings, as outlined in brief above, has led to a major, international initiative, termed the Human Microbiome Project (HMP); in particular, it is one of the top priorities of the National Institutes of Health (NIH), with participation from Institutes across the agency and an especial interest and commitment by the National Institute of Allergies and Infectious Disease (NIAID), the National Human Genome Research Institute (NHGRI), and the National Institute of General Medical Science (NIGMS). Overall, the factors that created the necessary excitement and focus include the significance of the truly novel observations on the impact of microbiome ecology on humans as explored for gut populations (Gordon et al., 2005), the prominent recognition (following years of controversy due to the expectations of traditional microbiology for laboratory culturing and related criteria) of the bacterial cause for many ulcers (Marshall and Warren, 1984), the completion of the NRC study and the release of its report on “The New Science of Metagenomics” (Handlesman, 2007), along with numerous, disparate clinical model systems, and initial sequence and phylogenetic observations from laboratories around the world. Especially through exhaustive sequencing of the genomes of populations in their disparate environments within and on healthy humans, along with subsequent computational and functional analyses, and later by similar approaches to characterize the microbial status for specific diseased states, the HMP seeks to understand the microbial components and the factors that influence the distribution and evolution of the constituent microorganisms in the human genetic and metabolic landscape and how they contribute to normal physiology and predisposition to disease (Turnbaugh et al., 2007). The Human Microbiome Project will focus initial

370

K. Arima & J. Wooley

research efforts on ascertaining to what extent (1) individual humans share the same microbial genes and populations (distribution of species), i.e., share a core microbiome; and (2) responses of and changes in the microbiome connect to changes in the health status of humans. To implement these goals, the HMP will also seek to advise the relevant technology and provide new software approaches and tools for bioinformatics, and, at the same time, will address any ethical, legal and social implications arising from the research and its findings.

6.8. The Higher Termite Hindgut (3rd Proctodeal Segment, P3) Microbiome All termites, spread around the entire world, are fascinating as metabolic engines that can degrade wood, and do so via obligate mutualisms. While termites harbor highly diverse gut microbiota, the lower termites in their hindgut also contain flagellate protozoa that produce cellulases and hemicellulases. Termites provide, in effect, one of the promising bridges from biology to alternative energy and sustainability, given their large global population as wood-feeding organisms, which implies their importance for environmental carbon turnover and their potential as sources of enzymes for bioconversion of wood into alternative (non-petrochemical) fuels. While it is possible to cure or manipulate lower termites to remove protozoa, the most abundant, species rich termites, the higher termites, only have a novel collection of hindgut microbes. It is possible after dissection to isolate and examine separately the lumen of the hindgut paunch or third proctodeal segment, opening a window on a unique population of microbes (Warnecke et al., 2007; also see Layers of Symbiosis in Sec. 14, the Online Resources). In what amounts to a new landmark paper in the field, a combined metagenomic sequence analysis, functional or proteomic analysis, and subsequent in vitro tests for cellulose activity of putative, relevant enzymes within workers of a higher, arboreal Nasutitermes species demonstrated a diverse, extensive collection of microbial genes for xylan and cellulose hydrolysis; many of the genes are expressed in vivo or were found to have cellulase activity in vitro. Phylogenetic analysis predominately uncovered not only the expected fibrobacter or Phylum Fibrobacteres, but also microbes from the genus Treponema, spirochetes. No archaeal species were found but in contrast to the human gut, the mechanisms for chemotaxis are present, consistent with the motile bacteria previously observed in the termite and in contrast with the bovine and human gut microbes, and suggested to be important for maintaining compartmentalization and niche selection (Warnecke et al., 2007). The putative enzymes did not include those associated with lignin degradation but included more than 100 gene modules corresponding to pathways for the hydrolysis of cellulose, as expected from earlier work. At the same time, the group observed new information on hydrogen metabolism, a central interest for bioenergy production, and given the small size of the hindgut, pointed out that the remarkable species and gene diversity and the observed range of symbiotic functions such

Metagenomics

371

as nitrogen fixation and carbon dioxide reductive acetogenesis “underscore how complex even a one microliter environment can be” (Warnecke et al., 2007).

6.9. A Minimal Microbiome: Symbiosis in Olavius algarvensis A study by Woyke and collaborators (Woyke et al., 2006) of the symbiotic community living in Olavius algarvensis, a marine worm lacking a mouth, gut and anus, demonstrates the power of a combination binning approach in a shotgun sequence dataset. Four clusters of scaffolds were binned based on intrinsic DNA signatures (e.g., GC-content and word frequencies) from the approximately 204 million bp data sequenced from small and large clone libraries. Four symbionts were identified by the presence of the corresponding rRNA operons, which represented the only rRNA operons found in these bins. The key to success in this project was the presence of a closely related, fully sequenced reference genome. Using the reference genome as a guide, four symbiotic species genomes were partially assembled and their metabolic pathways were reconstructed. This project yielded insight into symbiotic relationships between the host and bacterial symbionts, represents the minimal metagenomic population and the simplest microbiome and simplest supraorganisms, and provides inspiration for further analyses of other such relationships.

7. Methodologies In an emerging field, any effort at a comprehensive list of technical approaches can only be a snapshot in time. Among methods commonly used today are gene surveys via PCR of 16S rRNA genes, a fluorescent approach termed T-RFLP, ARISA, and notably, the sequencing of environmental populations by shotgun whole genome approaches and also newer technologies for sequencing.

7.1. 16S rRNA-PCR A PCR-based rRNA gene survey (16S rRNA-PCR) is a rapid and cost-efficient method to understand the types of microbes and to count the numbers present in an environmental sample on a phylogenetic basis. PCR amplification with universal primers that hybridize to highly conserved regions in bacteria and to archaeal 16S rRNA genes followed by cloning and sequencing yield the first critically important steps to analyze a microbial community. The phylotypes are determined by sequence similarities in rRNA genes, which exist in all cell-based organisms. The numbers of individual microbial types are estimated from the number of times the same rRNA gene is seen. For example, the public Ribosomal Database Project had 262,030 aligned and annotated public rRNA sequences, of which 84,442 were derived from

372

K. Arima & J. Wooley

cultivated bacterial strains, while 177,588 were from environmental samples, as of September 2006 (RDB-II; see Sec. 14). The limitations of 16S rRNA gene survey are the sampling issue (i.e., a single sampling may not describe the relative abundances of microbial community), PCR efficiency (i.e., not all rRNA genes are amplified in equal efficiency), a copy number issue (i.e., the copy numbers of rRNA genes are not consistent in bacteria and archaeal taxa, which potentially leads to overrepresentation of some species in 16S libraries). Described below are the two similar, extant techniques for molecular fingerprinting of microbial communities; terminal-restriction fragment length polymorphism (T-RFLP) and automated ribosomal intergenic spacer analysis (ARISA). 7.2. T-RFLP In conducting terminal-restriction fragment polymorphism, a selected region of 16S rRNA genes is amplified with two differentially fluorescently-labeled primers from total community DNA. The mixture of dually labeled amplicons is digested with a restriction enzyme to release the 5′ and 3′ ends (terminal restriction fragments, T-RFs) of each amplicon. The fluorescence-labeled fragments are detected using an automated capillary DNA Sequencer and the GeneScan software (Applied Biosystems). 7.3. ARISA Since the length of the intergenic transcribed spacer (ITS) region between the prokaryotic 16S-23S rRNA genes is inherently variable among organisms, automated ribosomal intergenic spacer analysis exploits this length heterogeneity by using PCR amplification across the ITS region to produce DNA fragment lengths characteristic of the taxa present in the sample. The fluorescence-labeled PCR products are detected using an automated capillary electrophoresis system. Individual peaks are considered to be operational taxonomic units (ARISA-OTU)). ARISA can also be coupled with 16S-ITS rRNA gene clone libraries to simultaneously assess multi-level microbial diversity patterns (Brown et al., 2005). 7.4. Environmental Shotgun Sequencing (ESS) The methodologies employed in environmental shotgun sequencing (ESS) are analogous to those routinely applied in single organism genomics. Specifically, this relates to the use of the whole genome shotgun sequencing approach, a well established method to sequence a single genome. The key difference between the two, however, is the composition of the source genomic DNA. In single organism genomics, the source DNA is derived from a single individual organism or a collection of clonally-derived individuals. In contrast, heterogeneous environmental

Metagenomics

373

source DNA is comprised of multiple organisms of varying phylogenetic relatedness and thus represents a spectrum of genotypic compositions. Ultimately, the heterogeneous nature of mixed assemblages brings to bear a multitude of unique considerations for manipulating and analyzing the resulting genome sequence information. Total community DNA obtained directly from an environmental sample is initially sheared into smaller fragments that are used to construct a DNA clone library. Large insert libraries (e.g. BACs; 40–250 kb) are commonly used to screen for clones possessing phylogenetic or functional marker genes of interest (such as the 16S rRNA or recA genes), but they can also be used for paired-end sequencing via the shotgun sequencing approach. Paired-end sequencing is a method to sequence two ends of an insert. Since the length of the clone can be estimated, the sequence information is used to know where the two reads can be placed in an assembly. Small insert libraries (e.g. plasmid libraries; 2–8 kb) are routinely used for massive paired-end shotgun sequencing owing in part to the relative ease and efficiency with which such libraries can be constructed. Environmental DNA fragments harbored in a small insert library are randomly sequenced bi-directionally and, based upon the sequence overlap amongst the generated reads, assembled into larger contiguous fragments called “contigs.” The resulting ESS data is comprised of contigs of varying lengths, as well as shorter, unassembled single-read fragments, which are typically less than 1000 bp in length. The resulting ESS data can provide considerable insight into the functional capabilities (phenotypic composition) of a given community as well as extensive information about community diversity (e.g. viruses, bacteria, archaea and eukaryotic species) at a level of resolution impossible to capture with 16S rRNA gene surveys alone.

8. Genomics and Proteomics Technologies 8.1. Challenges in Sequencing Technologies The most time-consuming and expensive part of metagenomics research is randomshotgun sequencing (environmental shotgun sequencing; ESS) and the following sequence analysis. Since genome coverage is proportional to abundance in the library, massive amount of ESS data are required to see the less abundant members in a complex community. Therefore, higher-throughput and lower-cost sequencing technologies will promise higher quality of metagenomics studies. Sanger-based capillary sequencing methods, in general, yield fewer than 2 million base pairs in a single run and the length of each read is approximately 600 to 900 bps. Emerging sequencing technologies, on the other hand, provide alternative strategies for generating substantially more sequence data at a lower cost than currently available Sanger-based methods. For example, the 454 Life Science Genome Sequencer FLX (the 454 pyrosequencer) eliminates the need for library construction and generate more than 100 million bps of DNA sequence in a single run (Margulies et al., 2005) and the Solexa 1G Genome Analyzer

374

K. Arima & J. Wooley

generates almost 1 billion bps per run without library construction (Bentley, 2006). A downside of these new technologies is their short read lengths (about 200 bp by the 454 pyrosequencer and about 30 bp by the Solexa 1G) in comparison with Sanger capillary sequencing. An additional drawback is no ability for the pairend sequencing. Both are disadvantages for assembling environmental sequence data. However, there is a high level of flux in state-of-the-art sequencing, and in particular, the rate of progress in terms of technology advances is quite high for new technologies. It appears that the length of reads and the ease and rate of sequencing will continue to approve, that other new technologies are likely, and the cost of sequencing will continue to drop.

8.2. Using Emerging Technologies for Environmental Studies The first metagenomics study using the 454 pyrosequencer analyzed two samples taken from the Soudan Iron Mine in Minnesota (Edwards et al., 2006). The 454 sequencing data analysis revealed remarkable differences between the two microbial communities. In order to validate the 454 sequencing approach, a 16S clone library was created from one of the samples and sequenced by traditional Sanger-based sequencer. The resulting sequence data were remarkably similar to those generated from the same sample with 454 pyrosequencing. This study suggests that 454 pyrosequencing is a low cost, high yield alternative when it is combined with effective statistical analyses. However, 454 has trouble accurately calling the number of bases in homopolymeric regions, especially for single-base runs longer than 5–8 bases (Margulies et al., 2005; Moore et al., 2006). A novel solution, with intriguing suggestions for the entire field and the potential for acceleration of metagenomic sequence data, has been recently provided in a combined technology approach that circumvents the difficulties; a 454 instrument to sequence Sulcia muelleri genome was primarily sequenced using a 454 instrument, but Solexa instrumentation was used to overcome the base calling problem since Solexa can contribute short yet accurate reads in a single run (McCutcheon and Moran, 2007). The combination of emerging technologies improved sequence quality and completed the Sulcia genome, which turned out to be the smallest known Bacteroidetes genome and among the smallest of any cellular organism (McCutcheon and Moran, 2007).

8.3. Single-Cell Sequencing Assembling an individual genome from an ESS dataset is one of the greatest challenges in metagenomics. As mentioned previously, heterogeneity within species is commonly observed in a microbial genome. From a pan-genome standpoint, a single genome may not represent the given species genome. However, knowing individual cell genomes is a powerful approach to understand the function(s) of

Metagenomics

375

a single species in a mixed community. Thus, single-cell sequencing is an alternative to traditional, often difficult, culturing of microorganisms prior to an analysis. The isothermal multiple displacement amplification (MDA) was originally developed for human samples by Laken and colleagues (Dean et al., 2002). This method uses the phage 29 DNA polymerase, which is highly precessive and has a strong strand displacement activity. Typically, small numbers of single cells (or single members of a community) are isolated by dilution or cell sorting and the cellular DNA is amplified greater than 109 -fold by MDA using random primers. The average size of the products was reported as 70 kb in length (Hutchison and Venter, 2006). Amplification from small amounts of DNA such as are in a single bacterial cell has been a big challenge because MDA tends to synthesize high background from contaminated DNA templates. Hutchison et al. suppressed background synthesis by reducing the reaction volume while using the same amount of template (Hutchison et al., 2005). This approach allowed cell-free cloning of genomic DNA isolated from bacteria. Zhang et al. developed fluorescence-activated cell sorting system to isolate a single bacterial cell in order to reduce background synthesis (Zhang et al., 2006). The authors amplified DNA clones (termed plones) from single cells of E. coli and Prochlorococcus marinus, for which complete genome sequences are already available. Shotgun sequencing of two P. marinus plones showed coverage of about two-thirds of the genome for each. The genomic regions not covered in the initial round of sequencing were recovered by sequencing PCR amplicons derived from the plonal DNA. Although several challenges must be met to complete whole genome sequences from single cells, the ability to amplify faithfully from single cells will provide a basis for an improved analysis of shotgun sequencing data and genome assembly as well as ecological analysis and identification of the specific metabolism and environmental function of members of microbial communities. Along with long insert or fosmid libraries as well as continued efforts to extend the conditions for culturing and expand the number of cultured microbes, the single cell sequencing technology will also serve as a bridge between the tradition of culturing in microbiology and the data obtained from pure cultures and environmental metagenomic data.

8.4. Array Technologies 8.4.1. Probing Large or Genome Scale Expression in Metagenomic Context The use of gene microarrays has become an especially widely-used technology in order to detect or probe the activity of thousands of target genes simultaneously. Despite the difficulties due to the complexities of environmental samples, the approach has already been used widely to probe the biological and biochemical activities and dynamics of microbial populations in natural settings (reviewed in Gentry et al. (2006)). Although microarrays provide for the rapid generation

376

K. Arima & J. Wooley

of large datasets on functional details and could provide essential information complementary to that derived from environmental sequencing, there remain many specific challenges in the application of such arrays to environmental samples. The amount of material available for an environmental sample is likely to be limiting and similar to other broad instrumental probes, microarray data will only reflect the most prevalent species or populations in the community. In addition, microarray analysis is based solely on genes and pathways that have been revealed through the study of laboratory isolates (and are now readily available). One of the most difficult complications is the design of an array sensitive enough to interpret the microbial community in a given environment. The advent of environmental shotgun sequencing may point to genes whose activities should be probed by microarray technology. However, the lack of reference databases of annotated sequences is another issue for designing comprehensive probe sets that would provide the coverage needed to detect and characterize the functional attributes of various environments and identify the microbes responsible for specific functions. The following two approaches aim to improve array-technology for the study of populations in natural environments. The first is a 16S rRNA survey and the other is a functional assay of a microbial community. 8.4.2. 16S rRNA-Microarray (Phylochip) Broide et al. designed a 16S rRNA-Microarray, named the PhyloChip, representing 8,741 taxa, in order to explore and characterize microbial diversity in urban aerosols (Brodie et al., 2007). The probes were designed from diverse regions of 16S rRNA genes. Each taxonomic group has at a minimum 11 different probes, so that a combination of the probes was specific enough to represent each taxon. In particular, the resulting arrays were able to detect various species in the phyla of Bacteria and Archaea, including known pathogens. Samples were collected from aerosols from two cities over 17 weeks. 16S rRNA genes were amplified by PCR and hybridized to the PhyloChip. The analysis revealed a high complexity for the urban aerosols, which contained more than 1,800 diverse bacterial types, and also included bacterial families with pathogenic members. Since the chip analysis can be completed in a day, this approach is ideal for monitoring targeted microbial populations (e.g., pathogens) in the air. 8.4.3. Functional Gene Array (FGA) Functional gene arrays (FGAs) have been used for a number of environmental studies, but more probes corresponding to genes involved in biogeochemical functions have been required to characterize fully the environments under study. The probes on an FGA termed GeoChip are 24,243 (50-mer) oligonucleotides covering approximately 10,000 genes in more than 150 functional groups involved in nitrogen, carbon, sulfur and phosphorus cycling, metal reduction and resistance, and organic

Metagenomics

377

contaminant degradation (He et al., 2007). As of the writing of this review, this study is the most comprehensive application of FGAs.

8.5. Metaproteomics Since functional assays in microbiology mainly rely on laboratory studies in pure cultures, there has been no established method for the study of complex microbial functions in situ after discovery of an environmental genome or population. An emerging field, metaproteomics responds to this experimental limitation through providing the missing link from discovery to analysis in order to investigate the range of functions in mixed microbial communities (Wilmes and Bond, 2006). The key to improve this technology is the combination of robust, multidimensional, nano-liquid chromatography with rapid scanning tandem mass spectrometry. To establish a model system for developing this system, VerBerkmoes and collaborators used a multidimensional protein separation system followed by mass spectrometry (MS) (Lo et al., 2007; Ram et al., 2005). They detected 2,033 proteins from the five most abundant species in the AMD biofilm. The proteins include 48% of the predicted proteins from the dominant biofilm organism, Leptospirillum group II. The metaproteomics analysis followed by cellular fractionation analysis revealed the cellular localization of 357 (apparently) unique and 215 conserved novel proteins of the AMD biofilm metaproteome (Lo et al., 2007; Ram et al., 2005). Another application of metaproteomics (by way of matrix-assisted laser desorption ionization-time of flight MS analysis following an initial 2D gel electrophoretic separation of protein samples) has been conducted on the microbiota in the human infant intestinal tract, as assayed in fecal samples. The infant samples were chosen due to the apparent simplicity of the Metabiome, with about half to two-thirds of the bacteria being bifidobacteria in the fecal samples from the two infants in the study. While the data in this initial study is quite limited in terms of sampling (over the course of the age of the infant and the number of subjects), the change observed in the metaproteome raises interest in more detailed studies with more subjects and more time points for sampling (Klaassens et al., 2007).

9. Computational Challenges Bioinformatics approaches to understanding genes and genomes are well established and indeed, bioinformatics became established as a separate research field in order to manage the information inherent in high-throughput DNA sequencing projects and to predict the explicit functions implicit in DNA sequences through comparative and transitive genetic analysis. Establishing the public resources and software tools to manage the DNA sequence data deposited in the International Nucleotide Sequence Data Collaboration later provided an essential platform for the genome project. Today, such genomic efforts are well established and highly routine although many

378

K. Arima & J. Wooley

challenges exist, such as accurately identifying structural and regulatory genes, including non-coding RNA, and effectively predicting function from sequence. At the same time, all information or knowledge-driven analyses in biology depend on data type standardization, effective ontologies and other related efforts to the integration of data of disparate types, with different means of measurement and collection and biological sources, among other sources of variation. The general case for standardization has begun to be addressed (Field and Kyrpides, 2007), but the metagenomics field will need its own efforts beyond what has already been contributed for genomics. A wide range of new data, information and knowledge resources, with more emphasis on metadata than in traditional sequence databases, have already been established and seek to find what will be novel paths that can bring together the many disparate disciplines that need to mine the experimental observations; the major resources are briefly described in Sec. 13. Metagenomics efforts, in contrast, are not routine or established, and the data themselves are far more complex than genomic data, with a highly discontinuous nature (the observed sequence fragments are not assembled into elements of complete genomes), and the data comes of maximum interest only with the inclusion of extensive metadata, that is, from the information about the sequence data, such as the geospatial location and environmental conditions for the sample. The general case (described in more detail below) for the new challenges introduced in bioinformatics research to deal with metagenomic data can also be considered in a life cycle, workflow or information flow analysis (see Chen and Pachter, 2005; Handelsman, 2007; Tress et al., 2006; Raes et al., 2007) and new software tools for various analyses, including statistical methods, will have to be developed (Rodriguez-Brito et al., 2006). In brief, DNA sequencing yields fragments which are assembled to increase length and reliability. Genes are predicted and translated in order to do protein-level functional analysis; for example, the proteins can be clustered by similarity metrics such as described for the GOS data. Deeper analyses attempt to provide biological parameters such as the population distribution among species and functionalities, and these are put into the ecological and environmental contexts provided in the metadata. Metagenomics data, however, will have higher error rates due to low sequence coverage (Tress et al., 2006). We expect that newer technology (described above) will continue to improve the rate of sequence acquisition but will also lead to higher error rates. The higher intrinsic error rates and fragmented DNA segments necessitate new ways of thinking about the data, and thus new directions for computational efforts, such as for data models, algorithms and software tools, in order to examine the changes in populations and their role in the environment. The sheer size of the data also means that most research groups will require algorithms that can run efficiently on modest computing resources, that is, can provide protein sequence clustering or other models in a reasonable time. An early such software tool is CD-HIT, which utilizes a series of filters to reduce the size of the input data (Li and Godzik, 2006). As the effective macromolecular physiological processes

Metagenomics

379

in communities are ascertained, metabolic models that generate community-scale metabolic and regulatory networks could be constructed for metagenomic samples, which in turn would allow network reconstructions to predict deeper aspects of community processes, such as in the case for E. coli and other cultured microbes (Palsson, 2006). Another difference to which computational methods will have to respond is the uniqueness of most metagenomics data. The complete genome of a cultured laboratory species or even of humans can in principle be obtained again, a technical issue now subject only to costs, but many ecological and environmental observations by metagenomics will be one time observations. Sampling is already being done repeatedly at individual sites at a series of depths and times but that doesn’t change the uniqueness of the individual sampling. The relationship of microbes to the ocean’s productivity, to nutrient recycling, to climate processes, as examples, requires a range of information beyond sequence in order to make use of metagenomic data for anything other than the study of macromolecular details. The nature of the environment or a population’s habitat, as well as physical descriptions of sampling parameters (e.g., depth, pressure, pH, temperature, weather/meterological information, geospatial location, oceanographic information, and perhaps, even the means for sampling) provide the details needed to understand the biological implications; this leads to the central requirement for metadata as part of metagenomic sequence resources. A general question requiring metadata concerns the role of individual species versus the contribution from genes themselves — no matter in what microbe — in the total population in terms of metabolic potential and primary productivity for a community. A representative list of molecular, ecological and environmental questions that can only be addressed through the use of metadata has been provided in the NRC New Science of Metagenomics (2007) report. For example, do communityspecific ORFs track specific biogeochemical gradients? Do gene richness and evenness patterns correlate with environmental characteristics? As such, there are more parameters and phenomena about which the metagenomics community must converge on standards than for genomics per se. Open reading frames can be readily predicted from metagenomic data, but the putative proteins will largely be of unknown function. While only at most a few percent of traditional gene function has been experimentally validated for completely sequenced genomes, only an extremely small proportion of predictions from metagenomics data will ever be experimentally tested; this is particularly true for the time at which an initial sequence deposit would be made. There are already challenges in the way in which traditional annotation of sequences, genes and genomes has been done, since the annotation by the original team or investigator remains primary even when third party annotation is allowed, but the problem will be much more severe for metagenomes, for which almost exclusively third party annotation will be required and that annotation will need to go on for an indefinite period into the future (see Chapter 5 of the NRC

380

K. Arima & J. Wooley

2007 Report; Handelsman, 2007). The most likely solution to such problems would seem to be routine, indefinitely on-going, extensive community engagement through open access editing tools, such as restricted wiki environments, which would allow authored, dated commentary and routine updates and editorial approval. Such approaches have been proposed and adopted for specialized genome databases like the Maize database (Lawrence et al., 2005) and for the structures and functions of proteins solved through the Protein Structure Initiative and the Open Protein Structure Annotation Network (TOPSAN; https://www.topsan.org; see Sec. 14).

9.1. Assembling Whole Genomes from ESS Data One of the main goals of metagenomic approaches is the assembly of the complete genomes for some or all of the organisms present in a community. The availability of completely sequenced reference genomes from single microbial isolates can aide significantly in this endeavor for closely related environmental genomes via the concept of comparative genome assembly — using a reference genome as a scaffold upon which to “hang” or organize environmentally-derived contigs and reads. However, the relatively sparse phylogenetic extent of available reference genomes combined with the dynamic plasticity of microbial genomes makes the use of comparative genome assembly approaches only possible under limited circumstances. ESS data represents a collection of random genome fragments from multiple organisms gathered in a manner that does not rely upon the dependence of laboratory-based cultivation. Consequently, ESS datasets are tremendous resources for the discovery of novel genes, gene products, physiologies, and potentially, novel organisms which are not represented in current genomic databases. In low complexity communities, such as those associated with acid mine drainage, ESS can yield enough overlapping fragments of the dominant organisms to reconstruct near-complete genomes in about 100 Mbp of shotgun sequences. In contrast, Tringe et al. observed that less than 1% of 100 Mbp of sequence generated from their soil library showed overlap with reads from independent clones. They estimated around 2×109 bp of sequence would be required to obtain 8-fold coverage (which has been traditionally targeted for draft genome assemblies) for the single most predominant genome (Tringe et al., 2005). In highly complex communities, genomic fragments from even the most dominant organisms are sufficiently diluted with respect to the total DNA pool such that assembly is precluded and the fragments from minority community members will be represented by singleton sequences alone (i.e., the unassembled reads would not be incorporated into a contig).

9.2. Comparative Metagenomics Many aspects of a microbial ecological niche or microbial environment can be interpreted without doing the massive amount of sequencing needed to assemble

Metagenomics

381

all members of a community, which is only possible in limited instances. Tringe et al. used singleton sequences that they termed environmental gene tags (EGTs) to identify putative proteins encoded by a microbial community (Tringe et al., 2005). Quantitative gene content analyses of different environmental samples revealed habitat-specific fingerprints that reflect known characteristics of the sampled environments. The identification of environment-specific genes through a gene-centric comparative analysis presents new opportunities for interpreting and diagnosing environments. Estimating the species richness and evenness is one of the biggest challenges in metagenomics. The highly complex community and the difficulty in culturing make it difficult to characterize the species’ richness and the evenness or diversity of population/species distribution within soil. To estimate species’ richness, Schloss and Handelsman developed a computer program, DOTUR, to assign 16S rRNA sequences rapidly to operational taxonomic units (OTUs) by using all possible pairwise distances between sequences (Schloss and Handelsman, 2005a). DOTUR is also useful to assess the completeness of a sequencing effort and the reliability of richness estimates. Estimating the richness of OTUs at the level of 3% sequence distance from each other (which was the cutoff for their definition of a same species), 690 sequences were sufficient for the Sargasso Sea sample while more than 10,000 sequences would be required for a soil community. More recently, the richness of bacteria in 0.5-g soil samples from Alaska and Minnesota were estimated to be 5,000 and 2,000 OTUs, respectively; a subsequent estimate is that only 18,000 sequences would be required for the Alaskan sample. A census of bacteria in an environmental sample is the significant first step in understanding the biology of an ecosystem. However, very few contigs from environmental samples carry 16S rRNA genes. For example, only 0.06% in the Sargasso Sea sample, and 0.017% in Minnesota soil sample did so. Thus, a large dataset is required to assess the membership of the community (McHardy et al., 2007). A new algorism called PHACCS (Phage Communities from Contig Spectrum) developed by Rohwer and colleagues infers genotype richness and evenness in uncultured viral communities by making a contig spectrum from a shotgun sequence library or 454 pyrosequences (Angly et al., 2005). The contig spectrum is a vector containing the number of q-contigs (a group of q overlapping sequences) found by the assembly of the DNA fragments. Since the number of copies and the abundance of a genotype reflect the size of contigs in the contig spectrum, it provides important information about the abundance and diversity of genotypes within a community. Thus, assembling quality contigs is the most critical step for this mathematical modeling system.

9.3. Binning ESS Data The process of categorically assigning environmental sequence data to specific organism types or phylogenetic affiliations is called “binning”. In other words,

382

K. Arima & J. Wooley

binning attempts to assign individual sequences to the organism from which it is derived. Various methods have been developed and each possesses its own relative merits and shortcomings. One common procedure involves the identification of intrinsic DNA sequence signatures (e.g., the percent GC content, codon usage, sequence coverage, and the frequency of short n-mers of nucleotides) that are potentially unique to a particular species, thus, are the signatures making them distinct from other organisms. The frequency of di-, tri-, tetra-, and n-mer nucleotide sequences can vary sufficiently amongst different species to serve as a basis for discriminating sequences and assigning sequence fragments to bins of similar composition. Ultimately, these nucleotide frequencies are utilized to sort ESS data into clusters approximating singular or similar organisms. While nucleotide composition-based analyses are useful for large genome fragments, their relative efficacy drops precipitously with ever smaller sized fragments owing to a lack of signal. If the ESS dataset consists of many short fragments or chimeric contigs composed of different species, these approaches will not be useful. Two examples of composition-based classifier programs that have been applied to ESS datasets are TETRA (Teeling et al., 2004) and PhyloPythia (McHardy et al., 2007). A second approach commonly used for binning ESS data relies upon the similarity of environmental sequences to reference sequences of known taxonomic affiliation. These methods simply involve assigning fragments to their closest phylogenetic neighbor based upon nucleotide or coding sequence identity (e.g. via BLAST analysis). Similarity-based approaches are reliant upon the completeness of the reference database (e.g., GenBank) and thus, due notably to cultivation bias, the representation of sequences from uncultured environmental organisms spanning the range of phylogenetic lineages is minimal, which makes this approach potentially unreliable in many circumstances. To date, there is no metagenome-specific assembler or binning method, although binning process can be improved by combining different approaches together, such as using one method to sort in a relaxed manner and then using another to subdivide the bins provided by the first method (Eisen, 2007). Then, how can we evaluate the resulting data? Mavromatis et al. constructed three simulated datasets (metagenomes) of varying complexity by combining sequencing reads randomly selected from 113 isolate genomes to evaluate the commonly used methods for binning, assembly and gene prediction (Mavromatis et al., 2007). Evaluating the appropriateness and accuracy of currently available tools using reference-simulated metagenomes provides practical information on metagenomics for both the users and the developers of the relevant software tools. The simulated datasets are available at IMG/M website (see below), at which researchers can check the most recent results, compare their dataset from a given metagenome against the simulated metagenome, and receive guidance as to which are the optimal tools for analysis. ESS data do not represent simply an artificial assemblage or a “bag of genes,” but are made up of subsets or compartments. A well-articulated description of

Metagenomics

383

the challenges and importance of binning for metagenomics has been provided recently (Eisen, 2007). Successful data binning is an essential step in understanding the functional roles of members in a metabolic pathway in a host-symbiont relationship as well as a microbial community. For instance, Eisen and colleagues have successfully binned ESS data into their origin from the two symbionts living in the gut of an insect, which has a diet exceptionally poor in organic nutrients (Wu et al., 2006). They could infer from those data that one of the symbionts synthesizes amino acids for the host while the other synthesizes the needed vitamins and cofactors.

10. The Interconnection of Microbiology Research with Metagenomics An especially prominent, while perhaps unexpected, impact of metagenomics — we believe — will be bringing microbial ecology into a central position in basic and applied biology, including the practice of clinical medicine. Even traditional evolutionary, environmental and ecological studies have underrepresented microbial populations and insights and challenges coming from microbial ecology, let alone the failure of medicine until the last few years to recognize that the human body is among our most important ecosystems for study and that the principles of ecology and related deep biological research ideas and philosophies need to be engaged in understanding human wellbeing and the attributes of the human supraorganism. Early steps in a clinical direction have been the recognition of the role of microbes in human energy balance, in microbial contributions to human metabolism, the influence of microbes on obesity, and other surprising and unexpected findings described above. Microbes have been called the unseen or invisible majority (e.g., Whitman et al., 1998). This invisible majority also had not been widely recognized until very recently for its essential role in many processes; for example, microbes not plants contribute the majority of photosynthesis on Earth. More generally, the central role of microbes in all life on earth, every process in the biosphere and throughout the physical world, has been shown (over the past few decades) to involve microbes. This central role has been especially revealed through biological oceanography and marine microbiology. The entire field of microbiology will impact progress in metagenomics and metagenomics will contribute to every subfield of microbiology; a direct interplay will come from NSF’s Microbial Observatories (described in Sec. 14). In this regard, we provide a suite of overview and technical references on related topics outside the scope of this review in the suggested reading list (Sec. 15). The interdisciplinary breadth and the newness of the ongoing revolution leads to a longer list than typical for the other chapters in this book; the list includes substantive reviews of recent research developments, including most notably, those in biological oceanography, microbial ecology, and applied microbiology, and includes an analysis of Archaeal populations in the ocean.

384

K. Arima & J. Wooley

11. Designing Metagenomics Research Projects 11.1. Science Strategies for Functional and Sequence Metagenomics 11.1.1. Complementary Techniques for Characterizing Metagenomic Samples The potential from an organized, international effort, presumably for a wide range of model organisms as well as human and for environmental samples as well as microbiomes, presents exciting opportunities while setting a range of expectations and considerations, which have been explored in detail in the NRC study, The New Science of Metagenomics (Handlesman, 2007). We consider a few complementary aspects that would need consideration in any major science community effort. The metagenomics research community distinguishes two fundamental approaches for obtaining information from metagenomic libraries; namely, function-driven analysis and sequence-driven analysis. This distinction was first introduced by Schloss and Handelsman (2003); given the nature of the research, this appears to be a very practical distinction to keep in mind in order to ensure balanced progress within the field. Function-driven analysis starts with the identification of clones that express a desired trait, followed by the characterization of the active clones by sequence and biochemical analysis. This approach is ideal for discovering natural products or proteins that have useful activities in medicine, agriculture, or industry. However, it is also challenging to find the host cell that includes all the genes required for expressing the function of interest and actually expresses the function. A typical example is the soil resistome project (see Sec. 6.1). The sequence-driven analysis uses conserved DNA sequences, such as the 16S rRNA nif and rec A genes, to design hybridization probes or PCR primers to screen metagenomic libraries for clones that contain sequences of interest. Meanwhile, metagenomic clones can be analyzed by random sequencing. The previously cited Sargasso Sea study (Venter et al., 2004) and the reconstructed genome paper (Tyson et al., 2004) are categorized by this type. More recently, gene distributions over an open-ocean depth gradient in the North Pacific Subtropical Gyre has been characterized as well by using this type of analysis (DeLong et al., 2006; Riesenfeld et al., 2004b). 11.1.2. Habitat Selection Researchers, of course, select the microbial community in which they are particularly interested. However, choosing well-characterized habitats is a key to success in a metagenomics study. The information about the habitat is useful to create specific hypotheses or questions to ask about specific gene functions in the community. For example, the geochemical condition of the AMD environment was well characterized before the metagenomics study. The insights from the earlier results led the authors to be interested in studying metabolic pathways involved in nitrogen fixation and sulfur and iron oxidation. The extremely acidic condition resulted in a community of very low complexity, which allowed the authors to reconstruct genomes relatively quickly.

Metagenomics

385

The Sargasso Sea was also known to be one of the best-studied and wellcharacterized low-nutrient waters of the global ocean. The accumulated knowledge about this region/environment motivated Venter and colleagues to select the habitat as a pilot study to interpret the environmental genome in the oceanographic context. The whole-genome shotgun sequencing analysis generated a total of 1.045 billion bp of nonredundant sequence, which has been annotated to more than 1800 genomic species in this single study. Unlike the AMD project, the presence of a highly complex community in the marine environment resulted in only 3% of the sequence being accounted for at three fold (3x) coverage or more, although eight fold (8x) coverage has been traditionally required for draft genome assemblies. These contrasting examples indicate that different habitats require different strategies for the analyses on the sequences observed. 11.1.3. Sampling Challenges Obtaining representative sampling is another major challenge for a metagenomics study. The type, size, scale, number, and the timing of sampling can be selected depending on what you want to know from which environment. At the same time, many experimental conditions limit what can actually be determined, such as in a general spatial survey versus an extended sampling over time of a given site. If insightful conclusions about a habitat are to be drawn, the samples need to represent the microbial community in the specific habitat, and at the same time, the complexity and the heterogeneity of the community over time and space need to be considered. In many cases, habitat changes over time are one of the most informative approaches for understanding a community structure, function and robustness. The researcher will have to ask if each sample captures a one-time snapshot of the environment and how many samples are going to be necessary to represent the many conditions of the community. An original example of the requirement for careful considerations on sampling is the long-term changes in the oceanic microbial community in the San Pedro Channel, California, conducted by the USC Microbial Observatory (see Sec. 14). The series of monthly observations over 4.5 years showed some taxa clearly had repeatable seasonal patterns in the community composition within the 171 operational taxonomic units of marine bacterioplankton (Fuhrman et al., 2006). These patterns in distribution and abundance of microbial taxa were highly predictable, but significantly influenced by a broad range of both abiotic and biotic factors. An inevitable consequence is that horizontal studies or large scale surveys will have some level of limitations for ecological implications outside of sampling diversity and discovering novelty. 11.2. Study of Natural Habitats Constructed in the Laboratory While the majority of microorganisms cannot be cultured by standard techniques, uncultured cells can be isolated by creating an optimal growth condition. The AMD

386

K. Arima & J. Wooley

project has found such a case in their discovery that one of the least abundant members of the AMD community, Leptospirillum ferrodiazotrophum, carried the nif operon. Using this insight, they isolated the bacterium from the environment in nitrogen-free liquid culture, in which only nitrogen-fixing bacteria could grow (Tyson et al., 2004).

12. Considerations on the Future Implications of Metagenomics In the mid 1980s to the mid 1990s, during the origins and early efforts on the human genome project, many biologists objected to such discovery science and to the extensive focus on genome as opposed to genetic research. Careful attention to sustaining individual science projects and balancing such projects with the larger programmatic sequencing efforts, among other attributes of the human genome sequencing initiative, sustained the work through major payoffs. At the onset of the 21st century, every aspect of biology has been transformed and all biologists use genomic methods or information derived from the genomic methods. Few transformations in science have been comparable, although certainly, within microbiology, the improvements of the light microscope and the corresponding evidence for ultra-small life, for microbes, had a similar impact. After decades of focus on eukaryotes and the use of all species, by way of their evolutionary relationships, as models for the study of mankind, the discovery and establishment of a third domain of life, Archaea, brought new interest in microbes, and increasingly, basic and applied microbiology has been undergoing a scientific renaissance. A key step was the recognition that microbes dominate the biomass, are essential for life itself, and touch every process in the biosphere and many in the physical world. Metagenomics, through integrating genomics, bioinformatics and systems biology and the use of a series of technologies to enable research beyond individual species, enables the characterization of populations and communities, and in turn, has provided an extraordinary acceleration of microbiology. Like genome sequencing projects, one can anticipate that the discoveries and novel advances will be routine and even unanticipated. Already, a deep appreciation of the vast diversity of genes and their proteins has emerged from metagenomics, as well as the recognition of extreme microbial genome complexity and unlimited variation and diversity in microbial life. The research has already opened up serious challenges about the nature of species, the nature of a genome, the relationship between microbes and their hosts, the diversity of life forms, the differences between community processes and evolution and those of laboratory species, and the role of viruses, among many others. The NRC Study, The New Science of Metagenomics — Revealing the Secrets of our Microbial Planet (Handelsman, 2007), provides, in a chapter entitled “Epilogue,” a very detailed, meticulous and carefully considered analysis and prediction for what the next twenty years will bring for metagenomics, and how metagenomics will influence basic and applied biology in numerous ways,

Metagenomics

387

from research in ecology and evolution to drug discovery, bioenergy and human healthcare. Anyone interested in pursuing metagenomics further would do well to start with the entire report, but the epilogue provides a very succinct explanation of why the excitement and what extraordinary progress can be expected. In sum, the report argues that metagenomics will become a concept-driven computational and experimental biology, one that will represent the “systems biology of the most inclusive biological system we know about: the biosphere of the planet.” Just as genomics turned out not be about a scale of efficiency in research but about a new and radical way of asking questions about life, looking at a generalized and integrated view of processes, proteins and genes, the report notes that metagenomics will provide a radical new step in asking questions about life, having fused large scale “omics” research with more traditional disciplines such as environmental and clinical microbiology, bioengineering, theoretical ecology, and so on (as per discussions earlier in the text). The disciplines will be transformed and questions will focus on a level below (genes and genomes) or above (communities and ecosystems) than the current focus on organisms and species. The report further argues that biologists will “understand ecosystems in terms of the collective activities and interactions of the genes they contain, how they are distributed and expressed in space and time and how they function together.” Major technical advances are expected for sequencing, transcriptomic and proteomic applications, and in culturing what have been previously unculturable microbes. Similarly, a transition from an experimental science to a computational one is predicted for microbial ecology. The report, while admitting that predicting what we will know about the biosphere in twenty years would be an exercise in science fiction, outlines some guesses as to progress expected in understanding viruses, cells and genomes, species, biogeography, and community structure and function and interactions within and between communities. An improved ability to communicate to the public and teach microbiology and metagenomics even in the K-12 curriculum is suggested as another important advance that can certainly be expected. Similarly, advances for applied sciences include the use of metagenomics for earth, life and biomedical science and agriculture, in which fundamental knowledge will open up new processes from remediation to preventive medicine. Besides the inevitable expectation for new antibiotics and treatment and diagnosis strategies and methods, particularly striking and hopeful is their assertion that the health field will transition to a focus on maintaining wellness given advances in probiotic therapy and an enhanced ability to manage chronic inflammatory and infectious disease. Bioenergy and novel pathways, bioremediation and soil microbiology, biotechnology and green chemistry, and biodefense and microbial forensics are also examined for likely impact. While the specific research trajectories cannot be projected in detail, and it is not possible to do better than the group of experts assembled for the NRC study, so many advances lie ahead that we hope the readers will see how their own

388

K. Arima & J. Wooley

expertise can contribute. Toward engaging the readers in research on metagenomics, we provide our own suggestions, following the existing research contributions in the field and the new directions and goals described above in the chapter; we expect that along with yielding many paradigm shifts in the conduct of research and in our understanding of life on earth, metagenomics most notably will drive and/or succeed in the following: • Ascertaining the extent to which macroscopic ecology principles describe microbial processes; • Characterizing populations and communities at the level of detail previously only available for defined laboratory cultures; • Providing a new perspective on symbiosis, mutualism, commensalism and parasitism, not just for the microbial populations but also in the context of their eukaryotic hosts; • Characterizing microbiomes for a selected set of multicellular eukaryotes by way of microbial ecological principles, as these principles emerge or are extended from extant knowledge; • Defining the similarities and differences between physically-constrained or environmental metagenomic communities and biologically-constrained or microbiome metagenomic communities, and using those comparative attributes in an integrated fashion in order to advance applied life sciences; • Establishing the role of microbes in eukaryotic wellness and disease, which must include evolutionary and ecological understanding through characterizing microbe-microbe and microbe-host interactions and the role of the host immune system; • Intervening directly in the human microbiome to detect and prevent disease processes such as colon cancer and also including the slowing or prevention of some aspects of aging; • Providing answers to the controversies around a potential web for microbial evolution and a tree for eukaryotic evolution, as well as ascertaining the specific extent and implications of lateral gene transfer; • Changing how the evolutionary history of life is viewed in that all multicellular organisms, all animals and all plants, are supraorganisms and as such, evolutionary studies, including modeling/simulations, must simultaneously take into account both host and microbiota evolution. The opportunities presented in understanding the ecology of the human microbiome/the human ecosystem and establishing a truly personalized medicine, the role of marine and soil and other microbes in the “homeostasis” of the Earth, and the reorientation of research on the “tree of life” to include symbiont co-evolution, seem to us especially profound consequences of what is being called the new science of metagenomics. Some larger projects that aim in these directions are described in Sec. 13.

Metagenomics

389

13. Overview of Large Metagenomics Projects for Data Delivery and Knowledge Management 13.1. CAMERA: Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis The aim of this project is to serve the needs of the microbial ecology research community by creating a rich, distinctive data repository and a bioinformatics tools resource that will address many of the unique challenges of metagenomic analysis. More generally, the project will help to build the metagenomics community and develop a suite of software tools and computing resources enabling the entire scientific community to use the rapidly growing treasure of metagenomic data and, in particular, their associated metadata. The project also involves ab initio research efforts to develop new tools that will aid in the interpretation and presentation of environmental genome sequence data. The first version of the CAMERA website, which has launched in March 2007, provides all the metagenomic data being collected by the J. Craig Venter Institute’s ”Sorcerer II” Global Ocean Sampling (GOS) expedition, a large-scale metagenomic survey of marine viral organisms collected from sites around the North American continent by Forest Rohwer and his research team at San Diego State University, and a vertical profile of marine microbial communities collected at the Hawaii Ocean Time-Series (HOTS) station ALOHA by Ed DeLong and his research team at MIT.

13.2. C-MORE: The Center for Microbial Oceanography — Research and Education The C-MORE project is led by David Karl at the University of Hawaii Manoa and composed of experts in microbial biology and oceanography from a variety of institutions. C-MORE research is organized around four interconnected themes: microbial biodiversity; metabolism and C-N-P-energy flow; remote and continuous sensing and links to climate variability; and ecosystem modeling, simulation and prediction. Databases such as the nascent genomic depth series, linked to ongoing oceanographic time series data, serve as a foundation, resource, and springboard for further study of microbial biodiversity and function in the ocean ecosystem. The Center has initiated a partnership with CAMERA to make C-MORE metagenomic datasets, notably, along with their associated metadata, available to the broader community.

13.3. ICoMM: International Census of Marine Microbes There is, to date, one international metagenomics project. The International Census of Marine Microbes (ICoMM), led by Mitchell Sogin, seeks to facilitate the inventory of marine microbial diversity. This project is developing a strategy to catalog all known diversity of single-cell organisms, that is, inclusive of Bacteria, Archaea,

390

K. Arima & J. Wooley

Protista and their associated viruses, in order to explore and characterize the extent and details of now unknown microbial diversity, and to place that knowledge into appropriate ecological and evolutionary contexts. The ICoMM emphasizes international collaborations, and forges linkages with existing and new field projects for collecting samples and contextual information and for testing and implementing new technologies. The ICoMM employs a 454 pyrosequencer to accelerate extensive sampling of microbial populations (Sogin et al., 2006). The PCR amplicons spanning the V6 hypervariable region of 16S rRNA genes data allow the measurement of both the relative abundance and the diversity of community members. ICoMM data are available at MICROBIOS (see Sec. 13).

13.4. IMG/M: An Experimental Metagenome Data Management and Analysis System The Department of Energy’s Joint Genome Institute (JGI) developed IMG/M based on the Integrated Microbial Genomes (IMG) platform, as set of a data management and analysis tools for use with cultured, fully sequenced microbial genomes (Markowitz et al., 2006a; Markowitz et al., 2006b). IMG/M provides tools and viewers for analyzing both metagenomes and isolate genomes individually or in a comparative context. Version 2.0 of IMG system includes a total of 2,301 isolate genomes consisting of 595 bacterial, 32 archaeal, 13 eukaryotic, and 1,661 virus genomes. In addition, IMG/M includes three of the simulated metagenome data sets employed for benchmarking several methods for assembly, gene prediction, and binning (see FAMeS at Sec. 13).

13.5. Megx.net: Database Resources for Marine Ecological Genomics The megx.net has been developed by the Microbial Genomics Group, Max Planck Institute for Marine Microbiology in Bremen, Germany (Lombardot et al., 2006). The megx.net consists of marine microbial genome databases and tools utilized for genomic and metagenomic analysis in their environmental contexts. Megx.net includes (i) a geographic information system for systematically storing and analyzing marine genomic and metagenomic data in conjunction with contextual information; (ii) an environmental genome browser with fast search functionalities; (iii) a database with pre-computed analyses for selected complete genomes; (iv) a database and tool to bin metagenomic fragments based on oligonucleotide signatures.

13.6. SEED: The Subsystems Approach to Genome Annotation The original SEED Project was started in 2003 by the Fellowship for Interpretation of Genomes (FIG) as a largely unfunded open source effort to annotate 1000

Table 1. Microbial Observatory

Website

Site

Environment

Alaskan Soil: A Cold Microbial Observatory Alpine Microbial Observatory (AMO) Anaerobic Bacteria and Methanogens in Northern Peatland Ecosystems

www.plantpath.wisc.edu/fac/ joh/mo.htm amo.colorado.edu www.micro.cornell.edu/cals/micro/ research/nsf-observatories/ yavitt-zinder/index.cfm www1.uprh.edu/salterns/index.html

Bonanza Creek Experimental Forest, AL Rocky Mountains, CO North American Peatland (NY, IL, MI)

Soil

Center for Microbial Ocenography (C-MORE) Duke Forest Mycological Observatory Methylotrophic Microbial Observatory International Census of Marine Microbes (ICOMM) Itasca State Park Microbial Observatory Kamchatka Microbial Observatory Iron Microbial Observatory at Loihi Volcano (FeMO) McMurdo Dry Valley Lakes Microbial Observatory

www.cedarcreek.umn.edu/microbo/ index.htm cmore.soest.hawaii.edu

Cabo Rojo Salterns, Puerto Rico Cedar Creek, MN

Hypersaline Soil

Hawaii ocean, HI

Marine

www.biology.duke.edu/fungi/ mycolab/DFMO.html depts.washington.edu/microobs/

Duke Forest, NC

Soil

Lake Washington, WA

Lake

icomm.mbl.edu

International sea

Marine

www.ndsu.nodak.edu/instruct/fawley/ coccoids/Itasca/itascadata.htm www.gly.uga.edu/kamchatka/ www.usc.edu/dept/LAS/biosci/ Edwards lab/Research/FeMO.html www.mcmlter.org/index.html

Itasca state Park, MN

Lake

Kamchatka Peninsula, Russia Hawaiian Seamount, HI

Hot spring Volcano

McMurdo Dry Valleys, Antarctica

Lake

Metagenomics

Cabo Rojo Salterns Microbial Observatory Cedar Creek Microbial Observatory

Soil Peatland/ wetland/marsh

391

392

Table 1.

(Continued )

Website

Site

Environment

Microbial Diveristy in lakes of the Hawaiian Archipelago Microbial Diversity of Prokaryotes in marine Songes of the Class Demospongiae Microbial Observatory at the HJ Andrews LTER Microbial Observatory at Zodletone Spring

www.hawaii.edu/microbiology/MO/

Hawaiian Archipelago, HI

Lake

serc.carleton.edu/microbelife/ microbservatories/ marinesponges/index.html cropandsoil.oregonstate.edu/ HJA mo/default.html facultystaff.ou.edu/K/Lee.R.Krumholz-1/ nsfzodletonepage03.html www.dbi.udel.edu/ MOVE/MO index.htm www.monolake.uga.edu/index.htm www.tigr.org/tdb/ MBMO/MBMO.shtml www.uga.edu/srel/Nevada Hot Springs/index.htm www.nimo-sc.org

ConchReef, Key Largo, FL

Marine

HJ Andrews Experimental Forest, OR Zodletone Mountain, OK

Ectomycorrhizal mats sulfur/methane springs

Chesapeake Bay

Marine

Mono Lake, CA Monterey Bay, CA

Alkaline/saline lake Marine

Great Basin valleys, NV

Hot spring

North Inlet Estuary in Georgetown, SC North Temperate Lakes, WI

Salt marsh Lake

Nyack Flood Plains, MO

Flood plain

South Glens Falls, NY

Contaminated aquifer

Microbial Observatory for virioplankton Ecology (MOVE) Mono Lake Microbial Observatory Monterey Bay Coastal Ocean Microbial Observatory Nevada Hot Springs Microbial Observatory North Inlet Microbial Observatory (NIMO) North Temperate Lakes Microbial Observatory Nyack Microbial Observatory Project Contaminated Aquifer Microbial Observatory

microbes.limnology.wisc.edu/ dbs.umt.edu/facilities/nyack observatory/people.htm serc.carleton.edu/microbelife/ microbservatories/ aquifer/index.html

K. Arima & J. Wooley

Microbial Observatory

Table 1.

(Continued )

Website

Site

Environment

Oceanic Microbial Observatory

www.lifesci.ucsb.edu/∼carlson/ index.html ecosystems.mbl.edu/PIMO/

Northwestern Sargasso Sea

Marine

Plum Island Estuary, MA

Salt marsh

jove.geol.niu.edu/faculty/ lenczewski/Nachusa.html

Nachusa Grassland, IL

Pedosphere and soil

www.wou.edu/∼boomers/ research/allresearch.html www.cwu.edu/∼pinkarth

Yellowstone National Park, ID, MO, WY Soap Lake, WA

Hot spring

www.okstate.edu/artsci/SPMO/

Salt Plains Nat. Wildlife Refuge, OK San Salvador, Bahamas

Hypersaline

Sapelo Island, GA

Salt marsh

Iron Mountain Mine, CA

Acid mine drainage

San Pedro Channel, CA

Marine

Area de Conservacion Guanacaste (ACG), Costa Rica Yellowstone National Park, ID, MO, WY

Caterpillar gut

Plum Island Estuary Microbial Observatory Prairie Restoration Impacts on Soil Microbial Communities at Illinois Nachusa Grassland Red Layer Microbial Observatory RUI Microbial Observatory at Soap Lake Salt Plains Microbial Observatory San Salvador Microbial Observatory Sapelo Island Microbial Observatory (SIMO) Sulfide Mineral Weathering Acid Mine Drainage Research The USC Microbial Observatory Tropical Caterpillars Microbial Observatory (MOCAT) Yellowstone Thermal Viruses

www.unc.edu/ims/paerllab/ research/ansalmo/index.htm simo.marsci.uga.edu/index.htm seismo.berkeley.edu/∼jill/amd/ AMDresearch.html#molecular www.usc.edu/dept/LAS/biosci/ Caron lab/MO/ alrlab.pdx.edu/research/mocat/

serc.carleton.edu/microbelife/ yellowstone/index.html

Alkaline/saline lake Hypersaline

Metagenomics

Microbial Observatory

Hot spring

393

394

K. Arima & J. Wooley

microbial genome (Overbeek et al., 2005). Argonne National Lab and the University of Chicago joined the project, and now much of the activity occurs at those two institutions as well as the University of Illinois at Urbana-Champaign, Hope College, San Diego State University, the Burnham Institute, and a number of other institutions. The cooperative effort focuses on the development of the comparative genomics environment called the SEED and, more importantly, on the development of curated genomic data. This project has a specific concept of how to approach high-throughput annotation: the effort is organized around subsystem experts, i.e., individuals who master the details of a specific biological subsystem (e.g., operons or component of a metabolic pathway) and then analyze and annotate the genes that make up that given subsystem over the entire collection of genomes available. The initial release of data includes 180,177 distinct proteins with 2133 distinct functional roles, which came from 173 subsystems and 383 different organisms.

13.7. Microbial Observatories (MOs) The Microbial Observatories explore the microbial processes occurring in different natural habitats in order to understand microbial diversity and interactions over time and across environmental gradients. Scientists leverage advances in molecular biology, genomics, metagenomics and cultivation technologies to discover and characterize novel microorganisms, microbial consortia, communities, activities and other novel properties, and to study their roles in diverse environments. Some MOs now established are listed in Table 1.

14. Online Resources AMD project: http://seismo.berkeley.edu/∼jill/amd/AMDhome.html/ CAMERA: http://camera.calit2.net/ C-MORE: http://cmore.soest.hawaii.edu/index.htm/ FAMeS: http://fames.jgi-psf.org/ Human Microbiome Project (HMP): http://nihroadmap.nih.gov/hmp/ ICoMM: http://icomm.mbl.edu/ IMG/M: http://img.jgi.doe.gov/ Layers of Symbiosis: http://www.jove.com/index/Details.stp?ID=197/ MaizeGDB (Maize Genome Database): http://www.maizegdb.org/ Megx.net: http://www.megx.net/ Metagenomics: http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=metagenomics/ Metagenomics and Our Microbial Planet: http://dels.nas.edu/metagenomics/ MICROBIOS: http://icomm.mbl.edu/microbis/ RDP-II : http://rdp.cme.msu.edu/ SEED: http://www.theseed.org/wiki/Main Page/ TOPSAN (The Open Protein Structure Annotation Network): https://www.topsan.org/

Metagenomics

395

USC Microbial Observatory: http://www.usc.edu/dept/LAS/biosci/ Caron lab/MO/

15. Further Reading Special Journal Issues Dedicated to Metagenomics and Related Research: PLoS Biology, 5, March 2007 Special Issue on “the Ocean Metagenomics Collection” Oceanography, 20, June 2007 Special Issue on “A Sea of Microbes” Nature Reviews Microbiology, 5, October 2007, Special Issue on “Marine Microbiology” Extensive Overviews of the Scientific Opportunities: A Report from the American Academy of Microbiology: “Microbial Genomes – Blueprints for Life,” ed. Relman D and Strauss E, 1999. National Science Foundation OCE Directorate Report: “Ecological Genomics: The Application of Genomic Sciences to Understanding the Structure and Function of Marine Ecosystems,” ed. Cary C and Chisholm S., 2000. National Science Foundation BIO Directorate Report: “The Microbe Project,” ed. Fraser CM and Wooley J, 2002. A Report from the American Academy of Microbiology: “Marine Microbial Diversity: the Key to Earth’s Habitability,” ed. by Hunter-Cevera J, Karl D and Buckley M, 2005 A Report from the American Academy of Microbiology: “Probiotic Microbes — The Scientific Basis,” ed. Walker R and Buckley M, 2006. National Research Council: “Understanding our Microbial Planet — The New Science of Metagenomics,” 2007. National Research Council Study Report: “The New Science of Metagenomics, Revealing the Secrets of our Microbial Planet,” co-chairs Handelsman J and Tiedje J, 2007. Selected Technical Papers and Reviews, Beyond the Scope of This Chapter : Abulencia CB et al. (2006) Environmental Whole-Genome Amplification to Access Microbial Populations in Contaminated Sediments. Applied and Experimental Microbiology 72: 3251–3301. Azam F and Malfatti F (2007) Microbial structuring of marine ecosystems. Nature Reviews Microbiology 5: 782–791. Cullen JJ et al. (2007) Patterns and Predictions in Microbial Oceanography. Oceanography 20: 34–46. DeLong EF, Karl DM (2005) Genomic perspectives in microbial oceanography. Nature 437: 336–342. DeLong EF et al. (2006) Community genomics among stratified microbial assemblages in the ocean’s interior. Science 311: 496–503. DeLong EF (2007) Microbial Domains in the Ocean: A Lesson from Archaea. Oceanography 20: 124–129. Hallam SJ, et al. (2006) Pathways of carbon assimilation and ammonia oxidation suggested by environmental genomic analyses of marine Crenarchaeota. PLoS Biol 4: e95. Hood RR et al. (2007) Modeling and Prediction of Marine Microbial Populations in the Genomic Era. Oceanography 20: 155–165. Jorgensen BB, Boetius A (2007) Feast and Faminie — microbial life in the deep sea bed. Nature Reviews Microbiology 5: 770–781. Karl DM (2007) Microbial Oceanography: Paradigms, Processes and Promise. Nature Reviews Microbiology 5: 759–769.

396

K. Arima & J. Wooley

McCutcheon JP, Moran MA (2007) Parallel genomic evolution and metabolic interdependence in an ancient symbiosis. Proc Natl Acad Sci USA 104: 19392–19397. Moran MA, Miller WL (2007) Resourceful heterotrophs make the most of light in the coastal ocean. Nature Reviews Microbiology 5: 792–800. Moran MA, Ambrust, E.V. (2007) Genomes of Sea Microbes. Oceanography 20: 47–55. Pomeroy LR et al. (2007) The Microbial Loop. Oceanography 20: 28–33. Pushker R, et al. (2005) Micro-Mar: a database for dynamic representation of marine microbial biodiversity. BMC Bioinformatics [electronic resource] 6: 222. Worden AZ, Cuvelier ML, Bartlett DH (2006) In-depth analyses of marine microbial community genomics. Trends in Microbiology 14: 331–336.

Acknowledgments The authors wish to thank the Gordon and Betty Moore Foundation for their support and for enabling the CAMERA project described in Sec. 13. Since the implementation of CAMERA and our direct engagement in community cyber and community building (over the past couple of years), our ideas about and understanding of metagenomics has also benefited greatly from our CAMERA colleagues and the speakers and participants in our International Metagenomics Congresses in 2006 and 2007.

REFERENCES

Abascal F, Zardoya R, Posada D (2006) GenDecoder: genetic code prediction for metazoan mitochondria. Nucleic Acids Res 34(Web Server issue):W389–W393. Abe H, Abo T, Aiba H (1999) Regulation of intrinsic terminator by translation in Escherichia coli: transcription termination at a distance downstream. Genes Cells 4(2):87–97. Abulencia CB, Wyborsky DL, Garcia JA, Podar M, Chen W, Chang SH, Chang HW, Watson D, Brodie EL, Hazen TC, Keller M (2006) Environmental Whole-Genome Amplification to Access Microbial Populations in Contaminated Sediments. Applied and Experimental Microbiology 72:3251–3301. Acquisti C, Kleffe J, Collins S (2007) Oxygen content of transmembrane proteins over macroevolutionary time scales. Nature 445(7123):47–52. Actis LA, Tolmasky ME, Crosa JH (1999) Bacterial plasmids: replication of extrachromosomal genetic elements encoding resistance to antimicrobial compounds. Front Biosci 4:D43–D62. Adachi T, Mizuuchi M, Robinson EA, Appella E, O’Dea MH, Gellert M, Mizuuchi K (1987) DNA sequence of the E. coli gyrB gene: application of a new sequencing strategy. Nucleic Acids Res 15:771–784. Adelman JL, Jeong YJ, Liao JC, Patel G, Kim DE et al. (2006) Mechanochemistry of transcription termination factor Rho. Mol Cell 22(5):611–621. Adhya S (2003) Suboperonic regulatory signals. Sci STKE (185):pe22. Adhya SL, Shapiro JA (1969) The galactose operon of E. coli K-12. I. Structural and pleiotropic mutations of the operon. Genetics 62:231–247. Agabian N (1990) Trans splicing of nuclear pre-mRNAs. Cell 61(7):1157–1160. Agris PF, Vendeix FA, Graham WD (2007) tRNA’s wobble decoding of the genome: 40 years of modification. J Mol Biol 366(1):1–13. Ahmed A (1985) A rapid procedure for DNA sequencing using transposon-promoted deletions in Escherichia coli. Gene 39:305–310. Akashi H (2001) Gene expression and molecular evolution. Curr Opin Genet Dev 11:660–666. Akashi H (2003) Metabolic economics and microbial proteome evolution. Bioinformatics 19 Suppl 2:II15. Akerley BJ, Rubin EJ et al. (2002) A genome-scale analysis for identification of genes required for growth or survival of Haemophilus influenzae. Proc Natl Acad Sci USA 99(2):966–971. Aki T, Adhya S (1997) Repressor induced site-specific binding of HU for transcriptional regulation. EMBO J 16(12):3666–3674. Akman L, Yamashita A, Watanabe H, Oshima K, Shiba T, Hattori M, Aksoy S (2002) Genome sequence of the endocellular obligate symbiont of tsetse flies, Wigglesworthia glossinidia. Nat Genet 32:402–407.

397

398

References

Alavi SM, Poussier S, Manceau C (2007) Characterization of ISXax1, a novel insertion sequence restricted to Xanthomonas axonopodis pv. phaseoli (variants fuscans and non-fuscans) and Xanthomonas axonopodis pv. vesicatoria. Appl Environ Microbiol 73:1678–1682. Albert I, Albert R (2004) Conserved network motifs allow protein-protein interaction prediction. Bioinformatics 20(18):3346–3352. Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P (2002) The molecular biology of the cell, 4th ed. Garland Press, New York, NY. Alex LA. Simon M I (1994) Protein histidine kinases and signal transduction in prokaryotes and eukaryotes. Trends Genet 10(4):133–138. Alfieri R, Mosca E, Merelli I, Milanesi L. (2007) Parameter estimation for cell cycle ordinary differential equation (ODE) models using a grid approach. Stud Health Technol Inform 126:93–102. Alkema WB, Lenhard B, Wasserman WW (2004) Regulog analysis: detection of conserved regulatory networks across bacteria: application to Staphylococcus aureus. Genome Res 14(7):1362–1373. Allen EE, Banfield JF (2005) Community genomics in microbial ecology and evolution. Nature Reviews 3:489–498. Allen EE, Tyson GW, Whitaker RJ, Detter JC, Richardson PM, Banfield JF (2007) Genome dynamics in a natural archaeal population. Proc Natl Acad Sci USA 104: 1883–1888. Alm E, Huang K, Arkin A (2006) The evolution of two-component systems in bacteria reveals different strategies for niche adaptation. PLoS Comput Biol 2(11):e143. Almagor H (1985) Nucleotide distribution and the recognition of coding regions in DNA sequences: an information theory approach. J Theor Biol 117:127–136. Almeida JS (2002) Predictive non-linear modeling of complex data by artificial neural networks. Curr Opin Biotechnol 13(1):72–76. Almeida JS, Voit EO (2003) Neural-network-based parameter estimation in S-system models of biological networks. Genome Inform 14:114–123. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W. Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. Alves R, Antunes F, Salvador A (2006) Tools for kinetic modeling of biochemical networks. Nat Biotechnol 24:667–672. Alwine JC, Kemp DJ, Stark GR (1977) Method for detection of specific RNAs in agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with DNA probes. Proc Natl Acad Sci USA 74(12):5350–5354. Amann RI, Ludwig W, Schleifer KH (1995) Phylogenetic identification and in-situ detection of individual microbial-cells without cultivation. Microbiological Reviews 59:143–169. Amirnovin R (1997) An analysis of the metabolic theory of the origin of the genetic code. J Mol Evol 44(5):473–476. Andachi Y, Yamao F, Iwami M, Muto A, Osawa S (1987) Occurrence of unmodified adenine and uracil at the first position of anticodon in threonine tRNAs in Mycoplasma capricolum. Proc Natl Acad Sci USA 84(21):7398–7402.

References

399

Anderson L, Seilhamer J (1997) A comparison of selected mRNA and protein abundances in human liver. Electrophoresis 18(3–4):533–537. Andersson AF, Lundgren M, Eriksson S, Rosenlund M, Bernander R et al. (2006) Global analysis of mRNA stability in the archaeon Sulfolobus. Genome Biol 7(10):R99. Andersson JO, Andersson SGE (1999) Genome degradation is an ongoing process in Rickettsia. Mol Biol Evol 16:1178–1191. Andersson SGE, Zomorodipour A, Andersson JO, Sicheritz-Ponten T, Alsmark UCM, Podowski RM, Naslund AK, Eriksson AS, Winkler HH, Kurland CG (1998) The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396:133–140. Angly F, Rodriguez-Brito B, Bangor D, McNairnie P, Breitbart M, Salamon P, Felts B, Nulton J, Mahaffy J, Rohwer F (2005) PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics 6:41. Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, Chan AM, Haynes M, Kelley S, Liu H, Mahaffy JM, Mueller JE, Nulton J, Olson R, Parsons R, Rayhawk S, Suttle CA, Rohwer F (2006) The marine viromes of four oceanic regions. PLoS Biol 4:e368. Anonymous A FGENESB Suite of Bacterial Operon and Gene Finding Programs. http://www.softberry.com/berry.phtml. Apic G, Gough J, Teichmann SA (2001) An insight into domain combinations. Bioinformatics 17 Suppl 1:S83–S89. Apic G, Huber W, Teichmann SA (2003) Multi-domain protein families and domain pairs: comparison with known structures and a random model of domain recombination. J Struct Funct Genomics 4(2–3):67–78. Aras RA, Kang J, Tschumi AI, Harasaki Y, Blaser MJ (2003) Extensive repetitive DNA facilitates prokaryotic genome plasticity. Proc Natl Acad Sci USA 100:13579–13584. Aravind L, Anantharaman V, Balaji S, Babu MM, Iyer LM (2005) The many faces of the helix-turn-helix domain: transcription regulation and beyond. FEMS Microbiol Rev 29(2):231–262. Archetti M, Di Giulio M (2007) The evolution of the genetic code took place in an anaerobic environment. J Theor Biol 245(1):169–174. Argaman L, Hershberg R, Vogel J, Bejerano G, Wagner EG, Margalit H, Altuvia S (2001) Novel small RNA-encoding genes in the intergenic regions of Escherichia coli. Curr Biol 11:941–950. Arnez JG, Moras D (1997) Structural and functional considerations of the aminoacylation reaction. Trends Biochem Sci 22(6):211–216. Arnold M (2006) Evolution through genetic exchange. Oxford, Great Britain, Oxford University Press. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1):25–29. Asthana S, King OD, Gibbons FD, Roth FP (2004) Predicting protein complex membership using probabilistic network reliability. Genome Res 14(6):1170–1175. Audic S, Claverie JM (1998) Self-identification of protein-coding regions in microbial genomes. Proc Natl Acad Sci USA 95:10026–10031. Avery OT, MacLeod CM, McCarty M (1944) Studies on the chemical nature of the substance inducing transformation of pneumococcal types: induction of transformation by a desoxyribonucleic acid fraction isolated from pneumococcus type III. J Exp Med 79(2):137–158.

400

References

Azad RK, Borodovsky M (2004a) Effects of choice of DNA sequence model structure on gene identification accuracy. Bioinformatics 20:993–1005. Azad RK, Borodovsky M (2004b) Probabilistic methods of identifying genes in prokaryotic genomes: connections to the HMM theory. Brief Bioinform 5:118–130. Azad RK, Lawrence JG (2005) Use of artificial genomes in assessing methods for atypical gene detection. PLoS Comput Biol 1(6):e56. Azad RK, Lawrence JG (2007) Detecting laterally-transferred genes: use of entropic clustering methods and genome position. Nucleic Acids Res 35:4629–4639. Azam F, Malfatti F (2007) Microbial structuring of marine ecosystems. Nature Reviews 5:782–791. Azam F, Worden AZ (2004) Oceanography. Microbes, molecules, and marine ecosystems. Science 303:1622–1624. Babitzke P (1997) Regulation of tryptophan biosynthesis: trp-ing the TRAP or how Bacillus subtilis reinvented the wheel. Mol Microbiol 26(1):1–9. Babitzke P (2004) Regulation of transcription attenuation and translation initiation by allosteric control of an RNA-binding protein: the Bacillus subtilis TRAP protein. Curr Opin Microbiol 7(2):132–139. Bachellier S, Cl´ement JM, Hofnung M (1999) Short palindromic repetitive DNA elements in enterobacteria: a survey. Res Microbiol 150:627–639. Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hogue CW (2001) BIND– The Biomolecular Interaction Network Database. Nucleic Acids Res 29:242–245. Bader GD, Hogue CW (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4:2. Bader JS, Chaudhuri A, Rothberg JM, Chant J (2004) Gaining confidence in highthroughput protein interaction networks. Nat Biotechnol 22(1):78–85. Badger JH, Olsen GJ (1999) CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol 16:512–524. Bahl LR, Jelinek F, Mercer RL (1983) A maximum likelihood approach to continuous speech recognition. IEEE Trans Pattern Anal Machine Intell 5:179–190. Bailey TL, Elkan C (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2:28–36. Bailey TL, Gribskov M (1998) Methods and statistics for combining motif match scores. J Comput Biol 5:211–221. Bajad SU, Lu W, Kimball EH, Yuan J, Peterson C, Rabinowitz JD (2006) Separation and quantitation of water soluble cellular metabolites by hydrophilic interaction chromatography-tandem mass spectrometry. J Chromatogr A 1125:76–88. Baker D, Sali A (2001) Protein structure prediction and structural genomics. Science 294(5540):93–96. Balazsi G, Barabasi AL, Oltvai ZN (2005) Topological units of environmental signal processing in the transcriptional regulatory network of Escherichia coli. Proc Natl Acad Sci USA 102(22):7841–7846. Balazsi G, Oltvai ZN (2005) Sensing your surroundings: how transcription-regulatory networks of the cell discern environmental signals. Sci STKE 2005(282):pe20. Baldi P (2000) On the convergence of a clustering algorithm for protein-coding regions in microbial genomes. Bioinformatics 16:367–371. Banerjee S, Chalissery J, Bandey I, Sen R (2006) Rho-dependent transcription termination: more questions than answers. J Microbiol 44(1):11–22. Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512. Barabasi AL, Oltvai ZN (2004) Network biology: understanding the cell’s functional organization. Nat Rev Genet 5(2):101–113.

References

401

Bar-Joseph Z (2004) Analyzing time series gene expression data. Bioinformatics 20(16):2493–2503. Barker WC, Garavelli JS, Haft DH, Hunt LT, Marzec CR, Orcutt BC, Srinivasarao GY, Yeh LS, Ledley RS, Mewes HW, Pfeiffer F, Tsugita A (1998) The PIR-International Protein Sequence Database. Nucleic Acids Res 26(1):27–32. Barnard A, Wolfe A, Busby S (2004) Regulation at complex bacterial promoters: how bacteria use different promoter organizations to produce different regulatory outcomes. Curr Opin Microbiol 7(2):102–108. Barrangou R, Fremaux C, Deveau H, Richards M, Boyaval P, Moineau S, Romero DA, Horvath P (2007) CRISPR provides acquired resistance against viruses in prokaryotes. Science 315:1709–1712. Barrett CL, Herring CD, Reed JL, Palsson BO (2005) The global transcriptional regulatory network for metabolism in Escherichia coli exhibits few dominant functional states. Proc Natl Acad Sci USA 102(52):19103–19108. Bartlett MS (2005) Determinants of transcription initiation by archaeal RNA polymerase. Curr Opin Microbiol 8(6):677–684. Baseggio N, Glew MD, Markham PF, Whithear KG, Browning GF (1996) Size and genomic location of the pMGA multigene family of Mycoplasma gallisepticum. Microbiology 142 (Pt 6):1429–1435. Batey RT (2006) Structures of regulatory elements in mRNAs. Curr Opin Struct Biol 16(3):299–306. Baudot A, Jacq B, Brun C (2004) A scale of functional divergence for yeast duplicated genes revealed from analysis of the protein-protein interaction network. Genome Biol 5(10):R76. Baum LE, Petrie T (1966) Statistical inference for probabilistic functions of finite state Markov chains. Ann Math Stat 37:1554–1563. Baumann P (2005) Biology of bacteriocyte-associated endosymbionts of plant sap-sucking insects. Annu Rev Microbiol 59:155–189. Baxevanis AD, Ouellette BFF (2005) Bioinformatics: a practical guide to the analysis of genes and proteins. Hoboken, NJ Wiley. Beard DA, Babson E, Curtis E, Qian H (2004) Thermodynamic constraints for biochemical networks. J Theor Biol 228:327–333. Beaurepaire C, Chaconas G (2007) Topology-dependent transcription in linear and circular plasmids of the segmented genome of Borrelia burgdorferi. Mol Microbiol 63: 443–453. Becker HD, Kern D (1998) Thermus thermophilus: a link in evolution of the tRNAdependent amino acid amidation pathways. Proc Natl Acad Sci USA 95(22): 12832–12837. Becker NB, Wolff L, Everaers R (2006) Indirect readout: detection of optimized subsequences and calculation of relative binding affinities using different DNA elastic potentials. Nucleic Acids Res 34(19):5638–5649. Beckett D (2001) Regulated assembly of transcription factors and control of transcription initiation. J Mol Biol 314(3):335–352. Becskei A, Kaufmann BB, van Oudenaarden A (2005) Contributions of low molecule number and chromosomal positioning to stochastic gene expression. Nat Genet 37(9):937–944. Beiko RG, Harlow TJ, Ragan MA (2005) Highways of gene sharing in prokaryotes. Proc Natl Acad Sci USA 102(40):14332–14337. Bekker M, Teixeira de Mattos MJ, Hellingwerf KJ (2006) The role of two-component regulation systems in the physiology of the bacterial cell. Sci Prog 89(Pt 3–4): 213–242.

402

References

Belda E, Moya A, Silva FJ (2005) Genome rearrangement distances and gene order phylogeny in gamma-proteobacteria. Mol Biol Evol 22:1456–1467. Bell SD, Jackson SP (1998) Transcription and translation in Archaea: a mosaic of eukaryal and bacterial features. Trends Microbiol 6(6):222–228. Bell SD, Jackson SP (2001) Mechanism and regulation of transcription in archaea. Curr Opin Microbiol 4(2):208–213. Bell SD, Kosa PL, Sigler PB, Jackson SP (1999) Orientation of the transcription preinitiation complex in archaea. Proc Natl Acad Sci USA 96(24):13662–13667. Bell SD, Magill CP, Jackson SP (2001) Basal and regulated transcription in Archaea. Biochem Soc Trans 29(Pt 4):392–395. Ben-Hur A, Noble WS (2005) Kernel methods for predicting protein-protein interactions. Bioinformatics 21 Suppl 1, i38–i46. Bennett MR, Volfson D, Tsimring L, Hasty J (2007) Transient dynamics of genetic regulatory networks. Biophys J 92(10):3501–3512. Bentley DR (2006) Whole-genome re-sequencing. Current Opinion in Genetics & Development 16:545–552. Berg CM, Berg DE, Groisman EA (1989) Transposable elements and the genetic engineering of bacteria. In: Berg DE and Howe MM (eds) Mobile DNA. ASM, Washington, DC, pp. 879–925. Berg DE (1989) Transposon Tn5. In: Berg DE and Howe MM (eds) Mobile DNA. ASM, Washington, DC, pp. 185–210. Berg DE, Berg CM, Sasakawa C (1984) Bacterial transposon Tn5: evolutionary inferences. Mol Biol Evol 1:411–422. Bergman NH, Passalacqua KD, Hanna PC, Qin ZS (2007) Operon prediction for sequenced bacterial genomes without experimental information. 73:846–854. Bergsten J (2005) A review of long-branch attraction. Cladistics 21(2):163–193. Bernander R (2000) Chromosome replication, nucleoid segregation and cell division in archaea. Trends Microbiol 8(6):278–283. Bernardi G (1989) The isochore organization of the human genome. Annu Rev Genet 23:637–661. Bernstein E, Hake SB (2006) The nucleosome: a little variation goes a long way. Biochem Cell Biol 84(4):505–517. Bernstein JA, Khodursky AB, Lin PH, Lin-Chao S, Cohen SN (2002) Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using twocolor fluorescent DNA microarrays. Proc Natl Acad Sci USA 99(15):9697–9702. Berry MJ, Banu L, Harney JW, Larsen PR (1993) Functional characterization of the eukaryotic SECIS elements which direct selenocysteine insertion at UGA codons. EMBO J 12(8):3315–3322. Besemer J, Lomsadze A, Borodovsky M (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29:2607–2618. Bik EM, Eckburg PB, Gill SR, Nelson KE, Purdom EA, Francois F, Perez-Perez G, Blaser MJ, Relman DA (2006) Molecular analysis of the bacterial microbiota in the human stomach. Proc Natl Acad Sci USA 103:732–737. Birnbaum K, Benfey PN, Shasha DE (2001) cis element/transcription factor analysis (cis/TF): a method for discovering transcription factor/cis element relationships. Genome Res 11(9):1567–1573. Blaisdell BE (1986) A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci USA 83:5155–5159. Blaisdell BE (1989a) Average values of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch counts requiring sequence alignment for a computer-generated model system. J Mol Evol 29:538–547.

References

403

Blaisdell BE (1989b) Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequences. J Mol Evol 29: 526–537. Blaisdell BE, Campbell AM, Karlin S (1996) Similarities and dissimilarities of phage genomes. Proc Natl Acad Sci USA 93:5854–5859. Blaisdell BE, Rudd KE, Matin A, Karlin S (1993) Significant dispersed recurrent DNA sequences in the Escherichia coli genome. Several new groups. J Mol Biol 229: 833–848. Blanc G, Ogata H, Robert C, Audic S, Suhre K, Vestris G, Claverie JM, Raoult D (2007) Reductive genome evolution from the mother of Rickettsia. PLoS Genet 3: Blanchette M, Schwikowski B, Tompa M (2002) Algorithms for phylogenetic footprinting. J Comput Biol 9:211–223. Blanco AG, Sola G, Gomis-Ruth FX, Coll M (2002) Tandem DNA recognition by PhoB, a two-component signal transduction transcriptional activator. Structure 10:701–713. Blattner FR, Plunkett G, Bloch CA (1997) The complete genome sequence of Escherichia coli K-12. Science 277:1453–1474. Blot M (1994) Transposable elements and adaptation of host bacteria. Genetica 93:5–12. Bock JR, Gough DA (2001) Predicting protein-protein interactions from primary structure. Bioinformatics 17(5):455–460. Bock JR, Gough DA (2003) Whole-proteome interaction mining. Bioinformatics 19(1):125–134. Bockhorst J (2003) E. coli K12 operon prediction. http://www.biostat.wisc.edu/generegulation. Bockhorst J, Craven M, Page D, Shavlik J, Glasner J (2003) A Bayesian network approach to operon prediction. Bioinformatics 19(10):1227–35. Bodmer WF, Parsons PA (1962) Linkage and recombination in evolution. Adv Genet 11: 1–100. Bolotin A, Quinquis B, et al. (2004) Complete sequence and comparative genome analysis of the dairy bacterium Streptococcus thermophilus. Nat Biotechnol 22(12):1554–1558. Bolotin A, Wincker, Mauger, Jaillon, Malarme, Weissenbach, Ehrlich, Sorokin (2001) The complete genome sequence of lactic acid bacterium L. lactis ssp. lactis IL1403. Genome Res 11:731–753. Bonarius HPJ, Schmid G, Tramper J (1997) Flux analysis of underdetermined metabolic networks: the quest for the missing constraints. Trends in Biotechnology 15:308–314. Bono H, Ogata H, Goto S, Kanehisa M (1998) Reconstruction of amino acid biosynthesis pathways from the complete genome sequence. Genome Res 8(3):203–210. Bornberg-Bauer E, Beaussart F, Kummerfeld SK, Teichmann SA, Weiner J (2005) The evolution of domain arrangements in proteins and interaction networks. Cell Mol Life Sci 62(4):435–445. Borodovsky M, Hayes WS, Lukashin AV (1999) Statistical predictions of coding regions in prokaryotic genomes by using inhomogeneous Markov models. In: Charlebois RL (ed) Organization of Prokaryotic Genomes. ASM Press, pp. 11–33. Borodovsky M, McIninch J (1993) GeneMark: parallel gene recognition for both DNA strands. Computers Chem 17:123–133. Borodovsky M, Rudd KE, Koonin EV (1994) Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Res 22:4756–4767. Borodovsky MY, Sprizhitsky YA, Golovanov EI, Alexandrov AA (1986a) Statistical patterns in the primary structures of functional regions of the genome in Escherichia coli: I. Frequency characteristics. Mol Biol 20:826–833. Borodovsky MY, Sprizhitsky YA, Golovanov EI, Alexandrov AA (1986b) Statistical patterns in the primary structures of functional regions of the genome in Escherichia coli: II. Nonuniform Markov models. Mol Biol 20:833–840.

404

References

Borodovsky MY, Sprizhitsky YA, Golovanov EI, Alexandrov AA (1986c) Statistical patterns in the primary structures of functional regions of the genome in Escherichia coli: III. Computer recognition of coding regions. Mol Biol 20:1144–1150. Borukhov S, Lee J (2005) RNA polymerase structure and function at lac operon. C R Biol 328(6):576–587. Bose M, Barber R (2006) Prophage Finder: a prophage loci prediction tool for prokaryotic genome sequences. In Silico Biology 6:223–227. Boucher Y, Labbate M, Koenig JE, Stokes HW (2007) Integrons: mobilizable platforms that promote genetic diversity in bacteria. Trends Microbiol 15:301–309. Boucher Y, Nesbo CL, Joss MJ, Robinson A, Mabbutt BC, Gillings MR, Doolittle WF, Stokes HW (2006) Recovery and evolutionary analysis of complete integron gene cassette arrays from Vibrio. BMC Evol Biol 6:3. Bourque G, Pevzner PA (2002) Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Res 12:26–36. Bourret RB, Borkovich KA, Simon MI (1991) Signal transduction pathways involving protein phosphorylation in prokaryotes. Annu Rev Biochem 60:401–441. Boussau B, Karlberg EO, Frank AC, Legault BA, Andersson SGE (2004) Computational inference of scenarios for alpha-proteobacterial genome evolution. Proc Natl Acad Sci USA 101:9722–9727. Boyd EF, Brussow H (2002) Common themes among bacteriophage-encoded virulence factors and diversity among the bacteriophages involved. Trends Microbiol 10: 521–529. Bradford JR, Westhead DR (2005) Improved prediction of protein-protein binding sites using a support vector machines approach. Bioinformatics 21(8):1487–1494. Breitbart M, Thompson LR, Suttle CA and Sullivan MB (2007) Exploring the Vast Diversity of Marine Viruses Oceanography 20:135. Brendel V, Beckmann JS, Trifonov EN (1986) Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J Biomol Struct Dyn 4:11–21. Brent R, Ptashne M (1985) A eukaryotic transcriptional activator bearing the DNA specificity of a prokaryotic repressor. Cell 43(3 Pt 2):729–736. Brodie EL, DeSantis TZ, Parker JP, Zubietta IX, Piceno YM, Andersen GL (2007) Urban aerosols harbor diverse and dynamic bacterial populations. Proc Natl Acad Sci USA 104:299–304. Brooks DJ, Fresco JR (2002) Increased frequency of cysteine, tyrosine, and phenylalanine residues since the last universal ancestor. Mol Cell Proteomics 1(2):125–131. Brooks DJ, Fresco JR, Lesk AM, Singh M (2002) Evolution of amino acid frequencies in proteins over deep time: inferred order of introduction of amino acids into the genetic code. Mol Biol Evol 19(10):1645–1655. Brooks DJ, Fresco JR, Singh M (2004) A novel method for estimating ancestral amino acid composition and its application to proteins of the Last Universal Ancestor. Bioinformatics 20(14):2251–2257. Brown J, Doolittle W (1995) Root of the universal tree of life based on ancient aminoacyltRNA synthetase gene duplications. Proc Natl Acad Sci USA 92(7):2441–2445. Brown MV, Schwalbach MS, Hewson I, Fuhrman JA (2005) Coupling 16S-ITS rDNA clone libraries and automated ribosomal intergenic spacer analysis to show marine microbial diversity: development and application to a time series. Environmental Microbiology 7:1466–1479. Browning DF, Busby SJ (2004) The regulation of bacterial transcription initiation. Nat Rev Microbiol 2(1):57–65. Bruggeman FJ, Westerhoff HV (2007) The nature of systems biology. Trends Microbiol 15(1):45–50.

References

405

Brun C, Chevenet F, Martin D, Wojcik J, Guenoche A, Jacq B (2003) Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol 5(1):R6. Brussow H, Canchaya C, Hardt WD (2004) Phages and the evolution of bacterial pathogens: from genomic rearrangements to lysogenic conversion. Microbiol Mol Biol Rev 68:560–602, table of contents. Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H et al. (2003) Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Res 31(9): 2443–2450. Buckley MR. (2004) The Global Genome Question: Microbes as the Key to Understanding Evolution and Ecology. In: DeLong EF, Relman D (eds). American Academy of Microbiology, Washington DC. Budin-Verneuil A, Pichereau V, Auffray Y, Ehrlich DS, Maguin E (2005) Proteomic characterization of the acid tolerance response in L. lactis MG1363 Proteomics 5: 4794–4807. Bueno SM, Santiviago CA, Murillo AA, Fuentes JA, Trombert AN, Rodas PI, Youderian P, Mora GC (2004) Precise excision of the large pathogenicity island, SPI7, in Salmonella enterica serovar Typhi. J Bacteriol 186:3202–3213. Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, Kerlavage AR, Dougherty BA, Tomb JF, Adams MD, Reich CI, Overbeek R, Kirkness EF, Weinstock KG, Merrick JM, Glodek A, Scott JL, Geoghagen NS, Venter JC (1996) Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science 273:1058–1073. Bulyk ML, McGuire AM, Masuda N, Church GM (2004) A motif co-occurrence approach for genome-wide prediction of transcription-factor-binding sites in Escherichia coli. Genome Res 14:201–208. Burden RL, Faires JD (1993) Numerical Analysis (5th ed.). Boston, MA: PWS Publishing Co. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78–94. Burrows PC (2003) Investigating protein-protein interfaces in bacterial transcription complexes: a fragmentation approach. Bioessays 25(12):1150–1153. Burrus V, Waldor MK (2004) Shaping bacterial genomes with integrative and conjugative elements. Res Microbiol 155:376–386. Busby S, Ebright RH (1994) Promoter structure, promoter recognition, and transcription activation in prokaryotes. Cell 79(5):743–746. Busby S, Ebright RH (1999) Transcription activation by catabolite activator protein (CAP). J Mol Biol 293(2):199–213. Buu-Hoi A, Horodniceanu T (1980) Conjugative transfer of multiple antibiotic resistance markers in Streptococcus pneumoniae. J Bacteriol 143:313–320. Cahill DJ, Nordhoff E (2003) Protein arrays and their role in proteomics. Adv Biochem Eng Biotechnol 83:177–187. Cai L, Friedman N, Xie XS (2006) Stochastic protein expression in individual cells at the single molecule level. Nature 440(7082):358–362. Calhoun DH, Wallen JW, Traub L, Gray JE, Kung HF (1985) Internal promoter in the ilvGEDA transcription unit of Escherichia coli K-12. J Bacteriol 161(1):128–132. Campbell A, Mr´ azek J, Karlin S (1999) Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc Natl Acad Sci USA 96:9184–9189. Campillos M, von Mering C, Jensen LJ, Bork P (2006) Identification and analysis of evolutionarily cohesive functional modules in protein networks. Genome Res 16(3): 374–382.

406

References

Canback B, Tamas I, Andersson SG (2004) A phylogenomic study of endosymbiotic bacteria. Mol Biol Evol 21(6):1110–1122. Canchaya C, Fournous G, Brussow H (2004) The impact of prophages on bacterial chromosomes. Mol Microbiol 53:9–18. Canchaya C, Fournous G, et al. (2003) Phage as agents of lateral gene transfer. Curr Opin Microbiol 6(4):417–424. Carbone A, Zinovyev A, Kepes F (2003) Codon adaptation index as a measure of dominating codon bias. Bioinformatics 19:2005–2015. Carpentier AS, Torresani B, Grossmann A, Henaut A (2005) Decoding the nucleoid organisation of Bacillus subtilis and Escherichia coli through gene expression data. BMC Genomics 6(1):84. Carter RJ, Dubchak I, Holbrook SR (2001) A computational approach to identify genes for functional RNAs in genomic sequences. Nucleic Acids Res 29:3928–3938 Casas V, Rohwer F (2007) Phage metagenomics. Methods Enzymol 421:259–268. Cases I, de Lorenzo V, Ouzounis CA (2003b) Transcription regulation and environmental adaptation in bacteria. Trends Microbiol 11(6):248–253. Cases I, Ussery DW, de Lorenzo V (2003a) The sigma54 regulon (sigmulon) of Pseudomonas putida. Environ Microbiol 5(12):1281–1293. Casjens S, Palmer N, van Vugt R, Huang WM, Stevenson B, Rosa P, Lathigra R, Sutton G, Peterson J, Dodson RJ, Haft D, Hickey E, Gwinn M, White O, Fraser CM (2000) A bacterial genome in flux: the twelve linear and nine circular extrachromosomal DNAs in an infectious isolate of the Lyme disease spirochete Borrelia burgdorferi. Mol Microbiol 35:490–516. Caspi R, Foerster H, Fulcher CA, Hopkinson R, Ingraham J, Kaipa P, Krummenacker M, Paley S, Pick J, Rhee SY, Tissier C, Zhang P, Karp PD (2006) MetaCyc: A multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res 34: D511-D516. Cavalcanti AR, Leite ES, Neto BB, Ferreira R (2004) On the classes of aminoacyl-tRNA synthetases, amino acids and the genetic code. Orig Life Evol Biosph 34(4):407–420. Cavalier-Smith T (2005) Economy, speed and size matter: Evolutionary forces driving nuclear genome miniaturization and expansion. Ann Bot 95:147–175. Cenatiempo Y (1986) Prokaryotic gene expression in vitro: transcription-translation coupled systems. Biochimie 68(4):505–515. Chandler M, Mahillon J (2002) Insertion sequences revisited. In: Lambowitz AM (eds) Mobile DNA. American Society for Microbiology, Washington, DC, pp. 631–662. Chargaff E (1950) Chemical specificity of nucleic acids and mechanism of their enzymatic degradation. Experientia 6:201–209. Charlebois RL, Doolittle WF (2004) Computing prokaryotic gene ubiquity: rescuing the core from extinction. Genome Res 14(12):2469–2477. Charron C, Roy H, Blaise M, Giege R, Kern D (2003) Non-discriminating and discriminating aspartyl-tRNA synthetases differ in the anticodon-binding domain. EMBO J 22(7):1632–1643. Chaudhuri BN, Yeates TO (2005) A computational method to predict genetically encoded rare amino acids in proteins. Genome Biol 6(9):R79. Che D (2006) Uber-Operon Database. http://csbl.bmb.uga.edu/uber. Che D (2007) UNIPOP. http://csbl.bmb.uga.edu/∼dongsheng/UNIPOP/. Che D, Li G, Mao F, Wu H, Xu Y (2006) Detecting uber-operons in prokaryotic genomes. Nucleic Acids Res 34(8):2418–2427. Che D, Zhao J, Cai L, Xu Y (2007) Operon prediction in microbial genomes using decision tree approach. Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, Honolulu.

References

407

Chen G, Gharib TG, Huang CC, Taylor JM, Misek DE, Kardia SL et al. (2002) Discordant protein and mRNA expression in lung adenocarcinomas. Mol Cell Proteomics 1(4): 304–313. Chen LA, DeVries L et al. (1997) Convergent evolution of antifreeze glycoproteins in Antarctic notothenioid fish and Arctic cod. Proc Natl Acad Sci USA 94(8):3817–3822. Chen M., Hofestadt R (2003) Quantitative Petri net model of gene regulated metabolic networks in the cell. In Silico Biol 3(3):347–365. Chen X, Su Z, Dam P, Palenik B, Xu Y, Jiang T (2004a) Operon prediction by comparative genomics: an application to the Synechococcus sp. WH8102 genome. Nucleic Acids Res 32(7):2147–2157. Chen X, Su Z, Xu Y, Jiang T (2004b) Computational prediction of operons in Synechococcus sp. WH8102. Proceedings of 15th International Conference on Genome Informatics 15(2):211–222. Chen XW, Liu, M (2005) Prediction of protein-protein interactions using random decision forest framework. Bioinformatics 21(24):4394–4400. Chevenet F, Brun C, Banuls AL, Jacq B, Christen R (2006) TreeDyn: towards dynamic graphics and annotations for analyses of trees. BMC Bioinformatics 7:439. Cho DY, Cho KH, Zhang BT (2006) Identification of biochemical networks by S-tree based genetic programming. Bioinformatics 22(13):1631–1640. Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, et al. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 2(1): 65–73. Chou IC, Martens H, Voit EO (2006) Parameter estimation in biochemical systems models with alternating regression. Theor Biol Med Model 3:25 Churchward G, Bremer H (1977) Determination of deoxyribonucleic acid replication time in exponentially growing Escherichia coli B/r. J Bacteriol 130:1206–1213. Ciampi MS (2006) Rho-dependent terminators and transcription termination. Microbiology 152(Pt 9):2515–2528. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P (2006) Toward automatic reconstruction of a highly resolved tree of life. Science 311(5765):1283–1287. Clark MA, Moran NA, Baumann P (1999) Sequence evolution in bacterial endosymbionts having extreme base compositions. Mol Biol Evol 16:1586–1598. Claverie JM, Bougueleret L (1986) Heuristic informational analysis of sequences. Nucleic Acids Res 14:179–196. Clewell DB, Flannagan SE (1993) The conjugative transposons of Gram-positive bacteria. In: Clewell DB (eds) Bacterial Conjugation. Plenum, New York, pp. 369–393. Cohan FM (2002a) Sexual isolation and speciation in bacteria. Genetica 116(2–3):359–370. Cohan FM (2002b) What are bacterial species? Annu Rev Microbiol 56:457–487. Cohan FM, Perry EB (2007) A systematics for discovering the fundamental units of bacterial diversity. Curr Biol 17(10):R373–R386. Cole JR, Chai B, Farris RJ, Wang Q, Kulam-Syed-Mohideen AS, McGarrell DM, Bandela AM, Cardenas E, Garrity GM, Tiedje JM (2007) The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public data. Nucleic Acids Res 35(Database issue):D169–D172. Cole ST, Eiglmeier K, Parkhill J, James KD, Thomson NR, Wheeler PR, Honore N, Garnier T, Churcher C, Harris D, Mungall K, Basham D, Brown D, Chillingworth T, Connor R, Davies RM, Devlin K, Duthoy S, Feltwell T, Fraser A, Hamlin N, Holroyd S, Hornsby T, Jagels K, Lacroix C, Maclean J, Moule S, Murphy L, Oliver K, Quail MA, Rajandream MA, Rutherford KM, Rutter S, Seeger K, Simon S, Simmonds M, Skelton J, Squares R, Squares S, Stevens K, Taylor K, Whitehead S, Woodward JR, Barrell BG (2001) Massive gene decay in the leprosy bacillus. Nature 409:1007–1011.

408

References

Collado-Vides J, Magasanik B, Gralla JD (1991) Control site location and transcriptional regulation in Escherichia coli. Microbiol Rev 55(3):371–394. Collis CM, Hall RM (1992) Gene cassettes from the insert region of integrons are excised as covalently closed circles. Mol Microbiol 6:2875–2885. Connell GJ, Illangesekare M, Yarus M (1993) Three small ribooligonucleotides with specific arginine sites. Biochemistry 32(21):5497–5502. Conway T, Schoolnik GK (2003) Microarray expression profiling: capturing a genome-wide portrait of the transcriptome. Mol Microbiol 47(4):879–889. Cortez DQ, Lazcano A, Becerra A (2005) Comparative analysis of methodologies for the detection of horizontally transferred genes: a reassessment of first-order Markov models. In Silico Biol 5(5–6):581–592. Courtois S, Cappellano CM, Ball M, Francou FX, Normand P, Helynck G, Martinez A, Kolvek SJ, Hopke J, Osburne MS, August PR, Nalin R, Guerineau M, Jeannin P, Simonet P, Pernodet JL (2003a) Recombinant environmental libraries provide access to microbial diversity for drug discovery from natural products. Applied and Environmental Microbiology 69:49–55. Courvalin P, Carlier C (1987) Tn1545: a conjugative shuttle transposon. Mol Gen Genet 206:259–264. Covert MW, Knight EM, Reed JL, Herrgard MJ, Palsson BO (2004) Integrating high-throughput and computational data elucidates bacterial networks. Nature 429(6987):92–96. Covert MW, Shilling CH, Palsson B (2001) Regulation of Gene Expression in Flux Balance Models of Metabolism. Journal of Theoretical Biology 213(1):73–88. Craig NL (2002) Tn7. In: Craig NL, Craigie R, Gellert M and Lambowitz AM (eds) Mobile DNA II. Washington, DC: ASM Press, pp. 423–456. Craig NL, Craigie R, Gellert M, Lambowitz AM (2002) Mobile DNA II. ASM Press, Washington, DC, pp. 1204. Crampin EJ, Schnell S, McSharry PE (2004) Mathematical and computational techniques to deduce complex biochemical reaction mechanisms. Prog Biophys Mol Biol 86(1): 77–112. Craven M, Page D, Shavlik J, Bockhorst J, Glasner J (2000) A probabilistic learning approach to whole-genome operon prediction. Proc Int Conf Intell Syst Mol Biol 8: 116–127. Craven SH, Neidle EL (2007) Double trouble: medical implications of genetic duplication and amplification in bacteria. Future Microbiol 2:in press. Crawford IP, Stauffer GV (1980) Regulation of tryptophan biosynthesis. Annu Rev Biochem 49:163–195. Crick FH (1968) The origin of the genetic code. J Mol Biol 38(3):367–379. Cuadros-Orellana S, Martin-Cuadrado A et al. (2007) Genomic plasticity in prokaryotes: the case of the square haloarchaeon. ISME J 1(1):235–245. Curto R, Voit EO, Sorribas A, Cascante M (1998) Mathematical models of purine metabolism in man. Math Biosc 151:1–49. Cutler P (2003) Protein arrays: the current state-of-the-art. Proteomics 3(1):3–18. Dagan T, Martin W (2006) The tree of one percent. Genome Biol 7(10):118. Dagan T, Martin W (2007) Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proc Natl Acad Sci USA 104(3):870–875. Dago AE, Wigneshweraraj SR, Buck M, Morett E (2007) A role for the conserved GAFTGA motif of AAA+ transcription activators in sensing promoter DNA conformation. J Biol Chem 282(2):1087–1097. Dai L, Toor N, Olson R, Keeping A, Zimmerly S (2003) Database for mobile group II introns. Nucleic Acids Res 31:424–426.

References

409

Dale C, Beeton M, Harbison C, Jones T, Pontes M (2006) Isolation, pure culture, and characterization of “Candidatus Arsenophonus arthropodicus,” an intracellular secondary endosymbiont from the hippoboscid louse fly Pseudolynchia canariensis. Appl Environ Microbiol 72:2997–3004. Dale C, Wang B, Moran N, Ochman H (2003) Loss of DNA recombinational repair enzymes in the initial stages of genome degeneration. Mol Biol Evol 20:1188–1194. Dam P (2007) Decision-tree and Logistic function based classifier for operon prediction. http://csbl.bmb.uga.edu/∼phd/operons/operon index.html. Dam P, Olman V, Harris K, Su Z, Xu Y (2007) Operon prediction using both genomespecific and general genomic information. Nucleic Acids Res 35(1), 288–298. Dam P, Su Z, Olman V, Xu Y (2004) In silico construction of the carbon fixation pathway in Synechococcus sp. WH8102. Journal of Biological Systems 12:97–125. Dame RT (2005) The role of nucleoid-associated proteins in the organization and compaction of bacterial chromatin. Mol Microbiol 56(4):858–870. Danchin A, Fang G et al. (2007) The extant core bacterial proteome is an archive of the origin of life. Proteomics 7(6):875–889. Dandekar T, Snel B, Huynen M, Bork P (1998) Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci 23(9):324–328. Daniel R (2005) The metagenomics of soil. Nature Reviews 3:470–478. Darling ACE, Mau B, Blattner FR, Perna NT (2004) Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Research 14:1394–1403. Darwin C (1859) On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. London, John Murray, Albemarle Street. Das S, Paul S, Bag SK, Dutta C (2006) Analysis of Nanoarchaeum equitans genome and proteome composition: indications for hyperthermophilic and parasitic adaptation. BMC Genomics 7:186. Date SV, Marcotte EM (2003) Discovery of uncharacterized cellular systems by genomewide analysis of functional linkages. Nat Biotechnol 21(9):1055–1062. Daubin V, Gouy M, Perriere G (2002) A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res 12(7):1080–1090. Daubin V, Lerat E et al. (2003) The source of laterally transferred genes in bacterial genomes. Genome Biol 4(9):R57. Daubin V, Moran NA (2004) Comment on “The origins of genome complexity”. Science 306:978. Daubin V, Ochman H (2004) Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. Genome Res 14(6):1036–1042. Daubin V, Ochman H (2004) Start-up entities in the origin of new genes. Curr Opin Genet Dev 14(6):616–619. Daubin V, Perri´ere G (2003) G+C3 structuring along the genome: a common feature in prokaryotes. Mol Biol Evol 20:471–483. Davidson AL, Chen J (2004) ATP-binding cassette transporters in bacteria. Annu Rev Biochem 73241–73268. Davis BK (1999) Evolution of the genetic code. Prog Biophys Mol Biol 72(2): 157–243. Davis RE, Hodgson S (1997) Gene linkage and steady state RNAs suggest trans-splicing may be associated with a polycistronic transcript in Schistosoma mansoni. Mol Biochem Parasitol 89(1):25–39. Dawkins R (1976) The Selfish Gene, Oxford University Press. Dayhoff MO (1976) The origin and evolution of protein superfamilies. Fed Proc 35: 2132–2138.

410

References

Dayhoff MO (1979) Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Washington, DC. D’Costa VM, McGrann KM, Hughes DW, Wright GD (2006) Sampling the antibiotic resistome. Science 311:374–377. de Boor C (1978) A practical guide to splines. New York: Springer-Verlag. de Boor C, H¨ ollig K, Riemenschneider SD (1993) Box splines. New York; Hong Kong: Springer-Verlag. De Carlo S, Chen B, Hoover TR, Kondrashkina E, Nogales E et al. (2006) The structural basis for regulated assembly and function of the transcriptional activator NtrC. Genes Dev 20(11):1485–1495. De Gregorio E, Silvestro G, Petrillo M, Carlomagno MS, Di Nocera PP (2005) Enterobacterial repetitive intergenic consensus sequence repeats in yersiniae: genomic organization and functional properties. J Bacteriol 187:7945–7954. de Hoon MJ, Imoto S, Kobayashi K, Ogasawara N, Miyano S (2004) Predicting the operon structure of Bacillus subtilis using operon length, intergene distance, and gene expression information. Pac Symp Biocomput 276–287. de Hoon MJ, Makita Y, Nakai K, Miyano S (2005) Prediction of transcriptional terminators in Bacillus subtilis and related species. PLoS Comput Biol 1(3):e25. de Jong H (2002) Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol 9(1):67–103. de Vos WM (1999) Gene expression systems for LAB. Curr Opin Microbiol 2:289–295 Dean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P, Sun Z, Zong Q, Du Y, Du J, Driscoll M, Song W, Kingsmore SF, Egholm M, Lasken RS (2002) Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci USA 99:5261–5266. Deane CM, Salwinski L, Xenarios I, Eisenberg D (2002) Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 1(5):349–356. Degnan JH, Rosenberg NA (2006) Discordance of species trees with their most likely gene trees. PLoS Genet 2(5):e68. Degnan PH, Lazarus AB, Brock CD, Wernegreen JJ (2004) Host-symbiont stability and fast evolutionary rates in an ant-bacterium association: cospeciation of Camponotus species and their endosymbionts, Candidatus Blochmannia. Syst Biol 53:95–110. del Rosario RCH, Mendoza E, Voit EO (2008) Challenges in lin-log modeling of glycolysis in Lactococcus lactis. IET Systems Biol, in press. Delcher AL, Bratke KA, Powers EC, Salzberg SL (2007) Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23:673–679. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res 27:4636–4641. Delcher AL, Phillippy A, Carlton J, Salzberg SL (2002) Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res 30:2478–2483. Delmotte F, Rispe C, Schaber J, Silva FJ, Moya A (2006) Tempo and mode of early gene loss in endosymbiotic bacteria from insects. BMC Evol Biol 6:56. DeLong EF (1992) Archaea in Coastal Marine Environments. Proceedings of the National Academy of Sciences of the United States of America 89:5685–5689. DeLong EF, Karl DM (2005) Genomic perspectives in microbial oceanography. Nature 437:336–342. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard NU, Martinez A, Sullivan MB, Edwards R, Brito BR, Chisholm SW, Karl DM (2006) Community genomics among stratified microbial assemblages in the ocean’s interior. Science 311: 496–503.

References

411

Deluca TF, Wu IH, Pu J, Monaghan T, Peshkin L, Singh S, Wall DP (2006) Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics 22(16):2044–2046. Dembo A, Karlin S (1988) Poisson approximations for r-scan processes. Ann Appl Prob 2:329–357. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc 39:1–38. Deng M, Mehta S, Sun F, Chen T (2002) Inferring domain-domain interactions from protein-protein interactions. Genome Res 12(10):1540–1548. Deng M, Sun F, Chen T (2003) Assessment of the reliability of protein-protein interactions and protein function prediction. Pac Symp Biocomput 140–151. Dethlefsen L, McFall-Ngai M, Relman DA (2007) An ecological and evolutionary perspective on human-microbe mutualism and disease. Nature 449:811–818. D’Haeseleer P, Wen X, Fuhrman S, Somogyi R (1999) Linear modeling of mRNA expression levels during CNS development and injury. Pac Symp Biocomput 41–52. Dharmadi Y, Gonzalez R (2004) DNA microarrays: experimental issues, data analysis, and application to bacterial systems. Biotechnol Prog 20(5):1309–1324. Di Giulio M (2001) The universal ancestor was a thermophile or a hyperthermophile. Gene 281(1–2):11–17. Di Giulio M (2003) The universal ancestor and the ancestor of bacteria were hyperthermophiles. J Mol Evol 57(6):721–730. Di Giulio M (2003) The universal ancestor was a thermophile or a hyperthermophile: tests and further evidence. J Theor Biol 221(3):425–436. Di Giulio M (2005) A comparison of proteins from Pyrococcus furiosus and Pyrococcus abyssi: barophily in the physicochemical properties of amino acids and in the genetic code. Gene 346:1–6. Di Giulio M (2005) Structuring of the genetic code took place at acidic pH. J Theor Biol 237(2):219–226. Di Giulio M (2005) The ocean abysses witnessed the origin of the genetic code. Gene 346:7–12. Di Giulio M (2005) The origin of the genetic code: theories and their relationships, a review. Biosystems 80(2):175–184. Dicksved J, Floistrup H, Bergstrom A, Rosenquist M, Pershagen G, Scheynius A, Roos S, Alm JS, Engstrand L, Braun-Fahrlander C, von Mutius E, Jansson JK (2007) Molecular fingerprinting of the fecal microbiota of children raised according to different lifestyles. Applied and Environmental Microbiology 73:2284–2289. Diruggiero J, Dunn D, Maeder DL, Holley-Shanks R, Chatard J, Horlacher R, Robb FT, Boos W, Weiss RB (2000) Evidence of recent lateral gene transfer among hyperthermophilic archaea. Mol Microbiol 38:684–693. Dobrindt U, Hochhut B, Hentschel U, Hacker J (2004) Genomic islands in pathogenic and environmental microorganisms. Nat Rev Microbiol 2:414–424. Dohkan S, Koike A, Takagi T (2006) Improving the performance of an SVM-based method for predicting protein-protein interactions. In Silico Biol 6(6):515–529. Dongen SV (2000) A cluster algorithm for graphs. Amsterdam, National Research Institute for Mathematics and Computer Science in the Netherlands. Doolittle WF (1999) Phylogenetic classification and the universal tree. Science 284(5423):2124–2129. Doolittle WF, Boucher Y, Nesbø CL, Douady CJ, Andersson JO, Roger AJ (2003) How big is the iceberg of which organellar genes in nuclear genomes are but the tip? Philosophical Transactions of the Royal Society B: Biological Sciences 358(1429): 39–58.

412

References

Douette P, Sluse FE (2006) Mitochondrial uncoupling proteins: new insigths from functional and proteomic studies. Free Radic Biol Med 40:1097–1107. Draper GC, Gober JW (2002) Bacterial chromosome segregation. Annu Rev Microbiol 56:567–597. Driscoll DM, Copeland PR (2003) Mechanism and regulation of selenoprotein synthesis. Annu Rev Nutr 23:17–40. Du J, Rozowsky JS, Korbel JO, Zhang ZD, Royce TE, Schultz MH et al. (2006) A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: systematically incorporating validated biological knowledge. Bioinformatics 22(24), 3016–3024. Dufayard JF, Duret L, Penel S, Gouy M, Rechenmann F, Perriere G (2005) Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics 21(11):2596–2603. Dufraigne C, Fertil B, Lespinats S, Giron A, Deschavanne P (2005) Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res 33:e6. Dufresne A, Garczarek L, Partensky F (2005) Accelerated evolution associated with genome reduction in a free-living prokaryote. Genome Biol 6(2):R14. Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN (2005) Flexible nets. The roles of intrinsic disorder in protein interaction networks. Febs J 272:5129–5148. Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge. Duret L (2002) Evolution of synonymous codon usage in metazoans. Curr Opin Genet Dev 12:640–649. Dykhuizen DE, Green L (1991) Recombination in Escherichia coli and the definition of biological species. J Bacteriol 173(22):7257–7268. Dynan WS, Tjian R (1983) The promoter-specific transcription factor Sp1 binds to upstream sequences in the SV40 early promoter. Cell 35(1):79–87. Eckburg PB, Bik EM, Bernstein CN, Purdom E, Dethlefsen L, Sargent M, Gill SR, Nelson KE, Relman DA (2005) Diversity of the human intestinal microbial flora. Science 308:1635–1638. Eddy SR (2002) Computational genomics of noncoding RNA genes. Cell 109:137–140. Edwards JS, Palsson BO (2000) The Escherichia coli MG1655 in silico metabolic genotype: Its definition, characteristics and capabilities. Proc Natl Acad Sci USA 97:5528–5533. Edwards JS, Ramakrishna R, Schilling CH, Palsson BO (1999) Metabolic flux balance analysis. In: Lee SY, Papoutsakis (eds). Metabolic Engineering, Marcel Dekker, pp. 13–57. Edwards KJ, Bond PL, Gihring TM, Banfield JF (2000) An archaeal iron-oxidizing extreme acidophile important in acid mine drainage. Science 287:1796–1799. Edwards MT, Rison SC, Stoker NG, Wernisch L (2005) A universally applicable method of operon map prediction on minimally annotated genomes using conserved genomic context. Nucleic Acids Res 33(10):3253–3262. Edwards RA, Dinsdale EA (2007) Marine environmental genomics: unlocking the ocean’s secrets. Oceanography 20:56. Edwards RA, Rodriguez-Brito B, Wegley L, Haynes M, Breitbart M, Peterson DM, Saar MO, Alexander S, Alexander EC, Jr., Rohwer F (2006) Using pyrosequencing to shed light on deep mine microbial ecology. BMC Genomics 7:57. Edwards RA, Rohwer F (2005) Viral metagenomics. Nature Reviews 3:504–510. Eigen M, Winkler-Oswatitsch R (1981) Transfer-RNA, an early gene? Naturwissenschaften 68(6):282–292.

References

413

Eilers PHC (2003) A perfect smoother. Analytical Chemistry 75(14), 3631–3636. Eisen JA (2007) Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLoS Biol 5:e82. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. PNAS 95(25):14863–14868. Ekman D, Bjorklund AK, Frey-Sk¨ ott J, Elofsson A (2005) Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J Mol Biol 348(1):231–243. Eknoyan G (1999) Santorio Sanctorius (1561–1636) — founding father of metabolic balance studies. Am J Nephrol 19(2):226–233. Elf J, Li GW, Xie XS (2007) Probing transcription factor dynamics at the single-molecule level in a living cell. Science 316(5828):1191–1194. Ellington AD, Khrapov M, Shaw CA (2000) The scene of a frozen accident. RNA 6(4): 485–498. Ely B, Croft RH (1982) Transposon mutagenesis in Caulobacter crescentus. J Bacteriol 149:620–625. Embley TM, Hirt RP (1998) Early branching eukaryotes? Curr Opin Genet Dev 8(6): 624–629. Endres RG, Schulthess TC, Wingreen NS (2004) Toward an atomistic model for predicting transcription-factor binding sites. Proteins 57(2):262–268. Engelke DR, Ng SY, Shastry BS, Roeder RG (1980) Specific interaction of a purified transcription factor with an internal control region of 5S RNA genes. Cell 19(3): 717–728. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature 402:86–90. Enright AJ, Ouzounis CA (2001) Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions. Genome Biology 2:341–347. Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575–1584. Ephraim Y, Merhav N (2002) Hidden Markov processes. IEEE Trans Inform Theory 48:1518–1569. Eriani G, Delarue M, Poch O, Gangloff J, Moras D (1990) Partition of tRNA synthetases into two classes based on mutually exclusive sets of sequence motifs. Nature 347(6289):203–206. Ermolaeva MD (2001) The Institute for Genomic Research (TIGR). http://www.tigr. org/tigr-scripts/operons/operons.cgi. Ermolaeva MD (2001) Synonymous codon usage in bacteria. Curr Issues Mol Biol 3:91–97. Ermolaeva MD, White O, Salzberg SL (2001) Prediction of operons in microbial genomes. Nucleic Acids Res 29(5):1216–1221. Eulenstein O (1997) A linear time algorithm for tree mapping. Arbeitspapiere Der GMD 1046. Facciotti MT, Reiss DJ, Pan M, Kaur A, Vuthoori M et al. (2007) General transcription factor specified global gene regulation in archaea. Proc Natl Acad Sci USA 104(11):4630–4635. Falkowski PG, Oliver MJ (2007) Mix and match: how climate selects phytoplankton. Nature Reviews 5:813–819. Fariselli P, Pazos F, Valencia A, Casadio R (2002) Prediction of protein–protein interaction sites in heterocomplexes with neural networks. Eur J Biochem 269(5):1356–1361. Fedoroff N, Wessler S, Shure M (1983) Isolation of the transposable maize controlling elements Ac and Ds. Cell 35:235–242.

414

References

Fedorova O, Zingler N (2007) Group II introns: structure, folding and splicing mechanism. Biol Chem 388:665–678. Felsenstein J (1978) Cases in which parsimony and compatibility methods will be positively misleading. Syst Zool 27:401–410. Felsenstein J (1988) Phylogenies from molecular sequences: inference and reliability. Annu Rev Genet 22:521–565. Feng DF, Doolittle RF (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 25(4):351–360. Feng L, Sheppard K, Namgoong S, Ambrogelly A, Polycarpo C, Randau L, TumbulaHansen D, Soll D (2004) Aminoacyl-tRNA synthesis by pre-translational amino acid modification. RNA Biol 1(1):16–20. Feng L, Tumbula-Hansen D, Toogood H, Soll D (2003) Expanding tRNA recognition of a tRNA synthetase by a single amino acid change. Proc Natl Acad Sci USA 100(10): 5676–5681. Fenn K, Blaxter M (2006) Wolbachia genomes: revealing the biology of parasitism and mutualism. Trends in Parasitology 22:60–65. Ferreira A (2000) Power Law Analysis and Simulation, www.dqb.fc.ul.pt/docentes/ aferreira/plas.html. Fichant G, Gautier C (1987) Statistical method for predicting protein coding regions in nucleic acid sequences. Comput Appl Biosci 3:287–295. Fickett JW (1982) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res 10:5303–5318. Fickett JW, Torney DC, Wolf DR (1992) Base compositional structure of genomes. Genomics 13:1056–1064. Fickett JW, Tung CS (1992) Assessment of protein coding measures. Nucleic Acids Res 20:6441–6450. Field D, Kyrpides N (2007) The positive role of the ecological community in the genomic revolution. Microb Ecol 53:507–511. Field D, Wills C (1998) Abundant microsatellite polymorphism in Saccharomyces cerevisiae, and the different distributions of microsatellites in eight prokaryotes and S. cerevisiae, result from strong mutation pressures and a variety of selective forces. Proc Natl Acad Sci USA 95:1647–1652. Fiers W, Contreras R, Duerinck F, Haegeman G, Iserentant D, Merregaert J et al. (1976) Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature 260(5551):500–507. Figeac M (2004) HUGO. http://bioinfo.lifl.fr/HUGO. Figeac M, Varre JS (2004) Detecting u ¨ber-operons in Prokaryotics Genomes. Laboratoire d’Informatique Fondamentale de Lille (LIFL). Fil´ee J, Siguier P, Chandler M (2007) Insertion sequence diversity in archaea. Microbiol Mol Biol Rev 71:121–157. Finlay BB, Falkow S (1997) Common themes in microbial pathogenicity revisited. Microbiol Mol Biol Rev 61:136–169. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34(Database issue): D247–D251. Fisher E, Sauer U (2005) Large-scale in vivo analysis shows rigidity and suboptimal performance of Bacillus subtilis metabolism. Nat Genet 37:636–640. Fisher RA (1930) The Genetical Theory of Natural Selection. Oxford, Oxford University Press. Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Zool 19(2): 99–113.

References

415

Fitch WM (2000) Homology a personal view on some of the problems. Trends Genet 16(5):227–231. Fitch WM, Upper K (1987) The phylogeny of tRNA sequences provides evidence for ambiguity reduction in the origin of the genetic code. Cold Spring Harb Symp Quant Biol 52:759–767. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269(5223):496–512. Fluit AC, Schmitz FJ (2004) Resistance integrons and super-integrons. Clin Microbiol Infect 10:272–288. Fondi M, Brilli M, Emiliani G, Paffetti D, Fani R (2007) The primordial metabolism: an ancestral interconnection between leucine, arginine, and lysine biosynthesis. BMC Evol Biol 7(Suppl 2):S3. Forchhammer K, Bock A (1991) Selenocysteine synthase from Escherichia coli. Analysis of the reaction sequence. J Biol Chem 266(10):6324–6328. Foster TJ (1983) Plasmid-determined resistance to antimicrobial drugs and toxic metal ions in bacteria. Microbiological Reviews 47:361–409. Fournier GP, Gogarten JP (2007) Signature of a Primitive Genetic Code in Ancient Protein Lineages. J Mol Evol Fouts D (2006) Phage Finder: automated identification and classification of prophage regions in complete bacterial genome sequences. Nucleic acids research 34:5839–5851. Francino MP, Chao L, Riley MA, Ochman H (1996) Asymmetries generated by transcription-coupled repair in enterobacterial genes. Science 272:107–109. Francino MP, Ochman H (1997) Strand asymmetries in DNA evolution. Trends Genet 13:240–245. Francino MP, Ochman H (2001) Deamination as the basis of strand-asymmetric evolution in transcribed Escherichia coli sequences. Mol Biol Evol 18:1147–1150. Franke AE, Clewell DB (1981) Evidence for a chromosome-borne resistance transposon (Tn916) in Streptococcus faecalis that is capable of “conjugal” transfer in the absence of a conjugative plasmid. J Bacteriol 145:494–502. Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM, Fritchman JL, Weidman JF, Small KV, Sandusky M, Fuhrmann J, Nguyen D, Utterback TR, Saudek DM, Phillips CA, Merrick JM, Tomb JF, Dougherty BA, Bott KF, Hu PC, Lucier TS, Peterson SN, Smith HO, Hutchison CA, Venter JC (1995) The Minimal Gene Complement of Mycoplasma genitalium. Science 270:397–403. Fraser-Liggett CM (2005) Insights on biology and evolution from microbial genome sequencing. Genome research 15:1603–1610. Freifelder D, Meselson M (1970) Topological relationship of prophage lambda to the bacterial chromosome in lysogenic cells. Proc Natl Acad Sci USA 65:200–205. Friedberg I (2006) Automated protein function prediction–the genomic challenge. Brief Bioinform 7(3):225–242. Frishman D, Mironov A, Mewes H-W, Gelfand M (1998) Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res 26:2941–2947. Frost LS, Ippen-Ihler K, Skurray RA (1994) Analysis of the sequence and gene products of the transfer region of the F sex factor. Microbiol Rev 58:162–210. Frost LS, Leplae R, Summers AO, Toussaint A (2005) Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol 3:722–732. Fuhrman JA (1999) Marine viruses and their biogeochemical and ecological effects. Nature 399:541–548.

416

References

Fuhrman JA, Hewson I, Schwalbach MS, Steele JA, Brown MV, Naeem S (2006) Annually reoccurring bacterial communities are predictable from ocean conditions. Proceedings of the National Academy of Sciences of the United States of America 103:13104–13109. Fuhrman JA, Mccallum K, Davis AA (1992) Novel major archaebacterial group from marine plankton. Nature 356:148–149. Fukami-Kobayashi K, Tateno Y et al. (2003) Parallel evolution of ligand specificity between LacI/GalR family repressors and periplasmic sugar-binding proteins. Mol Biol Evol 20(2):267–277. Fukuhara H (1995) Linear DNA plasmids of yeasts. FEMS Microbiol Lett 131:1–9. Furrie E (2006) A molecular revolution in the study of intestinal microflora. Gut 55: 141–143. Futcher AB (1988) The 2 micron circle plasmid of Saccharomyces cerevisiae. Yeast 4:27–40. Futcher B, Latter GI, Monardo P, McLaughlin CS, Garrels JI (1999) A sampling of the yeast proteome. Mol Cell Biol 19:7357–7368. Gaasterland T, Ragan MA (1998) Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. Microb Comp Genomics 3(4):199–217. Galagan JE, Nusbaum C, Roy A, Endrizzi MG, Macdonald P, FitzHugh W, Calvo S, Engels R, Smirnov S, Atnoor D, Brown A, Allen N, Naylor J, Stange-Thomann N, DeArellano K, Johnson R, Linton L, McEwan P, McKernan K, Talamas J, Tirrell A, Ye W, Zimmer A, Barber RD, Cann I, Graham, DE, Grahame DA, Guss AM, Hedderich R, Ingram-Smith C, Kuettner HC, Krzycki JA, Leigh JA, Li W, Liu J, Mukhopadhyay B, Reeve JN, Smith K, Springer TA, Umayam LA, White O, White RH, Conway de Macario E, Ferry JG, Jarrell KF, Jing H, Macario AJ, Paulsen I, Pritchett M, Sowers KR, Swanson RV, Zinder SH, Lander E, Metcalf WW, Birren B (2002) The genome of M. acetivorans reveals extensive metabolic and physiological diversity. Genome Res 12(4):532–542. Galazzo JL, Bailey JE (1990) Fermentation pathway kinetics and metabolic flux control in suspended and immobilzed Saccharomyces cerevisiae Enzyme Microbiol. Technol 12:162–172. Gal-Mor O, Finlay BB (2006) Pathogenicity islands: a molecular toolbox for bacterial virulence. Cell Microbiol 8:1707–1719. Galperin MY (2006) The fuzzy border between a cell and an organelle. Environ Microbiol 8:2062–2067. Ganot P, Kallesoe T, Reinhardt R., Chourrout D, Thompson EM (2004) Spliced-leader RNA trans splicing in a chordate, Oikopleura dioica, with a compact genome. Mol Cell Biol 24(17):7795–7805. Gao F, Zhang C-T (2006) GC-Profile: a web-based tool for visualizing and analyzing the variation of GC content in genomic sequences. Nucleic Acids Research 34:686–691. Gao Z, Tseng CH, Pei Z, Blaser MJ (2007) Molecular analysis of human forearm superficial skin bacterial biota. Proc Natl Acad Sci USA 104:2927–2932. Gardner TS, di Bernardo D, Lorenz D, Collins JJ (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science 301(5629): 102–105. Gardy JL, Laird MR, Chen F, Rey S, Walsh CJ, Ester M, Brinkman FS (2005) PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21:617–623. Garfinkel D (1968) The role of computer simulation in biochemistry. Comput Biomed Res 2:i–ii. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M., Bauer A et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415:141–147.

References

417

Gelfand MS (1999) Recognition of regulatory sites by genomic comparison. Res Microbiol 150:755–771. Gelfand MS, Novichkov PS, Novichkova ES, Mironov AA (2000) Comparative analysis of regulatory patterns in bacterial genomes. Brief Bioinform 1:357–371. Gentry TJ, Wickham GS, Schadt CW, He Z, Zhou J (2006) Microarray applications in microbial ecology research. Microb Ecol 52:159–175. Gerdes SY, Scholle MD et al. (2003) Experimental determination and system level analysis of essential genes in Escherichia coli MG1655. J Bacteriol 185(19):5673–5684. Gevers D, Cohan FM, Lawrence JG, Spratt BG, Coenye T, Feil EJ, Stackebrandt E, Van de Peer Y, Vandamme P, Thompson FL, Swings J (2005) Opinion: Re-evaluating prokaryotic species. Nat Rev Microbiol 3(9):733–739. Gevers D, Vandepoele K, Simillion C, de Peer YV (2004) Gene duplication and biased functional retention of paralogs in bacterial genomes. Trends Microbiol 12:148–154. Gianchandani EP, Brautigan DL, Papin JA (2006) Systems analyses characterize integrated functions of biochemical networks. Trends Biochem Sci 31(5), 284–291. Gibbons FD, Proft M, Struhl K, Roth FP (2005) Chipper: discovering transcription-factor targets from chromatin immunoprecipitation microarrays using variance stabilization. Genome Biol 6(11):R96. Gil R, Sabater-Munoz B, Latorre A, Silva FJ, Moya A (2002) Extreme genome reduction in Buchnera spp.: toward the minimal genome needed for symbiotic life. Proc Natl Acad Sci USA 99:4454–4458. Gil R, Silva FJ, Zientz E, Delmotte F, Gonzalez-Candelas F, Latorre A, Rausell C, Kamerbeek J, Gadau J, Holldobler B, van Ham RCHJ, Gross R, Moya A (2003) The genome sequence of Blochmannia floridanus: comparative analysis of reduced genomes. Proc Natl Acad Sci USA 100:9388–9393. Gilchrist MA, Salter LA, Wagner A (2004) A statistical framework for combining and interpreting proteomic datasets. Bioinformatics 20(5):689–700. Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE (2006) Metagenomic analysis of the human distal gut microbiome. Science 312:1355–1359. Gillespie DE, Brady SF, Bettermann AD, Cianciotto NP, Liles MR, Rondon MR, Clardy J, Goodman RM, Handelsman J (2002) Isolation of antibiotics turbomycin a and B from a metagenomic library of soil microbial DNA. Applied and Environmental Microbiology 68:4301–4306. Gillings MR, Holley MP, Stokes HW, Holmes AJ (2005) Integrons in Xanthomonas: a source of species genome diversity. Proceedings of the National Academy of Sciences of the United States of America 102:4419–4424. Giovannoni S, Stingl U (2007) The importance of culturing bacterioplankton in the ‘omics’ age. Nature Reviews 5:820–826. Giovannoni SJ, Britschgi TB, Moyer CL, Field KG (1990) Genetic Diversity in Sargasso Sea Bacterioplankton. Nature 345:60–63. Giovannoni SJ, Tripp HJ, Givan S, Podar M, Vergin KL, Baptista D, Bibbs L, Eads J, Richardson TH, Noordewier M, Rappe MS, Short JM, Carrington JC, Mathur EJ (2005) Genome streamlining in a cosmopolitan oceanic bacterium. Science 309: 1242–1245. Gitai Z, Thanbichler M, Shapiro L (2005) The choreographed dynamics of bacterial chromosomes. Trends Microbiol 13:221–228. Glass JI, Assad-Garcia N et al. (2006) Essential genes of a minimal bacterium. Proc Natl Acad Sci USA 103(2):425–430. Glass JI, Lefkowitz EJ, Glass JS, Heiner CR, Chen EY, Cassell GH (2000) The complete sequence of the mucosal pathogen Ureaplasma urealyticum. Nature 407:757–762.

418

References

Glazko GV, Mushegian AR (2004) Detection of evolutionarily stable fragments of cellular pathways by hierarchical clustering of phyletic patterns. Genome Biol 5(5):R32. Glew MD, Baseggio N, Markham PF, Browning GF, Walker ID (1998) Expression of the pMGA genes of Mycoplasma gallisepticum is controlled by variation in the GAA trinucleotide repeat lengths within the 5’ noncoding regions. Infect Immun 66: 5833–5841. Godde JS, Bickerton A (2006) The repetitive DNA elements called CRISPRs and their associated genes: evidence of horizontal transfer among prokaryotes. J Mol Evol 62:718–729. Goel G, Chou IC, Voit EO (2006) Biological systems modeling and analysis: a biomolecular technique of the twenty-first century. J Biomol Tech 17(4):252–269. Gogarten J, Hilario E (2006) Inteins, introns, and homing endonucleases: recent revelations about the life cycle of parasitic genetic elements. BMC Evolutionary Biology 6:94. Gogarten JP (1994) Which is the most conserved group of proteins? Homology-orthology, paralogy, xenology, and the fusion of independent lineages. J Mol Evol 39(5):541–543. Gogarten JP (1995) The early evolution of cellular life. Trends in Ecology and Evolution 10:147–151. Gogarten JP, Doolittle WF, Lawrence JG (2002) Prokaryotic evolution in light of gene transfer. Mol Biol Evol 19(12):2226–2238. Gogarten JP, Hilario E (2006) Inteins, introns, and homing endonucleases: recent revelations about the life cycle of parasitic genetic elements. BMC Evol Biol 6:94. Gogarten JP, Taiz L (1992) Evolution of proton pumping ATPases: Rooting the tree of life. Photosynthesis Research 33:137–146. Gogarten JP, Townsend JP (2005) Horizontal gene transfer, genome innovation and evolution. Nat Rev Microbiol 3(9):679–687. Gogol EB, Cummings CA, Burns RC, Relman DA (2007) Phase variation and microevolution at homopolymeric tracts in Bordetella pertussis. BMC Genomics 8:122. Goldberg DS, Roth FP (2003) Assessing experimentally derived interactions in a small world. PNAS 100(8):4372–4376. Golding I, Paulsson J, Zawilski SM, Cox EC (2005) Real-time kinetics of gene activity in individual bacteria. Cell 123(6):1025–1036. Gollnick P, Babitzke P (2002) Transcription attenuation. Biochim Biophys Acta 1577(2):240–250. Gollnick P, Babitzke P, Antson A, Yanofsky C (2005) Complexity in regulation of tryptophan biosynthesis in Bacillus subtilis. Annu Rev Genet 39:47–68. Gombert AK, Nielsen J (2000) Mathematical modelling of metabolism. Curr Opin Biotechnol 11(2):180–186. Gomez-Valero L, Latorre A, Silva FJ (2004a) The evolutionary fate of nonfunctional DNA in the bacterial endosymbiont Buchnera aphidicola. Mol Biol Evol 21:2172–2181. Gomez-Valero L, Rocha EP, Latorre A, Silva FJ (2007a) Reconstructing the ancestor of Mycobacterium leprae: the dynamics of gene loss and genome reduction. Genome Res 17:1178–1185. Gomez-Valero L, Silva FJ, Simon JC, Latorre A (2007b) Genome reduction of the aphid endosymbiont Buchnera aphidicola in a recent evolutionary time scale. Gene 389: 87–95. Gomez-Valero L, Soriano-Navarro M, Perez-Brocal V, Heddi A, Moya A, Garcia-Verdugo JM, Latorre A (2004b) Coexistence of Wolbachia with Buchnera aphidicola and a secondary symbiont in the aphid Cinara cedri. J Bacteriol 186:6626–6633. Gonzalez OR, Kuper C, Jung K, Naval Jr PC, Mendoza E (2007) Parameter estimation using Simulated Annealing for S-system models of biochemical networks. Bioinformatics 23(4):480–486.

References

419

Goodman M, Czelusniak J, Moore GW, Romero-Herrera AE, Matsuda G (1979) Fitting the Gene Lineage into its Species Lineage, a Parsimony Strategy Illustrated by Cladograms Constructed from Globin Sequences. Systematic Zoology 28(2):132–163. Gophna U, Doolittle WF, Charlebois RL (2005) Weighted genome trees: refinements and applications. J Bacteriol 187(4):1305–1316. Gordon JI (2005) A genomic view of our symbiosis with members of the gut microbiota. Journal of Pediatric Gastroenterology and Nutrition 40 Suppl 1:S28. Gotoh O (1983) Prediction of melting profiles and local helix stability for sequenced DNA. Adv Biophys 16:1–52. Gottesman S (2005) Micros for microbes: non-coding regulatory RNAs in bacteria. Trends Genet 21(7):399–404. Gourse RL, Ross W, Gaal T (2000) UPs and downs in bacterial transcription initiation: the role of the alpha subunit of RNA polymerase in promoter recognition. Mol Microbiol 37(4):687–695. Gouy M, Gautier C (1982) Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res 10:7055–7074. Grahame DA, Gencic S, DeMoll E (2005) A single operon-encoded form of the acetyl-CoA decarbonylase/synthase multienzyme complex responsible for synthesis and cleavage of acetyl-CoA in Methanosarcina thermophila. Arch Microbiol 184(1):32–40. Grainger DC, Hurd D, Goldberg MD, Busby SJ (2006) Association of nucleoid proteins with coding and non-coding segments of the Escherichia coli genome. Nucleic Acids Res 34(16):4642–4652. Grainger DC, Hurd D, Harrison M, Holdstock J, Busby SJ (2005) Studies of the distribution of Escherichia coli cAMP-receptor protein and RNA polymerase along the E. coli chromosome. Proc Natl Acad Sci USA 102(49):17693–17698. Gralla JD (1996) Activation and repression of E. coli promoters. Curr Opin Genet Dev 6(5):526–530. Gralla JD, Collado-Vides J (1996) Organization and Function of Transcription Regulatory Elements. In: Neidhardt FC, Curtiss III R, Ingraham J, Lin ECC, Low KB, Magasanik B, Reznikoff W, Schaechter M, Umbarger HE, Riley M (eds) Cellular and Molecular Biology: Escherichia coli and Salmonella Chap. 79 (2nd ed). Washington, DC: American Society for Microbiology, pp. 1232–1245. Grant PR, Grant BR, Markert JA, Keller LF, Petren K (2004) Convergent evolution of Darwin’s finches caused by introgressive hybridization and selection. Evolution 58(7):1588–1599. Grantham R, Gautier C, Gouy M, Jacobzone M, Mercier R (1981) Codon catalog usage is a genome strategy modulated for gene expressivity. Nucleic Acids Res 9:r43–r74. Grantham R, Gautier C, Gouy M, Mercier R, Pav´e A (1980) Codon catalog usage and the genome hypothesis. Nucleic Acids Res 8:r49–r62. Gray G, Fitch WM (1983) Evolution of Antibiotic Resistance Genes: The DNA Sequence of a Kanamycin Resistance Gene from Staphylococcus aureus. Mol Biol Evol 1:57–66. Gray MW (1999) Evolution of organellar genomes. Curr Opin Genet Dev 9:678–687. Grayling RA, Sandman K, Reeve JN (1996) Histones and chromatin structure in hyperthermophilic Archaea. FEMS Microbiol Rev 18(2–3):203–213. Green ML, Karp PD (2004) A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics 5:76. Green PJ, Silverman BW (1994) Nonparametric regression and generalized linear models: a roughness penalty approach (1st ed.). London, New York: Chapman & Hall. Gressmann H, Linz B et al. (2005) Gain and loss of multiple genes during the evolution of Helicobacter pylori. PLoS Genet 1(4):e43.

420

References

Gribaldo S, Cammarano P (1998) The root of the universal tree of life inferred from anciently duplicated genes encoding components of the protein-targeting machinery. J Mol Evol 47(5):508–516. Gribskov M, Devereux J, Burgess RR (1984) The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Res 12:539–549. Griffith F (1928) The significance of pneumococcal types. J Hyg 27:113–159. Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A (2005) Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 33: D121–D124. Groisman EA, Casadesus J (2005) The origin and evolution of human pathogens. Mol Microbiol 56:1–7. Gross, L (2007) Untapped bounty: surveying the seas to survey microbial biodiversity. Plos Biology 5:e85. Gruber TM, Gross CA (2003) Multiple sigma subunits and the partitioning of bacterial transcription space. Annu Rev Microbiol 57:441–466. Gruber TM, Markov D, Sharp MM, Young BA, Lu CZ et al. (2001) Binding of the initiation factor sigma(70) to core RNA polymerase is a multistep process. Mol Cell 8(1):21–31. GuhaThakurta D (2006) Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res 34(12):3585–3598. Guo FB, Ou HY, Zhang CT (2003) ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res 31:1780–1789. Gutierrez-Rios RM, Rosenblueth DA, Loza JA, Huerta AM, Glasner JD et al. (2003) Regulatory network of Escherichia coli: consistency between literature knowledge and microarray profiles. Genome Res 13(11):2435–2443. Hacker J, Blum-Oehler G, Hochhut B, Dobrindt U (2003) The molecular basis of infectious diseases: pathogenicity islands and other mobile genetic elements. A review. Acta Microbiol Immunol Hung 50:321–330. Hacker J, Blum-Oehler G, Muhldorfer I, Tschape H (1997) Pathogenicity islands of virulent bacteria: structure, function and impact on microbial evolution. Molecular Microbiology 23:1089–1097. Haeckel E (1866) Generelle Morphologie der Organismen: Allgemeine Grundz¨ uge der organischen Formen-Wissenschaft mechanisch begr¨ undet durch die von Charles Darvin reformierte Descendenz-Theorie. Berlin, Georg Riemer. Hall RM (1997) Mobile gene cassettes and integrons: moving antibiotic resistance genes in gram-negative bacteria. Ciba Found Symp 207:192–202; discussion 202–205. Hall RM, Collis CM, Kim MJ, Partridge SR, Recchia GD, Stokes HW (1999) Mobile gene cassettes and integrons in evolution. Annals of the New York Academy of Sciences 870:68–80. Hallam SJ, Mincer TJ, Schleper C, Preston CM, Roberts K, Richardson PM, DeLong EF (2006) Pathways of carbon assimilation and ammonia oxidation suggested by environmental genomic analyses of marine Crenarchaeota. PLoS Biol 4:e95. Hamann CS, Sowers KR, Lipman RS, Hou YM (1999) An archaeal aminoacyl-tRNA synthetase missing from genomic analysis. J Bacteriol 181(18):5880–5884. Hamel L, Zhaxybayeva O, Gogarten JP (2005) PentaPlot: A software tool for the illustration of genome mosaicism. BMC Bioinformatics 6(1):139. Handelsman J (2004) Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev 68:669–685. Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM (1998) Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry Biology 5:R245–R249.

References

421

Handelsman J, Tiedje JM, Alvarez-Cohen L, Ashburner M, Cann IKO, Delong EF, Doolittle WF, Fraser-Liggett CM, Godzik A, Gordon JI, Riley M, Schmidt TM (2007) The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet, Washington, DC: The National Academies Press. Haniford DB (2002) Transposon Tn10. In: Craig NL, Craigie R, Gellert M, Lambowitz AM (eds) Mobile DNA II. Washington, DC: ASM Press, pp. 457–483. Hannenhalli SS, Hayes WS, Hatzigeorgiou AG, Fickett JW (1999) Bacterial start site prediction. Nucleic Acids Res 27:3577–3582. Hao B, Gong W, Ferguson TK, James CM, Krzycki JA, Chan MK (2002) A new UAG-encoded residue in the structure of a methanogen methyltransferase. Science 296(5572):1462–1466. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature 431(7004): 99–104. Hardy S, Robillard PN (2004) Modeling and simulation of molecular biology systems using petri nets: modeling goals of various approaches. J Bioinform Comput Biol 2(4): 595–613. Harlow TJ, Gogarten JP, Ragan MA (2004) A hybrid clustering approach to recognition of protein families in 114 microbial genomes. BMC Bioinformatics 5:45. Harris JK, Kelley ST, Spiegelman GB, Pace NR (2003) The genetic core of the universal ancestor. Genome Res 13(3):407–412. Hartemink AJ, Gifford DK, Jaakkola TS, Young RA (2001) Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pac Symp Biocomput,:422–433. Hartke, Bouche, Giard, Benachour, Boutibonnes, Auffray (1996) The lactic acid stress response of L. lactis ssp. lactis. Curr Microbiol 33:194–199. Hartlein M, Cusack S (1995) Structure, function and evolution of seryl-tRNA synthetases: implications for the evolution of aminoacyl-tRNA synthetases and the genetic code. J Mol Evol 40(5):519–530. Hartwell LH, Hopfield JJ, Leibler S, Murray AW (1999) From molecular to modular cell biology. Nature 402(6761 Suppl):C47–C52. Hashimoto M, Ichimura T, Mizoguchi H, Tanaka K, Fujimitsu K et al. (2005) Cell size and nucleoid organization of engineered Escherichia coli cells with a reduced genome. Mol Microbiol 55(1):137–149. Hatzimanikatis V, Bailey JE (1996) MCA has more to say. J Theor Biol 182:233–242. Haugen SP, Berkmen MB, Ross W, Gaal T, Ward C et al. (2006) rRNA promoter regulation by nonoptimal binding of sigma region 1.2: an additional recognition element for RNA polymerase. Cell 125(6):1069–1082. Havranek JJ, Duarte CM, Baker D (2004) A simple physical model for the prediction and design of protein-DNA interactions. J Mol Biol 344(1):59–70. Hayat MA, Mancarella DA (1995) Nucleoid proteins. Micron 26(5):461–480. Hayes WS, Borodovsky M (1998) How to interpret an anonymous bacterial genome: machine learning approach to gene identification. Genome Res 8:1154–1171. He Z, Gentry TJ, Schadt CW, Wu L, Jost Liebich J, Chong SC, Huang Z, Wu W, Gu B, Jardine P, Criddle C, Zhou J (2007) GeoChip: a comprehensive microarray for investigating biogeochemical, ecological and environmental processes. The ISME Journal 1:67–77. Heijnen JJ (2005) Approximative kinetic formats used in metabolic network modeling. Biotechnol Bioeng 91:534–545. Heinrich R, Rapoport TA (1974) A linear steady-state treatment of enzymatic chains. General properties, control and effector strength. Eur J Biochem 42:89–95.

422

References

Henkin TM (1996) Control of transcription termination in prokaryotes. Annu Rev Genet 30:35–57. Henkin TM, Yanofsky C (2002) Regulation by transcription attenuation in bacteria: how RNA provides instructions for transcription termination/antitermination decisions. Bioessays 24(8):700–707. Hennig W (1966) Phylogenetic systematics, University of Illinois Press, Urbana. Henri MV (1903) Lois g´en´erales de l’action des diastases. Hermann, Paris. Hentschel U, Hacker J (2001) Pathogenicity islands: the tip of the iceberg. Microbes Infect 3:545–548. Herbeck JT, Degnan PH, Wernegreen JJ (2005) Nonhomogeneous model of sequence evolution indicates independent origins of primary endosymbionts within the enterobacteriales (gamma-Proteobacteria). Mol Biol Evol 22(3):520–532. Hernday A, Braaten B, Low D (2004) The intricate workings of a bacterial epigenetic switch. Adv Exp Med Biol 547:83–89. Herrero A, Muro-Pastor AM, Flores E (2001) Nitrogen control in cyanobacteria. J Bacteriol 183(2):411–425. Herring CD, Raffaelle M, Allen TE, Kanin EI, Landick R, Ansari AZ et al. (2005) Immobilization of escherichia coli RNA polymerase and location of binding sites by use of chromatin immunoprecipitation and microarrays. J Bacteriol 187(17):6166–6174. Herring S, Ambrogelly A, Polycarpo CR, Soll D (2007) Recognition of pyrrolysine tRNA by the Desulfitobacterium hafniense pyrrolysyl-tRNA synthetase. Nucleic Acids Res 35(4):1270–1278. Herrmann U, Soppa J (2002) Cell cycle-dependent expression of an essential SMClike protein and dynamic chromosome localization in the archaeon Halobacterium salinarum. Mol Microbiol 46(2):395–409. Herzel H, Weiss O, Trifonov EN (1999) 10–11 bp periodicities in complete genomes reflect protein structure and DNA folding. Bioinformatics 15:187–193. Hickey AJ, Conway de Macario E, Macario AJ (2002) Transcription in the archaea: basal factors, regulation, and stress-gene expression. Crit Rev Biochem Mol Biol 37(6): 537–599. Higgins CF, McLaren RS, Newbury SF (1988) Repetitive extragenic palindromic sequences, mRNA stability and gene expression: evolution by gene conversion? A review. Gene 72:3–14. Hilario E, Gogarten JP (1993) Horizontal transfer of ATPase genes — the tree of life becomes a net of life. Biosystems 31(2–3):111–119. Hinnebusch J, Tilly K (1993) Linear plasmids and chromosomes in bacteria. Mol Microbiol 10:917–922. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415(6868):180–183. Hochhut B, Wilde C, Balling G, Middendorf B, Dobrindt U, Brzuszkiewicz E, Gottschalk G, Carniel E, Hacker J (2006) Role of pathogenicity island-associated integrases in the genome plasticity of uropathogenic Escherichia coli strain 536. Mol Microbiol 61: 584–595. Hoefnagel MNH, Starrenburg MJC, Martens DE, Hugenholtz J, Kleerebezem M, Van Swam II, Bongers R, Westerhoff HV, Snoep JL (2002) Metabolic engineering of lactic acid bacteria, the combined approach: kinetic modelling, metabolic control and experimental analysis. Microbiology 148:1003–1013. Holmes AJ, Holley MP, Mahon A, Nield B, Gillings M, Stokes HW (2003) Recombination activity of a distinctive integron-gene cassette system associated with Pseudomonas stutzeri populations in soil. Journal of Bacteriology 185:918–928.

References

423

Holstege FC, Jennings EG, Wyrick JJ, Lee TI, Hengartner CJ et al. (1998) Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95(5):717–728. Hooshangi S, Thiberge S, Weiss R (2005) Ultrasensitivity and noise propagation in a synthetic transcriptional cascade. Proc Natl Acad Sci USA 102(10):3581–3586. Hoppert M, Mayer F (1999) Principles of macromolecular organization and cell function in bacteria and archaea. Cell Biochem Biophys 31(3):247–284. Horak CE, Snyder M (2002) ChIP-chip: a genomic approach for identifying transcription factor binding sites. Methods Enzymol 350:469–483. Horn M, Collingro A, Schmitz-Esser S, Beier CL, Purkhold U, Fartmann B, Brandt P, Nyakatura GJ, Droege M, Frishman D, Rattei T, Mewes HW, Wagner M (2004) Illuminating the evolutionary history of chlamydiae. Science 304:728–730. Horowitz H, Platt T (1982) Identification of trp-p2, an internal promoter in the tryptophan operon of Escherichia coli. J Mol Biol 156(2):257–267. Horowitz NH (1965) The evolution of biochemical syntheses-resprospect and prospect. Evolving genes and proteins. New York: Academic Press. Hotopp JCD, Clark ME, Oliveira DCSG, Foster JM, Fischer P, Torres MC, Giebel JD, Kumar N, Ishmael N, Wang SL, Ingram J, Nene RV, Shepard J, Tomkins J, Richards S, Spiro DJ, Ghedin E, Slatko BE, Tettelin H, Werren JH (2007) Widespread lateral gene transfer from intracellular bacteria to multicellular eukaryotes. Science 317:1753–1756. Hsiao W, Wan I, Jones SJ, Brinkman FS (2003) IslandPath: aiding detection of genomic islands in prokaryotes. Bioinformatics 19:418–420. Hsiao WW, Ung K et al. (2005) Evidence of a large novel gene pool associated with prokaryotic genomic islands. PLoS Genet 1(5):e62. Htun H, Dahlberg JE (1989) Topology and formation of triple-stranded H-DNA. Science 243:1571–1576. Hu J, Li B, Kihara D (2005) Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 33(15):4899–4913. Hua SS, Markovitz A (1972) Multiple regulator gene control of the galactose operon in Escherichia coli K-12. J Bacteriol 110(3):1089–1099. Huber H, Hohn MJ, Rachel R, Fuchs T, Wimmer VC, Stetter KO (2002) A new phylum of Archaea represented by a nanosized hyperthermophilic symbiont. Nature 417:63–67. Huffman JL, Brennan RG (2002) Prokaryotic transcription regulators: more than just the helix-turn-helix motif. Curr Opin Struct Biol 12(1):98–106. Huggins AR, Sandine WE (1977) Incidence and properties of temperate bacteriophages induced from lactic streptococci. Appl Environ Microbiol 33:184–191. Hughes DS, Felbeck H, Stein JL (1997) A histidine protein kinase homolog from the endosymbiont of the hydrothermal vent tubeworm Riftia pachyptila. Applied and Environmental Microbiology 63:3494–3498. Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS et al. (2006) The PROSITE database. Nucleic Acids Res, 34(Database issue), D227–D230. Hundt S, Zaigler A, Lange C, Soppa J, Klug G (2007) Global analysis of mRNA decay in Halobacterium salinarum NRC-1 at single-gene resolution using DNA microarrays. J Bacteriol 189(19):6936–6944. Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionary studies. Mol Biol Evol 23(2):254–267. Hutchison CA, 3rd, Smith HO, Pfannkoch C, Venter JC (2005) Cell-free cloning using phi29 DNA polymerase. Proc Natl Acad Sci USA 102:17332–17336. Hutchison CA, 3rd, Venter JC (2006) Single-cell genomics. Nature biotechnology 24: 657–658. Hutchison CA, Peterson SN et al. (1999) Global transposon mutagenesis and a minimal Mycoplasma genome. Science 286(5447):2165–2169.

424

References

Huynen MA, Bork P (1998) Measuring genome evolution. Proc Natl Acad Sci USA 95(11):5849–5856. Huynen MA, Snel B (2000) Gene and context: integrative approaches to genome analysis. Adv Protein Chem 54:345–379. Huynen MA, Snel B, Mering CV, Bork P (2003) Function prediction and protein networks. Current Opinion in Cell Biology 15(2):191–198. Ibba M, Bono JL, Rosa PA, Soll D (1997) Archaeal-type lysyl-tRNA synthetase in the Lyme disease spirochete Borrelia burgdorferi. Proc Natl Acad Sci USA 94(26): 14383–14388. Ihmels J, Bergmann S, Barkai N (2004) Defining transcription modules using large-scale gene expression data. Bioinformatics 20(13):1993–2003. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y et al. (2002) Revealing modular organization in the yeast transcriptional network. Nat Genet 31(4):370–377. Ikemura T (1981a) Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. J Mol Biol 146:1–21. Ikemura T (1981b) Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J Mol Biol 151:389–409. Ikemura T (1985) Codon usage and tRNA content in unicellular and multicellular organisms. Mol Biol Evol 2:13–34. Imamoto F (1973) Translation and transcription of the tryptophan operon. Prog Nucleic Acid Res Mol Biol 13:339–407. Ishihama A (2000) Functional modulation of Escherichia coli RNA polymerase. Annu Rev Microbiol 54:499–518. Ishii N, Nakahigashi K, Baba T, Robert M, Soga T, Kanai A et al. (2007) Multiple high-throughput analyses monitor the response of E. coli to perturbations. Science 316(5824):593–597. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y (2001) A comprehensive twohybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 98:4569–4574. Itoh T, Takemoto K, Mori H, Gojobori T 1999) Evolutionary instability of operon structures disclosed by sequence comparisons of complete microbial genomes. Mol Biol Evol 16(3):332–346. Jacob E, Sasikumar R, Nair KN (2005) A fuzzy guided genetic algorithm for operon prediction. Bioinformatics 21(8):1403–1407. Jacob F (1970) La Logique du Vivant, Une Histoire de L’H´er´edit´e. Paris: Gallimard. Jacob F, Brenner S (1963) On the regulation of DNA synthesis in bacteria: the hypothesis of the replicon. C R Hebd Seances Acad Sci 256:298–300. Jacob F, Monod J (1961) Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol 3:318–356. Jacob F, Perrin D, Sanchez C, Monod J (1960) L’op´eron: groupe de g`enes ` a expression coordonn´ee par un op´erateur. C. R. Seance Acad Sci 250:1727–1729. Janga SC (2006) Distinctive signatures of operon junctions across Prokaryotes. http:// tikal.ccg.unam.mx/sarath/sig predictions/. Janga SC (2005) Nebulon. http://tikal.cifn.unam.mx/nebulon. Janga SC, Collado-Vides J, Moreno-Hagelsieb G (2005) Nebulon: a system for the inference of functional relationships of gene products from the rearrangement of predicted operons. Nucleic Acids Res 33(8):2521–2530. Janga SC, Lamboy WF, Huerta AM, Moreno-Hagelsieb G (2006) The distinctive signatures of promoter regions and operon junctions across prokaryotes. Nucleic Acids Res 34(14):3980–3987.

References

425

Janga SC, Salgado H, Collado-Vides J, Martinez-Antonio A (2007) Internal versus external effector and transcription factor gene pairs differ in their relative chromosomal position in Escherichia coli. J Mol Biol 368(1):263–272. Jans DA, Hassan G (1998) Nuclear targeting by growth factors, cytokines, and their receptors: a role in signaling? Bioessays 20(5):400–411. Jansen R, Bussemaker HJ, Gerstein M (2003) Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models. Nucleic Acids Res 31: 2242–2251. Jansen R, Embden JD, Gaastra W, Schouls LM (2002) Identification of genes that are associated with DNA repeats in prokaryotes. Mol Microbiol 43:1565–1575. Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S et al. (2003) A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302(5644):449–453. Jansson J, VerBerkmoes, N., Shah, M., Hettich, R., Dicksved, J., Halvarsson, J., Tysk, C., Engstrand, L., Rosenquist, M. (2006). Profiling of the Gut Microbiota of Identical with Crohn’s Disease using Proteomics and Microbiomics. 14th International Microbial Genomics Conference; September 24–28, 2006; UCLA Conference Center, Lake Arrowhead. Jelinek F, Mercer RL (1980) Interpolated estimation of Markov source parameters from sparse data. In: Gelsema, ES, Kanal, LN (eds), Pattern Recognition in Practice. NorthHolland Publishing Company, Amsterdam, pp. 381–397. Jenke-Kodama H, Borner T et al. (2006) Natural biocombinatorics in the polyketide synthase genes of the actinobacterium Streptomyces avermitilis. PLoS Comput Biol 2(10):e132. Jenner RG, Young RA (2005) Insights into host responses against pathogens from transcriptional profiling. Nat Rev Microbiol 3(4):281–294. Jensen RA (2001) Orthologs and paralogs — we need to get it right. Genome Biol 2(8): 1002. Jeong H, Tombor B, Albert R, Oltvai ZN, Barab´ asi AL (2000) The large-scale organization of metabolic networks. Nature 407(6804):651–654. Jeong K, Ahn J, Khodursky A (2004) Spatial patterns of transcriptional activity in the chromosome of Escherichia coli. Genome Biology 5(11):R86. Ji H, Wong WH (2005) TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics 21(18):3629–3636. Jiang SC, Paul JH (1998) Gene transfer by transduction in the marine environment. Applied and Environmental Microbiology 64:2780–2787. Jiang W, Metcalf WW, Lee KS, Wanner BL (1995) Molecular cloning, mapping, and regulation of Pho regulon genes for phosphonate breakdown by the phosphonatase pathway of Salmonella typhimurium LT2. J Bacteriol 177:6411–6421. Johnson DI, Somerville RL (1983) Evidence that repression mechanisms can exert control over the thr, leu, and ilv operons of Escherichia coli K-12. J Bacteriol 155(1): 49–55. Johnson DI, Somerville RL (1984) New regulatory genes involved in the control of transcription initiation at the thr and ilv promoters of Escherichia coli K-12. Mol Gen Genet 195(1–2):70–76. Johnson WE, Li W, Meyer CA, Gottardo R, Carroll JS, Brown M et al. (2006) Modelbased analysis of tiling-arrays for ChIP-chip. PNAS 103(33):12457–12462. Johnson ZI, Chisholm SW (2004) Properties of overlapping genes are conserved across microbial genomes. Genome Res 14:2268–2272. Jorgensen BB, Boetius A (2007) Feast and famine–microbial life in the deep-sea bed. Nature Reviews 5:770–781.

426

References

Josse J, Kaiser AD, Kornberg A (1961) Enzymatic synthesis of deoxyribonucleic acid. VIII. Frequencies of nearest neighbor base sequences in deoxyribonucleic acid. J Biol Chem 236:864–875. Joyce T, Pintzas A (2007) Microarray analysis to reveal genes involved in colon carcinogenesis. Expert Opin Pharmacother 8(7):895–900. Kacser H, Burns JA (1973) The control of flux. Symp Soc Exp Biol 27:65–104. Kahali B, Basak S, Ghosh TC (2007) Reinvestigating the codon and amino acid usage of S. cerevisiae genome: a new insight from protein secondary structure analysis. Biochem Biophys Res Commun 354:693–699. Kanehisa M (2002) The KEGG database. Novartis Found Symp 247:91–101; discussion 101–103:119–128:244–152. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M (2004) The KEGG resource for deciphering the genome. Nucleic Acid Res 32:D277-D280. Kannan N, Taylor SS, Zhai Y, Venter JC, Manning G (2007) Structural and functional diversity of the microbial kinome. PLoS Biol 5:e17. Kaplan T, Friedman N, Margalit H (2005) Ab initio prediction of transcription factor targets using structural knowledge. PLoS Comput Biol 1(1):e1. Karaolis DK, Johnson JA, Bailey CC, Boedeker EC, Kaper JB, Reeves PR (1998) A Vibrio cholerae pathogenicity island associated with epidemic and pandemic strains. Proc Natl Acad Sci USA 95:3134–3139. Karch H, Schubert S, Zhang D, Zhang W, Schmidt H, Olschlager T, Hacker J (1999) A genomic island, termed high-pathogenicity island, is present in certain non-O157 Shiga toxin-producing Escherichia coli clonal lineages. Infect Immun 67:5994–6001. Karkas JD, Rudner R, Chargaff E (1968) Separation of B. subtilis DNA into complementary strands. II. Template functions and composition as determined by transcription with RNA polymerase. Proc Natl Acad Sci USA 60:915–920. Karl DM (2007a) Microbial oceanography: paradigms, processes and promise. Nature Reviews Microbiology 5:759–769. Karl DM, Proctor, LM (2007) Foundations of microbial oceanography Oceanography 20: 16–27. Karlin S (2001) Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends in Microbiology 9(7):335–343. Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87(6):2264–2268. Karlin S, Altschul SF (1993) Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci USA 90(12):5873–5877. Karlin S, Blaisdell BE, Bucher P (1992) Quantile distributions of amino acid usage in protein classes. Protein Eng 5:729–738. Karlin S, Brendel V (1992) Chance and statistical significance in protein and DNA sequence analysis. Science 257:39–49. Karlin S, Brocchieri L, Bergman A, Mr´ azek J, Gentles AJ (2002a) Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci USA 99:333–338. Karlin S, Brocchieri L, Mr´ azek J, Campbell AM, Spormann AM (1999) A chimeric prokaryotic ancestry of mitochondria and primitive eukaryotes. Proc Natl Acad Sci USA 96:9190–9195. Karlin S, Brocchieri L, Trent J, Blaisdell BE, Mr´ azek J (2002b) Heterogeneity of genome and proteome content in bacteria, archaea, and eukaryotes. Theor Popul Biol 61: 367–390. Karlin S, Bucher P (1992) Correlation analysis of amino acid usage in protein classes. Proc Natl Acad Sci USA 89:12165–12169.

References

427

Karlin S, Burge C (1995) Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 11:283–290. Karlin S, Campbell AM, Mr´ azek J (1998a) Comparative DNA analysis across diverse genomes. Annu Rev Genet 32:185–225. Karlin S, Cardon LR (1994) Computational DNA sequence analysis. Annu Rev Microbiol 48:619–654. Karlin S, Ladunga I (1994) Comparisons of eukaryotic genomic sequences. Proc Natl Acad Sci USA 91:12832–12836. Karlin S, Mr´ azek J (1996) What drives codon choices in human genes? J Mol Biol 262: 459–472. Karlin S, Mr´ azek J (2000) Predicted highly expressed genes of diverse prokaryotic genomes. J Bacteriol 182:5238–5250. Karlin S, Mr´ azek J (2001) Predicted highly expressed and putative alien genes of Deinococcus radiodurans and implications for resistance to ionizing radiation damage. Proc Natl Acad Sci USA 98:5240–5245. Karlin S, Mr´ azek J, Campbell A, Kaiser D (2001) Characterizations of highly expressed genes of four fast-growing bacteria. J Bacteriol 183:5025–5040. Karlin S, Mr´ azek J, Campbell AM (1996) Frequent oligonucleotides and peptides of the Haemophilus influenzae genome. Nucleic Acids Res 24:4263–4272. Karlin S, Mr´ azek J, Campbell AM (1997) Compositional biases of bacterial genomes and evolutionary implications. J Bacteriol 179:3899–3913. Karlin S, Mr´ azek J, Campbell AM (1998b) Codon usages in different gene classes of the Escherichia coli genome. Mol Microbiol 29:1341–1355. Karlin S, Ost F (1988) Maximal length of common words among random letter sequences. Ann Prob 16:535–563. Karp PD, Paley S, Romero P (2002) The pathway tools software. Bioinformatics 18(suppl 1), S225–S232. Kashi Y, King DG (2006) Simple sequence repeats as advantageous mutators in evolution. Trends Genet 22:253–259. Kauffman SA (1969) Metabolic stability and epigenesis in randomly constructed genetic nets. J Theor Biol 22(3):437–467. Kepes F (2004) Periodic transcriptional organization of the E.coli genome. J Mol Biol 340(5):957–964. Kikuchi S, Tominaga D, Arita M, Takahashi K, Tomita M (2003) Dynamic modeling of genetic networks using genetic algorithm and S-system. Bioinformatics 19(5): 643–650. Kikuchi S, Tominaga D, Arita M, Tomita M (2001) Pathway finding from given timecourses using genetic algorithm. Genome Informatics 12:304–305. Kim HY, Gladyshev VN (2005) Different catalytic mechanisms in mammalian selenocysteine- and cysteine-containing methionine-R-sulfoxide reductases. PLoS Biol 3(12):e375. Kim K-Y, Cho D-Y, Zhang B-T (2006) Multi-stage evolutionary algorithms for efficient identification of gene regulatory networks. Paper presented at the EvoWorkshops 2006. Kim TH, Ren B (2006) Genome-wide analysis of protein-DNA interactions. Annual Review of Genomics and Human Genetics 7(1):81–102. Kimura S, Hatakeyama M, Konagaya A (2003) Inference of S-system models of genetic networks using a genetic local search. Paper presented at the Proceedings of the 2003 Congress on Evolutionary Computation (CEC2003), Canberra, Australia. Kimura S, Hatakeyama M, Konagaya A (2004) Inference of s-system models of genetic networks from noisy time-series data. Chem-Bio Informatics Journal 4(1):1–14.

428

References

Kimura S, Ide K, Kashihara A, Kano M, Hatakeyama M, Masui R et al. (2005) Inference of S-system models of genetic networks using a cooperative coevolutionary algorithm. Bioinformatics 21(7):1154–1163. King AD, Przulj N, Jurisica I (2004) Protein complex prediction via cost-based clustering. Bioinformatics 20(17):3013–3020. Kingman JFC (1982) The coalescent. Stochastic Processes and their Applications 13(3):235–248. Kingsford CL, Ayanbule K, Salzberg SL (2007) Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake. Genome Biol 8(2): R22. Kinne-Saffran E, Kinne RK (1999) Vitalism and synthesis of urea. From Friedrich Wohler to Hans A. Krebs. Am J Nephrol 19(2):290–294. Kirzhner V, Nevo E, Korol A, Bolshoy A (2003) A large-scale comparison of genomic sequences: one promising approach. Acta Biotheor 51:73–89. Kitayama T, Kinoshita A, Sugimoto M, Nakayama Y, Tomita M (2006) A simplified method for power-law modelling of metabolic pathways from time-course data and steady-state flux profiles. Theor Biol Med Model 3:1–9. Klaassens ES, de Vos WM, Vaughan EE (2007) Metaproteomics approach to study the functionality of the microbiota in the human infant gastrointestinal tract. Applied and Environmental Microbiology 73:1388–1392. Klasson L, Andersson SGE (2004) Evolution of minimal-gene-sets in host-dependent bacteria. Trends Microbiol 12:37–43. Knapp S, Hacker J, Jarchau T, Goebel W (1986) Large, unstable inserts in the chromosome affect virulence properties of uropathogenic Escherichia coli O6 strain 536. J Bacteriol 168:22–30. Knight RD, Freeland SJ, Landweber LF (1999) Selection, history and chemistry: the three faces of the genetic code. Trends Biochem Sci 24(6):241–247. Knight RD, Landweber LF (1998) Rhyme or reason: RNA-arginine interactions and the genetic code. Chem Biol 5(9):R215–R220. Knight RD, Landweber LF, Yarus M (2001) How mitochondria redefine the code. J Mol Evol 53(4–5):299–313. Kobayashi K, Ehrlich SD et al. (2003) Essential Bacillus subtilis genes. Proc Natl Acad Sci USA 100(8):4678–4683. Kolesov G, Mewes HW, Frishman D (2001) SNAPping up functionally related genes based on context information: a colinearity-free approach. J Mol Biol 311(4):639–656. Komaki K, Ishikawa H (1999) Intracellular bacterial symbionts of aphids possess many genomic copies per bacterium. J Mol Evol 48:717–722. Kondo N, Nikoh N, Ijichi N, Shimada M, Fukatsu T (2002) Genome fragment of Wolbachia endosymbiont transferred to X chromosome of host insect. Proc Natl Acad Sci USA 99:14280–14285. Kono H, Sarai A (1999) Structure-based prediction of DNA target sites by regulatory proteins. Proteins 35(1):114–131. Konstantinidis KT, Tiedje JM (2005) Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci USA 102(7):2567–2572. Koonin EV (2001) An apology for orthologs — or brave new memes. Genome Biol 2(4):COMMENT1005. Koonin EV (2003) Comparative genomics, minimal gene-sets and the last universal common ancestor. Nat Rev Microbiol 1(2):127–136. Koonin EV (2005) Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet 39:309–338. Koonin EV, Wolf YI et al. (2002) The structure of the protein universe and genome evolution. Nature 420(6912):218–223.

References

429

Korf I, Yandell M, Bedell J (2003) “BLAST”: An Essential Guide to the Basic Local Alignment Search Tool, O’Reilly. Kornberg H (2000) Krebs and his trinity of cycles. Nat Rev Mol Cell Biol 1(3):225–228. Kornberg HL, Krebs HA (1957) Synthesis of cell constituents from C2-units by a modified tricarboxylic acid cycle. Nature 179(4568):988–991. Koski LB, Golding GB (2001) The closest BLAST hit is often not the nearest neighbor. J Mol Evol 52(6):540–542. Koski LB, Morton RA, Golding GB (2001) Codon bias and base composition are poor indicators of horizontally transferred genes. Mol Biol Evol 18(3):404–412. Koza JR, Mydlowec W, Lanza G, Yu J, Keane MA (2001) Reverse engineering of metabolic pathways from observed data using genetic programming. Pac Symp Biocomput 434–445. Krause L, McHardy AC, Nattkemper TW, Puhler A, Stoye J, Meyer F (2007) GISMO– gene identification using a support vector machine for ORF classification. Nucleic Acids Res 35:540–549. Krawiec S, Riley M (1990) Organization of the bacterial chromosome. Microbiol Rev 54:502–539. Krebs HA, Johnson WA (1937) Metabolism of ketonic acids in animal tissues. Biochem J 31(4):645–660. Kreil DP, Ouzounis CA (2001) Identification of thermophilic species by the amino acid compositions deduced from their genomes. Nucleic Acids Res 29:1608–1615. Kremling A, Jahreis K, Lengeler JW, Gilles ED (2000) The organization of metabolic reaction networks: a signal-oriented approach to cellular models. Metab Eng 2(3): 190–200. Krogh A (1997) Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol 5:179–186. Krogh A (2000) Using database matches with HMMGene for automated gene detection in Drosophila. Genome Res 10:523–528. Krogh A, Mian IS, Haussler D (1994) A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res 22:4768–4778. Kroll JS, Loynds BM, Langford PR (1992) Palindromic Haemophilus DNA uptake sequences in presumed transcriptional terminators from H. influenzae and H. parainfluenzae. Gene 114:151–152. Kryukov GV, Gladyshev VN (2004) The prokaryotic selenoproteome. EMBO Rep 5(5): 538–543. Krzycki JA (2004) Function of genetically encoded pyrrolysine in corrinoid-dependent methylamine methyltransferases. Curr Opin Chem Biol 8(5):484–491. Kuhnke G, Krause A, Heibach C, Gieske U, Fritz HJ, Ehring R (1986) The upstream operator of the Escherichia coli galactose operon is sufficient for repression of transcription initiated at the cyclic AMP-stimulated promoter. EMBO J 5(1): 167–173. Kunin V, Ouzounis CA (2003) GeneTRACE-reconstruction of gene content of ancestral species. Bioinformatics 19(11):1412–1416. Kunst F, Ogasawara N, Moszer I et al. (1997) The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature 390:249–256. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL (2004) Versatile and open software for comparing large genomes. Genome Biol 5:R12. Kurtz S, Schleiermacher C (1999) REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 15:426–427. Kustu S, Santero E, Keener J, Popham D, Weiss D (1989) Expression of sigma 54 (ntrA)dependent genes is probably united by a common mechanism. Microbiol Rev 53(3): 367–376.

430

References

Kypr J (1988) Possible reason for the preferential insertion of adenine opposite abasic lesions in DNA. J Theor Biol 135:125–126. Kypr J, Mr´ azek J (1987) Occurrence of nucleotide triplets in genes and the secondary structure of the coded proteins. Int J Biol Macromol 9:49–53. Lagesen K, Hallin P, Rodland EA, Staerfeldt HH, Rognes T, Ussery DW (2007) RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 35: 3100–3108. Laikova ON, Mironov AA, Gelfand MS (2001) Computational analysis of the transcriptional regulation of pentose utilization systems in the gamma subdivision of Proteobacteria. FEMS Microbiol Lett 205(2):315–322. Lake JA, Herbold CW, Rivera MC, Servin JA, Skophammer RG (2007) Rooting the tree of life using non-ubiquitous genes. Mol Biol Evol 24:130–136. Lall R, Voit EO (2005) Parameter estimation in modulated, unbranched reaction chains within biochemical systems. Comput Biol Chem 29(5):309–318. Lamour V, Quevillon S, Diriong S, N’Guyen VC, Lipinski M, Mirande M (1994) Evolution of the Glx-tRNA synthetase family: the glutaminyl enzyme as a case of horizontal gene transfer. Proc Natl Acad Sci USA 91(18):8670–8674. Lampson BC, Inouye M, Inouye S (2005) Retrons, msDNA, and the bacterial genome. Cytogenet Genome Res 110:491–499. Landick R (2006) The regulatory roles and mechanism of transcriptional pausing. Biochem Soc Trans 34(Pt 6):1062–1066. Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin ML, Pace NR (1985) Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. Proc Natl Acad Sci USA 82:6955–6959. Lapierre P, Shial R, Gogarten JP (2006) Distribution of F- and A/V-type ATPases in Thermus scotoductus and other closely related species. Syst Appl Microbiol 29(1): 15–23. Larsen TS, Krogh A (2003) EasyGene — a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinformatics 4:21. Laslett D, Canback B (2004) ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res 32:11–16. Lathe WC, Snel B, Bork P (2000) Gene context conservation of a higher order than operons. Trends Biochem Sci 25(10):474–479. Lawley TD, Klimke WA, Gubbins MJ, Frost LS (2003) F factor conjugation is a true type IV secretion system. FEMS Microbiol Lett 224:1–15. Lawrence CJ, Seigfried TE, Brendel V (2005) The maize genetics and genomics database. The community resource for access to diverse maize data. Plant physiology 138:55–58. Lawrence JG (2002) Gene Transfer in Bacteria: Speciation without Species? Theoretical Population Biology 61(4):449–460. Lawrence JG (2005) Common themes in the genome strategies of pathogens. Curr Opin Genet Dev 15:584–588. Lawrence JG, Hendrickson H (2005) Genome evolution in bacteria: order beneath chaos. Curr Opin Microbiol 8(5):572–578. Lawrence JG, Jeffrey G (2000) Clustering of antibiotic resistance genes: beyond the selfish operon. ASM News 66:281–286. Lawrence JG, Jeffrey G (2003) GENE ORGANIZATION: Selection, Selfishness, and Serendipity. Annual Review of Microbiology 57(1):419–440. Lawrence JG, Ochman H (1997) Amelioration of bacterial genomes: rates of change and exchange. J Mol Evol 44(4):383–397. Lawrence JG, Ochman H (1998) Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci USA 95(16):9413–9417.

References

431

Lawrence JG, Ochman H (2002) Reconciling the many faces of lateral gene transfer. Trends Microbiol 10(1):1–4. Lawrence JG, Roth JR (1996) Selfish operons: horizontal transfer may drive the evolution of gene clusters. Genetics 143(4):1843–1860. Lawson CL, Swigon D, Murakami KS, Darst SA, Berman HM, Ebright RH (2004). Catabolite activator protein: DNA binding and transcription activation. Curr Opin Struct Biol 14(1):10–20. Lecompte O, Ripp R, Puzos-Barbe V, Duprat S, Heilig R, Dietrich J, Thierry JC, Poch O (2001) Genome evolution at the genus level: Comparison of three complete genomes of hyperthermophilic Archaea. Genome Res 11:981–993. Lederberg J, Cavalli LL, Lederberg EM (1952) Sex Compatibility in Escherichia Coli. Genetics 37:720–730. Lee I, Date SV, Adai AT, Marcotte EM (2004) A probabilistic functional network of yeast genes. Science 306(5701):1555–1558. Lee NH, Saeed AI (2007) Microarrays: an overview. Methods Mol Biol 353:265–300. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298(5594): 799–804. Leigh JA (1999) Transcriptional regulation in Archaea. Curr Opin Microbiol 2(2): 131–134. Leigh JA (2000) Nitrogen fixation in methanogens: the archaeal perspective. Curr Issues Mol Biol 2(4):125–131. Lenski RE (2004) Phenotypic and genomic evolution during a 20000 generation experiment with the bacterium Escherichia coli. Plant Breeding Reviews 24:225–265. Lento GM, Hickson RE, Chambers GK, Penny D (1995) Use of spectral analysis to test hypotheses on the origin of pinnipeds. Mol Biol Evol 12(1):28–52. Leplae R, Hebrant A, Wodak SJ, Toussaint A (2004) ACLAME: a Classification of Mobile Genetic Elements. Nucleic Acids Res 32:D45–D49. Leplae R, Lima-Mendez G, Toussaint A (2006) A first global analysis of plasmid encoded proteins in the ACLAME database. FEMS Microbiol Rev 30:980–994. Lerat E, Daubin V et al. (2005) Evolutionary origins of genomic repertoires in bacteria. PLoS Biol 3(5):e130. Lerat E, Daubin V, Moran NA (2003) From gene trees to organismal phylogeny in prokaryotes: The case of the gamma-proteobacteria. PLoS Biol 1(1): E19. Leung MY, Marsh GM, Speed TP (1996) Over- and underrepresentation of short DNA words in herpesvirus genomes. J Comput Biol 3:345–360. Levine M, Tjian R (2003) Transcription regulation and animal diversity. Nature 424(6945):147–151. Lewis DC (1991) A qualitative analysis of S-systems: Hopf-bifurcations. In: Voit EO (eds) Canonical Nonlinear Modeling: S-System Approach to Understanding Complexity. Van Nostrand Reinhold, New York. Lewis EB (1951) Pseudoallelism and gene evolution. Cold Spring Harbor Symp Quant Biol 16:159–174. Lewis M (2005) The lac repressor. C R Biol 328(6):521–548. Ley RE, Harris JK, Wilcox J, Spear JR, Miller SR, Bebout BM, Maresca JA, Bryant DA, Sogin ML, Pace NR (2006) Unexpected diversity and complexity of the Guerrero Negro hypersaline microbial mat. Appl Environ Microbiol 72(5):3685–3695. Ley RE, Peterson DA, Gordon JI (2006a) Ecological and evolutionary forces shaping microbial diversity in the human intestine. Cell 124:837–848. Ley RE, Turnbaugh PJ, Klein S, Gordon JI (2006b) Microbial ecology: human gut microbes associated with obesity. Nature 444:1022–1023.

432

References

Li G, Che D, Xu Y (2007) A universal operon predictor for prokaryotic genomes. Submitted. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (Oxford, England) 22:1658–1659. Li W, Meyer CA, Liu XS (2005) A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics 21(suppl 1), i274–i282. Li W-H (1997) Molecular Evolution. Sunderland, MA: Sinauer Associates. Liao JC, Boscolo R, Yang YL, Tran LM, Sabatti C, Roychowdhury VP (2003) Network component analysis: reconstruction of regulatory signals in biological systems. Proc Natl Acad Sci USA 100(26):15522–15527. Lichtinghagen R, Musholt PB, Lein M, Romer A, Rudolph B, Kristiansen G et al. (2002) Different mRNA and protein expression of matrix metalloproteinases 2 and 9 and tissue inhibitor of metalloproteinases 1 in benign and malignant prostate tissue. Eur Urol 42(4):398–406. Lilja AE, Jenssen JR, Kahn JD (2004) Geometric and dynamic requirements for DNA looping, wrapping and unwrapping in the activation of E.coli glnAp2 transcription by NtrC. J Mol Biol 342(2):467–478. Lim A, Dimalanta ET, Potamousis KD, Yen G, Apodoca J, Tao C, Lin J, Qi R, Skiadas J, Ramanathan A, Perna NT, Plunkett G, 3rd, Burland V, Mau B, Hackett J, Blattner FR, Anantharaman TS, Mishra B, Schwartz DC (2001) Shotgun optical maps of the whole Escherichia coli O157:H7 genome. Genome Res 11:1584–1593. Lin X, Floudas CA, Wang Y, Broach JR (2003) Theoretical and computational studies of the glucose signaling pathways in yeast using global gene expression data. Biotechnol Bioeng 84:864–886. Lindahl L, Hinnebusch A (1992) Diversity of mechanisms in the regulation of translation in prokaryotes and lower eukaryotes. Curr Opin Genet Dev 2(5): 720–726. Lindholm D, Eriksson O, Korhonen L (2004) Mitochondrial proteins in neuronal degeneration. Biochem Biophys Res Commun 321:753–758. Link AJ, Robison K, Church GM (1997) Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K-12. Electrophoresis 18: 1259–1313. Liu X, Brutlag DL, Liu JS (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput 127–138. Liu Z, Mao F, Guo JT, Yan B, Wang P, Qu Y, Xu Y (2005) Quantitative evaluation of protein-DNA interactions using an optimized knowledge-based potential. Nucleic Acids Res 33(2):546–558. Livny J, Fogel MA, Davis BM, Waldor MK (2005) sRNAPredict: an integrative computational approach to identify sRNAs in bacterial genomes. Nucleic Acids Res 33:4096–4105. Lloyd G, Landini P, Busby S (2001) Activation and repression of transcription initiation in bacteria. Essays Biochem 37:17–31. Lo I, Denef VJ, Verberkmoes NC, Shah MB, Goltsman D, DiBartolo G, Tyson GW, Allen EE, Ram RJ, Detter JC, Richardson P, Thelen MP, Hettich RL, Banfield JF (2007) Strain-resolved community proteomics reveals recombining genomes of acidophilic bacteria. Nature 446:537–541. Lobry JR (1996) Asymmetric substitution patterns in the two DNA strands of bacteria. Mol Biol Evol 13:660–665. Loeb J (1906) The dynamics of living matter. Macmillan, New York Lombardot T, Kottmann R, Pfeffer H, Richter M, Teeling H, Quast C, Glockner FO (2006) Megx.net–database resources for marine ecological genomics. Nucleic Acids Res 34:D390–D393.

References

433

Long M, Betran E et al. (2003) The origin of new genes: glimpses from the young and old. Nat Rev Genet 4(11):865–875. Long M, Langley CH (1993) Natural selection and the origin of jingwei, a chimeric processed functional gene in Drosophila. Science 260(5104):91–95. Long SW, Faguy DM (2004) Anucleate and titan cell phenotypes caused by insertional inactivation of the structural maintenance of chromosomes (smc) gene in the archaeon Methanococcus voltae. Mol Microbiol 52(6):1567–1577. Longstaff DG, Blight SK, Zhang L, Green-Church KB, Krzycki JA (2007) In vivo contextual requirements for UAG translation as pyrrolysine. Mol Microbiol 63(1): 229–241. Longstaff DG, Larue RC, Faust JE, Mahapatra A, Zhang L, Green-Church KB, Krzycki JA (2007) A natural genetic code expansion cassette enables transmissible biosynthesis and genetic encoding of pyrrolysine. Proc Natl Acad Sci USA 104(3):1021–1026. Lotka AJ (1924) Elements of Physical Biology. Williams and Wilkins, Baltimore. Lowe TM, Eddy SR (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25:955–964. Lu P, Vogel C, Wang R, Yao X, Marcotte EM (2007) Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat Biotechnol 25(1):117–124. Luijsterburg MS, Noom MC, Wuite GJ, Dame RT (2006) The architectural role of nucleoid-associated proteins in the organization of bacterial chromatin: a molecular perspective. J Struct Biol 156(2):262–272. Luisi PL (2002) Toward the engineering of minimal living cells. Anatomical Record 268: 208–214. Lukashin AV, Borodovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26:1107–1115. Lundgren M, Andersson A, Chen L, Nilsson P, Bernander R (2004) Three replication origins in Sulfolobus species: synchronous initiation of chromosome replication and asynchronous termination. Proc Natl Acad Sci USA 101:7046–7051. Luo ZQ, Su S, Farrand SK (2003) In situ activation of the quorum-sensing transcription factor TraR by cognate and noncognate acyl-homoserine lactone ligands: kinetics and consequences. J Bacteriol 185(19):5665–5672. Lwoff A (1953) Lysogeny. Bacteriol Rev 17:269–337. Lynch M (2006) Streamlining and simplification of microbial genome architecture. Annu Rev Microbiol 60:327–349. Lynch M, Conery JS (2000) The evolutionary fate and consequences of duplicate genes. Science 290(5494):1151–1155. Lynch M, Conery JS (2003) The origins of genome complexity. Science 302:1401–1404. Maas WK, McFall E (1964) Genetic Aspects Of Metabolic Control. Annu Rev Microbiol 18:95–110. MacDonald D, Demarre G, Bouvier M, Mazel D, Gopaul DN (2006) Structural basis for broad DNA-specificity in integron recombination. Nature 440:1157–1162. MacNeil IA, Tiong CL, Minor C, August PR, Grossman TH, Loiacono KA, Lynch BA, Phillips T, Narula S, Sundaramoorthi R, Tyler A, Aldredge T, Long H, Gilman M, Holt D, Osburne MS (2001) Expression and isolation of antimicrobial small molecules from soil DNA libraries. Journal of Molecular Microbiology and Biotechnology 3: 301–308. Madan Babu M, Teichmann SA (2003) Evolution of transcription factors and the gene regulatory network in Escherichia coli. Nucleic Acids Res 31(4):1234–1244. Magasanik B (1989) Gene regulation from sites near and far. New Biol 1(3):247–251. Magasanik B (2000) Global regulation of gene expression. Proc Natl Acad Sci USA 97(26):14044–14045.

434

References

Maharjan RP, Seeto S, Ferenci T (2007) Divergence and redundancy of transport and metabolic rate-yield strategies in a single Escherichia coli population. J Bacteriol 189:2350–2358. Mahillon J, Chandler M (1998) Insertion sequences. Microbiol Mol Biol Rev 62: 725–774. Mahony S, McInerney JO, Smith TJ, Golden A (2004) Gene prediction using the SelfOrganizing Map: automatic generation of multiple gene models. BMC Bioinformatics 5:23. Majernik AI, Lundgren M, McDermott P, Bernander R, Chong JP (2005) DNA content and nucleoid distribution in Methanothermobacter thermautotrophicus. J Bacteriol 187(5):1856–1858. Majewski J, Cohan FM (1999) Adapt globally, act locally: the effect of selective sweeps on bacterial sequence diversity. Genetics 152(4):1459–1474. Makarova KS, Aravind L, Grishin NV, Rogozin IB, Koonin EV (2002) A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis. Nucleic Acids Res 30:482–496. Makarova KS, Grishin NV, Shabalina SA, Wolf YI, Koonin EV (2006) A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biol Direct 1:7. Makarova KS, Mironov AA, Gelfand MS (2001) Conservation of the binding site for the arginine repressor in all bacterial lineages. Genome Biol 2(4):0013. Maki Y, Tominaga D, Okamoto M, Watanabe S, Eguchi Y (2001) Development of a system for the inference of large scale genetic networks. Pac Symp Biocomput 446–458. Maki Y, Ueda T, Okamoto M, Uematsu N, Inamura Y, Eguchi Y (2002) Inference of genetic network using the expression profile time course data of mouse P19 cells. Genome Informatics 13:382–383. Makita Y (2004) Database of Transcriptional Regulation in Bacillus subtilis (DBTBS). http://dbtbs.hgc.jp. Makita Y, de Hoon MJ, Danchin A (2007) Hon-yaku: a biology-driven Bayesian methodology for identifying translation initiation sites in prokaryotes. BMC Bioinformatics 8:47. Makita Y, Nakao M, Ogasawara N, Nakai K (2004) DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. Nucleic Acids Res 32(Database issue):D75–D77. Malandrin L, Huber H, Bernander R (1999) Nucleoid structure and partition in Methanococcus jannaschii: an archaeon with multiple copies of the chromosome. Genetics 152(4):1315–1323. Manchester KL (1995) Louis Pasteur (1822–1895)–chance and the prepared mind. Trends Biotechnol 13(12):511–515. Mandar R, Mikelsaar M (1996) Transmission of mother’s microflora to the newborn at birth. Biology of the Neonate 69:30–35. Manson JM, Gilmore MS (2006) Pathogenicity island integrase cross-talk: a potential new tool for virulence modulation. Mol Microbiol 61:555–559. Manson MA, Church GM (2000) Predicting regulons and their cis-regulatory motifs by comparative genomics. Nucleic Acids Res 28:4523–4530. Mao F, Su Z, Olman V, Dam P, Liu Z, Xu Y (2006) Mapping of orthologous genes in the context of biological pathways: An application of integer programming. PNAS 103(1):129–134. Marcotte CJ, Marcotte EM (2002) Predicting functional linkages from gene fusions with confidence. Appl Bioinformatics 1(2):93–100.

References

435

Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D (1999) Detecting protein function and protein-protein interactions from genome sequences. Science 285(5428):751–753. Marcotte EM, Pellegrini M, Thompson M, Yeates T, Eisenberg D (1999) A combined algorithm for genome-wide prediction of protein function. Nature 402(23):25–26. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380. Margulis L (1995) Symbiosis in Cell Evolution: Microbial Communities in the Archean and Proterozoic Eons, WH Freeman & Co. Margulis L, Sagan D (2002) Acquiring genomes: A theory of the origin of species, Basic Books, New York. Marino S, Voit EO (2006) An automated procedure for the extraction of metabolic network information from time series data. J Bioinform Comput Biol 4(3):665–691. Markowitz VM, Ivanova N, Palaniappan K, Szeto E, Korzeniewski F, Lykidis A, Anderson I, Mavromatis K, Kunin V, Garcia Martin H, Dubchak I, Hugenholtz P, Kyrpides NC (2006a) An experimental metagenome data management and analysis system. Bioinformatics 22:e359–e367. Markowitz VM, Ivanova N, Palaniappan K, Szeto E, Korzeniewski F, Lykidis A, Anderson I, Mavromatis K, Kunin V, Garcia Martin H, Dubchak I, Hugenholtz P, Kyrpides NC (2006a) An experimental metagenome data management and analysis system. Bioinformatics (Oxford, England) 22:e359–e367. Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, Padki A, Zhao X, Dubchak I, Hugenholtz P, Anderson I, Lykidis A, Mavromatis K, Ivanova N, Kyrpides NC (2006b) The integrated microbial genomes (IMG) system. Nucleic Acids Res 34:D344–D348. Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, Padki A, Zhao X, Dubchak I, Hugenholtz P, Anderson I, Lykidis A, Mavromatis K, Ivanova N, Kyrpides NC (2006b) The integrated microbial genomes (IMG) system. Nucleic Acids Res 34: D344–D348. Marshall BJ, Warren JR (1984) Unidentified curved bacilli in the stomach of patients with gastritis and peptic ulceration. Lancet 1:1311–1315. Marshall CG, Lessard IA, Park I, Wright GD (1998) Glycopeptide antibiotic resistance genes in glycopeptide-producing organisms. Antimicrobial Agents and Chemotherapy 42:2215–2220. Martignetti JA, Brosius J (1993) BC200 RNA: a neural RNA polymerase III product encoded by a monomeric Alu element. Proc Natl Acad Sci USA 90(24):11563–11567. Martin F, Jean-Stephane V (2004) Detecting uber-operons in prokaryotics genomes, Laboratoire d’Informatique Fondamentale de Lille, France. Martin S, Roe D, Faulon J-L (2005) Predicting protein-protein interactions using signature products. Bioinformatics 21(2):218–226. Martin W (1999) Mosaic bacterial chromosomes: a challenge en route to a tree of genomes. Bioessays 21(2):99–104. Martinez-Antonio A, Collado-Vides J (2003) Identifying global regulators in transcriptional regulatory networks in bacteria. Curr Opin Microbiol 6(5):482–489.

436

References

Martinez-Antonio A, Janga SC, Salgado H, Collado-Vides J (2006) Internal-sensing machinery directs the activity of the regulatory network in Escherichia coli. Trends Microbiol 14(1):22–27. Martinez-Antonio A, Salgado H, Gama-Castro S, Gutierrez-Rios RM, Jimenez-Jacinto V et al. (2003) Environmental conditions and transcriptional regulation in Escherichia coli: a physiological integrative approach. Biotechnol Bioeng 84(7):743–749. Matsubara Y, Kikuchi S, Sugimoto M, Tomita M (2006) Parameter estimation for stiff equations of biosystems using radial basis function networks. BMC Bioinformatics 7:230. Matsumoto Y, Shigesada K, Hirano M, Imai M (1986) Autogenous regulation of the gene for transcription termination factor rho in Escherichia coli: localization and function of its attenuators. J Bacteriol 166(3):945–958. Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J et al. (2001) Identification of potential interaction networks using sequence-based searches for conserved proteinprotein interactions or “interologs”. Genome Res 11(12):2120–2126. Maurelli AT, Fernandez RE, Bloch CA, Rode CK, Fasano A (1998) “Black holes” and bacterial pathogenicity: A large genomic deletion that enhances the virulence of Shigella spp. and enteroinvasive Escherichia coli. Proc Natl Acad Sci USA 95: 3943–3948. Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NC (2007) Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nature Methods 4:495–500. May RM (1976) Theoretical Ecology: Principles and Applications. Blackwell, Oxford. Mazel D (2006) Integrons: agents of bacterial evolution. Nat Rev Microbiol 4:608–620. Mazel D, Dychinco B, Webb VA, Davies J (1998) A distinctive class of integron in the Vibrio cholerae genome. Science 280:605–608. McAdams HH, Shapiro L (2003) A bacterial cell-cycle regulatory network operating in time and space. Science 301(5641):1874–1877. McClintock B (1941) The Stability of Broken Ends of Chromosomes in Zea Mays. Genetics 26:234–282. McClure WR (1985) Mechanism and control of transcription initiation in prokaryotes. Annu Rev Biochem 54:171–204. McCue LA, Thompson W, Carmack, CS, Lawrence CE (2002) Factors influencing the identification of transcription factor binding sites by cross-species comparison. Genome Res 12(10):1523–1532. McCutcheon JP, Moran MA (2007) Parallel genomic evolution and metabolic interdependence in an ancient symbiosis. Proc Natl Acad Sci USA 104:19392–19397. McDonald TG, Van Eyk JE (2003) Mitochondrial proteomics. Undercover in the lipid bilayer. Basic Res Cardiol 98:219–227. McGuire AM, Hughes JD, Church GM (2000) Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Res 10:744–757. McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I (2007) Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods 4:63–72. McKay DB, Steitz TA (1981) Structure of catabolite gene activator protein at 2.9 A resolution suggests binding to left-handed B-DNA. Nature 290(5809):744–749. McKnight SL, Kingsbury R (1982) Transcriptional control signals of a eukaryotic proteincoding gene. Science 217(4557):316–324. McNally DJ, Hui JP, Aubry AJ, Mui KK, Guerry P, Brisson JR, Logan SM, Soo EC (2006) Functional characterization of the flagellar glycosylation locus in Campylobacter jejuni 81–176 using a focused metabolomics approach. J Biol Chem 281:18489–18498.

References

437

McNealy TL, Forsbach-Birk V, Shi C, Marre R (2005) The Hfq homolog in Legionella pneumophila demonstrates regulation by LetA and RpoS and interacts with the global regulator CsrA. J Bacteriol 187(4):1527–1532. M´edigue C, Rouxel T, Vigier P, H´enaut A, Danchin A (1991) Evidence for horizontal gene transfer in Escherichia coli speciation. J Mol Biol 222:851–856. Medini D, Donati C et al. (2005) The microbial pan-genome. Curr Opin Genet Dev 15(6):589–594. Mellor JC, Yanai I, Clodfelter KH, Mintseris J, DeLisi C (2002) Predictome: a database of putative functional links between proteins. Nucl Acids Res 30(1):306–309. Mendes P, Kell D (1998) Non-linear optimization of biochemical pathways: applications to metabolic engineering and parameter estimation. Bioinformatics 14(10): 869–883. Merkl R (2004) SIGI: score-based identification of genomic islands. BMC Bioinformatics 5:22. Meyer IM (2007) A practical guide to the art of RNA gene prediction Brief. Bioinform. Advance Access publsihed online, doi:10.1093/bib/bbm011. Michaelis L, Menten ML (1913) Die Kinetik der Invertinwirkung. Biochem Zeitschrift 49:333–369. Middendorf B, Hochhut B, Leipold K, Dobrindt U, Blum-Oehler G, Hacker J (2004) Instability of pathogenicity islands in uropathogenic Escherichia coli 536. J Bacteriol 186:3086–3096. Miller SL (1953) A production of amino acids under possible primitive earth conditions. Science 117(3046):528–529. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D et al. (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824–827. Mira A, Moran NA (2002) Estimating population size and transmission bottlenecks in maternally transmitted endosymbiotic bacteria. Microb Ecol 44:137–143. Mira A, Ochman H, Moran NA (2001) Deletional bias and the evolution of bacterial genomes. Trends Genet 17:589–596. Mirkin B, Muchnik I, Smith TF (1995) A biologically consistent model for comparing molecular phylogenies. J Comput Biol 2(4):493–507. Mirkin BG, Fenner TI, Galperin MY, Koonin EV (2003). Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol Biol 3(1):2. Mironov AA, Koonin EV, Roytberg MA, Gelfand MS (1999) Computer analysis of transcription regulatory patterns in completely sequenced bacterial genomes. Nucleic Acids Res 27(14):2981–2989. Miroslavova NS, Busby SJ (2006) Investigations of the modular structure of bacterial promoters. Biochem Soc Symp 73:1–10. Miroslavova NS, Mitchell JE, Tebbutt J, Busby SJ (2006) Recruitment of RNA polymerase to Class II CRP-dependent promoters is improved by a second upstream-bound CRP molecule. Biochem Soc Trans 34(Pt 6):1075–1078. Mocellin S, Rossi CR (2007) Principles of gene microarray data analysis. Adv Exp Med Biol 593:19–30. Mojica FJ, D´ıez-Villase˜ nor C, Garc´ıa-Mart´ınez J, Soria E (2005) Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. J Mol Evol 60:174–182. Mojica FJ, D´ıez-Villase˜ nor C, Soria E, Juez G (2000) Biological significance of a family of regularly spaced repeats in the genomes of Archaea, Bacteria and mitochondria. Mol Microbiol 36:244–246.

438

References

Mojica FJ, Ferrer C, Juez G, Rodr´ıguez-Valera F (1995) Long stretches of short tandem repeats are present in the largest replicons of the Archaea Haloferax mediterranei and Haloferax volcanii and could be involved in replicon partitioning. Mol Microbiol 17:85–93. Moles CG, Mendes P, Banga JR (2003) Parameter estimation in biochemical pathways: a comparison of global optimization methods. Genome Res 13(11):2467–2474. Montague MG, Hutchison CA (2000) Gene content phylogeny of herpesviruses. Proc Natl Acad Sci USA 97(10):5334–5339. Mooney RA, Darst SA, Landick R (2005) Sigma and RNA polymerase: an on-again, offagain relationship? Mol Cell 20(3):335–345. Mooney RA, Landick R (2003) Tethering sigma70 to RNA polymerase reveals high in vivo activity of sigma factors and sigma70-dependent pausing at promoter-distal locations. Genes Dev 17(22):2839–2851. Moore MJ, Dhingra A, Soltis PS, Shaw R, Farmerie WG, Folta KM, Soltis DE (2006) Rapid and accurate pyrosequencing of angiosperm plastid genomes. BMC Plant Biology 6:17. Mootha VK, Bunkenborg J, Olsen JV, Hjerrild M, Wisniewski JR, Stahl E, Bolouri MS, Ray HN, Sihag S, Kamal M, Patterson N, Lander ES, Mann M (2003) Integrated analysis of protein composition, tissue diversity, and gene regulation in mouse mitochondria. Cell 115:629–640. Moran MA, E.V. (2007) Genomes of Sea Microbes. Oceanography 20. Moran MA, Miller WL (2007) Resourceful heterotrophs make the most of light in the coastal ocean. Nature Reviews 5:792–800. Moran NA (1996) Accelerated evolution and Muller’s rachet in endosymbiotic bacteria. Proc Natl Acad Sci USA 93:2873–2878. Moran NA (2002) Microbial minimalism: genome reduction in bacterial pathogens. Cell 108:583–586. Moran NA (2003) Tracing the evolution of gene loss in obligate bacterial symbionts. Curr Opin Microbiol 6:512–518. Moran NA, Degnan PH, Santos SR, Dunbar HE, Ochman H (2005) The players in a mutualistic symbiosis: Insects, bacteria, viruses, and virulence genes. Proc Natl Acad Sci USA 102:16919–16926. Moran NA, Mira A (2001) The process of genome shrinkage in the obligate symbiont Buchnera aphidicola. Genome Biol 2:0054. Moran NA, Plague GR (2004) Genomic changes following host restriction in bacteria. Curr Opin Genet Dev 14:627–633. Moran NA, Wernegreen JJ (2000) Lifestyle evolution in symbiotic bacteria: insights from genomics. Trends Ecol Evol 15:321–326. Morandi A, Zhaxybayeva O et al. (2005) Evolutionary and diagnostic implications of intragenomic heterogeneity in the 16S rRNA gene in Aeromonas strains. J Bacteriol 187(18):6561–6564. Moreno-Hagelsieb G, Collado-Vides J (2002) A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics 18 Suppl 1:S329–S336. Moret BME, Tang JJ, Wang LS, Warnow T (2002) Steps toward accurate reconstructions of phylogenies from gene-order data. J Comput Syst Sci 65:508–525. Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M (2007) KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res W182-W185. Morozov AV, Havranek JJ, Baker D, Siggia ED (2005) Protein-DNA binding specificity predictions with structural models. Nucleic Acids Res 33(18):5781–5798. Moxon ER, Rainey PB, Nowak MA, Lenski RE (1994) Adaptive evolution of highly mutable loci in pathogenic bacteria. Curr Biol 4:24–33.

References

439

Mr´ azek J (2006) Analysis of distribution indicates diverse functions of simple sequence repeats in Mycoplasma genomes. Mol Biol Evol 23:1370–1385. Mr´ azek J, Gaynon LH, Karlin S (2002) Frequent oligonucleotide motifs in genomes of three streptococci. Nucleic Acids Res 30:4216–4221. Mr´ azek J, Guo X, Shah A (2007) Simple sequence repeats in prokaryotic genomes. Proc Natl Acad Sci USA 104:8472–8477. Mr´ azek J, Karlin S (1998) Strand compositional asymmetry in bacterial and large viral genomes. Proc Natl Acad Sci USA 95:3720–3725. Mr´ azek J, Karlin S (1999) Detecting alien genes in bacterial genomes. Ann NY Acad Sci 870:314–329. Mr´ azek J, Kypr J (1994a) Biased distribution of adenine and thymine in gene nucleotide sequences. J Mol Evol 39:439–447. Mr´ azek J, Kypr J (1994b) Length expansion is a general property of simple sequence repeats in eukaryotic genomes. Miami Biotechnology Short Reports 5:39. Mr´ azek J, Spormann AM, Karlin S (2006) Genomic comparisons among gammaproteobacteria. Environ Microbiol 8:273–288. Mr´ azek J, Xie S (2006) Pattern locator: a new tool for finding local sequence patterns in genomic DNA sequences. Bioinformatics 22:3099–3100. Muller HJ (1964) The relation of recombination to mutational advance. Mutat Res 1:2–9. Muramatsu T, Yokoyama S, Horie N, Matsuda A, Ueda T, Yamaizumi Z, Kuchino Y, Nishimura S, Miyazawa T (1988) A novel lysine-substituted nucleoside in the first position of the anticodon of minor isoleucine tRNA from Escherichia coli. J Biol Chem 263(19):9261–9267. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4):536–540. Mushegian A (1999) The minimal genome concept. Curr Opin Genet Dev 9:709–714. Mushegian AR, Koonin EV (1996) Gene order is not conserved in bacterial evolution. Trends in Genetics 12(8):289–290. Mushegian AR, Koonin EV (1996) A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc Natl Acad Sci USA 93:10268–10273. Muzzi A, Masignani V, Rappuoli R (2007) The pan-genome: towards a knowledge-based discovery of novel targets for vaccines and antibacterials. Drug Discovery Today 12:429–439. Myllykallio H, Lopez P, Lopez-Garcia P, Heilig R, Saurin W, Zivanovic Y, Philippe H, Forterre P (2000) Bacterial mode of replication with eukaryotic-like machinery in a hyperthermophilic archaeon. Science 288:2212–2215. Naas T, Blot M, Fitch WM, Arber W (1994) Insertion sequence-related genetic variation in resting Escherichia coli K-12. Genetics 136:721–730. Naas T, Mikami Y, Imai T, Poirel L, Nordmann P (2001) Characterization of In53, a class 1 plasmid- and composite transposon-located integron of Escherichia coli which carries an unusual array of gene cassettes. Journal of Bacteriology 183:235–249. Nagel GM, Doolittle RF (1995) Phylogenetic analysis of the aminoacyl-tRNA synthetases. J Mol Evol 40(5):487–498. Nakabachi A, Yamashita A, Toh H, Ishikawa H, Dunbar HE, Moran NA, Hattori M (2006) The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science 314:267. Nakamura Y, Gojobori T, Ikemura T (1999) Codon usage tabulated from the international DNA sequence databases; its status 1999. Nucleic Acids Res 27:292. Nakamura Y, Itoh T, Matsuda H, Gojobori T (2004) Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nature genetics 36:760–766.

440

References

Nakashima H, Fukuchi S, Nishikawa K (2003) Compositional changes in RNA, DNA and proteins for bacterial adaptation to higher and lower temperatures. J Biochem (Tokyo) 133:507–513. Nanavati DM, Nguyen TN et al. (2005) Substrate specificities and expression patterns reflect the evolutionary divergence of maltose ABC transporters in Thermotoga maritima. J Bacteriol 187(6):2002–2009. Navarre WW, Porwollik S, Wang Y, McClelland M, Rosen H et al. (2006) Selective silencing of foreign DNA with low GC content by the H-NS protein in Salmonella. Science 313(5784):236–238. Needleman SB, Wunsch Cd (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453. Neidhardt FC, Curtiss III R, Ingraham JL, Lin ECC, Low KB, Magasanik B, Reznikoff WS, Riley M, Schaechter M, Umbarger HE (2002) EcoSal : Escherichia coli and Salmonella : cellular and molecular biology. Washington DC: ASM Press. Nelson KE, Clayton RA, Gill SR, Gwinn ML, Dodson RJ, Haft DH, Hickey EK, Peterson JD, Nelson WC, Ketchum KA, McDonald L, Utterback TR, Malek JA, Linher KD, Garrett MM, Stewart AM, Cotton MD, Pratt MS, Phillips CA, Richardson D, Heidelberg J, Sutton GG, Fleischmann RD, Eisen JA, White O, Salzberg SL, Smith HO, Venter JC, Fraser CM (1999) Evidence for lateral gene transfer between archaea and bacteria from genome sequence of Thermotoga maritima. Nature 399(6734): 323–329. Nemeth A, Langst G (2004) Chromatin higher order structure: opening up chromatin for transcription. Brief Funct Genomic Proteomic 2(4):334–343. Neves AR, Ramos A, Nunes MC, Kleerebezem M, Hugenholtz J, de Vos WM, Almeida J, Santos H (1999) In vivo nuclear magnetic resonance studies of glycolytic kinetics in Lactococcis lactis Biotechnol. Bioeng 64:200–212. Newton ILG, Woyke T, Auchtung TA, Dilly GF, Dutton RJ, Fisher MC, Fontanez KM, Lau E, Stewart FJ, Richardson PM, Barry KW, Saunders E, Detter JC, Wu D, Eisen JA, Cavanaugh CM (2007) The Calyptogena magnifica chemoautotrophic symbiont genome. Science 315:998–1000. Nicholls H (2007) Sorcerer II: the search for microbial diversity roils the waters. PLoS Biol 5:e74. Nielsen P, Krogh A (2005) Large-scale prokaryotic gene prediction and comparison to genome annotation Bioinformatics 21(24):4322–4329. Ninfa AJ, Magasanik B (1986) Covalent modification of the glnG product, NRI, by the glnL product, NRII, regulates the transcription of the glnALG operon in Escherichia coli. Proc Natl Acad Sci USA 83(16):5909–5913. Noman N, Iba H (2005) Reverse engineering genetic networks using evolutionary computation. Genome Inform 16(2):205–214. Noman N, Iba H (2006) Inference of genetic networks using S-system: information criteria for model selection. Paper presented at the Genetic and Evolutionary Computation Conference (GECCO), Seattle, Washington. Nordheim A, Rich A (1983) The sequence (dC-dA)n X (dG-dT)n forms left-handed Z-DNA in negatively supercoiled plasmids. Proc Natl Acad Sci USA 80:1821–1825. Normand P, Lapierre P et al. (2007) Genome characteristics of facultatively symbiotic Frankia sp. strains reflect host range and host plant biogeography. Genome Res 17(1):7–15. Novichkov PS, Omelchenko MV, Gelfand MS, Mironov AA, Wolf YI, Koonin EV (2004) Genome-wide molecular clock and horizontal gene transfer in bacterial evolution. J Bacteriol 186(19):6575–6585. Nudler E (2006) Flipping riboswitches. Cell 126(1):19–22.

References

441

Nudler E, Gottesman ME (2002) Transcription termination and anti-termination in E. coli. Genes Cells 7(8):755–768. Nurminsky DI, Nurminskaya MV et al. (1998) Selective sweep of a newly evolved spermspecific gene in Drosophila. Nature 396(6711):572–575. O’Brien KP, Remm M, Sonnhammer EL (2005) Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 33(Database issue): D476–D480. Ochman H, Davalos LM (2006) The nature and dynamics of bacterial genomes. Science 311:1730–1733. Ochman H, Lawrence JG, Groisman EA (2000) Lateral gene transfer and the nature of bacterial innovation. Nature 405(6784):299–304. Ogawa A, Takeda T (1993) The gene encoding the heat-stable enterotoxin of Vibrio cholerae is flanked by 123-base pair direct repeats. Microbiology and Immunology 37:607–616. Oger P, Kim KS, Sackett RL, Piper KR, Farrand SK (1998) Octopine-type Ti plasmids code for a mannopine-inducible dominant-negative allele of traR, the quorum-sensing activator that regulates Ti plasmid conjugal transfer. Mol Microbiol 27(2):277–288. Ogura M, Tanaka T (2002) Recent progress in Bacillus subtilis two-component regulation. Front Biosci 7:1815–1824. Oh HJ, Cho KW, Jung IS, Kim WH, Hur BK, Kim GJ (2003) Expanding functional spaces of enzymes by utilizing whole genome treasure for library construction. Journal of Molecular Catalysis B-Enzymatic 26:241–250. Ohnishi M, Kurokawa K, Hayashi T (2001) Diversification of Escherichia coli genomes: are bacteriophages the major contributors? Trends Microbiol 9:481–485. Okuda S (2004) Operon DataBase (ODB). http://odb.kuicr.kyoto-u.ac.jp. Okuda S, Katayama T, Kawashima S, Goto S, Kanehisa M (2006) ODB: a database of operons accumulating known operons across multiple genomes. Nucleic Acids Res 34(Database issue):D358–D362. Olendzenski L, Liu L, Zhaxybayeva O, Murphey R, Shin DG, Gogarten JP (2000) Horizontal transfer of archaeal genes into the Deinococcaceae: Detection by molecular and computer-based approaches. J Mol Evol 51(6):587–599. Olendzenski L, Zhaxybayeva O, Gogarten JP (2004) A Brief History on Views of Prokaryotic Evolution and Taxonomy. Microbial Genomes. C. M. Fraser, T. Read and K. Nelson, Humana Press:143–154. Oliva A, Rosebrock A, Ferrezuelo F, Pyne S, Chen H, Skiena S et al. (2005) The cell cycle-regulated genes of Schizosaccharomyces pombe. PLoS Biol 3(7):e225. Olman V, Xu D, Xu Y (2003a) CUBIC: identification of regulatory binding sites through data clustering. J Bioinform Comput Biol 1(1):21–40. Olman V, Xu D, Xu Y (2003b) Identification of regulatory binding sites using minimum spanning trees. Pac Symp Biocomput 327–338. Ong IM, Glasner JD, Page D (2002) Modelling regulatory pathways in E. coli from time series expression profiles. Bioinformatics 18 Suppl 1, S241–S248. Ordovas JM, Mooser V (2006) Metagenomics: the role of the microbiome in cardiovascular diseases. Current opinion in lipidology 17:157–161. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM (1997) CATH–a hierarchic classification of protein domain structures. Structure 5(8):1093–1098. Osborn AM, Boltner D (2002) When phage, plasmids, and transposons collide: genomic islands, and conjugative- and mobilizable-transposons as a mosaic continuum. Plasmid 48:202–212. O’Shea YA, Boyd EF (2002) Mobilization of the Vibrio pathogenicity island between Vibrio cholerae isolates mediated by CP-T1 generalized transduction. FEMS Microbiol Lett 214:153–157.

442

References

Oshima K, Kakizawas, Nishigawa H, Jung HY, Wei W, Suzuki S, Arashida R, Nakata D, Miyata S, Ugaki M, Namba S (2004) Reductive evolution suggested from the complete genome sequence of a plant-pathogenic phytoplasma. Nat Genet 36(1):27–29. Ou HY, Chen LL, Lonnen J, Chaudhuri RR, Thani AB, Smith R, Garton NJ, Hinton J, Pallen M, Barer MR, Rajakumar K (2006) A novel strategy for the identification of genomic islands by comparative analysis of the contents and contexts of tRNA sites in closely related bacteria. Nucleic Acids Res 34:e3. Ouhammouch M (2004) Transcriptional regulation in Archaea. Curr Opin Genet Dev 14(2):133–138. Overbeek R, Bartels D, Vonstein V, Meyer F (2007) Annotation of bacterial and archaeal genomes: improving accuracy and consistency. Chem Rev 107:3431–3447. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de CrecyLagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED, Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, Pusch GD, Rodionov DA, Ruckert C, Steiner J, Stevens R, Thiele I, Vassieva O, Ye Y, Zagnitko O, Vonstein V (2005) The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 33:5691–5702. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang H-Y, Cohoon M et al. (2005) The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucl Acids Res 33(17):5691–5702. Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N (1999) The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA 96:2896–2901. Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N (1999) Use of contiguity on the chromosome to predict functional coupling. In Silico Biol 1:93–108. Ozawa T, Sako Y, Sato M, Kitamura T, Umezawa Y (2003) A genetic approach to identifying mitochondrial proteins. Nat Biotechnol 21:287–293. Page R (1994) Maps between Trees and Cladistic Analysis of Historical Associations Among Genes, Organisms, and Areas. Systematic Biology 43(1):58–77. Page RD (1998) GeneTree: comparing gene and species phylogenies using reconciled trees. Bioinformatics 14(9):819–820. Page RD, Charleston MA (1997) From gene to organismal phylogeny: reconciled trees and the gene tree/species tree problem. Mol Phylogenet Evol 7(2):231–240. Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G et al. (2004) The MIPS mammalian protein-protein interaction database. Bioinformatics bti115. Paget MS, Helmann JD (2003) The sigma70 family of sigma factors. Genome Biol 4(1):203. Pal C, Hurst LD (2004) Evidence against the selfish operon theory. Trends Genet 20(6):232–234. Pal C, Papp B, Lercher MJ, Csermely P, Oliver SG, Hurst LD (2006) Chance and necessity in the evolution of minimal metabolic networks. Nature 440:667–670. Paley SM, Karp PD (2002) Evaluation of computational metabolic-pathway predictions for Helicobacter pylori. Bioinformatics 18(5):715–724. Palmer C, Bik EM, Digiulio DB, Relman DA, Brown PO (2007) Development of the Human Infant Intestinal Microbiota. PLoS Biol 5:e177. Palsson B (2006) Properties of Reconstructed Networks, Cambridge: Systems Biology. Palsson BO (2006) Systems Biology: Properties of Reconstructed Networks: Cambridge University Press. Panina EM, Mironov AA, Gelfand MS (2001) Comparative analysis of FUR regulons in gamma-proteobacteria. Nucleic Acids Res 29(24):5195–5206.

References

443

Papazisi L, Gorton TS, Kutish G, Markham PF, Browning GF, Nguyen DK, Swartzell S, Madan A, Mahairas G, Geary SJ (2003) The complete genome sequence of the avian pathogen Mycoplasma gallisepticum strain R(low). Microbiology 149:2307–2316. Papke RT, Zhaxybayeva O et al. (2007) Searching for species in haloarchaea. Proc Natl Acad Sci USA 104(35):14092–14097. Pardee AB, Jacob F, Monod J (1959) The genetic control and cytoplasmic expression of “inducibility” in the synthesis of beta-galactosidase by E. coli. J Mol Biol 1: 165–178. Park B-H, Ostrouchov G, Gong-Xin Y, Geist A, Gorin A, Smatova NF (2003) Inference of Protein-Protein Interactions by Unlikely Profile Pair. Paper presented at the SIAM International Conference on Data Mining. Park LJ, Park CH, Park C, Lee T (1997) Application of genetic algorithms to parameter estimation of bioprocesses. Med Biol Eng Comput 35(1):47–49. Parkhill J, Wren BW, Thomson NR, Titball RW, Holden MT, Prentice MB, Sebaihia M, James KD, Churcher C, Mungall KL, Baker S, Basham D, Bentley SD, Brooks K, Cerdeno-Tarraga AM, Chillingworth T, Cronin A, Davies RM, Davis P, Dougan G, Feltwell T, Hamlin N, Holroyd S, Jagels K, Karlyshev AV, Leather S, Moule S, Oyston PC, Quail M, Rutherford K, Simmonds M, Skelton J, Stevens K, Whitehead S, Barrell BG (2001) Genome sequence of Yersinia pestis, the causative agent of plague. Nature 413:523–527. Parkinson JS, Kofoid EC (1992) Communication modules in bacterial signaling proteins. Annu Rev Genet 26:71–112. Patel A (2005) The triplet genetic code had a doublet predecessor. J Theor Biol 233(4): 527–532. Patil KR, Nielsen J (2005) Uncovering transcriptional regulation of metabolism by using metabolic network topology. Proc Natl Acad Sci USA 102(8):2685–2689. Payvar F, Wrange O, Carlstedt-Duke J, Okret S, Gustafsson JA, Yamamoto KR (1981) Purified glucocorticoid receptors bind selectively in vitro to a cloned DNA fragment whose transcription is regulated by glucocorticoids in vivo. Proc Natl Acad Sci USA 78(11):6628–6632. Pazos F, Valencia A (2001) Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng 14(9):609–614. Pearl J (1988) Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444–2448. Pei Z, Bini EJ, Yang L, Zhou M, Francois F, Blaser MJ (2004) Bacterial biota in the human distal esophagus. Proc Natl Acad Sci USA 101:4250–4255. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO (1999b) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 96(8):4285–4288. Pennisi E (1998) Genome data shake tree of life. Science 280(5364):672–674. Pereto J (2005) Controversies on the origin of life. Int Microbiol 8:23–31. Perez-Brocal V, Gil R, Ramos S, Lamelas A, Postigo M, Michelena JM, Silva FJ, Moya A, Latorre A (2006) A small microbial genome: The end of a long symbiotic relationship? Science 314:312–313. Perez-Martin J, De Lorenzo V (1997) Coactivation in vitro of the sigma54-dependent promoter Pu of the TOL plasmid of Pseudomonas putida by HU and the mammalian HMG-1 protein. J Bacteriol 179(8):2757–2760. Perez-Martin J, Rojo F, de Lorenzo V (1994) Promoters responsive to DNA bending: a common theme in prokaryotic gene expression. Microbiol Rev 58(2):268–290.

444

References

Perez-Rueda E, Collado-Vides J (2000) The repertoire of DNA-binding transcriptional regulators in Escherichia coli K-12. Nucleic Acids Res 28(8):1838–1847. Perna NT, Plunkett 3rd G, Burland V, Mau B, Glasner JD, Rose DJ, Mayhew GF, Evans PS, Gregor J, Kirkpatrick HA, Posfai G, Hackett J, Klink S, Boutin A, Shao Y, Miller L, Grotbeck EJ, Davis NW, Lim A, Dimalanta ET, Potamousis KD, Apodaca J, Anantharaman TS, Lin J, Yen G, Schwartz DC, Welch RA, Blattner FR (2001) Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409: 529–533. Perrin BE, Ralaivola L, Mazurie A, Bottani S, Mallet J, D’Alche-Buc F (2003) Gene networks inference using dynamic Bayesian networks. Bioinformatics 19 Suppl 2, II138–II148. Perutz MF, Pope BJ, Owen D, Wanker EE, Scherzinger E (2002) Aggregation of proteins with expanded glutamine and alanine repeats of the glutamine-rich and asparaginerich domains of Sup35 and of the amyloid beta-peptide of amyloid plaques. Proc Natl Acad Sci USA 99:5596–5600. Pesole G, Prunella N, Liuni S, Attimonelli M, Saccone C (1992) WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. Nucleic Acids Res 20:2871–2875. Petersen DW, Kawasaki ES (2007) Manufacturing of microarrays. Adv Exp Med Biol 593:1–11. Petrov DA (2001) Evolution of genome size: new approaches to an old problem. Trends Genet 17:23–28. Petrov DA, Sangster TA, Johnston JS, Hartl DL, Shaw KL (2000) Evidence for DNA loss as a determinant of genome size. Science 287:1060–1062. Petsko GA (2001) Homologuephobia. Genome Biol 2(2): COMMENT1002. Pietrokovski S, Henikoff JG, Henikoff S (1996) The blocks database–a system for protein classification. Nucleic Acids Res 24(1):197–200. Pietrokovski S, Hirshon J, Trifonov EN (1990) Linguistic measure of taxonomic and functional relatedness of nucleotide sequences. J Biomol Struct Dyn 7:1251–1268. Piganeau G, Moreau H (2007) Screening the Sargasso Sea metagenome for data to investigate genome evolution in Ostreococcus (Prasinophyceae, Chlorophyta). Gene 406:184–190. Platt T (1986) Transcription termination and the regulation of gene expression. Annu Rev Biochem 55:339–372. Ploy MC, Lambert T, Couty JP, Denis F (2000) Integrons: an antibiotic resistance gene capture and expression system. Clin Chem Lab Med 38:483–487. Polisetty PK, Voit EO, Gatzke EP (2006) Identification of metabolic system parameters using global optimization methods. Theor Biol Med Model 3:4. Polycarpo C, Ambrogelly A, Berube A, Winbush SM, McCloskey JA, Crain PF, Wood JL, Soll D (2004) An aminoacyl-tRNA synthetase that specifically activates pyrrolysine. Proc Natl Acad Sci USA 101(34):12450–12454. Pommier T, Canback B, Riemann L, Bostronm KH, Simu K, Lundberg P, Tunlid A, Hagstrom A (2007) Global Patterns of diversity and community structure in marine bacterioplankton. Molecular Ecology 16:867–880. Poplawski A, Bernander R (1997) Nucleoid structure and distribution in thermophilic Archaea. J Bacteriol 179(24):7625–7630. Popp A, Hertwig S, Lurz R, Appel B (2000) Comparative study of temperate bacteriophages isolated from Yersinia. Syst Appl Microbiol 23:469–478. Poptsova MS, Gogarten JP (2007) BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics 8:120.

References

445

Poptsova MS, Gogarten JP (2007) The power of phylogenetic approaches to detect horizontally transferred genes. BMC Evol Biol 7(1):45. Potamianos G, Jelinek F (1998) A study of N-gram and decision tree letter language modeling methods. Speech Comm 24:171–192. Poyart-Salmeron C, Trieu-Cuot P, Carlier C, Courvalin P (1989) Molecular characterization of two proteins involved in the excision of the conjugative transposon Tn1545: homologies with other site-specific recombinases. EMBO J 8:2425–2433. Poyart-Salmeron C, Trieu-Cuot P, Carlier C, Courvalin P (1990) The integration-excision system of the conjugative transposon Tn 1545 is structurally and functionally related to those of lambdoid phages. Mol Microbiol 4:1513–1521. Price MN (2005) Virtual Institute for Microbial Stress and Survival (VIMSS). http://www.microbesonline.org/operons. Price MN, Huang KH, Alm EJ, Arkin AP (2005) A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res 33(3):880–892. Prosseda G, Latella MC, Casalino M, Nicoletti M, Michienzi S, Colonna B (2006) Plasticity of the P junc promoter of ISEc11, a new insertion sequence of the IS1111 family. J Bacteriol 188:4681–4689. Przulj N, Wigle DA, Jurisica I (2004) Functional topology in a network of protein interactions. Bioinformatics 20(3):340–348. Ptashne M, Gaan A (2002) Genes and Signals: Cold Spring Harbor Laboratory Press. Ptashne M, Gann A (1997) Transcriptional activation by recruitment. Nature 386(6625): 569–577. Pushker R, D’Auria G, Alba-Casado JC, Rodriguez-Valera F (2005) Micro-Mar: a database for dynamic representation of marine microbial biodiversity. BMC Bioinformatics 6:222. Qin ZS, McCue LA, Thompson W, Mayerhofer L, Lawrence CE, Liu JS (2003) Identification of co-regulated genes through Bayesian clustering of predicted regulatory binding sites. 21:435–439. Qiu X, Gurkar AU, Lory S (2006) Interstrain transfer of the large pathogenicity island (PAPI-1) of Pseudomonas aeruginosa. Proc Natl Acad Sci USA 103:19830–19835. Quackenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2(6):418–427. Ra SR, Qiao M, Immonen T, Pujana I, Saris EJ (1996) Genes responsible for nisin synthesis, regulation and immunity form a regulon of two operons and are induced by nisin in Lactoccocus lactis N8. Microbiology 142(Pt 5):1281–1288. Rabiner LR (1989) A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc IEEE 77:257–286. Rabiner LR, Juang BH (1986) An introduction to hidden Markov models. IEEE ASSP Magazine 3:4–16. Ragan MA (2001a) Detection of lateral gene transfer among microbial genomes. Curr Opin Genet Dev 11(6):620–626. Ragan MA (2001b) On surrogate methods for detecting lateral gene transfer. FEMS Microbiol Lett 201(2):187–191. Ragan MA (2002) Reconciling the many faces of lateral gene transfer. Trends in Microbiology 10(1 SU -):4. Ragan MA, Harlow TJ, Beiko RG (2006) Do different surrogate methods detect lateral genetic transfer events of different relative ages? Trends in Microbiology 14(1):4–8. Rain JC, Selig L, De Reuse H, Battaglia V, Reverdy C, Simon S et al. (2001) The proteinprotein interaction map of Helicobacter pylori. Nature 409(6817):211–215.

446

References

Ram RJ, Verberkmoes NC, Thelen MP, Tyson GW, Baker BJ, Blake RC, 2nd, Shah M, Hettich RL, Banfield JF (2005) Community proteomics of a natural microbial biofilm. Science 308:1915–1920. Randall SK, Eritja R, Kaplan BE, Petruska J, Goodman MF (1987) Nucleotide insertion kinetics opposite abasic lesions in DNA. J Biol Chem 262:6864–6870. Randau L, Munch R, Hohn MJ, Jahn D, Soll D (2005) Nanoarchaeum equitans creates functional tRNAs from separate genes for their 5’- and 3’-halves. Nature 433(7025):537–541. Rank E (2003) Application of Bayesian trained RBF networks to nonlinear time-series modeling. Signal Process 83:1393–1410. Raoult D, Ogata H, Audic S, Robert C, Suhre K, Drancourt M, Claverie JM (2003) Tropheryma whipplei twist: A human pathogenic Actinobacteria with a reduced genome. Genome Res 13:1800–1809. Rappas M, Bose D, Zhang X (2007) Bacterial enhancer-binding proteins: unlocking sigma54-dependent gene transcription. Curr Opin Struct Biol 17(1):110–116. Rappe MS, Giovannoni SJ (2003) The uncultured microbial majority. Annual Review of Microbiology 57:369–394. Raymond J, Zhaxybayeva O et al. (2002) Whole-genome analysis of photosynthetic prokaryotes. Science 298(5598):1616–1620. Read TD, Myers GS et al. (2003) Genome sequence of Chlamydophila caviae (Chlamydia psittaci GPIC): examining the role of niche-specific genes in the evolution of the Chlamydiaceae. Nucleic Acids Res 31(8):2134–2147. Reay P, Yamasaki K, Terada T, Kuramitsu S, Shirouzu M et al. (2004) Structural and sequence comparisons arising from the solution structure of the transcription elongation factor NusG from Thermus thermophilus. Proteins 56(1):40–51. Reddy TR, Suryanarayana T (1989) Archaebacterial histone-like proteins. Purification and characterization of helix stabilizing DNA binding proteins from the acidothermophile Sulfolobus acidocaldarius. J Biol Chem 264(29):17298–17308. Reed JL, Patel TR, Chen KH, Joyce AR, Applebee MK, Herring CD et al. (2006) Systems approach to refining genome annotation. Proc Natl Acad Sci USA 103(46): 17480–17484. Reinert G, Schbath S, Waterman MS (2000) Probabilistic and statistical properties of words: an overview. J Comput Biol 7:1–46. Reiter WD, Palm P, Yeats S (1989) Transfer RNA genes frequently serve as integration sites for prokaryotic genetic elements. Nucleic Acids Res 17:1907–1914. Reitzer L (2003) Nitrogen assimilation and global regulation in Escherichia coli. Annu Rev Microbiol 57:155–176. Rella M, Mercenier A, Haas D (1985) Transposon insertion mutagenesis of Pseudomonas aeruginosa with a Tn5 derivative: application to physical mapping of the arc gene cluster. Gene 33:293–303. Remm M, Storm CE, Sonnhammer EL (2001) Automatic clustering of orthologs and inparalogs from pairwise species comparisons. J Mol Biol 314(5):1041–1052. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E et al. (2000) Genome-wide location and function of DNA binding proteins. Science 290(5500):2306–2309. Rennie MJ (1999) An introduction to the use of tracers in nutrition and metabolism. Proc Nutr Soc 58(4):935–944. Resendis-Antonio O, Freyre-Gonzalez JA, Menchaca-Mendez R, Gutierrez-Rios RM, Martinez-Antonio A et al. (2005) Modular analysis of the transcriptional regulatory network of E. coli. Trends Genet 21(1):16–20.

References

447

Rey MW, Ramaiya P et al. (2004) Complete genome sequence of the industrial bacterium Bacillus licheniformis and comparisons with closely related Bacillus species. Genome Biol 5(10):R77. Reznikof SW (2002) Tn5 Transposition. In: Craig NL, Craigie R, Gellert M and Lambowitz AM (eds) Mobile DNA II. Washington DC: ASM Press: pp. 403–422. Rhodius VA, Busby SJ (1998) Positive activation of gene expression. Curr Opin Microbiol 1(2):152–159. Rice LB, Carias LL (1994) Studies on excision of conjugative transposons in enterococci: evidence for joint sequences composed of strands with unequal numbers of nucleotides. Plasmid 31:312–316. Richardson JP (2003) Loading Rho to terminate transcription. Cell 114(2):157–159. Riesenfeld CS, Goodman RM, Handelsman J (2004a) Uncultured soil bacteria are a reservoir of new antibiotic resistance genes. Environmental Microbiology 6:981–989. Riesenfeld CS, Schloss PD, Handelsman J (2004b) Metagenomics: genomic analysis of microbial communities. Annual Review of Genetics 38:525–552. Rigoutsos I, Floratos A (1998) Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14:55–67. Rivas E, Klein RJ, Jones TA, Eddy SR (2001) Computational identification of noncoding RNAs in E. coli by comparative genomics. Curr Biol 11:1369–1373. Robertson TA, Varani G (2007) An all-atom, distance-dependent scoring function for the prediction of protein-DNA interactions from structure. Proteins 66(2):359–374. Robinson NJ, Robinson PJ, Gupta A, Bleasby AJ, Whitton BA, Morby AP (1995) Singular over-representation of an octameric palindrome, HIP1, in DNA from many cyanobacteria. Nucleic Acids Res 23:729–735. Robinson NP, Dionne I, Lundgren M, Marsh VL, Bernander R, Bell SD (2004) Identification of two origins of replication in the single chromosome of the archaeon Sulfolobus solfataricus. Cell 116:25–38. Robinson PJ, Rhodes D (2006) Structure of the ‘30 nm’ chromatin fibre: a key role for the linker histone. Curr Opin Struct Biol 16(3):336–343. Robison K, Gilbert W, Church GM (1994) Large scale bacterial gene discovery by similarity search. Nat Genet 7:205–214. Rocha E (2002) Is there a role for replication fork asymmetry in the distribution of genes in bacterial genomes? Trends Microbiol 10:393–395. Rocha EP (2003) An appraisal of the potential for illegitimate recombination in bacterial genomes and its consequences: from duplications to genome reduction. Genome Res 13:1123–1132. Rocha EP, Blanchard A (2002) Genomic repeats, genome plasticity and the dynamics of Mycoplasma evolution. Nucleic Acids Res 30:2031–2042. Rocha EP, Danchin A, Viari A (1999a) Analysis of long repeats in bacterial genomes reveals alternative evolutionary mechanisms in Bacillus subtilis and other competent prokaryotes. Mol Biol Evol 16:1219–1230. Rocha EP, Danchin A, Viari A (1999b) Functional and evolutionary roles of long repeats in prokaryotes. Res Microbiol 150:725–733. Rodionov DA, Mironov AA, Gelfand MS (2001) Transcriptional regulation of pentose utilisation systems in the Bacillus/Clostridium group of bacteria. FEMS Microbiol Lett 205(2):305–314. Rodriguez-Brito B, Rohwer F, Edwards RA (2006) An application of statistics to comparative metagenomics. BMC bioinformatics 7:162. Rojo F (2001) Mechanisms of transcriptional repression. Curr Opin Microbiol 4(2): 145–151.

448

References

Romero PR, Karp PD (2004) Using functional and organizational information to improve genome-wide computational prediction of transcription units on pathway-genome databases. Bioinformatics 20(5):709–717. Rondon MR, August PR, Bettermann AD, Brady SF, Grossman TH, Liles MR, Loiacono KA, Lynch BA, MacNeil IA, Minor C, Tiong CL, Gilman M, Osburne MS, Clardy J, Handelsman J, Goodman RM (2000) Cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorganisms. Applied and Environmental Microbiology 66:2541–2547. Rosenfeld N, Young JW, Alon U, Swain PS, Elowitz MB (2005) Gene regulation at the single-cell level. Science 307(5717):1962–1965. Rossello-Mora R, Amann R (2001) The species concept for prokaryotes. FEMS Microbiol Rev 25(1):39–67. Rossler OE (1979) Recursive evolution. Biosystems 11(2–3):193–199. Rowe-Magnus DA, Guerout AM, Biskri L, Bouige P, Mazel D (2003) Comparative analysis of superintegrons: engineering extensive genetic diversity in the Vibrionaceae. Genome Res 13:428–442. Rowe-Magnus DA, Guerout AM, Mazel D (1999) Super-integrons. Research in Microbiology 150:641–651. Rowe-Magnus DA, Guerout AM, Mazel D (2002) Bacterial resistance evolution by recruitment of super-integron gene cassettes. Molecular Microbiology 43:1657–1669. Rowe-Magnus DA, Guerout AM, Ploncard P, Dychinco B, Davies J, Mazel D (2001) The evolutionary history of chromosomal super-integrons provides an ancestry for multiresistant integrons. Proc Natl Acad Sci USA 98:652–657. Rudd KE (2000) EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res 28:60–64. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S et al. (2007) The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol 5(3):e77. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers YH, Falcon LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, Platt T, Bermingham E, Gallardo V, Tamayo-Castillo G, Ferrari MR, Strausberg RL, Nealson K, Friedman R, Frazier M, Venter JC (2007) The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biol 5:e77. Russell GJ, McGeoch DJ, Elton RA, Subak-Sharpe JH (1973) Doublet frequency analysis of bacterial DNAs. J Mol Evol 2:277–292. Russell GJ, Subak-Sharpe JH (1977) Similarity of the general designs of protochordates and invertebrates. Nature 266:533–536. Russell GJ, Walker PM, Elton RA, Subak-Sharpe JH (1976) Doublet frequency analysis of fractionated vertebrate nuclear DNA. J Mol Biol 108:1–23. Ryan KR, Shapiro L (2003) Temporal and spatial regulation in prokaryotic cell cycle progression and development. Annu Rev Biochem 72:367–394. Sabatti C, Rohlin L, Oh MK, Liao JC (2002) Co-expression pattern from DNA microarray experiments as a tool for operon prediction. Nucleic Acids Res 30(13):2886–2893. Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP (2005) Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721):523–529. Saetrom P, Sneve R, Kristiansen KI, Snove Jr O, Grunfeld T, Rognes T, Seeberg E (2005) Predicting non-coding RNA genes in Escherichia coli with boosted genetic programming. Nucleic Acids Res 33:3263–3270.

References

449

Saito N, Robert M, Kitamura S, Baran R, Soga T, Mori H, Nishioka T, Tomita M (2006) Metabolomics approach for enzyme discovery. J Proteome Res 5:1979–1987. Sakamoto E, Iba H (2001) Inferring a system of differential equations for a gene regulatorynetwork by using genetic programming. Paper presented at the Proceedings of the 2001 Congress on Evolutionary Computation (CEC2001), Seoul, South Korea. Saks ME, Sampson JR, Abelson J (1998) Evolution of a transfer RNA gene through a point mutation in the anticodon. Science 279(5357):1665–1670. Salgado H (2006) RegulonDB. http://regulondb.ccg.unam.mx/index.html. Salgado H, Gama-Castro S, Peralta-Gil M, Diaz-Peredo E, Sanchez-Solano F, SantosZavaleta A, Martinez-Flores I, Jimenez-Jacinto V, Bonavides-Martinez C, SeguraSalazar J, Martinez-Antonio A, Collado-Vides J (2006) RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res 34(Database issue):D394–D397. Salgado H, Moreno-Hagelsieb G, Smith TF, Collado-Vides J (2000) Operons in Escherichia coli: genomic analyses and predictions. Proc Natl Acad Sci USA 97(12):6652–6657. Salgado H, Santos-Zavaleta A, Gama-Castro A, Millan-Zarate D, Diaz-Peredo E, SanchezSolano F, Perez-Rueda E, Bonavides-Martinez C, Collado-Vides J (2001) RegulonDB (version 3.2): transcriptional regulation and operon organization in Escherichia coli K-12. Nucleic Acids Res 29:72–74. Sali A (1999) Functional links between proteins Nature Nature Publishing Group, Vol. 402, pp. 23. Sallai L, Tucker PA (2005) Crystal structure of the central and C-terminal domain of the sigma(54)-activator ZraR. J Struct Biol 151(2):160–170. Sallstrom B, Andersson SGE (2005) Genome reduction in the alpha-proteobacteria. Curr Opin Microbiol 8:579–585. Salyers AA, Shoemaker NB, Stevens AM, Li LY (1995) Conjugative transposons: an unusual and diverse set of integrated gene transfer elements. Microbiol Rev 59:579–590. Salzberg SL, Delcher AL, Kasif S,White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26:544–548. Samuel BS, Gordon JI (2006) A humanized gnotobiotic mouse model of host-archaealbacterial mutualism. Proc Natl Acad Sci USA 103:10011–10016. Sandberg R, Winberg G, Branden CI, Kaske A, Ernberg I, Coster J (2001) Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res 11:1404–1409. Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA et al. (1977) Nucleotide sequence of bacteriophage phi X174 DNA. Nature 265(5596):687–695. Santangelo TJ, Reeve JN (2006) Archaeal RNA polymerase is sensitive to intrinsic termination directed by transcribed and remote sequences. J Mol Biol 355(2): 196–210. Sapp J (2005) The prokaryote-eukaryote dichotomy: meanings and mythology. Microbiol Mol Biol Rev 69(2):292–305. Sarai A, Siebers J, Selvaraj S, Gromiha MM, Kono H (2005) Integration of bioinformatics and computational biology to understand protein-DNA recognition mechanism. J Bioinform Comput Biol 3(1):169–183. Sauerwald A, Zhu W, Major TA, Roy H, Palioura S, Jahn D, Whitman WB, Yates 3rd JR, Ibba M, Soll D (2005) RNA-dependent cysteine biosynthesis in archaea. Science 307(5717):1969–1972. Savageau MA (1969a) Biochemical systems analysis. I. Some mathematical properties of the rate law for the component enzymatic reactions. J Theor Biol 25(3):365–369. Savageau MA (1969b) Biochemical systems analysis. II. The steady-state solutions for an n-pool system using a power-law approximation. J Theor Biol 25(3):370–379.

450

References

Savageau MA (1970) Biochemical systems analysis. 3. Dynamic solutions using a powerlaw approximation. J Theor Biol 26:215–226. Savageau MA (1976) Biochemical systems analysis: a study of function and design in molecular biology. Reading, MA: Addison-Wesley. Savageau MA (1995) Michaelis-Menten mechanism reconsidered: implications of fractal kinetics. J Theor Biol 176:115–124. Sayyed-Ahmad A, Tuncay K, Ortoleva PJ (2007) Transcriptional regulatory network refinement and quantification through kinetic modeling, gene expression microarray data and information theory. BMC Bioinformatics 8:20. Schaaper RM, Dunn RL (1991) Spontaneous mutation in the Escherichia coli LacI gene. Genetics 129:317–326. Schaffrath R, Breunig KD (2000) Genetics and molecular physiology of the yeast Kluyveromyces lactis. Fungal Genet Biol 30:173–190. Schbath S (1997) An efficient statistic to detect over- and under-represented words in DNA sequences. J Comput Biol 4:189–192. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235): 467–470. Scherrer K, Jost J (2007) The gene and the genon concept: a functional and informationtheoretic analysis. Mol Syst Biol 3:87. Schiex T, Gouzy J, Moisan A, de Oliveira Y (2003) FrameD: A flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences, Nucleic Acids Res 31:3738–3741. Schlegel A, Bohm A, Lee SJ, Peist R, Decker K et al. (2002) Network regulation of the Escherichia coli maltose system. J Mol Microbiol Biotechnol 4(3):301–307. Schloss PD, Handelsman J (2003) Biotechnological prospects from metagenomics. Current Opinion in Biotechnology 14:303–310. Schloss PD, Handelsman J (2004) Status of the microbial census. Microbiol Mol Biol Rev 68(4):686–691. Schloss PD, Handelsman J (2005a) Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Applied and Environmental Microbiology 71:1501–1506. Schloss PD, Handelsman J (2005b) Metagenomics for studying unculturable microorganisms: cutting the Gordian knot. Genome Biology 6:229. Schmidt TM (2006) The maturing of microbial ecology. Int Microbiol 9:217–223. Schmidt TM, Delong EF, Pace NR (1991) Analysis of a Marine Picoplankton Community by 16s Ribosomal-RNA Gene Cloning and Sequencing. Journal of Bacteriology 173:4371–4378. Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, Huhn G, Schomburg D (2004) BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res 32:D431–D433. Schulz AR (1994) Enzyme kinetics: from diastase to multi-enzyme systems. Cambridge University Press, Cambridge; New York. Schumann W (2003) The Bacillus subtilis heat shock stimulon. Cell Stress Chaperones 8(3):207–217. Schurr T, Nadir E, Margalit H (1993) Identification and characterization of E.coli ribosomal binding sites by free energy computation. Nucleic Acids Res 21:4019–4023. Schwacke JH, Voit EO (2003) BSTLab: A Matlab Toolbox for Biochemical Systems Theory. Eleventh International Conference on Intelligent Systems for Molecular Biology, Brisbane, Australia.

References

451

Schwan TG, Burgdorfer W, Garon CF (1988) Changes in infectivity and plasmid profile of the Lyme disease spirochete, Borrelia burgdorferi, as a result of in vitro cultivation. Infect Immun 56:1831–1836. Scott JR, Churchward GG (1995) Conjugative transposition. Annu Rev Microbiol 49: 367–397. Scott JR, Kirchman PA, Caparon MG (1988) An intermediate in transposition of the conjugative transposon Tn916. Proc Natl Acad Sci USA 85:4809–4813. Seatzu C (2000) A fitting based method for parameter estimation in S-Systems. Dynam Systems Appl 9(1):77–98. Segal E, Shapira M, Regev A, Pe’er D, Botstein D et al. (2003) Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet 34(2):166–176. Seitz S, Lee SJ, Pennetier C, Boos W, Plumbridge J (2003) Analysis of the interaction between the global regulator Mlc and EIIBGlc of the glucose-specific phosphotransferase system in Escherichia coli. J Biol Chem 278(12):10744–10751. Selinger DW, Saxena RM, Cheung KJ, Church GM, Rosenow C (2003) Global RNA halflife analysis in Escherichia coli reveals positional patterns of transcript degradation. Genome Res 13(2):216–223. Selkov E, Maltsev N, Olsen GJ, Overbeek R, Whitman WB (1997) A reconstruction of the metabolism of Methanococcus jannaschii from sequence data. Gene 197(1–2): GC11–GC26. Sen TZ, Kloczkowski A, Jernigan RL (2006) Functional clustering of yeast proteins from the protein-protein interaction network. BMC Bioinformatics 7:355. Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M (2007) CAMERA: a community resource for metagenomics. PLoS Biol 5:e75. Seshasayee AS, Bertone P, Fraser GM, Luscombe NM (2006) Transcriptional regulatory networks in bacteria: from input signals to output responses. Curr Opin Microbiol 9(5):511–519. Shafer RH, Smirnov I (2000) Biological aspects of DNA/RNA quadruplexes. Biopolymers 56:209–227. Shapiro JA (1969) Mutations caused by the insertion of genetic material into the galactose operon of Escherichia coli. J Mol Biol 40:93–105. Shapiro JA, Adhya SL (1969) The galactose operon of E. coli K-12. II. A deletion analysis of operon structure and polarity. Genetics 62:249–264. Sharp PM, Bailes E, Grocock RJ, Peden JF, Sockett RE (2005) Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Res 33:1141–1153. Sharp PM, Li WH (1986) An evolutionary perspective on synonymous codon usage in unicellular organisms. J Mol Evol 24:28–38. Sharp PM, Li WH (1987) The codon Adaptation Index — a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15:1281–1295. Sharp PM, Stenico M, Peden JF, Lloyd AT (1993) Codon usage: mutational bias, translational selection, or both. Biochem Soc Trans 21:835–841. Shen-Orr SS, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31(1):64–68. Shibuya T, Rigoutso I (2002) Dictionary-driven prokaryotic gene finding. Nucleic Acids Res 30:2710–2725. Shigenobu S, Watanabe H, Hattori M, Sakaki Y, Ishikawa H (2000) Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp APS. Nature 407: 81–86.

452

References

Shimkets LJ (1998) Structure and sizes of the genomes of the Archaea and Bacteria. In: de Bruijn FJ, Lupski JR, Weinstock GM (eds) Bacterial Genomes. Chapman & Hall, New York, pp. 5–11. Shmatkov AM, Melikyan AA, Chernousko FL, Borodovsky M (1999) Finding prokaryotic genes by the ‘frame-by-frame’ algorithm: targeting gene starts and overlapping genes. Bioinformatics 15:874–886. Shoemaker BA, Panchenko AR (2007a) Deciphering protein-protein interactions. Part I. Experimental techniques and databases. PLoS Comput Biol 3(3):e42. Shoemaker BA, Panchenko AR (2007b) Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS Comput Biol 3(4):e43. Shoemaker BA, Panchenko AR, Bryant SH (2006) Finding biologically relevant protein domain interactions: conserved binding mode analysis. Protein Sci 15(2):352–361. Siefert JL, Martin KA, Abdi F, Widger WR, Fox GE (1997) Conserved gene clusters in bacterial genomes provide further support for the primacy of RNA. J Mol Evol 45(5):467–472. Siggia ED (2005) Computational methods for transcriptional regulation. Curr Opin Genet Dev 15(2):214–221. Siguier P, Filee J, Chandler M (2006a) Insertion sequences in prokaryotic genomes. Curr Opin Microbiol 9:526–531. Siguier P, Perochon J, Lestrade L, Mahillon J, Chandler M (2006b) ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Research 34:32–36. Silva FJ, Latorre A, Moya A (2001) Genome size reduction through multiple events of gene disintegration in Buchnera APS. Trends Genet 17:615–618. Silva FJ, Latorre A, Moya A (2003) Why are the genomes of endosymbiotic bacteria so stable? Trends Genet 19:176–180. Sinden RR (1994) DNA structure and function. San Diego: Academic Press. Sittka A, Pfeiffer V, Tedin K, Vogel J (2006) The RNA chaperone Hfq is essential for the virulence of Salmonella typhimurium. Mol Microbiol. Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A (2001) On the total number of genes and their length distribution in complete microbial genomes. Trends Genet 17:425–428. Smith HO, Tomb JF, Dougherty BA, Fleischmann RD, Venter JC (1995) Frequency and distribution of DNA uptake signal sequences in the Haemophilus influenzae Rd genome. Science 269:538–540. Snel B, Bork P, Huynen MA (2002) Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Res 12(1):17–25. Snel B, Bork P, Huynen MA (2002) The identification of functional modules from the genomic association of genes. Proc Natl Acad Sci USA 99(9):5890–5895. Snyder EE, Stormo GD (1993) Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Res 21:607–613. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ (2006) Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc Natl Acad Sci USA 103(32):12115–121120. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ (2006) Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc Natl Acad Sci USA 103:12115–12120. Solovyev VV, Salamov AA, Lawrence CB (1994) Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res 22:5156–5163.

References

453

Sonea S (1988a) A bacterial way of life. Nature 331(6153):216. Sonea S (1988b) The global organism: A new view of bacteria. The Sciences 28:38–45. Sonnenburg JL, Angenent LT, Gordon JI (2004) Getting a grip on things: how do communities of bacterial symbionts become established in our intestine? Nature immunology 5:569–573. Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res 26(1): 320–322. Sonnhammer EL, Koonin EV (2002) Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet 18(12):619–620. Sorek R, Zhu Y et al. (2007) Genome-wide experimental determination of barriers to horizontal gene transfer. Science 318(5855):1449–1452. Spieth C, Streichert F, Supper J, Speer N, Zell A (2005) Feedback memetic algorithms for modeling gene regulatory networks. Paper presented at the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB). Spieth C, Worzischek R, Streichert F (2006) Comparing evolutionary algorithms on the problem of network inference. Paper presented at the Genetic and Evolutionary Computation Conference (GECCO 2006). Spieth J, Brooke G, Kuersten S, Lea K, Blumenthal T (1993) Operons in C. elegans: polycistronic mRNA precursors are processed by trans-splicing of SL2 to downstream coding regions. Cell 73(3):521–532. Spirin V, Mirny LA (2003) Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA 100(21):12123–12128. Sprinzl M, Steegborn C, Hbel F, Steinberg S (1996) Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res 24:68–72. Srinivasan G, James CM, Krzycki JA (2002) Pyrrolysine encoded by UAG in Archaea: charging of a UAG-decoding specialized tRNA. Science 296(5572):1459–1462. Srividhya J, Crampin EJ, McSharry PE, Schnell S (2007) Reconstructing biochemical pathways from time course data. Proteomics 7(6):828–838. Staden R (1984) Measurements of the effects that coding for a protein has on a DNA sequence and their use for finding genes. Nucleic Acids Res 12:551–567. Stahl DA, Lane DJ, Olsen GJ, Pace NR (1985) Characterization of a Yellowstone hot spring microbial community by 5S rRNA sequences. Applied and Environmental Microbiology 49:1379–1384. Stanier RY, Van Niel CB (1962) The concept of a bacterium. Arch Mikrobiol 42:17–35. Stephanopoulos G (1999) Metabolic fluxes and metabolic engineering. Metab Eng 1(1): 1–11. Stevens A (1960) Incorporation of the adenine ribonucleotide into RNA by cell fractions from E. coli B. Biochem Biophys Res Commun 3:92–96. Stock AM, Robinson VL, Goudreau PN (2000) Two-component signal transduction. Annu Rev Biochem 69:183–215. Stock JB, Stock AM, Mottonen JM (1990) Signal transduction in bacteria. Nature 344(6265):395–400. Stokes HW, Elbourne LD, Hall RM (2007) Tn1403, a multiple-antibiotic resistance transposon made up of three distinct transposons. Antimicrob Agents Chemother 51:1827–1829. Stokes HW, Hall RM (1989) A novel family of potentially mobile DNA elements encoding site-specific gene-integration functions: integrons. Molecular Microbiology 3: 1669–1683. Storm CE, Sonnhammer EL (2002) Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics 18(1):92–99.

454

References

Stormo GD (2000) DNA binding sites: representation and discovery. Bioinformatics 16(1):16–23. Stormo GD, Hartzell III GW (1989) Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci USA 86:1183–1187. Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982a) Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res 10(9):2997–3011. Stormo GD, Schneider TD, Gold LM (1982b) Characterization of translational initiation sites in E. coli. Nucleic Acids Res 10(9):2971–2996. Stragier P, Losick R (1990) Cascades of sigma factors revisited. Mol Microbiol 4(11): 1801–1806. Struhl K (1999) Fundamentally different logic of gene regulation in eukaryotes and prokaryotes. Cell 98(1):1–4. Studholme DJ, Dixon R (2003) Domain architectures of sigma54-dependent transcriptional activators. J Bacteriol 185(6):1757–1767. Su D, Li Y, Gladyshev VN (2005) Selenocysteine insertion directed by the 3’-UTR SECIS element in Escherichia coli. Nucleic Acids Res 33(8):2486–2492. Su Z, Dam P, Chen X, Olman V, Jiang T, Palenik et al. (2003) Computational Inference of Regulatory Pathways in Microbes: An application to phosphorus assimilation pathways in Synechococcus WH8102. Genome Informatics 14:3–13. Su Z, Mao F, Dam P, Wu H, Olman V, Paulsen IT et al. (2006) Computational inference and experimental validation of the nitrogen assimilation regulatory network in cyanobacterium Synechococcus sp. WH 8102. Nucleic Acids Res 34(3):1050–1065. Su Z, Olman V, Mao F, Xu Y (2005) Comparative genomics analysis of NtcA regulons in cyanobacteria: regulation of nitrogen assimilation and its coupling to photosynthesis. Nucleic Acids Res 33(16):5156–5171. Su Z, Olman V, Mao F, Xu Y (2006) Comparative genomics analysis of NtcA regulons in cyanobacteria: regulation of nitrogen assimilation and its coupling to photosynthesis. Nucleic Acid Res 33(16):5156–5171. Su Z, Olman V, Xu Y (2007) Computational prediction of Pho regulons in cyanobacteria. BMC Genomics 8(1):156. Subrahmanyam CS, Noti JD, Umbarger HE (1980) Regulation of ilvEDA expression occurs upstream of ilvG in Escherichia coli: additional evidence for an ilvGEDA operon. J Bacteriol 144(1):279–290. Sugahara J, Yachie N, Arakawa K, Tomita M (2007) In silico screening of archaeal tRNAencoding genes having multiple introns with bulge-helix-bulge splicing motifs. RNA 13(5):671–681. Sugahara J, Yachie N, Sekine Y, Soma A, Matsui M, Tomita M, Kanai A (2006) SPLITS: a new program for predicting split and intron-containing tRNA genes at the genome level. In Silico Biol 6(5):411–418. Sugimoto M, Kikuchi S, Tomita M (2005) Reverse engineering of biochemical equations from time-course data by means of genetic programming. Biosystems 80(2):155–164. Suhre K, Claverie J-M (2003) Genomic correlates of hyperthermostability, an update. J Biol Chem 278:17198–17202. Suhre K, Claverie J-M (2004) FusionDB: a database for in-depth analysis of prokaryotic gene fusion events. Nucl Acids Res 32(90001):D273–D276. Sullivan JT, Ronson CW (1998) Evolution of rhizobia by acquisition of a 500-kb symbiosis island that integrates into a phe-tRNA gene. Proc Natl Acad Sci USA 95:5145–5149. Sullivan MB, Lindell D, Lee JA, Thompson LR, Bielawski JP, Chisholm SW (2006) Prevalence and evolution of core photosystem II genes in marine cyanobacterial viruses and their hosts. PLoS Biol 4:e234.

References

455

Suttle CA (2007) Marine viruses–major players in the global ecosystem. Nature Reviews 5:801–812. Suzek BE, Ermolaeva MD, Schreiber M, Salzberg SL (2001) A probabilistic method for identifying start codons in bacterial genomes. Bioinformatics 17:1123–1130. Suzuki H, Lima-Mendez G, Brown C, Top E, Toussaint A (2008) Bioinformatics of the prokaryotic mobilome. In: Field D (ed) Comparative genomics and bioinformatics for the microbiologist. Horizon Scientific Press. Swain PS, Elowitz MB, Siggia ED (2002) Intrinsic and extrinsic contributions to stochasticity in gene expression. Proc Natl Acad Sci USA 99(20):12795–12800. Switzer RL, Turner RJ, Lu Y (1999) Regulation of the Bacillus subtilis pyrimidine biosynthetic operon by transcriptional attenuation: control of gene expression by an mRNA-binding protein. Prog Nucleic Acid Res Mol Biol 62:329–367. Szekeres S, Dauti M, Wilde C, Mazel D, Rowe-Magnus DA (2007) Chromosomal toxinantitoxin loci can diminish large-scale genome reductions in the absence of selection. Molecular Microbiology 63:1588–1605. Tagle DA, Koop BF, Goodman M, Slightom JL, Hess DL, Jones RT (1988) Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J Mol Biol 203(2):439–455. Tamames J (2001) Evolution of gene order conservation in prokaryotes. Genome Biol 2:0020. Tamames J, Gil R, Latorre A, Pereto J, Silva FJ, Moya A (2007) The frontier between cell and organelle: genome analysis of Candidatus Carsonella ruddii. BMC Evol Biol 7:181. Tamas I, Klasson L, Canback B, Naslund AK, Eriksson AS, Wernegreen JJ, Sandstrom JP, Moran NA, Andersson SGE (2002) 50 Million Years of Genomic Stasis in Endosymbiotic Bacteria. Science 296:2376–2379. Tan K, McCue LA, Stormo GD (2005) Making connections between novel transcription factors and their DNA motifs. Genome Res 15(2):312–320. Tan K, Moreno-Hagelsieb G, Collado-Vides J, Stormo GD (2001) A comparative genomics approach to prediction of new members of regulons. Genome Res 11(4): 566–584. Tanaka T, Kikuchi Y (2001) Origin of the cloverleaf shape of transfer RNA — the doublehairpin model: Implication for the role of tRNA intron and the long extra loop. Viva Origino 29:134–142. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41. Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278(5338):631–637. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV (2001) The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 29:22–28. Tautz D, Schl¨ otterer C (1994) Simple sequences. Curr Opin Genet Dev 4:832–837. Tavare S, Song B (1989) Codon preference and primary sequence structure in proteincoding regions. Bull Math Biol 51:95–115. Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner FO (2004) TETRA: a webservice and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5:163.

456

References

Tenover FC (2006) Mechanisms of antimicrobial resistance in bacteria. Am J Med 119:S3– 10; discussion S62–S70. Tesler G (2002) GRIMM: genome rearrangements web server. Bioinformatics 18:492–493. Tettelin H, Masignani V et al. (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci USA 102(39):13950–13955. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, Deboy RT, Davidsen TM, Mora M, Scarselli M, Margarit y Ros I, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O’Connor KJ, Smith S, Utterback TR, White O, Rubens CE, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR, Rappuoli R, Fraser CM (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci USA 102:13950–13955. Teusink B, Passarge J, Reijenga CA, Esgalhado E, van der Weijden CC, Schepper M et al. (2000) Can yeast glycolysis be understood in terms of in vitro kinetics of the constituent enzymes? Testing biochemistry. European Journal of Biochemistry 267(17):5313–5329. Thanaraj TA, Argos P (1996) Protein secondary structural types are differentially coded on messenger RNA. Protein Sci 5:1973–1983. Thanbichler M, Shapiro L (2006) Chromosome organization and segregation in bacteria. J Struct Biol 156:292–303. Thanbichler M, Wang SC, Shapiro L (2005) The bacterial nucleoid: a highly organized and dynamic structure. J Cell Biochem 96(3):506–521. Theissen G (2002) Secret life of genes. Nature 415(6873):741. Thieffry D, Romero D (1999) The modularity of biological regulatory networks. Biosystems 50(1):49–59. Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beckstrom-Sternberg SM, Margulies EH, Blanchette M, Siepel AC, Thomas PJ, McDowell JC et al. (2003) Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424(6950):788–793. Thompson J, Higgins D, Gibson T (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680. Timmis JN, Ayliffe MA, Huang CY, Martin W (2004) Endosymbiotic gene transfer: Organelle genomes forge eukaryotic chromosomes. Nat Rev Genet 5:123–135. Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R (1997) Prediction of probable genes by Fourier analysis of genomic sequences. Comput Appl Biosci 13:263–270. Tjian R (1978) The binding site on SV40 DNA for a T antigen-related protein. Cell 13(1):165–179. Toh H, Weiss BL, Perkin SAH, Yamashita A, Oshima K, Hattori M, Aksoy S (2006) Massive genome erosion and functional adaptations provide insights into the symbiotic lifestyle of Sodalis glossinidius in the tsetse host. Genome Res 16:149–156. Tolstorukov MY, Virnik KM, Adhya S, Zhurkin VB (2005) A-tract clusters may facilitate DNA packaging in bacterial nucleoid. Nucleic Acids Res 33:3907–3918. Tomb J-F, White O, Kerlavage AR (1997) The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature 388:539–547. Tompa M (2001) Identifying functional elements by comparative DNA sequence analysis. Genome Res 11(7):1143–1144.

References

457

Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23(1):137–144. Tong AH, Drees B, Nardelli G, Bader GD, Brannetti B, Castagnoli L et al. (2002) A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 295(5553), 321–324. Torralba AS, Yu K, Shen P, Oefner PJ, Ross J (2003) Experimental test of a method for determining causal connectivities of species in reactions. Proc Natl Acad Sci USA 100(4):1494–1498. Torres NV, Voit EO (2002) Pathway analysis and optimization in metabolic engineering. Cambridge University Press, New York. T´ oth G, G´ asp´ ari Z, Jurka J (2000) Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res 10:967–981. Touchon M, Rocha EPC (2007) Causes of insertion sequences abundance in prokaryotic genomes. Mol Biol Evol 24:969–981. Toussaint A, Merlin C (2002) Mobile elements as a combination of functional modules. Plasmid 47:26–35. Tran LM, Brynildsen MP, Kao KC, Suen JK, Liao JC (2005) gNCA: a framework for determining transcription factor activity based on transcriptome: identifiability and numerical implementation. Metab Eng 7(2):128–141. Tran T (2007) Neural network operon prediction. http://csbl.bmb.uga.edu/∼tran/ operons. Tran T, Dam P, Su Z, Poole II FL, Adams MW, Zhou GT, Xu Y (2007) Operon prediction in Pyrococcus furiosus. Nucleic Acids Res 35(1):11–20. Travers A, Muskhelishvili G (2005) DNA supercoiling — a global transcriptional regulator for enterobacterial growth? Nat Rev Microbiol 3(2):157–169. Tress ML, Cozzetto D, Tramontano A, Valencia A (2006) An analysis of the Sargasso Sea resource and the consequences for database composition. BMC Bioinformatics 7:213. Trieu-Cuot P, Gerbaud G, Lambert T, Courvalin P (1985) In vivo transfer of genetic information between gram-positive and gram-negative bacteria. EMBO J 4(13A): 3583–3587. Trifonov EN (2000) Consensus temporal order of amino acids and evolution of the triplet code. Gene 261(1):139–151. Trifonov EN, Bettecken T (1997) Sequence fossils, triplet expansion, and reconstruction of earliest codons. Gene 205(1–2):1–6. Trifonov EN, Brendel V (1986) Genomic: a dictionary of genetic codes. Balaban Publishers, Rehovot, Philadelphia. Tringe SG, Rubin EM (2005) Metagenomics: DNA sequencing of environmental samples. Nat Rev Genet 6:805–814. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM (2005) Comparative metagenomics of microbial communities. Science 308:554–557. Tsai KY, Wang FS (2005) Evolutionary optimization with data collocation for reverse engineering of biological networks. Bioinformatics 21(7):1180–1188. Tsirigos A, Rigoutsos I (2005) A sensitive, support-vector-machine method for the detection of horizontal gene transfers in viral, archaeal and bacterial genomes. Nucleic Acids Res 33:3699–3707. Tsutsumi S, Denda K, Yokoyama K, Oshima T, Date T, Yoshida M (1991) Molecular cloning of genes encoding major two Subunits of a eubacterial V-Type ATPase from Thermus thermophilus. Biochim. Biophys. Acta 1098(1):13–20.

458

References

Tu Q, Ding D (2003) Detecting pathogenicity islands and anomalous gene clusters by iterative discriminant analysis. FEMS Microbiology Letters 221:269–275. Tu Z, Wang L, Arbeitman MN, Chen T, Sun F (2006) An integrative approach for causal gene identification and gene regulatory pathway inference. Bioinformatics 22(14): e489–e496. Tucker CL, Gera JF, Uetz P (2001) Towards an understanding of complex protein networks. Trends Cell Biol 11(3):102–106. Tucker W, Kutalik Z, Moulton V (2006) Estimating parameters for generalized mass action models using constraint propagation. Math Biosci. Tucker W, Moulton V (2006) Parameter reconstruction for biochemical networks using interval analysis. Reliable Computing 12:1–14. Tumbula-Hansen D, Feng L, Toogood H, Stetter KO, Soll D (2002) Evolutionary divergence of the archaeal aspartyl-tRNA synthetases into discriminating and nondiscriminating forms. J Biol Chem 277(40):37184–37190. Tummuru MK, Sharma SA, Blaser MJ (1995) Helicobacter pylori picB, a homologue of the Bordetella pertussis toxin secretion protein, is required for induction of IL-8 in gastric epithelial cells. Mol Microbiol 18(5):867–876. Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, Gordon JI (2007) The human microbiome project. Nature 449:804–810. Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, Mardis ER, Gordon JI (2006) An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444:1027–1031. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 37–43. Uberbacher EC, Mural RJ (1991) Locating protein-coding regions in human DNA sequences by a multiple sensors-neural network approach. Proc Natl Acad Sci USA 88:11261–11265. Uchiyama I (2007) MBGD: a platform for microbial comparative genomics based on the automated construction of orthologous groups. Nucleic Acids Res 35: D343–D346. Uetz P, Giot L, Cagney G, Mansfield T A, Judson RS, Knight JR et al. (2000) A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403(6770):623–627. Uetz P, Hughes RE (2000) Systematic and large-scale two-hybrid screens. Curr Opin Microbiol 3(3):303–308. Ulrich LE, Koonin EV, Zhulin IB (2005) One-component systems dominate signal transduction in prokaryotes. Trends Microbiol 13(2):52–56. Vaisvila R, Morgan RD, Posfai J, Raleigh EA (2001) Discovery and distribution of superintegrons among pseudomonads. Molecular Microbiology 42:587–601. Valentine DL (2007) Adaptations to energy stress dictate the ecology and evolution of the Archaea. Nat Rev Microbiol 5(4):316–323. Valentin-Hansen P, Eriksen M, Udesen C (2004) The bacterial Sm-like protein Hfq: a key player in RNA transactions. Mol Microbiol 51(6):1525–1533. van der Laan M, Rissler M, Rehling P (2006) Mitochondrial preprotein translocases as dynamic molecular machines. FEMS Yeast Res 6:849–861. van Dongen S (2000) Graph clustering by flow simulation. PhD thesis. University of Utrecht. van Ham RC, Kamerbeek J et al. (2003) Reductive genome evolution in Buchnera aphidicola. Proc Natl Acad Sci USA 100(2):581–586.

References

459

Van Ham RCHJ, Moya J, Latorre A (2004) The evolution of symbiosis in insects. Pp. 94–105 in A. Moya and E. Font, eds. The evolution of symbiosis in insects. Oxford University Press. van Niel CB (1955) Classification and taxonomy of the bacteria and blue green algae. A century of progress in the natural sciences, 1853–1953. E. L. Kessel. San Francisco, Ca, California Academy of Sciences: 89–114. van Nimwegen E, Zavolan M, Rajewsky N, Siggia ED (2002) Probabilistic clustering of sequences: inferring new bacterial regulons by comparative genomics. Proc Natl Acad Sci USA 99(11):7323–7328. Vanicek J, Klimek M (1971) A mathematical model of the course of the DNA synthesis in mammalian cells after ultraviolet irradiation and its use in the determination of the length of the replicon. Curr Mod Biol 3:347–352. Vapnik VN (1995) The Nature of Statistical Learning Theory, Springer. Veflingstad SR, Almeida J, Voit EO (2004) Priming nonlinear searches for pathway identification. Theor Biol Med Model 1:8. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304:66–74. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu DY, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304:66–74. VerBerkmoes NC, Shah MB, Lankford PK, Pelletier DA, Strader MB, Tabb DL et al. (2006) Determination and comparison of the baseline proteomes of the versatile microbe Rhodopseudomonas palustris under its major metabolic states. J Proteome Res 5(2):287–298. Veres Z, Kim IY, Scholz TD, Stadtman TC (1994) Selenophosphate synthetase. Enzyme properties and catalytic reaction. J Biol Chem 269(14):10597–10603. Veres Z, Stadtman TC (1994) A purified selenophosphate-dependent enzyme from Salmonella typhimurium catalyzes the replacement of sulfur in 2-thiouridine residues in tRNAs with selenium. Proc Natl Acad Sci USA 91(17):8092–8096. Verma M, Kagan J, Sidansky D, Srivastava S (2003) Proteomic analysis of cancer-cell mitochondria. Nat Rev Cancer 3:789–795 Vernikos GS, Parkhill J (2006) Interpolated variable order motifs for identification of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands. Bioinformatics 22:2196–2203. Vernikos GS, Thomson NR, Parkhill J (2007) Genetic flux over time in the Salmonella lineage. Genome Biol 8:R100. Versalovic J, Relman D (2006) How bacterial communities expand functional repertoires. PLoS Biol 4:e430. Vibranovski MD, Sakabe NJ et al. (2005) Signs of ancient and modern exon-shuffling are correlated to the distribution of ancient and modern domains along proteins. J Mol Evol 61(3):341–350. Visser D, Heijnen JJ (2002) The mathematics of metabolic control analysis revisited. Metabol Eng 4:114–123. Vitreschak AG, Rodionov DA, Mironov AA, Gelfand MS (2002) Regulation of riboflavin biosynthesis and transport genes in bacteria by transcriptional and translational attenuation. Nucleic Acids Res 30(14):3141–3151.

460

References

Vo TD, Palsson BO (2007) Building the power house: recent advances in mitochondrial studies through proteomics and systems biology. Am J Physiol Cell Physiol 292: 164–177. Voet D, Voet JG, Pratt CW (2005) Fundamentals of Biochemistry: Life at the Molecular Level (2nd ed.): John Wiley & Sons Inc. Vogel J, Papenfort K (2006) Small non-coding RNAs and the bacterial outer membrane. Curr Opin Microbiol 9(6):605–611. Vogler AP, Homma M, Irikura VM, Macnab RM (1991) Salmonella typhimurium mutants defective in flagellar filament regrowth and sequence similarity of FliI to F0F1, vacuolar, and archaebacterial ATPase subunits. J Bacteriol 173(11):3564–3572. Voit EO (1991) Canonical Nonlinear Modeling. S-System Approach to Understanding Complexity. Van Nostrand Reinhold, New York. Voit EO (1992) Optimization in integrated biochemical systems. Biotechn Bioengin 40: 572–582. Voit EO (2000) Computational Analysis of Biochemical Systems. A Practical Guide for Biochemists and Molecular Biologists. Cambridge, UK: Cambridge University Press. Voit EO (2004) The Dawn of a New Era of Metabolic Systems Analysis. Drug Discovery Today BioSilico 2(5):182–189. Voit EO, Almeida J (2004) Decoupling dynamical systems for pathway identification from metabolic profiles. Bioinformatics 20(11):1670–1681. Voit EO, Almeida JS, Marino S, Lall R, Goel G, Neves AR et al. (2006) Regulation of glycolysis in lactococcus lactis: an unfinished systems biological case study. IEE Proc Systems Biol 153:286–298. Voit EO, Marino S, Lall R (2005) Challenges for the identification of biological systems from in vivo time series data. In Silico Biol 5(2):83–92. Voit EO, Neves AR, Santos H (2006b) The intricate side of systems biology. Proc Natl Acad Sci USA 103:9452–9457. Voit EO, Savageau MA (1982a) Power-law approach to modeling biological systems; III. Methods of analysis. J Ferment Technol 60(3):223–241. Voit EO, Savageau MA (1982b) Power-law approach to modeling biological systems; II. Application to ethanol production. J Ferment Technol 60(3):229–232. Volterra V (1926) Variazioni e fluttuazioni del numero d’individui in specie animali conviventi. Mem. R. Accad. dei Lincei. 2. von Bertalanffy L (1968) General Systems Theory. George Braziller, New York. Von Dohlen CD, Moran NA (2000) Molecular data support a rapid radiation of aphids in the Cretaceous and multiple origins of host alternation. Biol J Linnean Soc 71:689–717. von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B (2003) STRING: a database of predicted functional associations between proteins. Nucleic Acids Res 31:258–261. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M et al. (2005) STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res 33 Database Issue, D433–D437. von Mering C, Zdobnov EM, Tsoka S, Ciccarelli FD, Pereira-Leal JB, Ouzounis CA et al. (2003) Genome evolution reveals biochemical networks and functional modules. Proc Natl Acad Sci USA 100(26):15428–15433. Vossbrinck CR, Maddox JV, Friedman S, Debrunner-Vossbrinck BA, Woese CR (1987) Ribosomal RNA sequence suggests microsporidia are extremely ancient eukaryotes. Nature 326(6111):411–414. Waack S, Keller O, Asper R, Brodag T, Damm C, Fricke WF, Surovcik K, Meinicke P, Merkl R (2006) Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov models. BMC Bioinformatics 7:142. Wade JT, Roa DC, Grainger DC, Hurd D, Busby SJ et al. (2006) Extensive functional overlap between sigma factors in Escherichia coli. Nat Struct Mol Biol 13(9):806–814.

References

461

Wade JT, Struhl K, Busby SJ, Grainger DC (2007) Genomic analysis of protein-DNA interactions in bacteria: insights into transcription and chromosome organization. Mol Microbiol 65(1):21–26. Wagner A (2001) The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Mol Biol Evol 18(7):1283–1292. Wagner A (2006) Periodic extinctions of transposable elements in bacterial lineages: evidence from intragenomic variation in multiple genomes. Mol Biol Evol 23:723–733. Wagner R (2000) Transcription Regulation in Prokaryotes, Oxford University Press, USA. Wainright PO, Hinkle G, Sogin ML, Stickel SK (1993) Monophyletic origins of the metazoa: an evolutionary link with fungi. Science 260(5106):340–342. Wall DP, Fraser HB, Hirsh AE (2003) Detecting putative orthologs. Bioinformatics 19(13):1710–1711. Wall ME, Hlavacek WS, Savageau MA (2004) Design of gene circuits: lessons from bacteria. Nat Rev Genet 5(1):34–42. Wallace DC, Morowitz HJ (1973) Genome Size and Evolution. Chromosoma 40:121–126. Wan P, Mao F, Olman V, Che D, Liu H, Zhou F, Xu Y (2007) “Operon structural diversity is a reflection of adaptive evolution.” submitted. Wan XF, Xu D (2005) Computational methods for remote homolog identification. Curr Protein Pept Sci 6(6):527–546. Wang FS, Ko CL, Voit EO (2007) Kinetic modeling using S-systems and lin-log approaches. Biochem Eng J 33:238–247. Wang GY, Graziani E, Waters B, Pan W, Li X, McDermott J, Meurer G, Saxena G, Andersen RJ, Davies J (2000) Novel natural products from soil DNA libraries in a streptomycete host. Org Lett 2:2401–2404. Wang JC (1979) Helical repeat of DNA in solution. Proc Natl Acad Sci USA 76:200–203. Wang L, Trawick JD, Yamamoto R, Zamudio C (2004) Genome-wide operon prediction in Staphylococcus aureus. Nucleic Acids Res 32(12):3689–3702. Warnecke F, Luginbuhl P, Ivanova N, Ghassemian M, Richardson TH, Stege JT, Cayouette M, McHardy AC, Djordjevic G, Aboushadi N, Sorek R, Tringe SG, Podar M, Martin HG, Kunin V, Dalevi D, Madejska J, Kirton E, Platt D, Szeto E, Salamov A, Barry K, Mikhailova N, Kyrpides NC, Matson EG, Ottesen EA, Zhang X, Hernandez M, Murillo C, Acosta LG, Rigoutsos I, Tamayo G, Green BD, Chang C, Rubin EM, Mathur EJ, Robertson DE, Hugenholtz P, Leadbetter JR (2007) Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite. Nature 450:560–565. Wassarman KM, Repoila F, Rosenow C, Storz G, Gottesman S (2001) Identification of novel small RNAs using comparative genomics and microarrays. Genes Dev 15: 1637–1651. Watanabe H, Mori H, Itoh T, Gojobori T (1997) Genome plasticity as a paradigm of eubacteria evolution. J Mol Evol 44 Suppl 1:S57–S64. Waters E, Hohn MJ, Ahel I, Graham DE, Adams MD, Barnstead M, Beeson KY, Bibbs L, Bolanos R, Keller M, Kretz K, Lin XY, Mathur E, Ni JW, Podar M, Richardson T, Sutton GG, Simon M, Soll D, Stetter KO, Short JM, Noordewier M (2003) The genome of Nanoarchaeum equitans: Insights into early archaeal evolution and derived parasitism. Proc Natl Acad Sci USA 100:12984–12988. Watson JD, Crick FH (1953) Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171:737–738. Weinbauer MG, Brettar, I. & Hofle, M. G. (2003) Lysogeny and virus-induced mortality of bacterioplankton in surface, deep, and anoxic marine waters. Limnol Oceanogr 48:1457–1465. Weiss SB, Gladstone L (1959) A mammalian system for the incorporation of cytidine triphosphate into ribonucleic acid. J Am Chem Soc 81:4118–4119.

462

References

Weissman KJ (2004) Polyketide biosynthesis: understanding and exploiting modularity. Philos Transact A Math Phys Eng Sci 362(1825):2671–2690. Welch RA, Burland V et al. (2002) Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc Natl Acad Sci USA 99(26):17020–17024. Weng L, Rubin EM, Bristow J (2006) Application of sequence-based methods in human microbial ecology. Genome research 16:316–322. Westover BP (2005) Operon Finding Software (OFS). http://www.cse.wustl.edu/ ∼jbuhler/research/operons. Westover BP, Buhler JD, Sonnenburg JL, Gordon JI (2005) Operon prediction without a training set. Bioinformatics 21(7):880–888. Whisstock JC, Lesk AM (2003) Prediction of protein function from protein sequence and structure. Q Rev Biophys 36(3):307–340. Whitaker RJ, Banfield JF (2006) Population genomics in natural microbial communities. Trends in Ecology & Evolution 21:508–516. Whitman WB, Coleman, DC, Wiebe, WJ (1998) Prokaryotes: the unseen majority. Proc Natl Acad Sci USA 95:6578–6583. Wicker T, Guyot R, Yahiaoui N, Keller B (2003) CACTA transposons in Triticeae. A diverse family of high-copy repetitive elements. Plant Physiol 132:52–63. Wilcox M, Nirenberg M (1968) Transfer RNA as a cofactor coupling amino acid synthesis with that of protein. Proc Natl Acad Sci USA 61(1):229–236. Wilkes T, Laux H, Foy CA (2007) Microarray data quality — review of current developments. Omics 11(1):1–13. Williams KP (2002) Integration sites for genetic elements in prokaryotic tRNA and tmRNA genes: sublocation preference of integrase subfamilies. Nucleic Acids Res 30:866–875. Wilmes P, Bond PL (2006) Metaproteomics: studying functional gene expression in microbial ecosystems. Trends in microbiology 14:92–97. Wilson CJ, Zhan H, Swint-Kruse L, Matthews KS (2007) The lactose repressor system: paradigms for regulation, allosteric behavior and protein folding. Cell Mol Life Sci 64(1):3–16. Winkler WC (2005) Riboswitches and the role of noncoding RNAs in bacterial metabolic control. Curr Opin Chem Biol 9(6):594–602. Winkler WC, Breaker RR (2005) Regulation of bacterial gene expression by riboswitches. Annu Rev Microbiol 59:487–517. Winogradsky S (1952) On the classification of bacteria. Ann Inst Pasteur (Paris) 82(2):125–131. Winstanley C, Hart CA (2001) Type III secretion systems and pathogenicity islands. J Med Microbiol 50(2):116–126. Wistow GJ (1995) Molecular biology and evolution of crystallins: gene recruitment and multifunctional proteins in the eye lens, New York, Austin, TX, Springer; R.G. Landes. Withers M, Wernisch L, Dos Reis M (2006) Archaeology and evolution of transfer RNA genes in the Escherichia coli genome. RNA 12:933–942. Woese CR (1987) Bacterial evolution. Microbiol Rev 51(2):221–271. Woese CR (1987) Bacterial Evolution. Microbiological Reviews 51:221–271. Woese CR, Fox GE (1977) Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci USA 74(11):5088–5090. Woese CR, Kandler O, Wheelis ML (1990) Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci USA 87(12):4576–4579. Woese CR, Maniloff J, Zablen LB (1980) Phylogenetic Analysis of the Mycoplasmas. Proc Natl Acad Sci USA 77:494–498.

References

463

Wolf YI, Aravind L, Grishin NV, Koonin EV (1999) Evolution of aminoacyl-tRNA synthetases — analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events. Genome Res 9(8):689–710. Wolf YI, Rogozin IB, Kondrashov AS, Koonin EV (2001) Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res 11(3):356–372. Wong JT (1975) A co-evolution theory of the genetic code. Proc Natl Acad Sci USA 72(5):1909–1912. Woodcock CL (2006) Chromatin architecture. Curr Opin Struct Biol 16(2):213–220. Woyke T, Teeling H, Ivanova NN, Huntemann M, Richter M, Gloeckner FO, Boffelli D, Anderson IJ, Barry KW, Shapiro HJ, Szeto E, Kyrpides NC, Mussmann M, Amann R, Bergin C, Ruehland C, Rubin EM, Dubilier N (2006) Symbiosis insights through metagenomic analysis of a microbial consortium. Nature 443:950–955. Wright F (1990) The ’effective number of codons’ used in a gene. Gene 87:23–29. Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka DR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, Ledley RS, Suzek BE, Arminski L, Chen Y, Zhang J, Cardenas JL, Chung S, Castro-Alvear J, Dinkov G, Barker WC (2004) PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res 32(Database issue):D112–D114. Wu D, Daugherty SC, Van Aken SE, Pai GH, Watkins KL, Khouri H, Tallon LJ, Zaborsky JM, Dunbar HE, Tran PL, Moran NA, Eisen JA (2006) Metabolic complementarity and genomics of the dual bacterial symbiosis of sharpshooters. PLoS Biol 4:1079–1092. Wu D, Daugherty SC, Van Aken SE, Pai GH, Watkins KL, Khouri H, Tallon LJ, Zaborsky JM, Dunbar HE, Tran PL, Moran NA, Eisen JA (2006) Metabolic complementarity and genomics of the dual bacterial symbiosis of sharpshooters. PLoS Biol 4:e188. Wu H, Mao F, Olman V, Xu Y (2005) Accurate prediction of orthologous gene groups in microbes. Proc IEEE Comput Syst Bioinform Conf :73–79. Wu H, Mao F, Olman V, Xu Y (2007) Hierarchical classification of functionally equivalent genes in prokaryotes. Nucleic Acids Res 35(7):2125–2140. Wu H, Su Z, Mao F, Olmen V, Xu Y (2005) Prediction of functional modules based on comparative analysis and Gene Ontology application. Nucleic Acids Res 33: 2822–2837. Wu HL, Bagby S, van den Elsen JM (2005) Evolution of the genetic triplet code via two types of doublet codons. J Mol Evol 61(1):54–64. Wu J, Kasif S, DeLisi C (2003) Idenfication of functional links between genes using phylogenetic profiles. Bioinformatics 19:1524–1530. Wu X, Dewey TG (2006) From microarray to biological networks: Analysis of gene expression profiles. Methods Mol Biol 316:35–48. Wylie JL, Berry JD, McClarty G (1996) Chlamydia trachomatis CTP synthetase: molecular characterization and developmental regulation of expression. Mol Microbiol 22(4):631–642. Wynne-Edwards VC (1962) Animal Dispersion in Relation to Social Behavior. London, Oliver & Boyd. Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D (2002) DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 30:303–305. Xia T, SantaLucia Jr J, Burkard ME, Kierzek R, Schroeder SJ, Jiao X, Cox C, Turner DH (1998) Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry 37(42): 14719–14735.

464

References

Xie G, Keyhani NO, Bonner CA, Jensen RA (2003) Ancient origin of the tryptophan operon and the dynamics of evolutionary change. Microbiol Mol Biol Rev 67(3): 303–342. Xu H, Hoover TR (2001) Transcriptional regulation at a distance in bacteria. Curr Opin Microbiol 4(2):138–144. Xu J (2006) Microbial ecology in the age of genomics and metagenomics: concepts, tools, and recent advances. Molecular ecology 15:1713–1731. Xu J, Bjursell MK, Himrod J, Deng S, Carmichael LK, Chiang HC, Hooper LV, Gordon JI (2003) A genomic view of the human-Bacteroides thetaiotaomicron symbiosis. Science 299:2074–2076. Xu J, Mahowald MA, Ley RE, Lozupone CA, Hamady M, Martens EC, Henrissat B, Coutinho PM, Minx P, Latreille P, Cordum H, Van Brunt A, Kim K, Fulton RS, Fulton LA, Clifton SW, Wilson RK, Knight RD, Gordon JI (2007) Evolution of Symbiotic Bacteria in the Distal Human Intestine. PLoS Biol 5:e156. Xu Y, Mural RJ, Shah M, Uberbacher EC (1994) Recognizing exons in genomic sequence using GRAIL II. In: Setlow J (ed) Genetic Engineering. Principles and Methods, New York, Plenum Press: Vol.16, pp. 241–253. Yachie N, Arakawa K, Tomita M (2006) On the interplay of gene positioning and the role of rho-independent terminators in Escherichia coli. FEBS Lett 580(30):6909–6914. Yada T, Hirosawa M (1996) Detection of short protein coding regions within the cyanobacterium genome: application of the hidden Markov model. DNA Res 3: 355–361. Yada T, Nakao M, Totoki Y, Nakai K (1999) Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models. Bioinformatics 15(12):987–993. Yada T, Totoki Y, Takagi T, Nakai K (2001) A novel bacterial gene-finding system with improved accuracy in locating start codons. DNA Res 8:97–106. Yamamoto N (1967) The origin of bacteriophage P221. Virology 33:545–547. Yamanishi Y, Vert JP, Kanehisa M (2004) Protein network inference from multiple genomic data: a supervised approach. Bioinformatics 20 Suppl 1:I363–I370. Yamanishi Y, Vert JP, Kanehisa M (2005) Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics 21 Suppl 1:i468–i477. Yan M, Lin ZS, Zhang CT (1998) A new fourier transform approach for protein coding measure based on the format of the Z curve. Bioinformatics 14:685–690. Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13(5):555–556. Yanofsky C (1971) Tryptophan biosynthesis in Escherichia coli. Genetic determination of the proteins involved. Jama 218(7):1026–1035. Yanofsky C, Platt T, Crawford IP, Nichols BP, Christie GE, Horowitz H, VanCleemput M, Wu AM (1981) The complete nucleotide sequence of the tryptophan operon of Escherichia coli. Nucleic Acids Res 9(24):6647–6668. Yarus M (1988) A specific amino acid binding site composed of RNA. Science 240(4860):1751–1758. Yeger-Lotem E, Sattath S, Kashtan N, Itzkovitz S, Milo R, Pinter RY et al. (2004) Network motifs in integrated cellular networks of transcription-regulation and protein-protein interaction. Proc Natl Acad Sci USA 101(16):5934–5939. Yeung MK, Tegner J, Collins JJ (2002) Reverse engineering gene networks using singular value decomposition and robust regression. Proc Natl Acad Sci USA 99(9): 6163–6168. Yooseph S, Sutton G et al. (2007) The sorcerer II global ocean sampling expedition: expanding the universe of protein families. PLoS Biol 5(3):e16.

References

465

Yu H, Gerstein M (2006) Genomic analysis of the hierarchical structure of regulatory networks. Proc Natl Acad Sci USA 103(40):14724–14731. Yuan J, Palioura S, Salazar JC, Su D, O’Donoghue P, Hohn MJ, Cardoso AM, Whitman WB, Soll D (2006) RNA-dependent conversion of phosphoserine forms selenocysteine in eukaryotes and archaea. Proc Natl Acad Sci USA 103(50):18923–18927. Zambrano MM, Siegele DA, Almiron M, Tormo A, Kolter R (1993) Microbial competition: Escherichia coli mutants that take over stationary phase cultures. Science 259: 1757–1760. Zhang B, VerBerkmoes NC, Langston MA, Uberbacher E, Hettich RL, Samatova NF (2006) Detecting differential and correlated protein expression in label-free shotgun proteomics. J Proteome Res 5(11):2909–2918. Zhang CT, Zhang R (1991) Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucleic Acids Res 19:6313–6317. Zhang GQ, Cao ZW, Luo QM, Cai YD, Li YX (2006) Operon prediction based on SVM. Comput Biol Chem 30(3):233–240. Zhang K, Martiny AC, Reppas NB, Barry KW, Malek J, Chisholm SW, Church GM (2006) Sequencing genomes from single cells by polymerase cloning. Nature Biotechnology 24:680–686. Zhang LV, Wong SL, King OD, Roth FP (2004) Predicting co-complexed protein pairs using genomic and proteomic data integration. BMC Bioinformatics 5:38. Zhang R, Zhang C-T (2004) A systematic method to identify genomic islands and its applications in analyzing the genomes of Corynebacterium glutamicum and Vibrio vulnificus CMCP6 chromosome I. Bioinformatics 20:612–622. Zhang Y, Baranov PV, Atkins JF, Gladyshev VN (2005) Pyrrolysine and selenocysteine use dissimilar decoding strategies. J Biol Chem 280(21):20740–20751. Zhang Y, Gladyshev VN (2005) An algorithm for identification of bacterial selenocysteine insertion sequence elements and selenoprotein genes. Bioinformatics 21(11):2580–2589. Zhang Y, Gladyshev VN (2007) High content of proteins containing 21st and 22nd amino acids, selenocysteine and pyrrolysine, in a symbiotic deltaproteobacterium of gutless worm Olavius algarvensis. Nucleic Acids Res 35(15):4952–4963. Zhang Z, Voit EO, Schwacke LH (1996) Parameter estimation and sensitivity analysis of S-systems using a genetic algorithm. In T. Yamakawa & G. Matsumoto (Eds.), Methodologies for the Conception, Design, and Application of Intelligent Systems. Singapore: World Scientific, pp. 155–158. Zhaxybayeva O, Gogarten JP (2002) Bootstrap, Bayesian probability and maximum likelihood mapping: Exploring new tools for comparative genome analyses. BMC Genomics 3(1):4. Zhaxybayeva O, Gogarten JP (2003) An improved probability mapping approach to assess genome mosaicism. BMC Genomics 4(1):37. Zhaxybayeva O, Gogarten JP (2004) Cladogenesis, coalescence and the evolution of the three domains of life. Trends in Genetics 20(4):182–187. Zhaxybayeva O, Gogarten JP, Charlebois RL, Doolittle WF, Papke RT (2006) Phylogenetic analyses of cyanobacterial genomes: Quantification of horizontal gene transfer events. Genome Res 16(9):1099–1108. Zhaxybayeva O, Hamel L, Raymond J, Gogarten JP (2004a) Visualization of the phylogenetic content of five genomes using dekapentagonal maps. Genome Biol 5(3):R20. Zhaxybayeva O, Lapierre P, Gogarten JP (2005) Ancient gene duplications and the root(s) of the tree of life. Protoplasma 227(1):53–64. Zheng Y (2002) Metabolic biochemical pathways for operon prediction. http://genomics10. bu.edu/operons.

466

References

Zheng Y, Szustakowski JD, Fortnow L, Roberts RJ, Kasif S (2002) Computational identification of operons in microbial genomes. Genome Res 12(8):1221–1230. Zhou F, Olman V, Xu Y (2007) Towards construction of genome-scale maps of recently active Insertion Sequences in cyanobacteria and archaea. Submitted. Zhu H, Bilgin M, Bangham R, Hall D, Casamayor A, Bertone P et al. (2001) Global analysis of protein activities using proteome chips. Science 293:2101–2105. Zhu H, Hu GQ, Yang YF, Wang J, She ZS (2007) MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes. BMC Bioinformatics 8:97. Zhu HQ, Hu GQ, Ouyang ZQ, Wang J, She ZS (2004) Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics 20: 3308–3317. Zhu J, Wiener MC, Zhang C, Fridman A, Minch E, Lum PY et al. (2007) Increasing the power to detect causal associations by combining genotypic and expression data in segregating populations. PLoS Comput Biol 3(4):e69. Zhu W, Becker DF (2003) Flavin redox state triggers conformational changes in the PutA protein from Escherichia coli. Biochemistry 42(18):5469–5477. Zhu W, Gincherman Y, Docherty P, Spilling CD, Becker DF (2002) Effects of proline analog binding on the spectroscopic and redox properties of PutA. Arch Biochem Biophys 408(1):131–136. Zimmerman SB (2006) Shape and compaction of Escherichia coli nucleoids. J Struct Biol 156(2):255–261. Zink RT, Kemble RJ, Chatterjee AK (1984) Transposon Tn5 mutagenesis in Erwinia carotovora subsp. carotovora and E. carotovora subsp. atroseptica. J Bacteriol 157:809–814. Zmasek CM, Eddy SR (2001) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 17(9):821–828. Zoetendal EG, Ben-Amor K, Akkermans AD, Abee T, de Vos WM (2001) DNA isolation protocols affect the detection limit of PCR approaches of bacteria in samples from the human gastrointestinal tract. Systematic and Applied Microbiology 24:405–410. Zoetendal EG, Collier CT, Koike S, Mackie RI, Gaskins HR (2004) Molecular ecological analysis of the gastrointestinal microbiota: a review. The Journal of Nutrition 134: 465–472. Zuckerkandl E, Pauling L (1965) Molecules as documents of evolutionary history. J Theor Biol 8(2):357–366.

INDEX

σ factor, 190–193, 198–200, 205, 206

bipartition, 147, 148 BranchClust algorithm, 210, 225, 228–230 breakpoint, 169, 170 BST, 303, 306–310, 328, 330, 331, 343 Buchnera, 7, 103, 104, 139, 140, 144, 148, 154, 160, 164, 184, 228

absent gene, 106, 146, 163, 167, 172 accessory genes, 108, 110 accuracy, 45, 55, 57, 63, 67–72, 95, 125, 132, 242, 243, 245, 250, 254–256, 266, 275, 298, 382 acid mine drainage, 355 activation, 192, 193, 198–200, 202, 282 Agrobacterium, 142, 143 amino acid biosynthesis, 78, 93, 94, 176, 233 amino acid composition, 15, 58, 93, 95 aminoacyl-tRNA synthetase, 79–81, 85, 97 ancestral genome, 100, 157, 159, 161, 162, 164, 167–169, 182, 183 ancestral sequence reconstruction, 95 ancient gene duplication, 96 anticodon, 77–80, 89 antigenic variation, 24, 32 assimilation, 262, 270, 272, 291, 300, 301 attC, 27, 28, 117, 118, 129 attenuation, 235, 236 attI, 27, 28, 117, 118, 129 atypical nucleotide composition, 143

CAMERA, 389, 394, 396 canonical model, 302, 303, 327, 331, 333 Carsonella rudii, 7, 8 character genes, 106, 108, 109 Chargaff rule, 21, 22 ChIP-chip, 202, 207, 260, 286, 287, 290, 294, 311, 313 Chi-sites, 30 chromosome, 1–9, 11, 15–19, 21–25, 27–31, 33–36, 174, 177, 180, 193, 196, 197, 202, 204, 206, 262, 286, 353 circular chromosome, 4, 23 cis-regulatory elements, 186–189, 193–195, 260, 263–267, 272 cis-regulatory motif, 264, 266–269, 274 coalescence, 88 coding, 5, 6, 9, 32, 33, 35, 39–44, 46, 47, 50–56, 59–62, 65–67, 73, 76, 81, 82, 84, 93, 123, 129, 130, 162, 170, 177, 187, 202, 269, 270, 274, 382 codon, 7, 9–12, 16, 18–22, 31, 39, 40, 42, 43, 47, 50–65, 67, 70, 72, 73, 75–82, 84, 86, 89–94, 96, 97, 123–125, 127, 129, 143, 162, 234, 242, 252, 382 codon usage, 16, 20, 21, 40, 60, 123, 124, 125 commensal, 159, 368 community genomics, 346, 347, 359 complexity, 33, 39, 91, 95, 97, 156, 157, 163, 177, 221, 285, 294, 298, 307, 339, 348, 350, 351, 353, 355, 361, 376, 380, 382, 384–386 composite transposons, 25, 121 compositional stratigraphy, 96

bacteriophage, 1, 2, 4, 5, 25, 28, 31, 35, 108, 116, 284, 360 bacteriophage Mu, 27, 28 base pairs, 2, 9, 11, 66, 119, 174, 204, 263, 284, 353, 373 Bayes’ theorem, 42 Bayesian method, 242, 306 BDGF, 60 bifurcation, 141, 283, 338 binding site, 31, 52, 61, 63, 91, 175, 187, 193, 195, 202, 203, 259–275, 286, 287, 294, 310 biochemical systems theory, 303, 306, 328 Bio-Dictionary, 60 biological species concept, 141 467

468

conjugation, 1, 4, 28, 119, 121 conjugative transposon, 28, 119, 121 consensus model, 92 conservative transposition, 25, 26 continuous (system/model), 185, 325 convergent evolution, 96, 139, 150 core promoter, 185, 186, 188 co-regulation model, 246 co-regulatory genes, 291 CRISPR, 35–37 CRITICA, 60 cyanobacterium, 270 deamination, 7, 20 decision tree, 45, 242, 245 deleted interpolation, 45–47 deletion bias, 9, 174 deterministic (model), 325, 334 differential equation, 283, 303, 304, 306–308, 316, 317, 325, 329, 331, 332, 337 dinucleotide relative abundances, 12–14, 16, 18 directed acyclic graphs, 56 disambiguation of coding space, 96 DNA repeats, 24 DNA-depending RNA polymerase (RNAP), 186, 188–195, 198–202, 208, 236, 237, 259, 267 domain superfamily, 218 double-stranded DNA (dsDNA), 1, 2, 12, 22, 200, 263 down-regulation, 259, 260 duplication event, 173, 212, 213 dynamic model, 283, 333, 336 E. coli, 1, 3, 14, 16–20, 23, 24, 30, 31, 33, 34, 43, 44, 50, 67, 69–72, 103, 104, 106, 114, 118, 120, 127, 144, 153, 164–166, 174, 182, 186, 187, 190, 191, 195–198, 201, 202, 204–207, 221, 228, 233, 235–238, 240–245, 247, 248, 250–255, 261–263, 270, 275, 289, 290, 294, 297, 299, 300, 302, 326, 335, 351, 367, 375, 379 EasyGene, 55, 61, 62, 64, 65, 69, 70, 73 ecological genomics, 346, 390 ECOPARSE, 41, 50, 51, 53, 63 effective population size, 171 effector signals, 197 elasticity, 330

Index

elemental chemical kinetics, 316 embedded quartet, 148, 149 endosymbiont, 45, 103, 139, 140, 154, 156, 158, 159, 161, 171–176, 178, 180–182, 228 endosymbiontic bacteria, 139 enhancer binding protein (EBP), 192, 193, 195, 198, 199 entropy density profile, 58 environmental genomics, 346 environmental shotgun sequencing, 353, 372, 373, 376 evolution of operon, 246, 256 expectation-maximization, 95 Expectation-Maximization algorithm, 49 expression, 16, 27, 28, 32, 49, 71, 91, 117, 118, 120, 123, 125, 134, 181, 186, 188, 190, 193, 195, 197, 200, 201, 203–205, 223, 235, 236, 240, 241, 248, 259, 260, 262, 265, 283, 287, 291, 292, 298, 300–304, 306, 312, 315, 327, 334–336, 346, 355, 375 extended core, 106–109 facultative symbiont, 159, 161 Fisher model, 246 fixation, 84, 114, 143, 171, 172, 174, 175, 301, 371, 384 flux, 282, 302, 306, 315, 326, 329, 330, 333, 336, 343, 374 flux balance analysis, 299, 306, 326 Forward-Backward algorithm, 48 Fourier spectrum, 53, 54 FrameD, 56, 69 Frankia, 104, 144 frequency of gene transfer, 146 functional metagenomics, 351, 355 functional module, 289, 290, 301, 336 functional relatedness, 234, 237, 238, 249, 288, 289 G+C content, 7–13, 15–17, 20–22, 47, 58, 72, 73, 115, 123 G-C skew, 21–23 gene, 4–9, 16, 20–25, 27–29, 32, 35, 37, 39–43, 45–53, 55–75, 81, 86, 88, 89, 91, 96, 99–101, 103–111, 113, 115–119, 123, 124, 126–129, 132–134, 137–150, 156, 158–166, 168–173, 175, 176, 178–188, 190, 191, 194, 199, 200, 203–208, 210–221, 223–225, 227–231, 233, 234, 236, 238–242, 244–247, 249–254, 256,

Index

259, 263, 265, 267, 272, 275, 281–284, 287, 288, 290–292, 296–304, 310–312, 315, 334–336, 346–348, 350, 352, 355, 357, 358, 360, 366, 368, 370–373, 375, 376, 379–382, 384, 388, 390 gene cluster, 123, 238, 239, 246, 247, 249, 290, 292 gene content, 5–8, 111, 126, 127, 144, 156, 158–162, 164–168, 171, 172, 175, 179, 183, 184, 351, 381 gene expression arrays, 239, 240 gene family, 75, 88, 96, 99, 100, 104–109, 111, 139, 145–149, 163, 210, 214–221, 224, 230, 368 gene flow, 105, 137, 139, 141 gene fusion, 170, 288, 290, 310 gene gain, 161, 168 gene loss, 7, 127, 144, 146, 158, 161, 168, 171, 172, 176, 180, 223, 231, 247 gene order, 159, 164–166, 169, 170, 185, 247, 248, 252, 256 gene organization, 186, 284 gene remnant, 146 gene superfamily, 217, 227, 230 gene transcription, 186, 188, 190, 191, 199, 203, 205, 208, 238, 259, 263, 282, 283 gene transfer, 4, 5, 21, 24, 37, 73, 88, 100, 110, 134, 137–139, 141, 142, 144, 146, 147, 180, 213, 350, 358, 388 GeneHacker, 51, 65, 69 GeneMark, 41–45, 52, 53, 55, 57, 62, 65, 69 GeneMark.fba, 53 GeneMark.fbf, 53 GeneMark-Genesis, 57 GeneMark.hmm, 52, 53, 58, 63, 64, 65 GeneMarkS, 58, 64, 69–71 generalization, 13, 46, 327, 360 generalized HMM, 50–52 generalized mass action system, 306 GeneScan, 53, 54, 372 genetic code, 75, 77–79, 81, 82, 86–96 genetic drift, 84, 140, 171, 183, 184 genetic programming, 68, 309 genetic regulatory network, 275 genome, 1–3, 5–9, 12–16, 18, 20–25, 29, 30, 32–35, 37, 39, 47, 51, 55–57, 59, 67, 69–74, 81, 82, 84, 86, 89, 99–108, 110, 113, 114, 121, 123, 125–134, 139–141, 143–150, 153–164, 167–170, 172–186, 193–195, 200–202, 210, 214, 220, 222, 223, 225, 228, 231, 233, 234, 237, 238,

469

240–256, 259, 260, 263, 265, 266, 269–272, 274, 275, 279, 284, 287–290, 294, 296, 298, 311–315, 320, 334, 346–350, 353, 355, 357–359, 362, 364–366, 368, 369, 371, 373, 374, 377–380, 382, 384, 387, 390, 394 genome annotation, 84, 162, 390 genome rearrangement, 7, 128, 133, 164, 170 genome reduction, 7–9, 153, 157–159, 170, 172, 178, 179, 183 genome signature, 12, 13, 15, 16, 20, 21, 123–125 genome size, 6, 9, 104, 130, 144, 153–161, 170–173, 175, 179, 180, 183, 185 genomic island, 21, 28, 110, 113, 122, 134 GISMO, 62, 69 glimmer, 43, 45, 58, 62–65, 69, 70, 72 Global Ocean Sampling (GOS), 356–359, 378, 389 glycolysis, 294, 295, 321, 322, 324, 341 GMA system/model, 329–331, 343 GPboostReg , 68 graph theory, 251 group selection, 142 half-life of a pseudogene, 172 helical period, 31 hidden Markov model, 41, 47, 52, 62, 67, 115, 124, 242, 243, 294 hierarchical union of genes from operons, 249, 254 highways of gene sharing, 150 HIP1, 30, 32 homeomorphic family, 219 homogeneous models, 16, 41, 42, 52, 62 homologous genes, 100, 110, 210–213, 215, 217, 246, 249, 254 homologs, 50, 60, 84, 100–103, 116, 119, 127, 145, 158, 209, 210, 213, 214, 229, 230 Hon-Yaku, 64, 71 horizontal gene transfer, 5, 88, 100, 113, 123, 125, 127, 138, 146, 161, 211, 246, 247 incongruence of phylogenetic trees, 138 induction, 116, 117, 236, 367 information content, 59, 268, 289 inhibition, 303, 319, 324, 327 inparalogs, 213, 216, 220, 227, 230, 231

470

insertion sequence (IS), 82, 113, 118, 120, 134, 155 integrase, 27, 28, 116–118, 121, 128, 129, 133, 134 integron, 27, 28, 113, 117–119, 128, 129, 132–134 intergenic, 9–12, 24, 30, 39, 50, 67, 68, 153, 155, 175, 178, 193, 195, 233, 241–243, 249, 251–253, 269, 270, 275, 372 intergenic distance, 234, 242 inter-operonic distance, 234, 241 interpolated Markov model, 43, 45, 56 inverse modeling, 305 inversion, 27, 170 inverted repeat, 25, 117, 119, 262 IS elements, 116, 119–121, 129, 130, 132, 134, 155, 159, 161, 162 KAAS, 296, 298, 311 kinetics, 207, 282, 313, 316, 318, 333 Kingman’s coalescence, 141 Lac operon, 233, 236, 237, 240 Lactococcus lactis (L. lactis), 30, 321 lagging strand, 22, 23 lateral gene transfer (LGT), 9, 21, 37, 350, 351, 355, 366, 368 leading strand, 22, 23 Lengthen-Shuffle, 54 lineage sorting, 140, 141, 150 linear discriminant function, 56 lin-log model, 330, 331, 333 long branch attraction, 140, 150 loop, 4, 66, 77, 79, 80, 199, 200, 205 Markov chain, 13, 41, 45 Markov cluster algorithm, 250 Markov model, 16–18, 30, 41–43, 45–47, 52, 58, 62 MED, 58, 64, 69, 70 MED-Start, 64 metabolic network, 178, 183, 282, 299, 301, 302, 306, 320, 325, 326, 328, 330, 333–336 metabolic pathways, 15, 76, 93, 94, 150, 176, 182, 249, 281–285, 294, 296, 297, 299, 302, 312, 336, 348, 365, 371, 384 metabolism, 20, 94, 142, 176, 177, 179, 181, 182, 196, 233, 255, 281, 285, 302, 315, 321–324, 327, 333, 335, 341, 342,

Index

359–362, 364, 365, 367, 370, 375, 383, 389 metabolites, 177, 196, 197, 203, 207, 282, 285, 303, 304, 308, 315, 317, 318, 320, 321, 324–326, 328, 332, 333, 335, 336, 343 methods for detection of HGT, 143 Michaelis-Menten rate law, 316 microarray, 67, 240, 241, 252, 260, 265, 283–286, 290, 291, 298–301, 312, 313, 334–336, 364, 375, 376 microbial communities, 142, 143, 345, 349–351, 354–356, 360, 362, 363, 365, 368, 372, 374, 375, 377, 389 microbial ecology, 346, 352, 359, 360, 364, 367, 383, 387, 389 microbial observatories, 383, 394 microbial oceanography, 389 microbiome, 345, 349, 361, 363, 364, 366, 367, 369–371, 388 minimal gene set, 105, 175, 176, 181, 184 minimal genome, 105, 175, 176, 177, 182 model design, 320, 321–323, 333, 334, 340 modularity model, 246 modulon, 234, 245 molecular markers, 138, 139 monera, 137 monocistronic, 233 Monte-Carlo simulation, 339, 340, 342 Muller’s ratchet, 171 mutualist, 367, 368 Mycobacterium leprae, 5, 153, 158 Mycoplasma, 7, 24, 32, 77, 79, 154, 157, 191 natal model, 246 natural selection, 141, 142, 163, 171, 174, 175, 183, 184 natural taxonomy, 137 neural network, 41, 55, 67, 73, 242, 244, 309 nitrogen, 92, 114, 142, 144, 270, 272, 283, 291, 301, 335, 349, 371, 376, 384, 386 non-orthologous gene displacement, 100, 181 Northern blot, 71, 239, 240 nucleoid, 4, 186, 193, 199, 200, 202, 206 nucleoid-associated proteins, 193, 202, 206 nucleotide composition, 16–18, 21, 40, 64, 382 nucleotides triphosphates, 192

Index

obligate symbiont, 7, 159 operator, 186, 188, 195, 200, 235, 236 operon, 24, 29, 62, 64, 116, 153, 186, 187, 202, 206, 233–256, 259–263, 265–267, 269, 274–276, 284, 288–290, 297–299, 301, 336, 371, 386, 394 operon prediction, 233–235, 240, 242, 251, 254, 255 optimization, 45, 46, 63, 64, 293, 297, 300, 301, 306–310, 320, 330, 336, 342, 343 ORF, 39, 41, 43, 45, 52, 53, 57, 59–62, 65, 73, 117, 118, 120, 129 origin of replication, 4, 16, 22, 23, 31 orphan open reading frame, 162 ORPHEUS, 59, 69 orthologous, 100, 101, 110, 127, 145, 146, 149, 161–163, 165, 181, 183, 193, 210, 212, 214, 216, 220, 224, 231, 234, 237, 245, 249, 250, 265, 266, 268, 272, 290, 296 orthologous genes, 100, 110, 146, 149, 161, 165, 181, 183, 193, 212, 214, 220, 231, 237, 245, 268, 272 orthologous replacement, 100, 145, 146 orthologs, 100, 126, 144, 145, 158, 161, 163, 164, 169, 183, 209–216, 219–225, 229–231, 248, 265–267, 269–272, 275, 311 outparalogs, 210, 212, 213, 227 overlap, 39, 52, 65, 218, 373, 380 pan-genome, 105, 106, 108, 109, 351, 374 paralogous, 110, 163, 210, 212 paralogous gene, 100, 163, 182, 220, 224 paralogs, 61, 94, 95, 100, 161, 183, 190, 205, 209–216, 220–222, 225, 227, 229–231, 247 parameter estimation, 304, 307–309, 317, 334 parasite, 1, 6, 7, 175, 176, 180, 181, 193 parasitic genes, 142 parsimony, 164, 167, 168, 183 pathogenicity island, 21, 25, 28, 105, 114, 115, 229 PathoLogic, 296–299, 311 pathway mapping, 294, 296–299, 336 PCR, 115, 239, 352, 371, 372, 375, 376, 384, 390 phage, 5, 28, 29, 35, 36, 115–117, 121, 122, 128, 131, 133–135, 173, 286, 358–361, 375

471

phosphonates, 262 phosphorus, 262, 291, 300, 301, 376 phosphorylation, 76, 196, 197, 199, 260, 283, 300 photosynthetic, 109, 272 phyletic patterns, 145, 146 phylogenetic footprinting, 260, 261, 265–267, 270, 272, 275 phylogenetic profile, 223, 241, 251, 287–290, 293, 300, 336 phylogenetic tree, 21, 80, 86, 95, 146, 147, 149, 150, 157, 163, 211–213, 221, 225, 231, 248, 266, 272 plasmid, 1–5, 16, 21, 25–28, 35, 113, 117, 118, 121, 122, 130, 134, 142, 160, 172, 354, 355, 373 PMAP, 297–299, 301, 311 polycistronic, 175, 186, 233, 255, 259 population genomics, 346 position weight matrix, 64, 264 post-translational modification (PTM), 285, 335 power-law approximation, 328, 331, 334 prediction of orthologous gene, 245 prediction of pathway, 238, 245, 290, 310 Prochlorococcus, 150, 153, 158, 270, 375 profile, 41, 43, 44, 68, 186, 216–219, 240, 260, 263–271, 274, 275, 283, 287, 288, 292, 300, 364, 389 prokaryotes, 4, 6–8, 25, 32, 37, 39, 51, 63, 65, 69, 72, 73, 82, 113, 116, 120–122, 130, 137, 138, 140, 153, 157, 158, 162, 170, 172, 173, 175, 179, 183, 186, 200, 204–206, 210, 219, 223, 224, 231, 233, 240, 250, 259, 261, 263, 275, 282–284, 287 prophage, 28, 29, 113, 116, 117, 119, 121, 127, 128, 131, 361 protein-DNA interaction, 279, 285, 286, 294, 298, 300, 301, 334 protein-protein interaction, 239, 245, 284–286, 288, 289, 292, 293, 300–302, 310, 335, 336 pseudogene, 6, 9, 153, 158, 159, 161–163, 167, 168, 172, 174, 183 pseudoorthologs, 210, 214 pseudoparalogs, 210, 214 pyrrolysine, 77, 82–85, 87 quartet, 147–149

472

rate constant, 328–330, 332 rate law, 304, 316, 327 reciprocal best BLAST hit (RBH) method, 220 reciprocal blast search, 100, 101, 110, 222 Reciprocal Smallest Distance (RSD) method, 222 recombination, 5, 7, 9, 25–28, 30, 32, 116, 117, 120, 141, 161, 171, 176–178, 181, 201, 356, 366 reconciled tree, 223 reference tree, 147–149 regulatory pathways, 250, 281–284, 300 regulon, 187, 193, 204–206, 234, 238, 239, 245, 249, 250, 259–267, 270–272, 274, 275, 279, 284, 288, 289, 297, 298, 301 REP elements, 30 replicative transposition, 26, 27 replicon, 1–5, 16, 25–27, 113 repression, 193, 199–201, 236 res site, 26, 27 RescueNet, 55 response regulator, 196, 197, 260, 283 reverse engineering, 307 ribosomal RNA, 66, 68, 138, 144, 199, 233, 352, 357 ribosome, 63, 66, 75, 77, 89, 175, 203, 204, 237, 272, 300 RNA genes, 66–68, 71, 156, 162, 176 RNA polymerase, 188, 190, 194, 203, 204, 233, 236, 237, 259, 282, 294 RNAmmer, 68 r-scan statistics, 31 RSCU, 55 RT-PCR, 239, 240 Sargasso Sea survey, 356 SCFG, 68 scoring function, 268, 270 SECIS element, 82, 84 secondary structure, 66, 67, 71, 217, 235 SEED, 297, 298, 311, 390, 394 selection coefficient, 171 selective sweep, 141, 142 selenocysteine, 77, 82–84, 87 self-organizing maps, 55 selfish genes, 142, 143 selfish operon model, 247, 256 self-learning algorithm, 57

Index

sensitivity, 68–72, 115, 125, 126, 131, 132, 242–244, 256, 275, 337, 338, 342 sensitivity analysis, 337, 338, 342 sensor kinase, 260, 262, 283 sequence metagenomics, 384 sequencing technologies, 345, 350, 373 signal transduction, 177, 193, 196, 260, 270, 283, 315, 338 signaling pathways, 206, 281, 283 signals sensing, 195, 196, 199 similarity-neighborhood approach, 249 simple sequence repeats, 24, 33, 34 soil metagenomics, 353 speciation event, 140, 211–213 species concepts, 141 specificity, 36, 68, 71, 79, 88, 94, 95, 109, 125, 126, 132, 193, 201, 202, 242–244, 256, 275, 365 sRNAPredict, 67 S-system, 306, 310, 328–332, 337–343 stability analysis, 338 standard HMM, 50 steady state, 282, 302, 306, 319, 326, 330, 337–339 stem, 77, 80, 82, 84, 89, 194, 205, 237 stimulon, 206 stoichiometric matrix, 325, 327 stoichiometric model, 282, 325, 328 substitution rates, 144, 150, 171, 174 superintegrons, 28, 118, 129, 133 support vector machine, 62, 67, 293 symbiosis, 159–161, 178, 184, 363, 371, 388 synologs, 210, 213, 214 synteny, 167 systematic, 16, 137, 138, 146, 282, 285, 287, 290, 292, 299, 302, 312, 313 systems analysis, 317 targets of natural selection, 142 TATA binding protein, 188 taxonomy, 137–139 TESTCODE, 40 thermodynamics, 316 thermophile, 15 Thermotoga, 14, 139, 146 tiling-array, 311 time series, 292, 307–309, 333, 336, 389 transcription factor, 186, 189, 193, 259, 287, 291, 292, 300, 310, 336

Index

transcription terminator, 31, 236 transcription units, 175, 186–188, 195, 198, 206, 235, 243, 252, 255 transcriptional regulation, 186, 195, 196, 198–202, 204, 206, 208, 233, 236, 250, 255, 260, 278, 285, 301, 335 transcriptional regulatory networks, 260, 275 transduction, 4, 116, 121, 135, 358 transformation, 119, 135 transposase, 25, 26, 116, 119, 120, 130, 133 transposon, 23–28, 113, 117–121, 129, 130, 132, 134, 181 trans-regulatory factors, 186, 189, 192, 194, 207 tree mapping, 223 tree reconciliation, 223, 224 tRNA, 18, 66, 68, 75, 78–90, 93, 94, 97, 115, 117, 138, 162, 176, 179, 204 tRNA-dependent biosynthesis, 86–89

473

tRNAscanSE, 68 two-component, 64, 196, 199, 260, 283 uber-operon, 233, 234, 237, 238, 249, 254 uber-operon prediction, 254 uptake signal sequence (USS), 29, 31 viral metagenomics, 360 Viterbi algorithm, 48, 51–53 Wigglesworthia, 7, 139, 148, 154, 160, 228 wobble rules, 79 xenologous, 163, 182, 212, 213 xenologs, 161, 183, 210–214 yeast two-hybrid, 286, 292, 294, 313 ZCURVE, 56, 70