Genome Informatics 2008: Proceedings of the 8th Annual International Workshop on Bioinformatics and Systems Biology (IBSB 2008) [1 ed.] 9781848162990, 1848162995

This volume contains 25 peer-reviewed papers based on the presentations at the 8th Annual International Workshop on Bioi

222 87 19MB

English Pages 301 Year 2008

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
CONTENTS......Page 6
Preface......Page 10
Program Committee......Page 12
1. Introduction......Page 14
2.2. Mathematical model......Page 16
2.3. Transcriptional regulation and external metabolites......Page 19
2.4. Parameter estimation......Page 21
2.5. Genetic algorithm and semi global search......Page 22
3.1. Parameter estimation......Page 23
3.2. Comparison to experimental data......Page 24
4. Discussion......Page 25
Appendix......Page 26
References......Page 27
1. Introduction......Page 28
2.1. IPaR Model......Page 30
2.2. CellModel......Page 31
2.3. Results......Page 33
3. Discussion......Page 36
References......Page 37
1. Introduction......Page 38
2. Cell System Ontology......Page 39
3. Rule-Based Reasoning for Ontology Validation......Page 41
3.2. Biologically correct models......Page 42
3.3. Systematically correct models......Page 44
5. Conclusions......Page 46
References......Page 48
1. Introduction......Page 50
2.1.1. Vector autoregressive model......Page 51
2.1.3. Bayesian information criterion......Page 52
2.2. L1 regularized spline additive model for gene regulatory network estimation......Page 53
2.3. Bayesian information criterion for nonparametric group LASSO regression......Page 54
2.4. Wald test for Granger causality......Page 56
3.1. Simulation data examples......Page 57
3.2. Application of expression data of human hela cell......Page 59
4. Discussion......Page 61
References......Page 62
Appendix A.......Page 63
1. Introduction......Page 65
1.1. The Idea......Page 66
2.1. Generating Alternative Models......Page 67
2.1.1. Removing Reactions and Modifiers......Page 68
2.1.2. Removing Species......Page 69
2.2. Model Discrimination......Page 70
3.1. Example......Page 72
3.2. Conclusions......Page 74
References......Page 75
1. Introduction......Page 77
2.1. Notation......Page 78
2.2. Related Work......Page 79
3. Framework......Page 80
4. Outlier Detection......Page 81
5. Probe Cleaning......Page 82
6. Statistical Methods......Page 83
7.1. Simulated Microarray Data......Page 84
7.3. Results......Page 85
8. Conclusion......Page 87
References......Page 88
1. Introduction......Page 90
2.1. Details of the model......Page 91
2.2. Experimental data and parameter estimation......Page 94
3.1. Simulation of model......Page 95
3.2. Time-varying response coefficients......Page 97
4. Conclusion......Page 101
References......Page 102
1. Introduction......Page 104
2. Theory......Page 105
3. Results......Page 108
4. Discussion......Page 112
References......Page 113
1. Introduction......Page 115
2.2. Graph Cut Indices......Page 117
3. Data and Methodology......Page 118
4. Results and Discussion......Page 121
5. Conclusions......Page 123
References......Page 124
1.1. Correlations of metabolite concentration data......Page 125
1.3. Entropy, mutual information and statistical (in)dependence......Page 126
2.1. Non-linear correlations are captured by mutual information......Page 128
2.2. Significance of the coefficients given the limited sample size......Page 129
3.1. Correlations among metabolite concentrations from Arabidopsis thaliana......Page 131
4. Conclusion......Page 133
References......Page 134
1. Introduction......Page 136
2.1. Experimentaljlux data......Page 137
3.1. Experimentally determinedjluxes......Page 138
3.2.1. Growth maximization......Page 140
3.2.4. Alternate maximization criteria......Page 141
3.3. Correlations between experimental and predicted fluxes......Page 142
3.4. Prediction of absolute flUX changes......Page 144
4. Discussion......Page 145
References......Page 146
1. Introduction......Page 148
2.1. Species-specific networks......Page 150
2.2. Biosynthetic potential of metabolites via scope......Page 151
2.5. Evaluation of parameter values......Page 152
3.1. Scope size distributions......Page 153
3.2. Cluster agglomeration......Page 155
3.3. Influence of cut-off and seed size......Page 157
4. Discussion......Page 159
References......Page 160
1. Introduction......Page 162
2.2.1. KEGG RP AIR database......Page 163
2.2.2. KEGG Atomtype......Page 164
2.2.3. RDMpattern......Page 165
2.5. Generalization of RDM patterns......Page 166
3.1. Relationship between Ee sub-subclasses and RDM patterns......Page 168
3.2. Hierarchical clustering and generalization of RDM patterns......Page 169
4. Discussion......Page 170
References......Page 171
1. Introduction......Page 172
2. A Constraint-based Model of Regulation......Page 173
3.1 Toy linear metabolic pathway......Page 176
3.2. Single flux perturbations......Page 177
3.3. Single perturbation robustness......Page 178
3.4. Single flux perturbation trajectories......Page 179
4. Glycolysis......Page 180
5. Discussion......Page 181
References......Page 182
1. Introduction......Page 184
2.2. Objective inference......Page 186
3. Results......Page 188
3.1. Conserved biomass coefficients across different glucose supply rates......Page 189
3.2. Influence of single gene deletion in pentose phosphate pathway......Page 190
4. Discussion......Page 192
References......Page 195
1. Introduction......Page 196
2. Methods......Page 198
2.1.3. RNADB05 and HIRES sets......Page 199
2.2. Calculation of RNA backbone string representation......Page 200
2.3. Suffix tree and array implementation......Page 201
3. Results......Page 202
3.1. Analysis of SCOR motifs......Page 203
3.2. Similarities among tRNA......Page 204
3.3. Similarities in the representative RNA sets......Page 207
4. Discussion......Page 208
Acknowledgements......Page 209
References......Page 210
1. Introduction......Page 212
2.1. DNA sequence andfunctional annotation data sources......Page 213
2.2. Local DNA structure prediction and GC content analysis......Page 214
3.1. Correlation between GC content and local DNA structure......Page 215
3.2. High hydroxyl radical cleavage regions overlap with functional elements......Page 217
4. Discussion......Page 220
References......Page 222
1. Introduction......Page 225
2.1. Coexpressing gene sets......Page 226
2.3. Transcription factor finding site (TFBS) prediction......Page 227
2.4. Bootstrap method......Page 228
2.6. Association rule data mining......Page 230
References......Page 232
1. Introduction......Page 235
2. Methods......Page 236
3. Results......Page 237
References......Page 242
1. Introduction......Page 244
2.1. Data......Page 245
2.2. Methods......Page 246
3.1. Chemical properties......Page 247
3.2. Functional properties......Page 251
3.4. Case study......Page 252
4. Conclusion and Future Perspectives......Page 253
References......Page 254
1. Introduction......Page 256
2.1. Compound database......Page 257
2.2. Two-dimensional searching......Page 258
2.4. Homology modeling......Page 259
3.1. Sequence alignment and homology modeling......Page 260
3.2. In silico screening......Page 261
3.4. Experimental validation......Page 262
References......Page 263
1. Introduction......Page 265
2.2. Drug interaction network......Page 266
3.1. Interaction/actors......Page 267
4. Discussion......Page 269
Acknowledgments......Page 270
References......Page 272
1. Introduction......Page 273
2.1. Preparing surface and grid representation......Page 275
3.1. Docking performance......Page 276
3.2. Sampling of a serine-protease-inhibitor complex......Page 278
Acknowledgments......Page 280
References......Page 281
1. Introduction......Page 283
2. Methods......Page 284
3. Results......Page 286
References......Page 288
1. Introduction......Page 290
2. Methods......Page 292
3. Results......Page 293
4. Discussion......Page 294
References......Page 295
Author Index......Page 298
Recommend Papers

Genome Informatics 2008: Proceedings of the 8th Annual International Workshop on Bioinformatics and Systems Biology (IBSB 2008) [1 ed.]
 9781848162990, 1848162995

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Genome Informatics 2008

GENOME INFORMATICS SERIES (GIS) ISSN: 0919·9454 The Genome Informatics Series publishes peer-reviewed papers presented at the International Conference on Genome Informatics (GIW) and some conferences on bioinformatics. The Genome Informatics Series is indexed in MEDLINE.

No.

Title

Year

ISBN CIJPa.

1

Genome Informatics Workshop I

1990

(in Japanese)

2

Genome Informatics Workshop II

1991

(in Japanese)

3

Genome Informatics Workshop III

1992

(in Japanese)

4

Genome Informatics Workshop IV

1993

4-946443-20-7

5

Genome Informatics Workshop 1994

1994

4-946443-24-X

6

Genome Informatics Workship 1995

1995

4-946443-33-9

7

Genome Informatics 1996

1996

4-946443-37-1

8

Genome Informatics 1997

1997

4-946443-47-9

9

Genome Informatics 1998

1998

4-946443-52-5

10

Genome Informatics 1999

1999

4-946443-59-2

11

Genome Informatics 2000

2000

4-946443-65-7

12

Genome Informatics 2001

2001

4-946443-72-X

13

Genome Informatics 2002

2002

4-946443-79-7

14

Genome Informatics 2003

2003

4-946443-82-7

2004

4-946443-88-6

15

Genome Informatics 2004 Vol. 15, No.1

16

Genome Informatics 2004 Vol. 15, No.2

2004

4-946443-91-6

17

Genome Informatics 2005 Vol. 16, No.1

2005

4-946443-93-2

18

Genome Informatics 2005 Vol. 16, No.2

2005

4-946443-96-7

19

Genome Informatics 2006 Vol. 17, No.1

2006

4-946443-97 -5

20

Genome Informatics 2006 Vol. 17, No.2

2006

4-946443-99-1

21

Genome Informatics 2007 Vol. 18

2007

978-1-86094-991-3

22

Genome Informatics 2007 Vol. 19

2007

978-1-86094-984-5

23

Genome Informatics 2008 Vol. 20

2008

978-1-84816-299-0

Genome Informatics Series Vol. 20

ISSN: 0919-9454

Genome Informatics 2008 Proceedings of the 8th Annual International Workshop on Bioinformatics and Systems Biology (lBSB 2008) Zeuten Lake, Berlin, Germany

9 -11 June 2008

Ernst-Walter Knapp Free University Berlin, Germany

Gary Benson Boston University, USA

Herman-Georg Holzhutter Charita-University Medicine Berlin, Germany

Minoru Kanehisa Kyoto University, Japan

Satoru Miyano

セ@

University of Tokyo, Japan

_________________________Im __p_e_ri_a_l_C_O_ll_e_g_e_p_re_s__ s

Published by Imperial College Press 57 Shelton Street Covent Garden London WC2H 9HE Distributed by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

GENOME INFORMATICS 2008 Proceedings of the 8th Annual International Workshop on Bioinformatics and Systems Biology (mSB 2008) Copyright © 2008 by the Japanese Society for Bioinformatics (http://www.jsbi.org) All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the JSBi.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN-13978-1-84816-299-0 ISBN-I0 1-84816-299-5

Printed by Fulsland Offset Printing (S) Pte Ltd, Singapore

CONTENTS

Preface

ix

Program Committee

xi

Exploring the Effect of Variable Enzyme Concentrations in a Kinetic Model of Yeast Glycolysis J. Bruck, W. Liebermeister fj E. Klipp

1

The Role of IP 3 R Clustering in Ca 2+ Signaling A. Skupin fj M. Falcke

15

Rule-Based Reasoning for System Dynamics in Cell Systems E. Jeong, M. Nagasaki fj S. Miyano

25

Estimation of Nonlinear Gene Regulatory Networks via Ll Regularized NVAR from Time Series Gene Expression Data K. Kojima, A. Fujita, T. Shimamura, S. Imoto fj S. Miyano

37

ModelMage: A Tool for Automatic Model Generation, Selection and Management M. Flottmann, J. Schaber, S. Hoops, E. Klipp fj P. Mendes

52

A Framework for Determining Outlying Microarray Experiments R. Wan, A. M. Wheelock fj H. Mamitsuka Exploring the Impact of Osmoadaptation on Glycolysis Using Time-Varying Response-Coefficients C. Kuhn, E. Petelenz, B. Nordlander, J. Schaber, S. Hohmann fj E. Klipp Comparing Flux Balance Analysis to Network Expansion: Producibility, Sustainability and the Scope of Compounds K. Kruse fj o. EbenhOh Semi-Supervised Graph Partitioning with Decision Trees T. Hancock fj H. Mamitsuka

v

64

77

91

102

vi

Contents

Measuring Correlations in Metabolomic Networks with Mutual Information J. Numata, O. Ebenhoh fj E.- W. Knapp

112

Optimality Criteria for the Prediction of Metabolic Fluxes in Yeast Mutants E. S. Snitkin Cd D. Segre

123

Biosynthetic Potentials from Species-Specific Metabolic Networks G. Basler, Z. Nikoloski, O. EbenhOh fj T. Handorf Generalized Reaction Patterns for Prediction of Unknown Enzymatic Reactions Y. Shimizu, M. Hattori, S. Goto fj M. Kanehisa Optimal Metabolic Regulation Using a Constraint-Based Model W. 1. Riehl Cd D. Segre Comparative Determination of Biomass Composition in Differentially Active Metabolic States H.-C. Chiu fj D. Segre Suffix Techniques as a Rapid Method for RNA Substructure Search R. A. Bauer, K. Rother, J. M. Bujnicki fj R. Preissner The Relationship between Fine Scale DNA Structure, GC Content, and Functional Elements in 1% of the Human Genome S. C. J. Parker, E. H. Margulies fj T. D. Tullius A Novel Strategy to Search Conserved Transcription Factor Binding Sites Among Coexpressing Genes in Human Y. Hatanaka, M. Nagasaki, R. Yamaguchi, T. Obayashi, K. Numata, A. Fujita, T. Shimamura, Y. Tamada, S. Imoto, K. Kinoshita, K. Nakai fj S. Miyano Modeling IL-2 Gene Expression in Human Regulatory T Cells M. Benary, H. Bendfeldt, R. Baumgrass fj H. Herzel Toxicity versus Potency: Elucidation of Toxicity Properties Discriminating between Toxins, Drugs, and Natural Compounds S. Struck, U. Schmidt, B. Gruening, 1. S. Jaeger, J. Hossbach fj R. Preissner Comparative VEGF Receptor Tyrosine Kinase Modeling for the Development of Highly Specific Inhibitors of Tumor Angiogenesis U. Schmidt, J. Ahmed, E. Michalsky, M. Hoepfner fj R. Preissner

135

149

159

171

183

199

212

222

231

243

Contents

vii

Network Analysis of Adverse Drug Interactions M. Takarabe, S. Okuda, M. ftoh, T. Tokimatsu, S. Goto €1 M. Kanehisa

252

Sampling Geometries of Protein-Protein Complexes A. Guerler, S. Lorenzen, F. Krull €1 E. - W. Knapp

260

Computer Aided Optimization of Carbon Atom Labeling for Tracer Experiments B. S. Menkiic, C. Gille €1 H.-G. Holzhiitter

270

Web-Links as a Means to Document Annotated Sequence and 3D-Structure Alignments in Systems Biology C. Gille, A. Hoppe €1 H.-G. Holzhiitter

277

Author Index

285

This page intentionally left blank

PREFACE

Genome Informatics Vol. 20 contains a selection of peer-reviewed papers presented at the Eighth Annual International Workshop on Bioinformatics and Systems Biology on 9-11 June of 2008. This time the workshop was held in the Teikyo Hotel at the Zeuthen Lake near Berlin, jointly organized by the German members of the International Research Training Group (IRGT) 'Genomics and System Biology of Molecular Networks' and supported by the German Science Foundation (DFG). These workshops were created to give doctoral students and young researchers the opportunity to present and discuss their research work in Bioinformatics and Systems Biology in the frame of an international scientific meeting. The first workshop was held 2001 in Berlin. It was organized by Prof. Dr. Reinhart Heinrich, a co-founder of this series of workshops. Since 2001, the workshop has been held in Boston (2002), Berlin (2003), Kyoto (2004), Berlin (2005), Boston (2006) and Tokyo (2007). The present workshop was held in Zeuthen near Berlin as a part of a collaborative educational program involving the leading institutions committing the following programs and partner institutions of the US, Japan and Germany: • Boston - Graduate Program in Bioinformatics, Boston University • Berlin - The International Research Training Group (IRTG) "Genomics and Systems Biology of Molecular Networks" • Kyoto/Tokyo - Joint Bioinformatics Education Program of Kyoto University and University of Tokyo Partner Institutions • • • • • • • • • •

Boston University Charite Berlin Free University Berlin Humboldt University Berlin Kyoto University, Bioinformatics Center, Institute for Chemical Research Kyoto University, Department of Bioinformatics and Chemical Genomics, Graduate School of Pharmaceutical Sciences Max Delbriick Centre for Molecular Medicine, Berlin Max Planck Institute for Molecular Plant Physiology, Potsdam Max-Planck Institute of Molecular Genetics, Berlin University of Tokyo, Human Genome Center, Institute of Medical Science

This time we decided to first perform the workshop and to collect and re-

ix

x

Preface

view the manuscripts three weeks later, such that the discussions and criticisms at the workshop could be considered appropriately by the authors. However, there was also a pre-selection of the oral and poster contributions to be accepted at the workshop. The contributors were then allowed to submit manuscripts for the Genome Informatics volume. These contributions were reviewed by the members of the workshop event. We have selected 25 papers after revision. These papers will be indexed in Medline, and their electronic versions are freely available from the website of Japanese Society for Bioinformatics as Genome Informatics Online (http://www.jsbLorg/modulesjjournal/index.php/index.html). Former publications are also electronically available as Genome Informatics Vol. 15, No.1 (2004), Vol. 16, No.1 (2005), Vol. 17, No.1 (2006), and Vol. 18 (2007). We wish to thank all of those who submitted papers and helped with the reviewing process. We also wish to thank all those who helped in organizing this workshop for their efforts in local arrangement, especially the Local Committe Members: Martin Falcke, Alexander Skupin, Bianca Sprincenatu, Oliver Ebenhoh, Moritz Schutte, and Johannes Bausch.

Program Committee Chair Ersnt-Walter Knapp Organizers Gary Benson Hermann-Georg Holzhiitter Minoru Kanehisa Satoru Miyano

PROGRAM COMMITTEE Ernst-Walter Knapp Tatsuya Akutsu Gary Benson Oliver Ebenhoh Martin Falcke Hermann-Georg Holzhiitter Minoru Kanehisa Hiroshi Mamitsuka Satoru Miyano Robert Preissner Daniel Segre Brandon Xia

Free University Berlin, PC Chair Kyoto University Boston University Humboldt University Berlin Max-Delbriick-Center for Molecular Medicine Charite-University Medicine Berlin Kyoto University Kyoto University University of Tokyo Charite-University Medicine Berlin Boston University Boston University

xi

This page intentionally left blank

EXPLORING THE EFFECT OF VARIABLE ENZYME CONCENTRATIONS IN A KINETIC MODEL OF YEAST GLYCOLYSIS JOZSEF BRUCK',2 [email protected]

WOLFRAM LIEBERMEISTER' [email protected]

EDDAKLIPP' [email protected] , Max Planck Institutefor Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany University Berlin, Department of Biology, Chair of Theoretical Biophysics, Invalidenstr. 42,10115 Berlin, Germany

2 Humboldt

Metabolism is one of the best studied fields of biochemistry, but its regulation involves processes on many different levels, some of which are still not understood well enough to allow for quantitative modeling and prediction. Glycolysis in yeast is a good example: although high-quality quantitative data are available, well-established mathematical models typically only cover direct regulation of the involved enzymes by metabolite binding. The effect of various metabolites on the enzyme kinetics is summarized in carefully developed mathematical formulae. However, this approach implicitly assumes that the enzyme concentrations themselves are constant, thus neglecting other regulatory levels - e.g. transcriptional and translational regulation - involved in the regulation of enzyme activities. It is believed, however, that different experimental conditions result in different enzyme activities regulated by the above mechanisms. Detailed modeling of all regulatory levels is still out of reach since some of the necessary data - e.g. quantitative large scale enzyme concentration data sets - are lacking or rare. Nevertheless, a viable approach is to include the regulation of enzyme concentrations into an established model and to investigate whether this improves the predictive capabilities. Proteome data are usually hard to obtain, but levels of mRNA transcripts may be used instead as clues for changes in enzyme concentrations. Here we investigate whether including mRNA data into an established model of yeast glycolysis allows to predict the steady state metabolic concentrations for different experimental conditions. To this end, we modified an established ODE model for the glycolytic pathway of yeast to include changes of enzyme concentrations. Presumable changes were inferred from mRNA transcript level measurement data. We investigate how this approach can be used to predict metabolite concentrations for steady-state yeast cultures at five different oxygen levels ranging from anaerobic to fully aerobic conditions. We were partly able to reproduce the experimental data and present a number of changes that were necessary to improve the modeling result. Keywords: yeast; glycolysis; fermentation; respiration; kinetic modeling; metabolic regulation

1.

Introduction

Cellular metabolism is one of the key components of living systems. Its most basic functions are to generate the energy and the building blocks necessary to sustain the cells' life. Elucidation of central carbon metabolism, the source of energy for all heterotrophic life, is one of the success stories of biochemistry: function and mechanism of most of its components are known in considerable detail. A large class of the regulatory mechanisms of metabolism is well understood: the catalytic function of many enzymes is influenced by metabolites present in the cell. This kind of interactions have been successfully

1

2

J. Bruck, W. Liebermeister f3 E. Klipp

quantified in enzyme kinetic laws, which has led to ODE based models of metabolic pathways with considerable predicting power, as described in [4, 2, 7] and applied among others in [9, 5, 11]. However, metabolism is also regulated by other functional units of the cell, most importantly the transcriptional-regulatory system. It acts by changing the concentration of various enzymes via regulated production and degradation. This kind of regulation is necessary for the cell to steer its metabolism to meet its needs under various conditions. However, change in protein levels is usually not implemented in kinetic models: these typically adopt kinetic expressions for the included reactions with fixed maximal velocities, which amounts to the implicit assumption of constant enzyme concentrations. One of the possible reasons is that quantitative data on concentrations of single proteins in different experimental conditions are still lacking or rare. A fundamental determinant of the concentration of an enzyme's active form, and hence, its activity, is the amount of mRNA transcripts presents in the cell. However, many other layers of regulation exist, e.g. at the level of translation and allosteric regulation of the final protein among many others. It is controversial to what extent the final enzyme activity is determined by or correlated to the concentrations of its mRNA components. While genome-wide comparisons between mRNA and enzyme concentrations exist [1, 3], the abundance of a given set of proteins and their corresponding transcription rates should be systematically compared in different cell states to obtain a clearer picture. To the authors' knowledge such studies are not yet available. Based on an established ODE-based model of yeast glycolysis, we present an approach for modeling how metabolism is regulated by the transcriptional-regulatory system. In the model we include the change in enzyme concentrations in various experimental conditions. We used experimental data [12] from steady state yeast cultures with five different oxygen levels ranging from anaerobic to fully aerobic conditions. We implemented the change in enzyme concentrations by changing the maximal rates of the enzymatic reactions. For the above mentioned reasons, we determined these changes from mRNA concentration measurements, using them as inputs for the model. The model allows for computing metabolite concentrations and fluxes, which we compared to the corresponding experimental values. We performed parameter estimation to determine a set of parameters which best fit for the experimental data. The main question posed is the following: to what extent can experimental data for different cell states be explained by including expression data in the model under the assumption that biochemical reaction rates obey rate laws known from enzyme kinetics?

Exploring the Effect of Variable Enzyme Concentrations

2.

3

Methods

2.1. Experimental data

We used metabolite concentration and flux data from Wiebe et al. [12] obtained from cultures of Saccharomyces cerevisiae CEN.PKI13-1A grown in glucose-limited chemostat cultures (dilution rate D=O.lO/h). External conditions in these cultures could be controlled to a high extent. Steady-state cultures were obtained under one anaerobic (0% oxygen) and four aerobic conditions (0.5%, 1%, 2.8%, 20.9% oxygen in the inlet gas) with all other external conditions being kept constant. Measured quantities included biomass, concentration of external metabolitesa (Glucose, Ethanol, Glycerol), of intermediate metabolites (G6P, F6P, F16P, PEP, PYR, ATP, ADP, AMP, and the sum of 3PG and 2PG concentrations), net fluxes (consumption rates of oxygen and glucose and exhaust rate of ethanol, glycerol and C02) per unit of biomass, and relative fold changes of the mRNA concentrations compared to the anaerobic cultures for 69 genes with functions in carbon metabolism. 2.2. Mathematical model

We constructed a mathematical model of central carbon metabolism in S. cerevisiae based on the glycolytic pathway model by Teusink et al. [11]. The original model was based on measurements on steady state cell cultures under anaerobic conditions by comparison of experimental data of concentrations and fluxes of intermediate and external metabolites. The sum of the concentrations [NAD+] and [NADH] is a conserved moiety of the model. The adenosine species [ATP], [ADP] and [AMP] are not dynamical variables of the original model, instead, they were written as analytic expressions in term of the sum of high-energy phosphates. These were obtained under the assumptions that a) the sum of their concentrations is conserved, and b) the reaction catalyzed by adenosine kinase is fast in comparison to the other reactions, and hence in equilibrium. The metabolites GAP and DHAP are lumped to a single chemical species called "triose" reflecting the assumption that the transforming reaction between them (catalyzed by TPI) is also in equilibrium. The kinetic constants were largely obtained from experiments and fitted only to a minimal extent. The side branches of glycolysis contained in the model were

aAbbreviations: G6P: Glucose-6-phosphate; F6P: Fructose-6-phosphate; F l6P: Fructose-I,6-bisphosphate; Triose-P: sum of GAP: Glyceraldehyde-3-phosphate and DHAP: Dihydroxyacetone phosphate; BPG: 1,3bisphosphoglycerate; 3PG and 2PG: 3- and 2-phosphoglycerate respectively; PG: sum of 3PG and 2PG; PEP: Phosphoenolpyruvate; ACA: Acetaldehyde; AMP, ADP, ATP: Adenosine-mono-, di-, and triphosphate, respectively. NAD+, NADH: oxidation states of Nicotinamide adenine dinucleotide. Enzymes: ENO: Enolase; GAPDH: D-glyceraldehyde-3-phosphate dehydrogenase; ADHI, ADH2: Alcohol dehydrogenase I and 2, respectively; HK: Hexokinase; PGI: Phosphogluco isomerase; PFK: Phosphofructokinase; ALD: Aldolase; G3PDH: Glycerol-3-phosphate-dehydrogenase; PGK: Phosphoglycerate kinase; PGM: Phosphoglycerate mutase; PYK: Pyruvate kinase; PDC: Pyruvate decarboxylase; FBPI: Fructose-I,6-bisphosphatase.

4

J. Bruck, W. Liebermeister

f rlL with the diffusion time T = (k+[B]T)-l and length L =

\I

The Role of IP3R Clustering in Ca2 + Signaling

19

J

DCa(k+(BJT )-1 the resulting system in dimensionless units defined in Table 2 takes the form (16]

(2a) (2b) (2c) where the first equation describes the dimensionless free Ca2 + concentration and the other two correspond to the scaled free mobile and immobile buffer concentrations, resp. The first term in Eq. (2a) corresponds to diffusion of Ca2+, whereas the next four terms describe the reactions with buffers and the coupling with the ER by the pumps and the leak flux (O' p and 0'1 respectively). The last term specifies release of Ca2+ by channels, which we assume to be delta sources. Nevertheless we incorporate their spatial character by using Eq. (1) for the scaled flux o'. The two remaining equations in (2) describe the buffers dynamics. The dimensionless resting conditions are given by eo = (Ca 2+Jo/K, bo = (eo + 1)-1 and bi,o = (eo;;; + 1)-1 depending on the buffer dissociation constant K of the mobile buffer and the ratio ;;; of the dissociation constants of the two buffer types. For the linear system of PDEs (2) we derived an analytical solution by means of coupled Green's functions for a spherical cell with noflux boundary condition at the cell membrane (16). The solution for the concentration dynamics can now be used as a natural environment for localized IP 3 R clusters to study the interplay of their nonlinear stochastic opening behavior and the feedback on Ca2+. Therefore we couple the global deterministic solution to the local stochastic channel behavior by a Gillespie algorithm described in (12J. Table 2. c

b bi e d セt@

セゥ@

T

ER CTi CT K, K,E

Definition of dimensionless parameters.

dimensionless free Ca 2 + concentration dimensionless free mobile buffer concentration dimensionless free immobile buffer concentration dimensionless free Ca 2 + concentration within the ER ratio of the diffusion coefficients DB/Dca time separation of the mobile buffer [BJT/K time separation of the immobile buffer [BiJT/K [BdTki /[BJTk- ratio of buffer influence scaled fluxes of CTI and CTp P;lk+[B1T

[Ca2+J/K [Bl/[B1T [Bi]/[BilT [E]/KE

J"'k+[B]T 2FK

K/Ki K/KE

scaled channel flux ratio of the dissociation constants of the mobile and immobile buffer ratio of the dissociation constants of the cytosolic and lumenal buffer

20

A. Skupin

fj

M. Falcke

B

z

'" d:.c

セ@

0

ッZセ@

C

N=2

セ@ il セi@ il l l l lil 0

250

500 t (5)

750

1000

N=32

1

: fiji" " ,[ .:,

1: ::::J 0

250

500 t (5)

750

1000

Fig. 3. A: Sketch of the spatial arrangement for the clustering analysis. Clusters are put on a regular grid around the origin. Band C : Representative examples of the channel dynamics. Upper panels show number of open channels and the lower panels the amount of inhibited subunits for a cell with 128 channels in total, which are distributed on N clusters.

2.3. Results For the following investigation we use the parameters of the DK model listed in Table 1 and standard parameters for the RDS listed in Table 3 reflecting typical properties of eukaryotic cells. Our results do not depend qualitatively on this explicit choice, but can differ in a quantitative manner for different parameters. To study the influence of 1P3R clustering we vary the number of clusters N in the cell arranged on a regular grid with a grid constant d as depicted in Fig. 3A. The grid constant influences the spatial coupling between the clusters as the pumps will decrease the Ca2+ signal at adjacent clusters with increasing separation d and thus decrease the probability for a global event. Figure 3B and C exhibits two representative examples of the cooperative channel behavior for a cell with 128 channels distributed equally on N clusters separated Table 3.

R

Standard values of parameters used for simulations.

cell radius channel radius diffusion coefficient of cytosolic Ca 2 + diffusion coefficient of lumenal Ca 2 + 70 fJ,m 2 /s diffusion coefficient of mobile buffer 95 fJ,m 2/s 50 nM cytosolic Ca 2 + base level 90 nM IP3 concentration 25 fJ,M total mobile buffer concentration 600 (fJ,Ms)-l on rate of the mobile buffer 100 s-l dissociation rate of the mobile buffer 30 fJ,M total immobile buffer concentration 600 (fJ,Ms)-l on rate of the immobile buffer 100 s-l dissociation rate of the immobile buffer 86 s-l pump rate 4.3 10 6 s-l channel flux constant i'::j 0.01 s-l leak flux constant implicitly given by Pp and [Ca2+]o 10 fJ,m

8nm 220 fJ,m2/s

The Role of 1P3R Clustering in Ca2+ Signaling

21

by d = 1 J.Lm. The upper panels show the number of open channels Nopen and the lower panels depict the degree of inhibition R inh , which is zero if no subunit is inhibited and one for total inhibition. We observe for two clusters each consisting of 64 channels a relatively regular spiking caused by the self amplifying character of CICR. If one channel of a cluster opens, it will open other channels of the cluster, too, leading to an increase of the cytosolic Ca2 + concentration which will activate the second cluster. The resulting high [Ca2+] leads to a almost complete inhibition of channels terminating the spike. If we distribute the 128 channels on 32 clusters, i.e. each cluster has 4 channels, the amplitude and frequency decreases, since the spatial coupling is decreased. Thus we observe a higher uncoordinated background activity, i.e. opening events of very few channels, that leads rarely to global events as the puffs are too small to nucleate a global wave. To characterize such oscillations we will determine in the following the mean amplitude and the mean period Tav by averaging over the ISIs, here given by the time between to successive maxima of open channels. Cells can control the number of IP3R and the degree of clustering. Thus, we are interested in how cells can tune spiking with these two variables. We compare a stimulated cell with the above mentioned high [IP 3] and a cell with a lower IP 3 concentration. It turned out that cells with high [IP3] and a sufficiently high number of channels exhibit a saturated behavior as can be seen in fig. 4. Here the squares show T av and the number of open channels for a cell with a fixed number of channels Nch = 320, which are distributed equally on N clusters separated by d = 1 J.Lm. Both, T av and the amplitude exhibit only small fluctuations indicating the strong coupling between the clusters. This behavior changes if we switch to low IP 3 concentrations as can be seen by the dots in fig. 4. Here each cluster contains 100 channels, i.e. by increasing the number of clusters we increase the number of channels. The amplitudes increase by increasing the number of clusters. Thereby Tav decreases from about 50 s for 2 clusters to about 20 s for 15 clusters. That is

A

B

60

f

45

セ@

+

> l-

'"

30

IjJ

ill

ゥセ@

IjJ

+

,

OJ "0

f !

90

60

.-2

C. E

IjJ

III

III III

30

•• 0

15

0

9

18

27

number of clusters

36

0

III III

•• 9

f t

+

i

III

18

III

27

III

36

number of clusters

Fig. 4. Comparison of a cell with [IP3]=50 nM and a fixed number of channels distributed equally on clusters (squares) with a cell with [IP3 ]=10 nM, where each cluster consists of 100 channels (dots). A: Dependence of the mean period Tav on the number of clusters. B: Averaged maximal amplitude of the channel oscillations. (All error bars denote SEM.)

22

A. Skupin E9 M. Faleke

in the range of the mean period of the saturated cell and is due to the increased nucleation probability by the increased number of channels. For even more clusters, T av increases again since inhibition obstructs the more regular behavior. That is a consequence of the increased amplitudes shown in Fig. 4B for higher amounts of clusters and channels leading to higher Ca2+ concentrations. We observe a steep increase of the amplitudes up to the level of the saturated cell of about 45 channels. From that point on a further expression of channels is less sufficient as the amplitude increases slower and exhibits larger variations. Interestingly this cross over point of the amplitudes coincides with the fastest oscillation period in 4A. To analyze the effect of channel distribution further we use a grid with a grid constant d = 1.5 セュ@ and less channels to avoid a saturated behavior. Figure 5 exhibits Tav and the amplitude for two different cell setups. The dots correspond to Neh= 128 and the squares mark Nch= 256. The mean periods in Fig. 5A exhibit a pronounced change for less than ten channels per cluster. Another property is shown by the amplitudes. Althol}gh the squares have the double amount of channels compared to the dots, the average maximal amplitude is only slightly increased caused by self inhibition. These results suggest that cells with 128 channels have a larger dynamic range for frequency coding. In addition T av exhibits a more pronounced change than the amplitude and could be used for a robust control mechanism. We now return to the question about diffusive arranged channels. In a third approach to the analysis of the cluster distribution, we preserve the channel density by scaling the grid constant with the cubic root of the number of channels per cluster, i.e. d = dl (Nch/Ncl)1/3, where d 1 denotes the minimal grid constant for one channel per cluster. In Fig. 6 we compare two cells with the same [IP3] and Ca2+ base level concentration but with two different number of channels Nch and minimal grid constants d 1. Both setups, the one with Nch = 128 and dP)= 1 セュ@ denoted by the squares and the setup with Nch = 256 and 、セRI@ = 1.5 セュ@ shown by the dots, exhibit a minimum in Tav' as shown in Fig. 6A. That means, cells with a more

A

B

160 120

セ@

> f-'"

80 40 0

•+

セ@

[!][!J

0

9

• III

18

+f ID

t

number of clusters

..

30

'" .-2 "0

セ@

a. 20 E III

10

ID

27

40

36

0

¢ifi

IDID

IjJ

I!J

' •• • ¢ •• 0

9

18

27

セ@

+ 36

number of clusters

Fig. 5. Influence of clustering with a conserved number of channels (triangles denote N c h=128 and squares Nch = 256, i.e. each square has doubled amount of channels as the corresponding dots) and a fixed grid constant d = 1.5 !-lm . A: Mean period Tav against the number of clusters. B: Amplitude dependence for two different total number of channels within the cell.

The Role of IPa R Clustering in Ca 2 + Signaling

A

B

60

+

40

+

セ@

セ@ l-

ID

rn

20

0

0

t1

++

セ@

f セ@

10

20

number of clusters

•• •

50 40

+

30

CD "0

.€

III

III



30

Q.

E

'"

20







III III

ill III

10 0

23

I!l

• III

0

9

18

27

36

number of clusters

Fig. 6. Influence of clustering with a conserved channel density. A: The comparison of T av for a cell with Nch = 128 channels and dl =1 セュ@ (squares) and a cell with Nch = 256 channels and dl =1.5 セュ@ (dots) demonstrate that the minimal Tav is not a simple effect of the density. B: The amplitudes exhibit a constant region and show, that diffusively arranged channels do not create global oscillations for physiological regions, as the period increases and the amplitude goes to zero for increasing number of clusters.

diffusive arrangement of channels can decrease T av and increase the amplitude by clustering of IP3Rs. That is due to the existence of an optimal coupling strength for systems with discrete excitable stochastic elements [14]. Once the minimal Tav is reached, further clustering results again in slower oscillations, since inhibition blocks the channel clusters. Further we see that oscillations with a lower channel density (dots) are slower compared to those with a higher density (squares). The two minima of T av for the two setups occur at distinct cluster numbers and T av values, but in both minima each cluster has 16 channels. We observe for both realizations a plateau of the amplitudes for a relatively large range from about 8 to 23 clusters. In this range the cell with the larger amount of channels exhibits a nearly doubled average amplitude, whereas the amplitude is only slightly higher for few clusters due to inhibition and goes to zero for a diffusive arrangement of channels at larger cluster numbers. Interestingly the minimal periods are in this range of constant amplitudes what might indicate a stabilized regime. 3. Discussion In this paper we used our recently developed method for modeling Ca2+ dynamics in three dimensions to investigate the role of IP 3R clustering. We found that spike amplitudes and lSI depend on the degree of clustering, cluster configuration and number of clusters. We found optimal configurations and numbers of channels with respect to a variety of properties. Reliable fast spiking can be obtained with about 10 channels per clusters and cluster densities of about 0.01 {tm- 3. That would wean numbers of channels per cell which are about one order of magnitude smaller than those estimated from IP 3 binding experiments (see [9] and references therein). Remarkably, expressing move IP 3 or increasing the degree of clustering does not improve

24

A. Skupin f3 M. Faleke

regularity or accelerate spiking. It is currently believed that Ca2+ oscillations use frequency encoding. Small channel numbers appear more suitable for that purpose than large ones. Clustering of channels consistently improved spiking with respect to regularity of ISIs and amplitudes of spikes. If we assume that the ability to spike and to use frequency coding is the purpose of the Ca2+ signaling pathway, our results indicate that it can be achieved with surprisingly small channel numbers and if channels cluster.

References [1] Bentele, K. and Falcke, M., Quasi-Steady Approximation for Ion Channel Currents, Biophys. J., 93:2597-2608, 2007. [2] Berridge, M., Inositol trisphosphate and calcium signalling, Nature, 361:315-325, 1993. [3] Berridge, M., Elementary and global aspects of calcium signalling, J. Physiol., 499:291-306, 1997. [4] Berridge, M., Lipp, P. and Bootman, M., The versatility and universality of calcium signalling, Nature Rev. Mol. Cell Biol., 111-22,2000. [5] Bootman, M., Niggli, E., Berridge, M., and Lipp, P., Imaging the hierarchical Ca2 + signalling in HeLa cells, J. Physiol, 499:307-314, 1997. [6] Falcke, M., On the role of stochastic channel behavior in intracellular Ca 2 + dynamics, Biophys. J., 84:42-56, 2003. [7] Falcke, M., Reading the patterns in living cells - the Physics of Ca2+ signaling, Advances in Physics, 53:255-440, 2004. [8] Marchant, J., Callamaras, N., and Parker, 1., Initiation of IP3-mediated Ca 2 + waves in Xenopus oocytes, The EMBO J., 18:5285-5299, 1999. [9] Marchant, J. and Parker, 1., Role of elementary Ca2 + puffs in generating repetitive Ca2+ oscillations, The EMBO Journal, 20:65-76, 200l. [10] Meinhold, L. and Schmansky-Geier, L., Analytical description of stochastic calcium periodicity PRE, 66: 050901(R), 2002. [11] Putney, J. and Bird, G., The inositolphosphate-calcium signaling system in nonexcitable cells, Endocrine Reviews, 14:610-631, 1993. [12] Rudiger, S. et al., Hybrid Stochastic and Deterministic Simulations of Calcium Blips, Biophys. J., 93:1847-1857, 2007. [13] Schuster, S., Marhl, M., and HOfer, T., Modelling of simple and complex calcium oscillations, Eur. J. Biochem., 269:1333-1355, 200l. [14] Shuai, J. and Jung, P., Optimal ion channel clustering for intracellular calcium signaling, PNAS, 100:506-510, 2003. [15] Skupin, A. et al., How does intracellular Ca 2 + oscillate: By chance or by the Clock, Biophys. J., 94:2404-2411, 2008. [16] Skupin, A. and Falcke, M., How to model Ca2+ dynamics in 3D, submitted, 2008. [17] Taylor, C., Inositol trisphosphate receptors: Ca 2 + -modulated intracellular Ca 2+ channels, Biochimica and Biophysica Acta, 1436:19-33, 1998. [18] Tsien, R. and Tsien, R., Calcium channels, stores and oscillations, Annu. Rev. Cell Biol., 6:715-760, 1990.

RULE-BASED REASONING FOR SYSTEM DYNAMICS IN CELL SYSTEMS EUNA JEONG eajeongQims.u-tokyo.ac.jp

MASAO NAGASAKI masaoQims.u-tokyo.ac.jp

SATORU MIYANO miyanoQims.u-tokyo.ac.jp

Human Genome Center, Institute of Medical Science, University of Tokyo, Tokyo 108-8639, Japan A system-dynamics-centered ontology, called the Cell System Ontology (CSO), has been developed for representation of diverse biological pathways. Many of the pathway data based on the ontology have been created from databases via data conversion or curated by expert biologists. It is essential to validate the pathway data which may cause unexpected issues such as semantic inconsistency and incompleteness. This paper discusses three criteria for validating the pathway data based on CSO as follows: (1) structurally correct models in terms of Petri nets, (2) biologically correct models to capture biological meaning, and (3) systematically correct models to reflect biological behaviors. Simultaneously, we have investigated how logic-based rules can be used for the ontology to extend its expressiveness and to complement the ontology by reasoning, which aims at qualifying pathway knowledge. Finally, we show how the proposed approach helps exploring dynamic modeling and simulation tasks without prior knowledge.

Keywords: Cell System Ontology; CSO; rule-based inference; pathway knowledge base; ontology validation

1. Introduction

The Cell System Ontology (CSO) [5] has been developed as a unified framework for the representation of biological pathways, based on the notion of hybrid functional Petri net with extension [8]. CSO defines classes for modeling, visualizing, and simulating biological pathways and relationships between classes in the Web Ontology Language (OWL) [12]. Furthermore, the selected controlled vocabularies are defined in CSO to easily represent biological pathways. The pathway data based on the CSO classes are created by data integration and exchange efforts such as BioPAX2CSO [4] and Transpath2CSML [9], modeling and simulating tools such as Cell Illustrator [14, 15], or ontology editors such as Protege [13] and SWOOP [18]. The Cell System Markup Langauge (CSML) [16] is fully compatible with CSO. The static pathway models in other biological knowledge resources are reconstructed into mathematical models with improved visualization in CSO via data conversion. The CSO tools [6, 7, 14, 15] allow to explore the possible dynamic behavior of pathway components. Unfortunately, there is ambiguous and missing information in those resources [4, 9] which makes any semantic inconsistency

25

26

E. Jeong, M. Nagasaki €:J S. Miyano

and incompleteness in the pathway data in eso. As a huge volume of the eso data is generated, it is crucial to provide a knowledge base which enables dynamic simulation and hypothesis testing of biological models. In this paper, we first propose three criteria for validating the pathway data in eso in terms of both Petri nets and biological meaning. Modeling and validating biological pathways with Petri nets are shown in many studies [2, 3, 10, 11] because Petri nets allow graphical representation and simulation for biological pathways. However, the related studies are focused on representing dynamics of the system such as how to set relevant logical parameters for Petri net components. In fact, the Petri net components rarely embed semantics in biology in the sense that whether a place represents a gene or a protein, or whether a transition is gene expression or protein modification is not important. Secondly, we propose a rule-based approach to extend the expressiveness of the ontology and to complement the ontology by reasoning, which aims at qualifying pathway knowledge. In the next section, we briefly introduce how eso describes biological pathways. In Sec. 3, we define three criteria for validating the pathway data and present how rules are used in conjunction with eso. Finally, a small example shows how the proposed rule-based approach helps exploring dynamic modeling and simulation tasks without prior knowledge.

2. Cell System Ontology

eso

defines a model as a set of processes. The processes have entities as participants. The processes and entities are related via directed connectors. The main classes to represent biological interactions are Process, Entity, and Connector. Each process represents a biological event such as binding, translation, and activation. eso currently supports biological entities such as genes, proteins, RNA, small molecules, and complexes. RNA is further classified into its subclasses. A connector defines a role of the entity which is involved into a process. Depending on its role, the connector class is further classified into InputAssociationBiological, InputInhibitorBiological, InputProcessBiological, and OutputProcessBiological which mean an activator, an inhibitor, an input, and an output, respectively. These basic elements are defined as BiologicalElement in eso as shown in Figure lA. Furthermore, with the eso schema, one can specify simulation-related parameters for mathematical models, graphical visualization of biological elements, and available literature data. eso also provides comprehensive controlled vocabularies for such as biological events, cellular compartments, organism type, and cell type, to model biological pathways with different scales and modalities in cell systems. The formal schema of the complete ontology is available at [16]. Figure lB describes asserted facts for a simple model, where simulation- and visualization-related properties are abbreviated for convenience. In the figure, the property values for a biological event, a cellular compartment, and a fea-

Rule-Based Reasoning for System Dynamics in Cell Systems 27

• SmallMolecule • Enti tyBiologicalOther • Enti tyBiologicalUnknown .. • Enti tyNonBiological . . . Fact " • Process • ProcessBiological . . . ProcessNonBiol.

ProcessBiological (pl) hasBiologicalEvent (p3, ME_phosphorylation) hasConnector(pl, c6) hasConnector(pl, c7) ProcessBiological (p2) hasBiologicalEvent (p2, ME_binding) hasConnector(p2, cl) hasConnector(p2, c3) hasConnector(p2, c2) ProcessBiological (p3) hasBiologicalEvent (pl, ME_translocation) hasConnector(p3, c4) has Connector (p3, e5) InputProcessBiological (el) hasEntity(cl, el) InputProeessBiological (e2) hasEntity(e2, e2) OutputProeessBi ological (c3) hasEntity(c3, c3) InputProcessBiological (e4) hasEntity(c4, e2) OutputProeessBiologieal (e5) hasEntity(c5, e4) InputProcessBiological (c6) hasEntity(e6, e1) OutputProcessBiologi eal (e7) hasEntity(c7,e5) Protein (el) Protein (e2) locatedIn(e2, CC_cytoplasm) Complex (e3) Protein(e4) locatedln(e4, CC..nucleus) Protein (e5) hasFeature (e5, FT _phosphorylated)

A. Biological elements defined in CSO.

B. Asserted facts in CSO for a simple model.

" • BiologicalElement " • Connector " • Input . . . InputAssociation • InputAssociationBiological .. • InputAssociationNonBiol. . . . InputInhibitor • InputInhibi torBiological . . . InputInhibitorNonBiol. .. • InputProcess • InputProcessBiological .. • InputProcessNonBiol. " • Output .. • OutputProcess • OutputProcessBiological .. • OutputProcessNonBiol. " • Entity " • Enti tyBiological • EntityBiologicalCell • Enti tyBiologicalCompartment • Enti tyBiologicalEnvironment . . . EntityBiologicalMolecule •

Complex

Dna •• ObjectOther • •

ObjectUnknown Protein

. . . Rna

C. A simple model visualized in Cell Illustrator.

Fig. 1. The biological elements defined in CSO (A), asserted facts for a simple model (B), and its visualization in Cell Illustrator (C).

ture type refer to the controlled vocabulary terms defined in e80, prefixed with ME., CC., and FL, respectively. For example, p3 represents a biological event as translocation and has two connectors, c4 and c5, each of which is an instance of InputProcessBiological and OutputProcessBiological, respectively. The connector c4 (c5) is related to the entity e2 (e4), respectively. The two entities, e2 and e4, have location properties. The related facts are underlined in the figure.

28

E. Jeong, M. Nagasaki &! S. Miyano

Figure lC shows a graphical illustration of the simple model imported into Cell Illustrator. The graphical images and positions for biological elements are also stored in CSO as a machine-readable format. Because of this, visualization tools can facilitate these data for automatic drawing of biological networks considering cellular compartments [6] and the hierarchy of the CSO classes [7].

3. Rule-Based Reasoning for Ontology Validation

We define three criteria for qualifying pathway knowledge as follows: • Structurally correct models in terms of Petri nets. • Biologically correct models to capture biological meaning. • Systematically correct models to reflect biological behaviors. Although CSO defines sophisticated classes and relationships to describe the details of any given interaction unambiguously, sometimes only an OWL ontology is not enough for providing a qualified knowledge base of biological pathways. In OWL, there is no proper way to constrain what kind of entities can participate in which types of biological processes, or what data values are valid for a particular process. For ontology validation based on the three criteria, we use a rule-based approach represented in OWL constructors and axioms [12]. The available constructors and their correspondence in the Description Logic (DL) with the First Order Logic (FOL) are shown in Table 1. Table 1.

OWL constructors and DL FOL equivalence.

Constructor intersectionOf unionOf complement Of oneOf allValuesFrom someValuesFrom minCardinality max Cardinality

DL syntax

Cl n ···nCn C 1 U ... UCn セc@

{al···an } VP.C 3P.C セョpNc@

":;;nP.C

FOL syntax

Cl(X) /\ ... /\ Cn(x) C1(x)V···vCn(x) セcHクI@

x = al V ... V x = an Vy.(P(x,y) -+ C(y» 3y.(P(x,y) /I C(y» SセョケNHークL@ y) /I C(y» 3(n y .(p(x, y) /I C(y»

In Tab. 1, eli) is a class, P is a property, ali) is an individual, n is a non-negative integer, and x and yare variables. In FOL, classes correspond to unary predicates, properties correspond to binary predicates, and individuals are equivalent to constants. In the following description, rules are described in FOL. A rule has the form: H ;i x3 .(hasConnector(xi, X3) /\ Connector(x3)/\ hasEnti tY(X3' X2) /\ Enti tY(X2))

Given any pair of one entity and one process, if there exists zero or one connector between them, this relationship is correct. Some biological knowledge resources allow physical entities to have multiple roles in a process. The results of data conversion from those resources into eso may violate this rule. There also exist gaps between different levels of abstraction, different structured manners used in biological knowledge resources. For example, in BioPAX [1], a catalyzed inactivation process is represented as two different processes: a catalysis and an inactivation. A catalysis describes that an enzyme catalyses the inactivation process. In this case, the enzyme may participate as an activator of the catalysis process as well as an input of the inactivation process. In eso, a catalyzed inactivation is described as one process. After conversion from BioPAX to eso, one input entity is connected to the process with two roles: a catalyzer and a substrate. It is not allowed in eso based on Petri nets because a catalyzer does not change its concentration during interaction, but a substrate does it. A query to evaluate the relationship between process Xi and entity X2 can be written as follows: Ql: If not VALIDCONNECTION(xl, X2) then alert

In QI, it requires user intervention (alert) to select a correct relationship if there exist multiple connections between Xi and X2, because it is difficult to decide which one is correct without understanding the details of interaction.

3.2. Biologically correct models In many cases, controlled vocabularies are used to control and limit terms to describe biological processes, whose definitions are usually given as comments for human users. In eso, the type of a process is identified with the property of a biological

30

E. Jeong, M. Nagasaki 8J S. Miyano

event which has cardinality 1. On the other hand, it is optional in BioPAX and has different meaning [9] in TRANSPATH [21]. It is useful to formalize the definitions based on shared knowledge underlying biological processes. We define a biologically correct model as a model to correctly represent biological meaning of processes as a machine-readable format. In this paper, the three processes depicted in Fig. Ie are considered to represent rules. Translocation is a process which has a biological event as ME_Translocation. Similarly, binding and phosphorylation are processes annotated as ME...Binding and ME..Phosphorylation, respectively. The following rules define the three processes. R2: TRANSLOCATION(Xl)

f-

Process(xl) /\ hasBiologicalEvent(xl, ME_Translocation) R3: BINDING(Xl)

f-

Process(xl) /\ hasBiologicalEvent(xl, ME...Binding) R4: PHOSPHORYLATION(Xl)

f-

Process(xl) /\ hasBiologicalEvent(x1> ME..Phosphorylation) The following queries, Q2, Q3, and Q4, are evaluating whether the given process satisfies some conditions. In the queries, HASINPUT and HAS OUTPUT are defined as follows: HASINPUT(Xl,X3)

f-

3X2, X3.(hasConnector(xl' X2) /\ Input(x2) /\ hasEnti tY(X2' X3)) HASOUTPUT(x1> X3)

f-

3X2, x3.(hasConnector(xl, X2) /\ Output (X2) /\ hasEntity(x2, X3)) If an entity is connected to a process via the Input connectors, then we say that the process has an input entity. On the other hand, the process has an output entity if the entity is connected to the process via Output. In the queries, DifferentFrom and SameAs, are OWL axioms for identification of individuals. Each has the form {Xl} セ@ ""{X2} and {xd == {X2} in DL, respectively. Q2: If TRANSLOCATION(Xl) then

If ...,(3Xi.2';;i';;7.HASINPUT(Xl, X2) /\ Entity(x2) /\ locatedIn(x2, X4)/\ hasXref(x2, xs) /\ HASOUTPUT(Xl, X3) /\ Enti ty(X3) /\ locatedIn(x3, xs)/\ hasXref(x3, X7) /\ DifferentFrom(x4, xs) /\ SameAs(xs, X7)) then

alert Q3: If BINDING(Xl) then If NLHSセRxG@

3';;lX3.HASINPUT(x1> X2) /\ Entity(x2))/\ HASOUTPUT(x1> X3) /\ Complex(x3)) then

alert

Rule-Based Reasoning for System Dynamics in Cell Systems

31

Q4: If PHOSPHORYLATION(Xl) then

If -{3Xi,2";H;S.HASINPUT(Xl, X2) /\ Enti tY(X2) /\ hasXref(x2, X4)/\ HASOUTPUT(Xl, X3) /\ Enti ty(X3) /\ hasXref(x3, xs)/\ hasFeature(x3, FE_phosphorylated) /\ SameAs(x4, xs)) then

alert The definition of translocation in CSO is the process that an entity located in one cellular compartment is moved to another cellular compartment. In CSO, the same molecule in different locations is recognized as two different entities. Q2 describes that a translocation process has to satisfy the constrains that the input and output entities have the same external reference and different cellular locations. A binding process is an interaction of a molecule with specific sites on another molecule. In Q3, a binding process needs at least two input entities and generates one output entity as Complex. Formally, phosphorylation is the process of introducing a phosphate group into a molecule, usually with the formation of a phosphoric ester. In Q4, the constraints describe that the input and output entities have the same external reference and the sequence of the output entity has phosphorylated features. If the constraints are not satisfied, then prompt users for intervention (alert). Users may be guided to add missing constraints into the knowledge base.

3.3. Systematically correct models CSO is an ontology to represent dynamics of biological pathways and is supposed to simulate complex molecular mechanisms at different level of details. Once a mathematical model of biological pathways has been generated, it is necessary to estimate any free parameters and unknown rate constants based upon experimental data. We limit our consideration to generating a simulatable model ready for evaluation. We define a systematically correct model as a model to capture generic behaviors that govern the system dynamics. In the current state of this paper, we focused on protein turnover. Normally, proteins are synthesized within the cell and over time are gradually broken down into individual amino acids, and this cycle is repeated. To capture this behavior, we define three rules to recognize which entities are synthesized and degraded. R5 defines a starting entity as an entity except for a complex, which is connected to processes via only Input connectors. This indicates that a starting entity is not a product of any process. A predicate with a superscript of minus sign means the inverse of the predicate, e.g. hasEnti ty-. R6 identifies a starting entity whose type is complex. In addition, R7 is defined for biological entities except for genes to be degraded. R5: STARTINGENTITY(Xl)

Input(x2))

32

E. Jeong, M. Nagasaki €:f S. Miyano セ@

R6: STARTINGCOMPLEX(xd

Complex(xd!\ \fx2.((hasEntity-(xl,X2)

R7: DEGRADINGENTITY(Xl)

セ@

->

Input(x2))

Protein(xl) V Complex(xl) V mRNA(Xl)

The next three queries are generated from rules R5, R6, and R7, which will complement the given models by adding new instances (add-instance) and properties (add-property). The variable in braces, e.g. {xd, denotes a new instance ID. In Q5, if a given entity is STARTINGENTITY whose type is not complex, then a production process ({X2}) as a pre-process of the entity, a connector ({X3}) to relate xi and {X2}, and any necessary properties are added. This will make the starting entity be a product of the production process. In Q6, if a given entity is STARTINGCOMPLEX, then we assume that the complex is generated via a binding process whose participants are the components of the complex. Depending on the number of components of the complex, multiple connectors will be added. For degrading entities including protein, complex, and mRNA, a degradation process is added with a connector between the entity and the degradation process in Q7. In the Petri net formalism, adding pre-processes for starting entities (complexes) makes those processes to be fired without any constraints when the simulation is started. All entities consume their initial concentrations at the starting point of simulation. This complementation of the pathway data in eso will help users to intuitively understand the given model and how it works.

Q5: If STARTINGENTITY(xd then

add-instance Process( {X2}), OutputProcessBiological( {X3}) add-property hasBiologicalEvent ({X2}, ME_UnknownProduction) , hasConnector( {X2}, {X3}), hasEntity( {X3}, Xl) Q6: If STARTINGCOMPLEX(Xl) then

add-instance Process( {X2}), OutputProcessBiological( {X4}) add-property hasBiologicalEvent ({ X2}, ME..Binding), hasConnector- ({X4}, {X2})

for all hasComponents(xt, X3) do add-instance InputProcessBiological( {Xi} ) add-property hasConnector- (X3' {xd) Q7: If DEGRADEDENTITY(Xl) then

add-instance Process( {X2}), InputProcessBiological( {X3}) add-property hasBiologicalEvent( {X2}, ME...Degradation) , hasConnector( {X2}, {X3}), hasEntity( {X3}, Xl)

Rule-Based Reasoning for System Dynamics in Cell Systems

33

4. Experimental Results

In order to perform the rule-based system, we used AllegroGraph 2.2.5 [17] for the CSO data storage and query engine, SPQRQL query language [20] for querying, Java applications and Perl scripts for query manipulation and knowledge base manipulation, respectively. AllegroGraph is a RDF graph database with support for SPARQL. Signaling by FGFR pathway from Reactome (ID=190236) [19] is selected as an example. The 22 members of the fibroblast growth factor (FGF) family of growth factors mediate their cellular responses by binding to and activating the different isoforms encoded by the four receptor tyrosine kinases (RTKs) designated FGFRl, FGFR2, FGFR3, and FGFR4. These receptors are key regulators of several developmental processes in which cell fate and differentiation to various tissue lineages are determined. This leads to stimulation of intracellular signaling pathways that control cell proliferation, cell differentiation, cell migration, cell survival and cell shape, depending on the cell type or stage of maturation [19]. The Reactome data exported into the BioPAX format is converted into the CSO format by BioPAX2CSO [4]. Figure 2 shows the result of BioPAX2CSO. In the figure, the squared boxes point places to be evaluated by queries described in Sec. 3. Figure 3 shows the result of ontology validation for the same model in Fig. 2. Via ontology validation, seven not-valid connections are corrected and six starting complexes have pre-binding processes. In addition, 15 unknown production and 43 degradation processes are added for starting entities and degrading entities, respectively. This validation makes the given model to be simulatable when loaded in Cell Illustrator without any changes. The results of simulation are shown as charts in the below of Fig. 3. 5. Conclusions

We have presented a rule-based approach to provide qualified knowledge bases for biological pathways. Three criteria had been proposed for ontology validation in terms of both Petri nets and biological meaning. The experimental result shows how ontology validation can be done by using rules in conjunction with CSO. The main contributions of this work are summarized as follows: (1) to give a formal representation for biological events and biological behaviors and (2) to provide new criteria for qualifying biological pathway knowledge. Our proposed method can be used for biological pathway models generated via data conversion and manual curation. In addition, it can be used as a plugin of modeling and simulating tools such as Cell Illustrator. When users create models, users are guided to generate models which are simulatable as well as biologically correct. As a result, the proposed method helps to generate qualified pathway models, which allow to easily explore the possible dynamic behavior of pathway components. In future work, we plan to define rules for the biological events defined in CSO as much as possible. Furthermore, we will define more rules to capture generic

34

E. Jeong, M. Nagasaki & S. Miyano

Fig. 2.

eso.

The signaling by FGFR pathway from Reactome (ID=190236) [19] after conversion into

biological behaviors learned from modeling experts and literature. For example, the speed of processes are different depending on biological events: binding and dimerization may have different speed; the speed of natural degradation is slower than other processes; and the transcription speed of mRNA is quicker than that of miRNA. Moreover, time to translate a protein and time to transcribe a gene are different depending on species.

Rule-Based Reasoning faT System Dynamics in Cell Systems 35 Utl/Prot; P224S5

lItl1Prot:P22607_2

Fig. 3. The results of ontology validation of the pathway described in Fig. 2 and the simulation results with default values of parameters.

References [1] Bader, G. and Cary, M., BioPAX - biological pathways exchange language level 2, version 1.0 documentation, 2005.

36

E. Jeong, M. Nagasaki &J S. Miyano

[21 Genrich, H.J., Kiiffner, R., and Voss, K., Executable Petri net models for the analysis of metabolic pathways, International Journal on Software Tools for Technology Transfer, 3(4):394-404, 200l. [31 Hofestiidt, R. and Thelen, S., Quantitative modeling of biochemical networks, In Silico Bioi., 1(1):39-53, 1998. [4] Jeong, E., Nagasaki, M., and Miyano, S., Conversion from BioPAX to CSO for system dynamics and visualization of biological pathway, Genome Informatics, 18:225-236, 2007. [5] Jeong, E., Nagasaki, M., Saito, A., and Miyano, S., Cell system ontology: representation for modeling, visualizing, and simulating biological pathways, In Silico Biology, 7(6):623-638, 2007. [6] Kojima, K., Nagasaki, M., Jeong, E., Kato, M., and Miyano, S., An efficient grid layout algorithm for biological networks utilizing various biological attributes, BMC Bioinformatics, 8:76, 2007. [7] Kojima, K., Nagasaki, M., Miyano, S., Fast grid layout algorithm for biological networks with sweep calculation, Bioinformatics, 24(12):1433-1441, 2008. [8] Nagasaki, M., Doi, A., Matsuno, H., and Miyano, S., A versatile Petri net based architecture for modeling and simulation of complex biological processes, Genome Informatics, 15(1):180-197, 2004. [9] Nagasaki, M., Saito, A., Li, C., Jeong, E., and Miyano, S., Systematic reconstruction of TRANSPATH data into Cell System Markup Language, BMC Systems Biology, 2:53,2008. [10] Peleg, M., Yeh, I., and Altman, R.B., Modelling biological processes using workflow and Petri Net models, Bioinformatics, 18:6, 825-837, 2002. [11] Reddy V.N., Liebman, M.N., and Mavrovouniotis, M.L., Qualitative analysis of biochemical reaction systems, Comput. Bioi. Med., 26:9-24, 1996. [12] Smith, M., Welty, C., and McGuinness, D., OWL Web Ontology Language Guide, 2004. [13] http://protege . stanford. edu/ The Protege ontology editor and knowledge acquisition system. [14] http://www.cellillustrator.com/. Cell Illustrator 3.0. [15] http://cionline.hgc.jp/, Cell Illustrator Online. [16] http://www . csml. org/, Cell System Markup Language (CSML). [17] http://www.franz.com/. AllegroGraph - Web 3.0 database. [18] http://www . mindswap. org/2004/SWDDP / SWOOP - hypermedia-based OWL ontology browser and editor. [191 http://www . reactome. org/, Reactome - a curated knowledge base of biological pathways. [20] http://www.w3.org/TR/rdf-sparql-query/, SPARQL query language for RDF. [21] http://www.biobase.de/. TRANSPATH the pathway databases.

ESTIMATION OF NONLINEAR GENE REGULATORY NETWORKS VIA L1 REGULARIZED NVAR FROM TIME SERIES GENE EXPRESSION DATA KANA ME KOJIMA

ANDRE FUJITA 。ヲオェゥエセュウNMッォケ」ー@

ォ。ョュ・セゥウNオMエッケ」ェー@

TEPPEI SHIMAMURA ウィゥュ。セNオMエッォケ」ェー@

SEIYA IMOTO

SATORU MIYANO

ゥュッエセウNオMォケ。」ェー@ ュゥケ。ョッセウNオMエォ」ェー@

Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan Recently, nonlinear vector autoregressive (NVAR) model based on Granger causality was proposed to infer nonlinear gene regulatory networks from time series gene expression data. Since NVAR requires a large number of parameters due to the basis expansion, the length of time series microarray data is insufficient for accurate parameter estimation and we need to limit the size of the gene set strongly. To address this limitation, we employ Ll regularization technique to estimate NVAR. Under Ll regularization, direct parents of each gene can be selected efficiently even when the number of parameters exceeds the number of data samples. We can thus estimate larger gene regulatory networks more accurately than those from existing methods. Through the simulation study, we verify the effectiveness of the proposed method by comparing its limitation in the number of genes to that of the existing NVAR. The proposed method is also applied to time series microarray data of Human hela cell cycle.

Keywords: time series gene expression data; gene regulatory networks; vector autoregression; B-spline; group LASSO

1. Introduction

Using time series microarray data, estimation of gene regulatory networks is one of the essential roles to elucidate transcriptional systems. Recently, various statistical approaches have been proposed to capture gene regulations using dynamic Bayesian network [13, 18], vector autoregressive model [7-9], and state space model [12, 25J based on statistical causality. In this study, we use vector autoregressive model and capture gene regulations based on Granger causality. Linear vector autoregressive models are well-established in statistics and in existing literature it has been applied to estimate gene regulatory networks. However, most of the regulations cannot be limited by linear [9J, and we need to extend classical vector autoregressive models into nonlinear vector autoregressive models. Fujita et at. [9J introduced nonparametric regression technique to vector autoregressive model for estimating nonlinear and nonmonotonic regulations in gene

37

38

K. Kojima et al.

regulatory networks. In non parametric regression, since basis expansion technique was applied to build nonlinear mean function, the number of parameters increases rapidly. In addition, the number of genes that can be handled is highly limited by the fact that the length of time series microarray data is very short. Thus, we propose to use L1 regularization technique and address the estimation of nonlinear and nonmonotonic gene regulatory networks. L1 regularized nonparametric regression is reduce to group LASSO problem [16, 21]. For the solution of group LASSO, we show a new efficient method based on interior point method. Also, the estimates of group LASSO depend on the regularization parameter oX that determines which variables are chosen. Therefore, appropriate choice of oX is essential for statistical modeling based on group LASSO. We investigate this problem from a Bayesian point of view and derive an information criterion to choose the value of oX. We apply the proposed method to the artificial network of ten genes and twenty edges [9]. From the comparison of true positive rates of our proposed method and the methods based on ordinary least square (OLS) and L2 penalization, i.e., ridge estimator, under false discovery rate control, the effectiveness of the proposed method is verified especially from the time series data of length less than 75. Our proposed method is also applied to time series gene expression data from Human hela cell cycle [24] and the obtained gene regulatory network is analyzed. This manuscript is organized as follows: Section 2.1 gives the definition of group LASSO model and its efficient solution. L1 regularized spline additive model and its relationship to group LASSO are described in Section 2.2. In Section 2.3, an information criterion of group LASSO is derived and the selection of the regularized parameter oX is shown. Statistical test for Granger causality is illustrated in Section 2.4. In Section 3, our proposed method is applied to the time series data from the artificial network and real data. Finally, we discuss our work in Section 4.

2. L1 spline additive regression

2.1. Preliminary 2.1.1. Vector autoregressive model

Given gene expression profile vectors of p genes and T time points {Xl, ... , XT}, first order vector autoregressive (VAR(l)) model at time point t is given by:

(1) where A is a p x p autoregressive coefficient matrix, and e is a vector of normally distributed noise Ci,t rv N(O, for the expression of gene i at t time point. For simplicity of explanation, we use the following notations: Yi = (Xi,2, ... , Xi,T)', X = (Xl, ... ,XT-l)', f3 i = (ai,l, ... ,ai,p)', and ei = (ci,l, ... ,ci,T-d'. By using these notations, autoregressive model each gene i in Equation (1) can be given as:

an

Yi

= Xf3 i + ei·

(2)

Estimation of Nonlinear Gene Regulatory Networks

39

Granger [l1J defined a concept of Granger causality, in which a cause cannot come after the effect. Thus, if a gene Xi affects a gene Xj, the expression of gene Xi should help improving the prediction of the expression of gene Xj. To estimate Xi has significant Granger causality to Xj, we test whether the autoregressive coefficient aj,i is O.

2.1.2. Linear autoregression with grouped covariates We consider that p covariates are partitioned into disjoint G groups and rewrite the regression model in Equation (2) as: G

Yi

= E X 9{3i,g + ei, g=l

where {3i,g is a sub-vector of {3i corresponding to Pg covariates in the gth group, and Xi,g is a (T -1) x Pg matrix of columns corresponding to covariates in the gth group. Like LASSO, group LASSO [26] can put the restriction that all coefficients in some {3i,g'S are simultaneously and exactly zero. The estimates of group LASSO are obtained by solving the following minimization:

。イァセゥョ@



{(Yi - EX9{3i,g)'(Yi - EX9{3i,9) 9 9

+ AE 9

jサSセLァkゥGYス@

(3)

where Ki,g is a Pg x Pg positive semi-definite matrix. Since Equation (3) is a convex optimization problem but not differentiable at (3i,g = 0, Park and Hastie [21J proposed to use interior point method, introducing dummy variables.

2.1.3. Bayesian information criterion Given data D, we may select a model M of maximum posterior probability P(MID) among the models of interest. If prior probability P(M) for model is assumed to be uniform, due to the Bayes theorem, the posterior probability of M is proportional to the marginal likelihood P(DIM). Suppose that a model M is characterized by a parametric model f(DI6) and prior distribution 7r(6) for parameter 6. Marginal likelihood of model M with respect to data D is given by:

P(DIM) =

J

f(DI6)7r(6IM)d6.

Bayesian information criterion [1, 22J was proposed as an approximation of the posterior probability of the model to select the optimal model based on the data: BIC

セ@

-2 log P(MID)

= -2 log

J

f(DI6)7r(6IM)d6.

40

K. Kojima et al.

2.2. L1 regularized spline additive model for gene regulatory network estimation In non parametric regression, spline function is often used for constructing regressors. Let Si,j(Xj,t) be the spline function for the expression of gene j at time point t, Xj,t. In this study, third-order B-splines are used as base of spline function and spline function Si,j(Xj,t) for variable Xj,t is represented by I:;1'kbi,j,k(Xj,t). The smoothing spline additive model is obtained by minimizing the loss function: T

f; p

t;(Xi,t -

Si,j(Xj,t-lW

+

f; J p

d2 dX2 Si,j(X)

{

A

}2 dx.

Lin and Zhang [16], and Bach et at. [3] extended the above smoothing spline additive model to L1 regularized spline additive model in which L1 norms of first and second derivatives of spline functions are used as penalization. In L1 regularized spline additive model, the following loss function is optimized:

t, { t s" x,.' -

In B-spline,

(x,,_,)

f サ、セ@

r

+>

セ@

(J {!

S'., (x)

r

dx

+

J{,;::,

S,.,(x)

r

dx ) ,

(4)

Si,j (x) } 2 dx and

f { -/l;z Si,j (x) } 2 dx can be given as following

forms [4]:

Jサ、セsゥLェHxIイ@ Jサ、セR@

dX=1':,jD 1,i,j1'i,j, Si,j(X)} 2 dx

= I:,jD 2,i,J1'i,j'

Therefore, we can rewrite Equation (4) by: p

(Yi -

L

P

p

B i,J1'i)'(Yi -

j=1

L B ,J1'i) + A L i

j=1

V1':,jEi,J1'i,j,

(5)

j=1

where Yi = (Xi,I, ... , Xi,T-l)', mi,j is the number of basis functions for variable Xj, li,j = (')'i,j,l, ... , 1'i,j,m.)', Ei,j = D 1,i,j + D 2,i,j, and Bi,j is a (T -1) x mi,j matrix:

b"j,1 (XJ,I) [

b',J,m,,) (XJ,I)

b',J,I (LT-l) ::: bi,J,m,,)

1

セxjGtMャI@

In the L1 regularized spline additive model, since we would like to evaluate whether all coefficients of some splines are simultaneously and exactly zero, we can thus use the procedure based on group LASSO. However, use of dummy variables increases the number of variables to be concerned. In addition, unstable constraints caused by dummy variables induce the slow

Estimation of Nonlinear Gene Regulatory Networks

41

convergence. Thus, we propose to convert the optimization problem in Equation (5) to:

where Bi = (Bi,l, ... ,Bi,p), and Ii = (,:,1' ... ":,p)'. The optimization problem in Equation (6) can be solved by interior point method without using dummy variables. See Appendix A for details.

2.3. Bayesian information criterion for nonparametric group LASSO regression Selection of regularization parameter). is important for variable selection and coefficient estimation in group LASSO. We derive Bayesian information criterion for L1 spline additive model and A minimizing the criterion. From the view point of Bayesian statistics, probabilistic model of L1 spline additive model can be characterized as likelihood function !(YiIBi, Ii' of linear regression with product of Laplacian prior 7fi,j(,i,jluT,A) for li,j given by:

an

!(YiIBi, Ii' u;)

=

セtMャ@

7fi,j(,ilu2,).)

=

Li,j exp ( -

= _pl/2 27T

ai

where L iJ, ·

1.,)

(2=2>-

exp {- 212 (Yi - Bnd(Yi - Bni)} , U..

27fUT

)Pi,j 、・エHeゥjᄋIQORイセLェN@

Rセ[@

VI:,jEi,i'Yi,j) ,

'Pt,)

(7) (8)

Using Equations (7) and (8) in

Equation (4), the posterior probability of the model based on group LASSO can be given as:

P(DIM)

=

J

!(YiIBi, Ii' u;) I17fi,j(,i)U;, A)d,i ·

(9)

J

Note that the variance u; is considered to be known. For unknown uT, we use CrT = "L.'{=2(Yi,t - Yi,t? /(T - 1) as the estimator of uT- Hereafter, we omit uT and A in !(Yi IBi, Ii' uT) and 7fi,j (,i IUT, A) if no confusion occurs. In the following, we explain how the integration in Equation (9) is solved. Let Ai be a set of group vector li,j estimated as non-zero in group LASSO. If li,k is not in Ai, i.e., estimated as exactly 0 in group LASSO, it implies Laplacian prior of li,k is much stronger than likelihood function. Thus, we approximately calculate

42

K. Kojima et al.

the integration these li,k'S

J = JセtMャ@

rt Ai, ignoring li,k in the likelihood function:

!(YiIBi',i)7ri,jhi,k)d,i,k

セ@ セtMャ@

exp {-

27ra 2

サMRセHyゥ@

exp

27rai

Rセ@

(Yi - LBi,i'Yi)'(Yi - LBi,i'Yi)} 7ri,khi,k)d,i,k

t

After integrating all I i,j' j BICGL

j

t

セ@

j

- LBi,i'Yi)'(Yi - LBi,jli)}' joIk j'lk

rt Ai, we have: -2 log

J

(10)

!(YiI B i"A.)7rA,hA,)d,Ai ,

where

セtMャ@

!Ai (YiIBA" IA,) =

=

セtMャ@

exp {- 212 (Yi - L

at

27ra;

jEAi

Bi,i'Yi,j)'(Yi - L Bi,jli,j)} ' jEAi

exp {- 21? (Yi - BAilAJ'(Yi - BAilA,)} '

at

27ra;

II 7ri,jhi),

7rA i hA.) =

Here, BAi is a sub-matrix Bi for covariates in Ai, and I Ai is a sub-vector of Ii for covariates in Ai' For the integration with respect to li,j' j E Ai, Laplace approximation is used. By Laplace approximation, the integration is approximated as:

J

exp {q(O)} dO

セ@

exp {q(6)} (27r)p/2/

1- セZ@

I,

where 6 = arg maxo q( 0) . Applying Laplace approximation to Equation (10), we have: log

セ@

J

!Ai(YiIBA"'A,)7rAihA.)d'Ai

log !Ai (YiIBA"iA,)7rA.(iA,)

+ ャセゥッァRWイ@

-logdetJ(iA,),

where IAi I is the length of I Ai' and J (i AJ is a IAi I x IAi I matrix given as:

J(iA,)

= -

=

a a;,

IAi IAi

log!Ai(YiIBA,,'A.)7rAihAJI

セ@a 2 (B'Ai B A.. + セ、G@2

[ lag

.."1Ai =-"1Ai

Ei,j - Ei,jii,j {Ei,jii,j }']) E. . 3 li,jEt,Jli,j It,J t,Jlt,J A'



A





A

,

••

A

(11)

Estimation of Nonlinear Gene Regulatory Networks

43

Thus, BICGL is given as: BICGL = -2Iog!A.(YiI B Ai ,i'A,)11"Ai (i'AJ -IA i !log21r+2IogdetJ(i'A,) = -(T - 1 + 21Ail) log 2 - (T - 1) log1r - (T - 1 + lAd) loga 2

+ 2 L (IOgr(pi,j/2) -IOgr(Pi,j) + セ@

+ 21Aillog A

log IEi,jl)

JEAi

-:2

{(Yi- BA ii'AY(Yi- BA .i'AJ+A.2: JEAi

jゥGセLェeス@

-logIJ(i'A,)I·

(12) 2.4. Wald test for Granger causality

The variables selected by group LASSO are considered as the candidate variables having Granger causality to the response. In order to control false discovery rate of those candidates, we test the coefficients of basis functions li,j corresponding to each selected variable Xj. Usually, to test whether all the coefficients of grouped variables in linear regression are simultaneously zero, i.e., li,j = 0, we may use Wald test. However, Wald test is based on asymptotic normality of maximum likelihood estimators. Since group LASSO is not a maximum likelihood method due to the existence of Laplacian prior, it is impossible to use Wald test directly for the estimators of group LASSO. Konishi and Kitagawa [15J considered that a parameter () is represented as a functional T(G) for the true distribution G(x) and the estimator iJ for () is given by T(a), where a is the empirical distribution of G. The asymptotic normality of T( a) was shown:

Fn(T(a) - T(G)) ....... N(O,

J

T(l)(G) {T(l) (G)}, dG(x))

in law,

where T(1)(x, G) is influence function for T(l) given by:

T (l)( x, G) -_ l'1m T((l - E)G + E6"x} - T(G) . ...... 0

E

Here, 6"x is a distribution function having a probability of 1 at point x. Since various estimators including maximum likelihood estimator and maximum penalized likelihood estimator can be represented as T(a), we exploit this property for Wald test. Let T Ai (G) be a functional for I Ai' group LASS 0 coefficients in Ai' Due to the KKT conditions for group LASSO estimators [3, 21], functional T Ai (G) for, Ai satisfies

J

\l1 Ai (y, T Ai (G))dG(y)

where

= 0,

K. Kojima et al.

44

In addition, it is natural to assume that set of groups selected Ai by group LASSO is invariant for small perturbation Eb x to the distribution [23J. Thus, for small E, we have:

J

IJI Ai (y, T Ai ((1 - E)G + Ebx))d ((1- E)G(y)

+ Ebx(Y)) = O.

By following the derivation of the influence function for M-estimator in [15], influ(x, G) for T Ai (G) is given as: ence function tセZ@

tセZHクLgI]@

{JaEP

IJIAi(y,'AJ!

'Ai

dG(y)}_lIJ1Ai(X,TAi(G)). 'YAi=TAi(G)

By using empirical distribution LASSO estimator:

G for G, we have the covariance matrix of group

where I ('A) k t

2

1 ( BA, A2 BA - -Wk A A B'k A ' + -WkWA· A , ) , 1 'A Bk - -2 = --4 lWk na 2 4n t

t

t

t

and J(iAJ is given by Equation (11). Here, A ,j E Ai. a vector comprised of ケiLセゥG@ "'ti,j

l

t

= diag [Xi,t -

ゥセ「aLエャ@

t

t

and WAi is

t,j1i,j

For the null hypothesis, Ho : R, Ai group LASSO coefficients as follows:

WGL = (RiAi - r)' サrセ「ajGス@

= r, we can derive Wald statistics W GL for -1

(RiAi - r)

-+

X;ank(R)

in law.

(,;,1' ,;,2' ,;,3' ,;,4)'

For example, suppose that, Ai = and we would like to evaluate the null hypothesis Ho : = 0, we set R = (Omi,2,mi,l' 1mi ,2' Om,,2,mi,3' Omi,2,mi,.) and r = 0, where Om,n is an m x n matrix whose elements are zero and 1m is the identity matrix of size m.

'2

3. Numerical examples

3.1. Simulation data examples We use an artificial network of ten genes having twenty linear and nonlinear relationships and show the performance of our proposed method, L1 NVAR. For the competitors of L1 NVAR, OLS based nonlinear vector autoregressive based (NVAR) model [9J and nonlinear vector autoregressive model with L2 penalization (L2 NVAR) are employed. In L2 NVAR, L1 penalization in Equation (5) is replaced by L2 penalization, and regularization parameter is selected by Bayesian information criterion. A Wald test derived in a similar manner for L1 NVAR is used to

Estimation of Nonlinear Gene Regulatory Networks

45

capture significant Granger causalities of L2 NVAR. Twenty edges in the artificial network are set as follows: XI,t X2,t X3,t X4,t X5,t X6,t X7,t X8,t Xg,t

= 0.5 X I,t-1 + cI,t = 0.6 X 2,t-l + c2,t = 0. 7X 3,t-l + E3,t = 0.8X4,t-1 + C4,t = 0.9X5,t-1 + C5,t = sin(xl,t_l) + 0.5X2,t-l - 0.5x9,t-l + 2 + C6,t = 2COS(X2,t-l) - 2sin(x3,t_l) + 0.6XlO,t-1 + E7,t = 0.8 COS(X3,t_l) + 0.6X4,t-l + COS(X6,t-l) + 1 + c8,t = sin(x4,t-l) + COS(X5,t-l) - 0.8X7,t-l + cg,t

XlO,t =

sin(xl,t-l) - 0.8X5,t-l

+ cos(x8,t-dclO,t

Graphical representation of the artificial network drawn with Cell Illustrator [19, 27] is shown in Figure 1. From the artificial network, we generate time series data of various length {1O, 20, 30, 40, 50, 75, 100} and apply NVAR, Ll NVAR and L2 NVAR to them. Since time series length is not sufficient for the estimation, the number of B-splines is set to four. We repeat the experiment 100 times for each time series length. Granger causalities are estimated under false discovery rate 5%.

Fig. 1. An artificial network of 10 genes and 20 edges used in a simulation study. Some of edges represent nonlinear causality.

First, in order to verify the false discovery rate control, we calculate the true false discovery rates by comparing edges in the artificial network and significantly estimated Granger causalities in NVAR, L1 NVAR, and L2 NVAR for each time series length. Those true false discovery rates are summarized in Table l. When the length of time series is short, i.e., data is not enough, false discovery rate is not controlled within 5% in Ll NVAR. In L2 NVAR, false discovery rate is exploded for all the time series length. This problem may be related to the convergence of covariance matrix for coefficients.

46

K. Kojima et al.

In Wald tests for L1 NVAR and L2 NVAR, asymptotic normality is used for the derivation of covariance matrix. On the other hand, covariance matrix in NVAR coincides with its unbiased estimator, and thus asymptotic normality of maximum likelihood estimator is actually not used. This hypothesis is supported by the fact that false discovery rate is controlled for all the time series length in NVAR. For relatively long time series data, e.g., times series data of length 50, false discovery rate is correctly controlled in L1 NVAR, while it is out of control in L2 NVAR. In L1 NVAR, some variables are dropped in estimation, and thus convergence of covariance matrix is faster than the case considering all the variables. However, in L2 NVAR, no variable is dropped for the estimation. Thus, false discovery rate is converging to 5 % as time series length increase, but it is still not converged in time series data of sufficient length. Table 1. True false discovery rates obtained by comparing the artificial network and· "estimated Granger causalities in Ll NVAR and NVAR under false discovery rate controlled 5 % (mean ± standard deviation, in %). Time series length 10 20 30 40 50 75 100

NVAR

2.23 ± 9.02 3.83 ± 6.02 4.37 ± 5.21

Ll NVAR 40.54 ± 35.71 18.72 ± 16.23 11.64 ± 11.04 7.40 ± 8.95 4.13 ± 5.88 3.48 ± 4.42 2.18 ± 4.03

L2 NVAR 67.13 ± 20.25 65.11 ± 13.55 61.23 ± 8.20 55.59 ± 8.27 49.84 ± 8.49 39.67 ± 10.01 31.91 ± 8.88

True positive rates of estimated Granger causalities in NVAR, L1 NVAR, and L2 NVAR under false discovery rate 5 % are compared in Figure 2(a). Since false discovery rate in L2 NVAR is completely out of control, we also calculate the true positive rates obtained by controlling true false discovery rate within 5 % in Figure 2(b). According to the results in Figure 2(a), L1 NVAR overwhelms NVAR for time series data of length less than 75. On the other hand NVAR gives slightly better performance than L1 NVAR for long time series data. However, our interest and design of L1 NVAR is the estimation of Granger causality using insufficient time series data, and this point do not have to be concerned. L2 NVAR seems to give the best performance among the three methods in Figure 2(a), but under the control of true false discovery rate within 5 %, L1 NVAR gives the best performance among them. Therefore we conclude that L1 NVAR is the best of option among them to estimate the nonlinear and nonmonotonic gene regulatory network from the short time series data. 3.2. Application of expression data of human hela cell

We apply the proposed method to the time series gene expression data of Human hela cell [24]. 48 times points for 94 genes selected by [8] are used in our study. The

Estimation of Nonlinear Gene Regulatory Networks

47

number of B-splines is set to four, that is, the number of parameters is approximately eight times as many as the length of data. Figure 3 shows the estimated gene regulatory network under false discovery rate 5%. In the following, significantly estimated Granger causalities and biologically reported facts are compared: • Transcription factor NF-KB is known to work as the central mediator of the human immune response [20]. IeAM-I, eyelin DI, A20, lAP are reported to be target genes of NF-I\;B [20]. In our estimated network, lAP is estimated as the Granger causal of NF-I\;B, eyclin DI, A20, and IeAM-l. PKR is known to activate NF-I\;B. In the estimated network, PKR and NF-I\;B have connection with the Granger causal of lAP. • Bel-2 is known to inhibit PERP-induced cell death [2]. This regulation coincides the Granger causality, Bel-2 --> PERP in the estimated network. • E2FI is a transcription factor known to regulate the transcription of eyelin EI [10]. The estimated Granger causality eyelin EI --> E2FI is oppsite to the biologically known fact. • Puma is known to bind Mel-I to maintain the expression level of Mel-I [6]. This coincides the estimated Granger causality Puma --> Mel-I. • In colon cancer cell, over expression of E2F-I is reported to down-regulate Mel-I, up-regulate c-myc, and induce apoptosis [5]. In hela cell, this mechanism may be different, but interestingly, in the estimated network there is a completely opposite Granger causality path, c-myc --> Mel-I --> PUMA --> E2F-1. • Fas is a well known target gene of P53. On the other hand, Fas is Granger causal of P53 in the estimated network. • P2I is estimated to have self loop. This self loop is also detected by NVAR based on OL8 and verified [9].

セ@

,---------------------------, "' .

.'

. r····r··

,..:t .'X

...

BMセ\GN@ 20

60

40

TlmalMlriesl&ogttt

(a)

100

"

20

100

TIme&EHiallength

(b)

Fig. 2. True positive rates for estimated Granger causalities in NVAR, L1 NVAR, and L2 NVAR. (a) True positive rates under false discovery control 5 %. (b) True positive rates obtained by controlling true false discovery rate within 5 %.

48

K. Kojima et ai.

'o,x)

'ffll

',)/",

'",,,V

O. Our implementation makes use of LU-decomposition and back substitution routines [14] instead of Gaussian elimination since it is about three times faster and more numerically stable to round-off errors [4J.

70

R. Wan, A. M. Wheelock & H. Mamitsuka R

R

A B

A B

C

C

D

D

E

E

(a) Statistical methods

(b) Our graph-based method

Fig. 2. A comparison of statistically-based outlier methods against our graph-based one. Each of the two figures represent a microarray data set of replicates where each row is a probe and each column is an experiment. The black square represents the value being evaluated and the gray squares indicate the values used to make the evaluation.

6. Statistical Methods As a baseline for microarray experiment scoring, statistical methods for onedimensional data can be applied as usual for each probe. The difference is that there is no distinction made between the experiments of Rand t. These methods are applied to the combined data set RUt on an expression level-by-expression level basis. Figure 2 illustrates how these statistical methods differ from our framework. In Figure 2( a), the grid represents the unified microarray data set of RUt so that a row is a probe and a column is an experiment. The expression level being evaluated is shaded in black and the values which it is compared with are in gray. Statistical methods treat every experiment the same way and compare each expression level with the replicates within the same probe. In Figure 2(b), our method makes a distinction between Rand t, as described earlier. Statistical methods perform a direct comparison while our framework constructs a graph using the shaded values of R and the evaluation is performed using the shaded values of t. At least three types of statistical methods are at our disposal: (a) comparison against the inter-quartile range (IQR) , (b) standardized scores (or Z-scores) , and (c) Q-test. The inter-quartile range is the range from the first to the third quartile. Values outside of this range are considered outliers. The Z-test calculates a standardized score or Z-score for each value Pij against the overall average and standard deviation for all replicates of Pi. The Z-score reports the number of standard deviations the expression level is from the mean f..Li:

(7) For both IQR and standardized scores, a cut-off is required to indicate either how many times the IQR or how many standard deviations from f..Li are accepted before labeling a value as an outlier. A larger cut-off yields a more conservative test. In the natural sciences, the Q-test compares each value to its nearest neighbor and the overall range of values according to some confidence interval (critical values according to a 90% confidence interval are shown in Table 1):

A Framework for Determining Outlying Microarray Experiments

71

Table 1. Critical values for the Q-test for a 90% confidence interval [16, pg. 35J. N Qc

3 0.94 Table 2.

Name Vi V2 V3 V4

4 0.76

5 0.64

6 0.56

7 0.51

8 0.47

9 0.44

10 0.41

Simulated data sets created using SIMAGE.

Probes 11,664 11,664 11,664 11,664

Experiments 100 100 10 10

Dye-swap Yes No Yes No

Random noise N(0,0.219) N(O, 0.219) N(0,0.500) N(0,0.500)

Q(Pi') = Pij - (closest value to Pij) I J

range

(8)

7. Experiment Results Both the statistical methods in the previous section and our framework was applied to simulated microarray data sets.

7.1. Simulated Microarray Data We employed simulated microarray data to give us better control over our experiments. Several researchers have looked into creating simulated micro array data which are still "real" since they model real microarray data sets [1, 13J. The SIMAGE system is a publicly available web servera which models various aspects of microarray data in a controlled way, including effects from spot pins, channels, and replication [1 J. Four data sets were constructed using SIMAGE, as summarized in Table 2. SIMAGE has default parameters that were chosen through the modeling of a data set of 23 experiments [IJ. These default values, which were left unchanged throughout our work, are not shown in this table. Every data set consists of 11,664 probes and either 100 or 10 experiments. Two data sets were dye-swapped (Vi and V 3 ) and two were not (V2 and V4)' As SIMAGE simulates real microarray data, the default parameters already introduces noise into the microarray data as a Gaussian distribution of N(O, 0.219). The first two data sets contained this level of noise; the remaining two have a larger standard deviation of 0.500. Therefore, two sets of experiments are conducted. In the first set, we used simulated dye-swapped data and formed G using all of Vi and then applied the graph to the first 10 experiments of Vi and V 3 , where the ones in V3 are known to have more noise. In the second scenario, non-dye-swapped data is considered and V 2 is used to form G and it is applied to the first 10 experiments in V 2 and V 4 . aURL: http://bioinformatics . bioI. rug .nl/websoftware/simage/

72

R. Wan,

A.

M. Wheelock &J H. Mamitsuka

R

Percentage of outlying probes (initial)

Fig. 3.

Percentage of

outlying probes (final)

The framework for assessing our graph-based method.

7.2. Framework of Experiments The framework of our experiments encompass both the statistical tests and the use of our graph-based method. For the statistical tests, we combined 9 of the experiments from R with only one experiment known to have more noise to act as t since critical values for the Q-test are available for only up to 10 values (see Table 1). The aim is to determine how well statistical methods can isolate t. As for our graph-based method, we evaluate outlier detection and probe cleaning together using the framework shown in Figure 3. The repository data R is used to construct a graph G by selecting a value for dt . The graph is applied to t and the percentage of outlying probes is reported as the "initial" percentage using a fixed value for et. Afterwards, the probes are cleaned using the same graph structure. Next, the "final" percentage of outlying probes is reported using the same value for et. In addition, the first application of outlier detection is done for the first 10 experiments in R and averaged to act as a baseline. The aim of our framework is to demonstrate the usefulness of our graph-based method in comparison to more well-established statistical methods. In order to unify the comparison, the statistical methods also report a percentage indicating the number of probes which they deemed were outliers. The baseline for the statistical methods is the average percentage across the 9 experiments from R. This is compared to the single percentage obtained from evaluating the probes of the test set t.

7.3. Results The results from our experiments are summarized in the graphs of Figure 4 for both simulated dye-swapped and non-dye-swapped data sets. Beginning with the dye-swapped data sets, Figure 4(a) and Figure 4(b) present the results for statistical methods and methods based on our framework. In both figures, the vertical axes indicate the percentage of probes that are marked as outliers. Along the horizontal axes is the parameter relevant to the method. Beginning with the statistical methods in Figure 4(a), it would seem that the IQR test performs better than the Z-test as there is a clear separation between the two graphs for the baseline and the test set. As expected, for both methods, the number of probes identified as outliers decreases as the parameter increases for

A Framework for Determining Outlying Microarray Experiments

Dye-swapped ('0 1 and '03)

r-------------------------------, セ@

-)( -

73

___

g イMセ

Baseline (3%) Baseline (10%) -)( - Initial test set (3%) -. - Initial test set (10%)

---M-

lOR (Baseline, averaged) lOR (Test set) Z-test (Baseline, averaged)

K

Final test Set (3%)

..•

Fina! test set (10%)

- •. Z-test (Test set) - - -

Q-test (Baseline, averaged) Q-test (Test set)

.

1;1

,. - - -)( - - - K- - -

i(- - - 'i(- - ..

-J+. - _

-)if. __ ->f ___ )(

.セ@

... .

GセB@

0

'"

1.0

1.5

2.5

2.0

.. x

10

3.0 Expression threshold

Parameter

(a) Statistical methods

g

(b) Graph-based methods

Non-dye-swapped ('0 2 and '04)

§

NMセ@ ---M-

lOR (Baseline. averaged)

-)( -

lOR (Test set) Z-le51 (Baseline, averaged)

NMセ@

---M- Baseline (3%) Baseline (10%) -)( - Initial test set (3%) -. - Initial test set (10%) ,x Final test set (3%) ..• Final test set (10%)

Z-lesl (Test set)

a-test (Baseline. averaged) - - - a-test (Test set) iLセZ@

"---)(----t = 0 (data not shown). Even for small Nsize = 43, applying both methods enriches detection of correlations in comparison to the usage of only I'c.

0.545 IinearJlarge rKroo) HfィセN@ 6) I'c>0.545 r/'as < 0.665 linear (small rKra,) (Fig. 7) I'c 0.656, these plots exhibit

f'C= 0.63, ',""'=0.70

f'C= 0.69, ',""'=0.68

5,············..··..· ..• .. •·················· ..· .... ········,

x

[] )(

o

Nセ@

セUMPG@

0

NUMGZセo⦅j@

leUCine

glucose 6-phosphate

galactinol

fucose

Figure 6: Examples where both correlation coefficients I'c and r/m , indicate significant correlation. (I'c> 0.545, rjKm, > 0.656).

UQMセA@ .5 \ . . . . . . . . - - - - 0 - - - " 5 succinic add

f'C= 0.67, ,,",",= 0.58

f'C= 0.64, ',"'"'= 0.56

f'C= 0.66, ',""'= 0.60

jJ

rPC=0.59, ',K",,= 0.26

+

:セ@

++

threonk add

I

セK@

セ@

Nセ@

c 0

]

I .

)(

glyceric acid

," 2,4 hydroxybutiric add

Figure 7: Examples for the case where only the Pearson coefficient was significant (?C > 0.545) but the nonlinear coefficient was not (rI Km, < 0.656).

120

J. Numata, O. Ebenhoh

0X

セ@

9-

0

§ '"

oS ·5

x +0 +

0 glucose 6-phosphate

E.- W. Knapp

"0 Nセ@

."8 Nセ@

u

I 01

J ·5

• b

b

T

0

citric acid

r""= 0.39, r,Kro'=0.34

t"c= 0.10, r,Kn;'=0.10

t"c=-0.06, fj"""=-0.41

r""= 0.17, r,Krn'=0.25

]

{3

Nセ@

1!

rn

o Q)) 00

i

w

x

+

.2 0

'"

?;fxt1l +

0

cellobiose

xylitol

Figure 8: The above examples show likely uncorrelated pairs of metabolites, where the limited number of data points does not allow a clearer classification.

Figs. 4 and 5 present correlations which were only detectable as significant by the mutual information coefficient r/ras, but invisible to the Pearson correlation coefficient !,C. In Fig. 4, the reason is the presence of outliers. The examples in Fig. 5, in addition to correlation, also present differences among plant lines, which cluster in different concentration regimes. Three of the plots in Fig. 5 involve cellobiose, which in another study [16] using a larger data set was found to be the largest contributor to phenotypic variations. The metabolic data analyzed in the present study are a subset of these data. In particular the metabolic data with cellobiose are in an experimentally trustworthy concentration regime, where correlations are likely not caused by experimental error. Fig. 6 shows examples where both correlation coefficients r[ Kras and !,C adopt values that indicate significant correlation. The first plot corresponds to large correlation found in a variety of studies [1]. The metabolites glucose 6-phosphate and fructose 6-phosphate are directly connected in the biochemical network by the enzyme EC 5.3.1.9. [17]. In the second plot, both metabolites are hydrophobic amino acids. But the chemical nature of the metabolites is seemingly unrelated in the third and fourth plot. In Fig. 7, we illustrate metabolite pairs where only the Pearson correlation coefficient, !,C, points to significant correlation. Most of such cases yield an intermediate value for r/ras, in the "gray area" that does not allow clear discrimination. The last plot in Fig. 6 is a rare example where r[ Kras is particularly small. Lastly, Fig. 8 shows either uncorrelated cases, or cases where the coefficients were not able to detect correlation reliably. The second plot shows two chemically related metabolites, which however show no correlation. The third plot shows a separate cluster for plant line Col-O, but no correlation. In the last plot some correlation seems to appear, but the correlation coefficients are too small to be significant. 4.

Conclusion

There are two major advantages in using the mutual information coefficient. The first one is the discovery of additional correlations invisible to the Pearson coefficient, frequently because of the presence of outliers (see Fig. 4). The second advantage is the detection of correlation even if plant lines cluster in different concentration ranges. Although a cluster analysis would be able to detect these differences in concentration regimes, the present

Measuring Correlations in Metabolomic Networks

121

method allows concurrent detection of correlation. For example, cellobiose displays a consistently lower concentration range when compared to galactinol for plant line Col-O, but not for the other three plant lines. Simultaneously, the two metabolites were found to be correlated by the mutual information coefficient, but not if the Pearson coefficient is used. (Fig. 5). In this work, the emphasis was on discovering few but highly significant correlations, with a small risk of false classifications even for small sample sizes of Nsize = 43. However, it should be noted that larger sample sizes of a few hundred data points would allow to detect also smaller correlations.

Acknowledgments This work was supported by the International Research Training Group "Genomics and Systems Biology of Molecular Networks" (GRK1360 of the DFG). We would like to thank Dr. Matthias Steinfath and Dr. Jan Lisec for useful discussions and for sharing their experimental data [16].

References [1] [2] [3] [4] [5] [6] [7]

[8] [9] [10] [11] [12]

Steuer, R., On the analysis and interpretation of correlations in metabolomic data. Briefings in Bioinformatics. 7(2): 151-158,2006. Camacho, D., A.dJ. Fuente, and P. Mendes, The origin of correlations in metabolomics data. Metabolomics. 1(1): 53-63,2005. MUller-Linow, M., W. Weckwerth, and M.-T. Hutt, Consistency analysis of metabolic correlation networks. BMC Systems Biology. 1(44),2007. Steuer, R., et al., Observing and interpreting correlations in metabolomic networks. Bioinformatics. 19(8): 1019-1026,2003. Kraskov, A., H. StOgbauer, and P. Grassberger, Estimating mutual information. Phys. Rev. E. 69: 066138, 2004. Hnizdo, V., et al., Nearest neighbor estimates of entropy. American J of Math and Manag Sciences. 23: 301-321,2003. Hnizdo, V., et aI., Nearest-Neighbor Nonparametric Method for Estimating the Configurational Entropy of Complex Molecules. J Comput Chem. 28(3): 655-668, 2007. Cover, T.M. and J.A. Thomas, Elements ofInformation Theory. 2nd E ed. Wiley Series in Telecommunications, ed. D.L. Schilling. 2006. Steinfath, M., et aI., Metabolite profile analysis: from raw data to regression and classification. Physiologia Plantarum. 132: 150-161, 2008. Numata, 1., M. Wan, and E.W. Knapp, Conformational Entropy of Biomolecules: Beyond the Quasi-Harmonic Approximation. Genome Informatics. 18: 192,2007. Steuer, R., et al., The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics. 18 Suppl. 2: S231-S240, 2002. Matsuda, H., Physical nature of higher-order mutual information: Intrinsic correlations and frustration. Phys Rev E. 3: 3096-3102, 2000.

122

J. Numata, O. Ebenhoh fj E.- W. Knapp

[13] Dionisioa, A., R. Menezes, and D.A. Mendes, Mutual information: a measure of dependency for nonlinear time series. Physica A: Statistical Mechanics and its Applications. 344(1-2): 326-329,2004. [14] Lange, O.F. and H. Grubmiiller, Generalized Correlation for Biomolecular Dynamics. Proteins: Structure, Function, and Bioinformatics. 62: lO53-lO61, 2006. [15] Storey, J.D., The Positive False Discovery Rate: A Bayesian Interpretation and the q-Value. The Annals of Statistics. 31(6): 2013-2035, 2003. [16] Lisee, J., et aI., Identification of metabolic and biomass QTL in Arabidopsis thaliana in a parallel analysis ofRIL and IL populations. The Plant Journal. 53: 960-972, 2008. [17] Mueller, L.A., P. Zhang, and S.Y. Rhee, AraCyc: A Biochemical Pathway Database for Arabidopsis. Plant Physiology. 132: 453-460, 2003.

OPTIMALITY CRITERIA FOR THE PREDICTION OF METABOLIC FLUXES IN YEAST MUTANTS EVAN S. SNITKIN 1 [email protected]

DANIEL SEGRE 1•2 [email protected]

IGraduate Program in Bioinjormatics, Boston University, 44 Cummington St., Boston, Massachusetts, 02215, USA 2Departments of Biology and Biomedical Engineering, Boston University, 24 Cummington St., Boston, Massachusetts, 02215, USA Constraint-based models of cellular metabolism, such as flux balance analysis (FBA), use convex analysis and optimization to study metabolic networks at a genome scale. The availability of reaction lists for numerous organisms, along with a variety of network analysis and optimization tools, is making these approaches increasingly popular for metabolic engineering and biomedical applications, as well as for addressing fundamental biological questions. It is therefore very important to assess the predictive capacity of these models and to understand how to interpret them in a biologically relevant manner. Typically, model assessment is limited to gauging the ability to predict phenotypes, such as viability under different environmental and genetic conditions. These types of assessments, for the most part, focus only on the growth phenotype of the cells, but ignore the underlying flux predictions. While this may be sufficient for certain types of study, the question of whether flux balance models can reliably predict intracellular and transport fluxes is crucial for more detailed analysis, and remains largely unanswered. Here we compare FBA model predictions of yeast metabolic fluxes to a previously published set of experimentally determined fluxes for \3 different single gene deletion mutants across a variety of possible objective functions. We find that the specific optimization criteria used to determine fluxes have a significant impact on the accuracy of the predicted fluxes. Interestingly, while different optimization methods provide very different levels of agreement relative to experimental fluxes, they tend to provide similar predictions with respect to the effect of the perturbation on growth. This demonstrates that assessment of models at the level of flux predictions is a critical step in assessing the biological validity of different models and optimization criteria.

Keywords: flux balance analysis; gene deletion; optimality criteria; flux measurements

1.

Introduction

A century of detailed biochemical studies, in conjunction with the genomic revolution, has culminated in the release of metabolic reconstructions for a number of model organisms. These metabolic reconstructions comprise the stoichiometries of all known enzymatic reactins in a given organism. In addition to enabling the study of metabolic networks in diverse organisms [19], these reconstructions have yielded the ability to create genome-scale predictive models by using the steady state framework of flux balance analysis [12]. Flux balance models have been released for a number of bacterial organisms such as E. coli [7]and H pylori [14], and more recently also for the eukaryotes yeast [9] and human [5]. With the ability to generate models largely from sequence data, it should be expected that the pace of model development will only increase in the coming months and years.

123

124

E. S. Snitkin

fj

D. Segre

Along with the increase in model availability has come a widening of the spectrum of reported applications of flux balance models. Recent work has demonstrated the use of flux balance models to address cutting edge research questions ranging from understanding the dynamics of microbial communities [17] to predicting perturbations required to fulfill complex metabolic engineering objectives [2]. These various applications of flux balance models often require different levels of predictive abilities from the models. For instance, for some applications, being able to accurately capture the range of possible metabolic behaviors of an organism is sufficient [3], while for others the ability to predict the precise metabolic state resulting from specific perturbations is required [2]. Given that different model applications may require different levels of predictive proficiency, it is important to be able to evaluate the appropriateness of models for addressing different research questions. A common method fqr evaluating models is by quantifying their abilities to predict the effects of environmental and genetic perturbations on growth rate. The attractiveness of this approach for model evaluation largely stems from the availability of high-throughput growth phenotype data for many organisms, in addition to the ease with which the effects of environmental and genetic perturbations on growth can be determined using these models. While such assessments evaluate model behavior in response to diverse perturbations, the assessments are typically limited to growth phenotype. An open question is how a model's ability to predict the growth phenotypes under a variety of conditions translates into its ability to predict the fluxes underlying the growth predictions. Here, we utilized a compendium of experimentally determined fluxes for yeast single gene deletion mutants [1] to gain insight into the ability of yeast flux balance models to predict central carbon metabolic fluxes in response to perturbations. In addition to assessing the relationship between predictions of growth phenotypes and predictions of the underlying fluxes, we also compared the ability of different objective functions to predict the metabolic response to genetic perturbations. Through this analysis we hoped not only to assess the predictive abilities of flux balance models at the level of flux predictions, but also to understand what drives the metabolic response to genetic perturbations. Our results support previous studies which suggested that the metabolic response to genetic perturbations is best described as a minimal rerouting of fluxes around the perturbation. Despite the clear superiority of an objective function implementing minimal flux rerouting to predict mutant fluxes, all tested objective functions correctly predicted the growth phenotype for all 13 mutants considered. This suggests that correct predictions of growth phenotype do not necessarily imply an accurate prediction of the underlying fluxes.

2.

Methods

2.1. Experimentaljlux data

All experimentally measured fluxes and uptake/secretion rates were taken from the supplementary material of the 2005 manuscript by Blank et at. [1]. Among the 38 single gene deletion mutants for which fluxes were measured, we focused on 13 for which the deleted gene did not have any duplicates. The reason for this is that gene duplicates are

Optimality Criteria for the Prediction of Metabolic Fluxes

125

implemented in a trivial manner in flux balance models, unless regulation is explicitly taken into account. In a typical flux balance calculation duplicate genes completely back one another up under all conditions. 2.2. Flux Balance Analysis Flux balance analysis is a linear constraint based modeling approach which has been described in detail elsewhere [6]. Briefly, flux balance analysis consists of two critical steps; (1) the imposition of linear constraints on fluxes, stemming from the assumption of steady state, and (2) an optimization step by which a particular set of fluxes fulfilling the given constraints is selected. These linear constraints limit the feasible flux solutions to those which result in no net production or consumption of any metabolite. These steady state constraints can be described by the nullspace of the m x n stoichiometric matrix S. The columns of S represent the n reactions, and its rows the m different metabolites. An entry Sij represents the stoichiometric coefficient of metabolite i in reaction). In addition to the steady state constraints, additional linear constraints are imposed to set upper and lower bounds on individual fluxes Caj セ@ Vj セ@ hj). These constraints can be applied to fix maintenance requirements, restrict reversibility of reactions and set limits on nutrient uptake rates. The previously released iLL672 yeast metabolic reconstruction was used for all analyses [13]. Constraints on uptake rates were imposed to mimic the minimal glucose conditions under which the utilized set of experimentally determined fluxes were determined. Gene deletions were implemented in the model by setting the flux to zero for all reactions requiring the protein product of the deleted gene. 2.3. Objective functions to predict mutant jluxes While the imposition of the linear constraints mentioned above restricts the space of possible metabolic behaviors, there are still potentially an infinite number of flux states which can fulfill the given constraints. To select a particular flux state, which can in turn be compared to the experimentally measured fluxes, one typically maximizes or minimizes a linear combination of fluxes, based on a biologically relevant criterion. Here we evaluated the flux predictions made using several different criteria. A summary of the different objective functions and the motivation for testing them can be found in Table 1.

3.

Results

3.1. Experimentally determinedjluxes To evaluate the relative abilities of different objective functions to accurately predict the metabolic flux response to genetic perturbations, we utilized the aforementioned compendium of experimentally determined fluxes for S. cerevisiae single gene deletion mutants [1]. The mutants analyzed by Blank et al. were selected on the basis that the deleted genes encoded enzymes which catalyzed reactions that were active under minimal glucose conditions, but were not essential to growth. In other words, these genes encoded enzymes in flexible reactions, such that by observing how the metabolic network responds to their deletion, insight could be gained into the metabolic basis for

E. S. Snitkin f3 D. Segre

126

the robustness to gene deletions that has been previously observed in yeast metabolism [I, 4]. Despite the fact that the set of mutants analyzed by Blank et al. targeted genes in various central carbon metabolic processes, the nature of the metabolic flux responses were largely similar. Specifically, it was observed that for most mutants, the metabolic response was a local rerouting of flux around the perturbed reaction, with the relative flux through other pathways remaining similar to the wildtype. The exceptions to this rule were for mutants in reactions critical to redox metabolism, where more distant rerouting was observed. An important caveat to the observed similarity in the flux distributions of the different mutants is that the absolute flux of carbon varied greatly. This aspect of the deletion mutant response is demonstrated in Fig. 1, where the glucose uptake and biomass production for the 13 mutants analyzed in the current study are shown. It can be seen that although the efficiency with which carbon is utilized is largely similar across different mutants, the growth rates vary greatly. 1.1 eLSCl eMAEl

1.0:

ewr eCTPl

0.9; II) II)

Q)

eSFCl

S

0.8i-

'c,

0.7i

eGLYl

u:::

セ@

eGCV2

PCKl eOACl

e

SDH1

0 (5

'iii

>.

.c

PNVセ@

a.

eFUMl

0.5: ePDAl

0.4; eRPEl

セM

--6-----8'------to--- . . . . Mエセ@

J .............

14

......... L _......

16

-18

Glucose Uptake Rate (mmol/g/h) Fig. I. Experimentally determined glucose uptake rates and fitness for strains analyzed in current study. Glucose uptake rates were plotted against the physiological fitness for the 13 mutants analyzed in the current study, along with the wildtype. Each point represents an individual strain, which is labeled with the gene which was deleted, or with WT if no gene was deleted. Physiological fitness was computed by normalizing a strains growth rate by that of the wildtype. The wide range of glucose uptake rates indicates variation in the absolute metabolic flux carried in the different mutants. On the other hand, the strong correlation between glucose uptake rate and physiological fitness suggests that the glucose is largely being used in a similar manner across the different mutants.

Optimality Criteria for the Prediction of Metabolic Fluxes

127

3.2. Objective functions used to predict mutant fluxes

Our assessment of the ability of yeast flux balance models to predict fluxes in single gene deletion mutants included the evaluation of a set of 9 different objective functions (See Table 1). These 9 objective functions can be dissected into four categories: growth maximization, minimization of metabolic adjustment, experimentally motivated and alternate maximization criteria. Table I. Objective functions used to detennine mutant fluxes.

Optimization Method

Primary Optimization Function

max

Vgrowlh

FBA_WT_MIN_DIST

max

Vgrowlh

KO

m

min セIカサo@

A secondary optimization was performed to minimize the sum of the absolute values of the fluxes A secondary optimization was performed to minimize the distance from an experimentally constrained WTsolution

KO

FBA MIN AV

MOMA_LP

Additional Notes

Mカセ@

LP refers to the use of linear programming to minimize the Manhattan distance

I

i=I QP refers to the use of quadratic programming to minimize Euclidean distance

m

MOMA_QP

mm セZcvゥko@

_V;WT)2 ;=1

I

m

MOMA_LP_ WT_ CONSTR

mm

IViKO _

WT - EXP

Vi

I

i=1

m

MOMA_QP_WT_CONSTR

MOMA_ LP_OLC_UP_NORM

mm

I

(V{O Mvセ@

_EXP)2

i=I

min

m

v KO

i=I

VGLC

L:I セッ@

VWT

--1rTi VGLC

m

MOMA_LP_BM_SINK

The experimentally constrained WT solution was computed minimizing the sum of fluxes, given the experimental constraints [13].

min

II

V;KO _V;WT

I

During the optimization sink reactions were created for each biomass component

;=1

FBA_MAX_ETOH

max

KO VEIOH

For both primary and secondary optimizations biomass was fixed to the experimental value determined for theJli ven mutant

Abbreviations: WT = Wildtype, KO = Knock Out LP = Linear Programming, QP = Quadratic Programming, BM = Biomass, GLC = Glucose, EXP = Experimental

3.2.1. Growth maximization

This set consisted of two objective functions, which both select flux solutions which maximize biomass production. The two objective functions differ in their secondary objective functions, which are used to select among the set of alternative flux solutions which all result in optimal biomass production. The first, FBA_MIN_A V, performs a

128

E. S. Snitkin €3 D. Segre.

secondary optimization which finds the flux distribution which produces the optimal biomass and has the minimal sum of the absolute values of fluxes through all reactions. The hypothesis underlying this approach is that yeast will attempt to achieve maximal growth at a minimal expense in terms of enzyme usage [10, 15]. The second objective function, FBA_WT_MIN_DIST, performs a secondary optimization which finds the set of fluxes which produces the optimal biomass and has the minimal Manhattan distance from an experimentally constrained wildtype solution. The motivation for this secondary objective was the aforementioned observation that the distribution of flux in deletion mutants is overall very similar to the wildtype. 3.2.2. Minimization of metabolic adjustment

This set consisted of four objective functions all of which minimize the distance from a wildtype flux solution, given the additional constraint of the gene deletion [16]. These objectives differ in the distance metric used and the wildtype flux solution to which the distance was minimized. The distance metrics were Manhattan (MOMA_ LP and MOMA_ LP_ WT_ CONSTR) and Euclidean (MOMA_ QP and MOMA_ QP_ WT_ CONSTR) distances, both of which have been used in previous applications of the minimization of metabolic adjustment criteria [13, 16]. The wildtype flux distributions differed in that one uses experimental flux data to constrain the solution space (MOMA_LP _WT_ CONSTR and MOMA_QP _WT_CONSTR), and the other does not (MOMA_ LP and MOMA_QP). 3.2.3. Experimentally motivated

Both of the experimentally motivated objective functions are derivatives of minimization of metabolic adjustment, but with additions which were motivated by some of the observations made by Blank et al. [1], and others [8], in the analysis of fluxes in genetic mutants. MOMA_GLC_NORM used an experimentally constrained wildtype solution as above, but minimized the distance between fluxes normalized by the glucose uptake rate (See Table I). The motivation for MOMA_GLC_NORM was the observed variation in the absolute flux among the different deletion mutants. The second objective is MOMA_BM_SINK, which minimized the Manhattan distance from an experimentally constrained wildtype solution as above, but included sink reactions for all biomass components. The motivation for MOMA_BM_SINK was to alleviate constraints on maintaining wildtype growth, when minimizing distance to the wildtype flux solution. 3.2.4. Alternate maximization criteria

The only objective function in this category maximized ethanol production in the mutant, given that biomass production was fixed to the experimentally observed value. The FBA_MAX_ETOH objective was motivated by the well known phenomenon whereby yeast preferentially ferments glucose, although it can be more efficiently broken down through oxidative phosphorylation [II]. Some have theorized that this aspect of yeast metabolism is a result of a selective advantage in maximizing ethanol production, so as to create a poor environment for potential competitors [18].

Optimality Criteria for the Prediction of Metabolic Fluxes

129

3.3. Correlations between experimental and predicted fluxes

Initial evaluation of the different objective functions was done by computing the Spearman Rank correlation between predicted fluxes and 36 experimental flux measurements. These 36 fluxes, which consist of fluxes through central carbon metabolism along with uptake/secretion rates, were selected for correlation analysis because they represent a set of linearly independent variables in the genome scale yeast model used. The results of the correlation analysis are shown in Fig. 2 for four optimization methods, which were found to be representative of the nine evaluated. For all 13 mutants tested, the objective functions which computed minimal distance from an experimentally determined wildtype solution achieved the best correlations. The performance of this set of methods was largely unaffected by the choice of distance metric (Manhattan or Euclidean), the addition of sinks for biomass components or by computing distances based on fluxes normalized by glucose uptake rates. On the other hand, the nature of the wildtype reference from which the distance was minimized was found to be very important. Specifically, inferior performance was observed across all mutants when using the method which minimizes the distance from a wildtype solution predicted by assuming maximal biomass production. 1.00

• •

0.95

• •





• • • I

0.90

0::: .lI::: c: C\'l 0::: セ@

III

• • I III

III

0.85

0.80

E C\'l セ@

en

0.75

0.70

0.65

CTP1

FUM1 GCV2 GLY1

LSCl

MAE1 OACl PCK1

PDAl

Mutants Fig. 2. Spearman eorrelations of predicted fluxes with experimentally determined fluxes. Spearman rank correlation R values were computed between experimentally determined fluxes and the fluxes predicted by each of the 9 objective functions for the 13 different gene deletion mutants. Here, the R values for 4 objective functions are shown, as these 4 were found to be representative of all 9. Specifically, MOMA_LP performed the same as MOMA_QP, while MOMA_LP_WT_CONSTR performed the same as MOMA_QP_WT_CONSTR, MOMA_OLC_NORM, and MOMA_BM_SINK. For virtually all mutants the strongest correlation was achieved using an objective which minimized the distance from an experimentally

130

E. S. Snitkin & D. Segre

constrained wildtype flux solution (black circles). The reference flux solution was critical, as minimizing the distance from a wildtype solution computed with the assumption of optimal growth resulted in a decreased correlation in all mutants (gray triangles). The objective maximizing production of ethanol (gray diamonds), produced fluxes which were least correlated with the experimental measurements. Notably, despite the respirofermentative behavior of yeast in aerobic glucose conditions, maximization of ethanol did a worse job of describing the flux response than maximization of growth (black squares) for a1\ 13 mutants. ACETATE SECRETION ANAPLEROTIC REACTIONS BIOMASS CITRATE CYCLE ETC, COMPLEX II ETC. COMPLEX IV ETHANOL SECRETION GLUCOSE UPTAKE GLYCEROL SECRETION GLYCOLYSIS PENTOSE PHOSPHATE CYCLE SUCCINATE SECRETION

Fig. 3. Normalized difference of fluxes predicted by MOMA_ LP_ WT_ CONSTR from experimental values. Differences were computed between the experimenta1\y determined and model predicted fluxes. Before taking the difference between fluxes, all fluxes were normalized by the glucose uptake rate for the given mutant. In order to make differences comparable for fluxes of different magnitudes, flux differences were then normalized by the range of a given flux across all experimental measurements. Fina1\y, flux differences for reactions in the same metabolic pathway were averaged together to allow for easier interpretation of incorrect flux predictions. Displaying this data in a heatmap, where black represents maximal difference and white minimal difference, reveals that the largcst differences between experimental and model predicted fluxes are for the pdal, zwfl and rpel mutants. This fits with correlation analysis, as these mutants had three of the lowest Spearman R values for the MOMA_ LP_ WT_ CONSTR objective. Looking at the heatmap to identifY the processes with the largest differences for these three mutants provides insight into the cause of the low correlations. For pdal, the large difference in succinate secretion is a result of the model failing to predict that the TCA cycle is used to maintain NADHINAD balance in the absence of the pyruvate dehydrogenase reaction. For rpel, the model did not capture rerouting present in many pathways. Most of these reroutings stemmed from differential use of the pentose phosphate pathway resulting from the gene deletion. Fina1\y, for zwfJ, there is a large increase in the flux through malic enzyme to compensate for the inability to produce NADPH through the pentose phosphate pathway. The increased flux through malic enzyme is associated with an increase in flux through the TCA cycle and the respiratory chain, which is not predicted by the model. In general, these three gene deletions a1\ result in reroutings to maintain redox balance, and the full scope of these reroutings are missed by the model predictions.

Optimality Criteria for the Prediction of Metabolic Fluxes

131

While the objective function which minimizes distance from an experimentally constrained wildtype solution was best for all mutants, there is variability in its relative performance across mutants. To explore this variability in more detail, we examined predicted fluxes for MOMA_LP_WT_CONTR, and assessed how well the fluxes though different metabolic pathways were predicted for different mutants. We hoped that the results of this analysis, which are displayed in a heatmap in Fig. 3, would provide insight into the sources of the decreased performance in certain mutants. The most erroneous flux predictions for most pathways are largely restricted to three mutants: rpel, pdal, and zwfl. The pdal and zwfl mutants are in reactions which utilize redox cofactors, and as described by Blank et at. such mutants tend to enact more distant rerouting to maintain redox balance. Therefore, it fits with intuition that using an objective function which minimizes distance from the wildtype would struggle in capturing more distant flux changes. A detailed examination of the predicted fluxes for these two mutants shows that while adjustments are predicted which resolve the redox imbalances caused by the given mutation, they are not the same adjustments found experimentally. For instance, for the pdal mutant, the NADINADH imbalance caused by the mutation is predicted to be resolved using the NADH dependant acetaldehyde dehydrogenase, but it seems that instead yeast increases respiratory activity to achieve redox balance. For the zwfl mutant, the model fails to predict the huge increase through the TCA cycle and malic enzyme, which occurs in yeast to counteract the deficiency in NADPH resulting from the lack of an intact pentose phosphate pathway. These examples indicate that the flux rerouting in yeast metabolism which takes place in order to maintain redox balance does not represent a minimal adjustment, or at least not minimal with respect to the distance metrics evaluated here. 3.4. Prediction of absolute flUX changes While the correlations computed above quantify how well the different objective functions predict the nature of flux reroutings in the various deletion mutants, they do not capture how well the different objectives predict the absolute flux through the system. As discussed above, while the 13 different mutants analyzed here largely have the same relative flux through different pathways as observed in the wildtype, the absolute flux varies greatly. To evaluate how well the different objective functions capture different mutations' effects on absolute flux, we compared predicted biomass production in each mutant to the corresponding experimentally measured values. The results of this comparison are displayed in Fig. 4 for the MOMA_LP _WT_CONSTR objective function. Fig. 4 indicates that despite the strong correlation between predicted and observed fluxes for all deletion mutants, there is little success in predicting the relative effects of the same mutations on the growth rate. The same trend observed in Fig. 4 was seen for all objective functions. Specifically, across all objective functions no mutant was predicted to have less than 90% of the wildtype growth, whereas experimental measurements found that 9 of the 13 mutants in fact had less than 90% of the wildtype growth rate.

132

E. S. Snitkin

fj

D. Segre .PCKl .GLY1. SFC1 . . GCV2 • •MAEl .OACl CTPl WT

1.00 .FUMl

.SDHl

0.99

!:lell

0.98

.LSCl

.5

u:

@セ '5l a:

0.97

Qj

"8 ::a:

0.96 .RPEl

0.95

.PDAl

Experimental Fitness Fig. 4. Comparison of model predicted and experimentally determined growth rates for different strains. Experimentally determined fitness was plotted against fitness predicted using the MOMA_ LP_ WT_ CONSTR objective function for the 13 gene deletion and wildtype strains. Fitness was defined as the ratio between the growth rate of a given strain and the growth rate of the wildtype. While the experimental fitness values have a wide range across the 13 mutants, the model predicts that no mutant has a growth rate less than 95% that of the wildtype.

4.

Discussion

We evaluated the proficiency with which yeast flux balance models can predict the flux response to a variety of gene deletion mutations. Specifically we assessed the flux predictions made by nine different objective functions, in response to 13 different single gene deletions. Comparison of flux predictions to complementary experimentally measured fluxes revealed that for all mutants the best performing objective functions were those which minimized the distance of mutant fluxes from an experimentally constrained wildtype solution. Importantly, while the 9 objective functions showed major differences in the accuracy of their predicted fluxes, all objectives correctly predicted that the 13 mutants would be able to produce biomass. This clearly demonstrates that the ability to correctly predict growth phenotypes does not necessarily translate into the ability to correctly characterize the underlying response at the level of reaction fluxes. The fact that for all mutants the flux response was best described by objectives which implemented minimal flux rerouting, supports previous analyses of the metabolic response to gene deletions. Although the minimal rerouting objectives were consistently the best, predictions for all mutants were not equally good. Specifically, it was found that for mutants in reactions involving redox cofactors, a minimal adjustment was not

Optimality Criteria for the Prediction of Metabolic Fluxes

133

sufficient to completely describe the flux response. We hypothesize that the reason for this is that there are a number of degrees of freedom in redox balancing, and the minimal rerouting criteria by itself is not sufficient to accurately predict the observed response. Likely, criteria which cannot easily be captured by flux balance models, such as enzyme affinity for redox substrates and kinetic rate constants, are crucial in determining how redox balance is achieved. In addition to issues with redox mutants, all objective functions failed to predict the absolute flux for different mutants. Specifically, despite accurate predictions of how fluxes were rerouted in the mutants, the model predictions did not capture the reduction in the overall flux observed in the experiments .. The inability of any objective function to capture this aspect of the mutant response leaves the mechanism responsible for this observation unidentified. Again, it is likely that features of the metabolic response which cannot be captured by flux balance models are important here. Specifically, the relative efficiency of alternative pathways may limit the overall flux in mutants. Alternatively, regulatory responses to imbalances resulting from the gene deletions may cause an overall reduction in metabolic activity. Despite some of the shortcomings in the abilities of flux balance models to predict mutant flux responses, overall they largely capture the salient features of the response to the different gene deletions. Importantly, the selection of objective function proved critical to the accuracy of the predicted fluxes, despite little effect on the prediction of mutant growth. Acknowledgements

The authors would like to thank Bill Riehl and Hsuan-Chao Chiu for critical reading of the manuscript. The authors would also like to acknowledge support from the NASA Astrobiology Institute, the US Department of Energy, and Boston University. References

[1]

[2]

[3]

[4] [5]

[6]

Blank, L.M., Kuepfer, L. and Sauer, U., Large-scale I3C-flux analysis reveals mechanistic principles of metabolic network robustness to null mutations in yeast, Genome Bioi, 6(6):R49, 2005. Burgard, A.P., Pharkya, P. and Maranas, C.D., Optknock: a bilevel programming framework for identifying gene knockout strategies for microbial strain optimization, Biotechnoi Bioeng, 84(6):647-57, 2003. Burgard, A.P., Nikolaev, E.V., Schilling, C.H., et ai., Flux coupling analysis of genome-scale metabolic network reconstructions, Genome Res, 14(2):301-12, 2004. Deutscher, D., Meilijson, I., Kupiec, M., et ai., Multiple knockout analysis of genetic robustness in the yeast metabolic network, Nat Genet, 38(9):993-8, 2006. Duarte, N.C., Becker, S.A., Jamshidi, N., et ai., Global reconstruction of the human metabolic network based on genomic and bibliomic data, Proc Natl Acad Sci USA, 104( 6): 1777-82, 2007. Edwards, 1.S., Ibarra, R.U. and Palsson, B.O., In silico predictions of Escherichia coli metabolic capabilities are consistent with experimental data, Nat Biotechnoi, 19(2):125-30,2001.

134

E. S. Snitkin & D. Segre

[7]

Feist, A.M., Henry, C.S., Reed, J.L., et al., A genome-scale metabolic reconstruction for Escherichia coli K-12 MG 1655 that accounts for 1260 ORFs and thermodynamic information, Mol Syst Bioi, 3:121, 2007. Fischer, E. and Sauer, U., Large-scale in vivo flux analysis shows rigidity and suboptimal performance of Bacillus subtilis metabolism, Nat Genet, 37(6):636-40, 2005. Forster, J., Famili, I., Fu, P., et al., Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network, Genome Res, 13(2):244-53,2003. Holzhutter, H.G., The principle of flux minimization and its application to estimate stationary fluxes in metabolic networks, Eur J Biochem, 271(14):2905-22, 2004. Johnston, M. and Kim, J.H., Glucose as a hormone: receptor-mediated glucose sensing in the yeast Saccharomyces cerevisiae, Biochem Soc Trans, 33(Pt 1):24752,2005. Kauffman, K.J., Prakash, P. and Edwards, J.S., Advances in flux balance analysis, Curr Opin Biotechnol, 14(5):491-6,2003. Kuepfer, L., Sauer, U. and Blank, L.M., Metabolic functions of duplicate genes in Saccharomyces cerevisiae, Genome Res, 15(10):1421-30,2005. Schilling, C.H., Covert, M.W., Famili, I., et al., Genome-scale metabolic model of Helicobacter pylori 26695, J Bacteriol, 184(16):4582-93,2002. Schuetz, R., Kuepfer, L. and Sauer, U., Systematic evaluation of objective functions for predicting intracellular fluxes in Escherichia coli, Mol Syst Bioi, 3:119,2007. Segre, D., Vitkup, D. and Church, G.M., Analysis of optimality in natural and perturbed metabolic networks, Proc Natl A cad Sci USA, 99(23):15112-7, 2002. Stolyar, S., Van Dien, S., Hillesland, K.L., et al., Metabolic modeling of a mutualistic microbial community, Mol Syst Bioi, 3:92, 2007. Thomson, J.M., Gaucher, E.A., Burgan, M.F., et ai., Resurrecting ancestral alcohol dehydrogenases from yeast, Nat Genet, 37(6):630-5, 2005. Vitkup, D., Kharchenko, P. and Wagner, A., Influence of metabolic network structure and function on enzyme evolution, Genome Bioi, 7(5):R39, 2006.

[8]

[9] [10] [11]

[12] [13] [14] [15]

[16] [17] [18] [19]

BIOSYNTHETIC POTENTIALS FROM SPECIES-SPECIFIC METABOLIC NETWORKS ZORAN NIKOLOSKI l ,2

GEORG BASLERl,z 「。ウャ・イセーゥュMァッN、@

ョゥォッャウセーュMァN、・@

OLIVER EBENHOHl,2 ebenhoehmmpimp-golm.mpg.de

THOMAS HANDORF3 ィ。ョ、ッイヲセーゥュMァャN・@

1 Institute

for Biochemistry and Biology, University of Potsdam, 14476 Potsdam, Germany 2 Max Planck Institute of Molecular Plant Physiology, 14476 Potsdam, Germany 3 Theoretical Biophysics, Humboldt- University Berlin, 10115 Berlin, Germany Studies of genome-scale metabolic networks allow for qualitative and quantitative descriptions of an organism's capability to convert nutrients into products. The set of synthesizable products strongly depends on the provided nutrients as well as on the structure of the metabolic network. Here, we apply the method of network expansion and the concept of scopes, describing the synthesizing capacities of an organism when certain nutrients are provided. We analyze the biosynthetic properties of four species: Arabidopsis thaliana, Saccharomyces cerevisiae, Buchnera aphidicola, and Escherichia coli. Matthaus et al. [12J have recently developed a method to identify clusters of scopes, reflecting specific biological functions and exhibiting a hierarchical arrangement, using the network comprising all reactions in KEGG. We extend this method by considering random sets of nutrients on well-curated networks of the investigated species from Bioeye. We identify structural properties of the networks that allow to differentiate their biosynthetic capabilities. Furthermore, we evaluate the quality of the clustering of scopes applied to the species-specific networks. Our study provides a novel assessment of the biosynthetic properties of different species.

Keywords: biosynthetic capabilities; clustering; scope; species-specific

1. Introduction

Recently, there has been tremendous interest in the comparison of metabolic network structures in order to quantitatively and qualitatively explain the organizational structure and identify possible intrinsic network design principles. While the research in this field historically concentrated on kinetic modelling of small parts of metabolism, e.g., the glycolytic pathway [15J, the emergence of biochemical databases, such as: KEGG [10], Brenda [11], and BioCyc [16], has prompted the interest for analyses of large-scale metabolic networks. As kinetic data corresponding to genome-wide, species-specific metabolic networks are often difficult to obtain or precisely determine, novel, topology-based methods have been introduced in the last decade to allow a functional anal-

135

136

C. Basler et al.

ysis of such networks. In particular, such networks have been investigated by graph-theoretic approaches [1, 18, 20], steady-state analysis, e.g., elementary flux modes [17] or the related concept of extreme pathways [14], flux balance methods [5, 19], or, recently, by characterizing their synthesizing capacities using the concept of scopes [7]. The concept of a scope provides an effective method for determining which products a network can synthesize when it is provided with a given set of nutrient metabolites. In [8], it was shown that the synthesizing capacities of the nutrient metabolites, i.e., their scopes, form a complex hierarchy in the species-independent network defined by the KEGG database. This hierarchy is mainly determined by the chemical composition of the metabolites-those with a larger number of chemical elements or chemical groups (and, therefore, with a larger scope) are placed on top of metabolites with a simpler composition. In a recent paper [12], this complex hierarchy was condensed into a terse hierarchy of descriptive consensus scopes resulting from a clustering of scopes originating from all nutrient metabolites, taken individually. These consensus scopes represent sets of highly similar scopes, and could be assigned to characteristic combinations of chemical elements and a few chemical groups. As it is computationally impossible to calculate the synthesizing capacities of all nutrient combinations, the consensus scopes are useful to efficiently describe the biosynthetic potential of a given metabolic network. Here, we investigate at which meaningful threshold values the formerly observed hierarchies and corresponding consensus scopes can also be found in species-specific networks. Our analysis comprises the metabolic networks offour model species: Arabidopsis thaliana, Saccharomyces cerevisiae, Buchnera aphidicola, and Escherichia coli, as defined in the BioCyc database. These species have been chosen as representatives of different domains of life and contrasting living environments. In particular, Arabidopsis thaliana (abbr. Arabidopsis, taxon 3702) is a eukaryotic multicellular CO 2 fixating plant, while Buchnera aphidicola (abbr. Buchnera, taxon 107806) is a highly specialized, intracellular parasite in aphids. Escherichia coli (abbr. E. coli, taxon 83333) is a well-studied bacteria that can grow in a variety of environments, and Saccharomyces cerevisiae (abbr. Yeast, taxon 4932) is a unicellular eukaryote and fungus that has been extensively used as a model organism. Furthermore, we perform extensive analyses focused on the effect of different parameters on the outcome of the clustering approach. Finally, as the concept of scope strongly depends on the network structure, we discuss the influence of properties, characteristic for the investigated species-specific networks, on the scopes. Organization and contributions: The methods employed in this study are presented in Section 2: The employed network representations and the scope algorithm are outlined in Subsections 2.1 and 2.2. In Subsections 2.3 - 2.5, the three main methods used in evaluating the influence of different parameters on the scope hierarchies, namely: the scope size distribution, (dis)similarity indices, and weighted modularity of a given clustering, are presented. The results from our analysis ap-

Biosynthetic Potentials from Species-Specific Metabolic Networks

137

pear in Section 3, while discussion about the effect of the network properties on the investigated approach for determining a representative scope hierarchy is given in Section 4. 2. Methods

In this section, we describe the methods for testing the sensitivity of the approach proposed by Matthaus et al. [12J in order to investigate the biosynthetic potential of specific species. In Subsection 2.1, we detail the retrieval and representation of networks used in this study. The main method-calculation ofthe scope-is formally presented in Subsection 2.2. The size distributions of scopes on the investigated networks are discussed in Subsection 2.3, and the approach for determining the relationship between the parameters and methods for clustering is discussed in Subsections 2.4 and 2.5. 2.1. Species-specific networks

A metabolic network is typically represented by a directed bipartite graph G (V, E). The node set V of G can be partitioned into two subsets: Vr , containing reaction nodes, and Vm , comprised of metabolite nodes, such that Vr U Vm = V. The edges in E are directed either from a node u E Vm to a node v E Vr , in which case the metabolite u is called a substrate of the reaction v, or from a node v E Vr to a node u E Vm , when u is called a product of the reaction v. In the following, we refer to substrates as predecessors (abbr. pred), and products as successors (abbr. succ). Such representation of a metabolic network can be retrieved from a publically available database of biochemical reactions. Here, the metabolic networks of the four investigated species were obtained from the BioCyc database [16]. Similarly to the network retrieval procedure specified in Matthaus et al. [12], the reactions were checked for consistency, and, consequently, those showing erroneous stoichiometry were removed. In addition, generic reactions and metabolites integrating sets of related metabolites were removed from the network, as proposed in [6]. The curation process was applied to the BioCyc database release from December 5, 2007, and resulted in networks of the following sizes: 1329 compounds and 1404 reactions (Arabidopsis) , 1158 compounds and 1256 reactions (E. coli), 620 compounds and 594 reactions (Yeast), 356 compounds and 336 reactions (Buchnera). The BioCyc database also provides information on the reversibility of biochemical reactions. Every enzymatic reaction (with a given direction), in principle, may also proceed in the reverse direction. However, the direction in which a reaction actually proceeds strongly depends on the metabolite concentrations, and may therefore vary for different physiological conditions. Thus, for analyzing the structure of a metabolic network from a given species, all reactions may be considered as being operable in both directions. Here, as a result, all reactions are assumed to be reversible. Hence, the network is represented by a bipartite graph G = (V, E), where the successors and predecessors of a reaction are exchange ably considered as

138

G. Basler et al.

reactants or products.

2.2. Biosynthetic potential of metabolites via scope Given a metabolic network G of an investigated species, the biosynthetic potential for a given set of metabolites, acting as substrates, can be described in terms of their scope, i.e., the metabolites that can be synthesized in the network by the substrates. The scope concept is related to reach ability in the metabolic network G: A reaction node v E Vr is reachable if all of its substrates are reachable. Given a subset S of metabolite nodes, called a seed, a node u E Vm is reachable either if u E S or if u is a product of a reachable reaction. With these clarifications, we can present a precise mathematical formulation for the scope of a given seed [3J: Definition 2.1. Given a metabolic network G = (V, E) and a set S セ@ Vm , the scope of the seed S, denoted by R( S), is the set of all metabolite nodes reachable from S. For a given metabolic network G = (V, E) and a set S セ@ Vm , the scope R(S) can be determined in polynomial time of the order O(IEI . IV!), as can be established by analyzing the following algorithm: Algorithm 1: Scope for a set of seed metabolites S in a metabolic network G Input: Metabolic network G = (Vm U Vr , E), set of seed metabolites S セ@ Vm Output: Scope R(S) 1 mark all nodes in Vr .as unreachable and unvisited 2 R(S) = S 3 repeat 4 if there is a reachable unvisited node r E Vr then 5 mark r as visited 6 R(S) = R(S) U pred(r) U succ(r) 7 end 8 foreach node rEv,. do 9 if pred(r) セ@ R(S) or succ(r) セ@ R(S) then 10 mark r as reachable 11 end 12 end 13 until no reachable unvisited nodes in Vr

I

I

In our analysis, the seed, S, is chosen uniformly at random from the set of metabolite nodes in a given network G. Algorithm 1 is then applied to each of f = 3000 sets S of a specified cardinality c. In the following, we describe how one can determine the distribution and clustering of scopes for a given cardinality, c, of

Biosynthetic Potentials from Species-Specific Metabolic Networks

139

the seed.

2.3. Distribution of scope sizes

Ex

Given a species X with a metabolic network represented by G x, let be the set of all scopes for f randomly chosen sets S, such that c = lSI. The scope size distribution for gives the probability, Px(s), that a scope, randomly chosen from is of size s. The effect of the parameter c on the distribution P( s) can be investigated by plotting the curves Px(s) for different values of c. To investigate the (possible) difference in the scope size distribution for several species, the sizes of the scopes are normalized by the number of metabolites in the corresponding network for each species. The scope size distributions of the investigated species are analyzed in Subsection 3.1.

Ex

Ex,

2.4. Clustering of scopes Existing studies of biosynthetic potential [8, 12] have identified that a large number of metabolites do have scopes similar in size and metabolite composition. Here, we investigate this idea by hierarchical clustering for a set of scopes generated from a seed with cardinality c and a given metabolic network of a species X. Hierarchical clustering is based on a given distance (dissimilarity) matrix for the elements of Similar to [12], we employ the reversed Jaccard index as a distance measure for a pair of scopes, R(Si) and R(Sj), ISil = ISjl = c, 1 :S i,j :S J. The computation is in the order of O(/f1 2 ) for J scopes. For completeness, we give the definition of Jaccard distance, JR(Si)R(Sj):

Ex

Ex.

JR(Si)R(Sj)

IR(Si) n R(Sj)1 = 1 - IR(Si) U R(Sj)1

We investigate the effect of a nearest neighbor group-average clustering algorithm [9]. Nearest neighbor clustering is a bottom up clustering method where iteratively clusters with increasing distance are joined, starting with clusters composed of single elements (scopes). Group-averaging refers to the method of defining the distance between two clusters as the average over all distances between pairs of the corresponding cluster elements. The output of a hierarchical clustering algorithm is a tree, which can be cut at a given distance between the clusters, to retrieve the clusters of scopes. The clusters obtained from a cut at distance T contain all scopes whose mutual distance is not greater than T. The results of the clustering of scopes are presented in Subsection

3.2. 2.5. Evaluation of parameter values To evaluate the influence of the size of the seed, c, and the distance, T, at which the clustering tree is cut on the quality of the obtained clusters, we use weighted

140

C. Basler et al.

modularity [2]-a generalization of the graph cluster quality measure proposed by Newman and Girvan [13]. To apply graph cluster quality measures, one first has to build a graph from a given matrix of dissimilarity indices. Here, we construct a graph from the dissimilarity matrix by creating a node for each scope, with the distances between the scopes as weighted edges: let I be the dissimilarity matrix used in the hierarchical clustering. The weighted adjacency matrix A of the graph H is given by 1 - IR(Si)R(Sj) , over all pairs R(Si) and R(Sj) in 2: The edges of graph H are then weighted by the similarity of the scopes Si and Sj. Let C = {C1 , ... , Cp } be the set of scope clusters obtained by cutting the clustering tree at distance T. Given a graph H, with node set given by the f scopes and weighted edges as defined above, the modularity of C measures the quality of the clustering, or how separated nodes (scopes) from different clusters are from each other. It is defined as:

x.

Q

__ 1 . c,r -

セ@

2m.L.....t

(A .. _d(i)d(j)) b" 2m 'J

tJ'

t,)=1

where m = 2: ij Aij is the weighted number of edges in H, Aij is the element of the adjacency matrix in row i and column j, d( i) is the weighted degree of scope i in H, d(j) is the weighted degree of scope j in H, and bij = 1, if i and j are in the same cluster of C, and 0, otherwise. With regard to this definition, the modularity measure assesses the closeness of the scopes placed in the same cluster (according to the employed clustering algorithm) and their "distance" from the scopes placed in the other clusters with respect to the weighted adjacency matrix (i.e., the similarity matrix). We investigate the behavior of the cluster quality for different sizes of the seed and different values for the parameter T at which the clustering tree is cut to obtain the set of clusters C (see Subsection 3.3). 3. Results

Here, we analyze and compare the scope size distributions, cluster agglomeration, and weighted modularities of scope clusters, obtained from the networks of the four investigated species. The scope size distributions and cluster agglomeration reveal characteristic features of the networks, while the weighted modularities determined for different values of cut-off and seed size allow to systematically and quantitatively assess the relative influence of these parameters on the clustering. 3.1. Scope size distributions

Analyses of the scope concept have already identified that metabolites exhibit different biosynthetic potentials, i. e. the number of reachable metabolites strongly

Biosynthetic Potentials from Species-Specific Metabolic Networks

141

depends on the composition of the seed [3J . Therefore, we use the size of the scope to quantitatively characterize the biosynthetic potential of the seed metabolites in a given metabolic network. To this end, we empirically determine the size distributions of scopes resulting from the four investigated species (see Fig. 1). In order to enable comparability, the scope sizes were normalized by the size of the network , and the counts of scopes were turned into a probability distribution (see Subsection 2.3 for details).

Arabldopsis thaliana scope size distributions

セ@

_

E. coli scope size distributions

セ@

Seed size 4

_ Seed size 14 D-- Seed size 24

15

ci

Seed size 4

_

Seedsize14

ci

g セ@

I;

I

セ@

J' セ@

8

ci

Nャセ|@

i'l ci

0

ci

ci

8

is

ci

ci

0.0

0.0

0. 1

0.2

Scope size (normalized)

0.3

OA

0.5

Scope size (normalized)

(a)

(b)

Saccharomyces cerevisiae scope size distributions

セ@

_ _

Buchnera aphidicola scope size distributions

セ@

S eedsize4 Seedsize14

0 - - Seed size 24

15

_ _

Seedsize4 Seedsize14

0---- Seed size 24

15

ci

i J'

_

0--- Seed size 24

15

ci

セ@

I;

セ@

! セ@

8

ci

i'l ci

" ci

ci

is

is

ci

」セ

ci

0.0

0.1

0.2

0.3

OA

Scope size (normalized)

(c)

0.5

0.6

0.0

0. 1

セ N Qョ@

0.2

セ@ 0.3

OA

0.5

0.6

0.7

Scope size (normalized)

(d)

Fig. 1. Scope size distributions of (a) Arabidopsis, (b) E. coli, (c) Yeast and (d) Buchnera, normalized by the number of metabolites in the corresponding network. The distributions are shown for seed sizes 4 (red), 14 (blue), and 24 (yellow). The highest frequencies for seed size 4 are = 0.39, and pセオ 」 ィョ・ イ。 HTI@ = excluded for clarity: P.4 rnbidopsis(4) = 0.38, Pi:. co li (4) = 0.35, p セ・。 ウ エ HTI@ 0. 38.

We observe that with small seeds of four metabolites, the scope size distributions of all investigated networks share a high peak for very small scope sizes, indicating that a large number of seeds exhibit a very low biosynthetic potential. The remaining large isolated peaks in the networks of Arabidopsis (Fig. 1a) and E . coli (Fig. 1b) correspond to characteristic scopes reachable from a relatively large number of

142

G. Basler et al.

different seeds. These characteristic scopes correspond to large subnetworks with a high degree of mutually reachable metabolites, which we refer to as scope communities: If the seed contains metabolites from within such a scope community, then there is a high probability of reaching all the metabolites within the community. In addition, a scope community is self-contained in the sense that metabolites outside of the community can only be reached if the seed contains certain metabolites also outside of the community. Note that although one characteristic peak may correspond to several such scope communities with a similar scope size, this is not observed in the networks of Arabidopsis and E. coli. Instead, the subsequent clustering reveals that scopes pertaining to the same characteristic peak are agglomerated into one cluster at a merging distance not greater than 0.2. Furthermore, the relatively large sizes of the communities (apx. 35%, 46%, and 60% of the network size in Arabidopsis, see Fig. la, and apx. 38% and 45% in E. coli, see Fig. lb) suggest that the smaller scope communities form subsets of the larger ones and, thus, exhibit a hierarchical arrangement, as identified by Matthaus et al. [12J. By increasing the seed size, the probability of reaching any particular metabolite increases, and, therefore, one obtains larger scopes. In particular, we observe that for all networks the fraction of small scopes decreases, while the overall scope sizes increase. For the more complex networks of Arabidopsis and E. coli, we observe that the center of the large peaks shifts towards the larger scope size. This demonstrates that seeds containing metabolites from within a scope community now frequently contain additional metabolites from outside of the community, which account for a small increase of the scope size. Moreover, seeds containing no metabolites from within a scope community remain to have a small scope, regardless of the increased seed size. Consequently, scope communities in the more complex networks represent an outstanding feature that is robust with respect to the seed size. In contrast to these findings, an increase of the seed size in the smaller networks of Yeast (Fig. lc) and Buchnera (Fig. ld) results in more evenly distributed scope sizes. This observation suggests that scope communities do not exist or are less pronounced compared to the cases of Arabidopsis and E. coli. For these two species, there are many scopes containing a distinct fraction of metabolites in the network. Finally, while the scope size distributions of Arabidopsis and E. coli are easily distinguishable by the frequency, relative scope size and number of scope communities, this is not the case for Yeast and Buchnera.

3.2. Cluster agglomeration The dissimilarity matrix serves as the basis for the clustering described in Subsection 2.4. During the clustering process, scopes are agglomerated into clusters, starting with the most similar. At a merging distance of 0, every scope forms an individual cluster, so that the number of clusters equals the number of scopes f, i.e., 3000. The number of clusters monotonically decreases with an increasing merging distance,

Biosynthetic Potentials from Species-Specific Metabolic Networks

143

until, at a distance of 1, all scopes form a single cluster. The number of clusters obtained at a certain merging distance provides information on the overall mutual similarities between scopes. In the case of many highly similar scopes, a small number of clusters will be obtained for a small merging distance, while the opposite holds for the case of many dissimilar scopes. For instance, if at a distance of 0.5 the number of clusters is half the number of scopes, then more than half of the scopes have a mutual distance of at most 0.5; therefore, more than half of the scopes share at least two thirds of their metabolites with another scope (cf. Subsection 2.4).

Arabldopala thallan. cluster agglomeration 0

-

セ@

\

g

.;

\\

.. Lセ@

.;

.... o.

セ@

.;

セ@

セ@

!

- . . . . . . . . . . >.:>. . . セL@

.

0.2

0.8

0.'

0.8

Gセ@

セ@

.;

'.\

:l

.........ZMN[セ@ セ@ 0.0

|MセL@

g

, ..............,

セ@

Seedslz64 ---' Seed size 14 Seed siZ624

\':

セ@

0

'-..'"

'.

セ@

!

Seed w&4 Seed aze 14 Seed IJiiz624

\._--_.............

セ@ G

E. coli cluster agglomeration

セ@

......••

, ...

セ@ 1.0

0.0

08

Merging distance

Merging distance

(b)

(a) Saccharomyces carevi.'ae cluater agglomeration セ@

--_.

\\ -..., "

\,

G

ti

'\. セ@

...

GLセ@ ....セL@ .......,.

N

$eedslze4 Seedsize14

Seed size 24

セ@ G

.;

1

........., ...........

セMN@

f

....セ@

..... '..........

.;

Buchne,. aphldlcol. cluater agglomeraHon セ@

Seed Size 4 Seed size 14

Seed aze 24

'.. "'"

G

j

|セB@

.;

.....-.......-..•..........-..... ........ .... ....................^NセZ@

セ@

0 N

0

'. セ@

セ@ 0.0

0.2

0.8

0.'

Merging distance

(C)

OB

1.0

0.0

02

0.'

10

Merging distance

(d)

Fig. 2. Frequency of observed clusters over the merging distance for (a) Arabidopsis, (b) E. coli, (c) Yeast and (d) Buchnera. While steps appear in the frequencies for seed size of 4 (solid line) as a consequence of numerical effects of the Jaccard distance, the shapes appear continuous for seed sizes of 14 (dashed line) and 24 (dotted line). Furthermore, the overall mutual distances of scopes decrease when increasing the seed size, resulting in a smaller fraction of clusters at a particular merging distance.

As shown in Fig. 2, the mutual similarities of scopes exhibit significant differences when using varying seed sizes. As a trend, the number of clusters obtained at

144

G. Basler et al.

a certain merging distance is reduced with the increase of the seed size, demonstrating that more similar sc.opes result from a larger seed size. This conforms to the intuition, as larger seeds result in larger scopes with a higher probability of sharing common metabolites. While the agglomeration curves from seed sizes 14 and 24 appear continuous, steps appear in the curves from seed size 4. For the latter curves, a large number of scopes is agglomerated into clusters at certain distances. For Ambidopsis (Fig. 2a) and E. coli (Fig. 2b), there are large steps of more than 160 scopes at characteristic distances of 2/3 and 3/4, and steps of more than 530 scopes at a distance of 6/7. In Yeast (Fig. 2c) and Buchnem (Fig. 2d), there are steps of more than 300 scopes at distances of 2/3 and 6/7. These are numerical effects of the Jaccard distance which provides a discrete number of possible dissimilarity values, decreasing with smaller cardinalities of the compared entities. When using a small seed size, the fraction of small scopes is very large (cf. Subsection 3.1). Consequently, for a large number of scopes there is a small number of possible distances to consider. For instance, at a distance of 2/3, all scopes of size four with two metabolites in common are merged, and all scopes of size six with three metabolites in common, and so on. With many small scopes, these characteristic distances occur more frequently, leading to the observed steps. For the clustering of Ambidopsis and E. coli with seed sizes of 14 and 24, a significant fraction of scopes is agglomerated with a merging distance of less than 0.1. This indicates that there are many scopes with a high mutual similarity. In contrast, this does not hold for Yeast and Buchnem, where the range of similarities between scopes is more uniformly distributed and, thus, results in cluster agglomerations at higher distances. Again, there are significant differences between the calculated scopes of A mbidopsis and E. coli on one hand, and Yeast and Buchnem, on the other hand.

3.3. Influence of cut-off and seed size Due to the observed large impact of the employed seed size and cut-off on the calculated scopes and the resulting clustering, we aim at evaluating the influence of these parameters on the quality of clustering. Particularly, we are interested in those parameter values that allow to obtain clusters of highest weighted modularity. Moreover, thorough investigation of the parameter space may provide insights in the presented approach of scope clustering. We determine scopes from random seeds as described in Subsection 2.2 for seed sizes 2 ::::: c ::::: 25. For each set of scopes resulting from a given network and seed size, we perform the clustering of scopes as described in Subsection 2.4. Finally, we cut the obtained cluster trees at cut-off distances 0.05 ::::: T ::::: 1 with step-size of 0.05, and determine the weighted modularities of the resulting sets of clusters, as defined in Subsection 2.5. In Fig. 3, the resulting matrices of weighted modularities for different parameter

Biosynthetic Potentials from Species-Specific Metabolic Networks Color Key

Influence of cut-off and seed size on cluster quality for Arabidopsis thaliana

]g " 0

0.'

Xセ@

Influence of cut-off and seed size on cluster quality for E.coli

cg セ@

0

0.20.304

Value

0.1

0.2

0.3

Value 0.05 0 .•

0.' 0.15 0.2 0.25 0.3 0.35

05 0.55

0,15

0.2 0.25 0.3 0.35 0.45 0.5 0.55 0.' 065 0.' 0.75 0.6

?S U

0.' 0,75 0.' 0.55 0.9 0.95

. nmvセwァ@

""' •• ' ••

"

Value

93

Color Key

Influence of cut-off and seed size on cluster quality for Buchnera aphidicola



8g 0

0'

0'

{t20.3

OA

Value

0'=:::; 0.41, and for Buchnera (Fig. 3d) Qc=2,r=O.95 セ@ 0.43. However, these maxima correspond to identical parameters of c = 2 and r = 0.95 in Arabidopsis, Yeast and Buchnera, while the modularity obtained from the same parameters in E. coli is Qc=2,r=O.95 セ@ 0.18.

146

G. Basler et al.

The evaluation of parameters indicate that the best clustering is achieved for a small seed size of c = 2 and a very high cut-off of T = 0.95 for all species except E. coli, for which T = 0.7 results in the highest cluster quality. The preference for small seeds demonstrates that small sets of metabolites can be well classified into distinct groups according to their biosynthetic potential using the concept of scopes. On the other hand, our analysis suggests that scopes from more complex seed compositions are harder to classify. Furthermore, the selection of a cut-off value T = 0.95 indicates , that a small number of large clusters, containing scopes up to a very high distance, is preferred. Hence, the arrangement of scopes from small seeds into few very coarse groups results in the highest separation of clusters.

4. Discussion

Characterizing the biosynthetic potential by only employing the structure. of metabolic networks offers a means for comparing and contrasting different species. Here, we investigated to what extent the approach proposed in [12] could be extended to determining scope clusterings and metabolite hierarchies in speciesspecific networks. To this end, we performed a comprehensive sensitivity analysis of the approach, which depends on the size and composition of random seeds and the cut-off distance for extracting clusters of scopes. The analysis furthermore includes the effect of the size and composition of random seeds on the scope size distributions in the four investigated species. The findings related to the scope size distributions conform to the existing results on species-specific networks [4] as well as the network comprising all reactions from KEGG [12], i.e., alarge number of seeds exhibit a small biosynthetic potential. Accordingly, we observe characteristic scope sizes corresponding to scope communities for Arabidopsis and E. coli, which indicates the existence of consensus scopes and supports their hierarchical arrangement. This argument can be further strengthened by our findings regarding the scope size distributions of Arabidopsis and E. coli: With an increase of seed sizes, the overall scope sizes increase, while preserving the scope community structure. The results from the agglomerative clustering performed on the scopes of 3000 randomly chosen seeds of different sizes suggest a plateau for the fraction of clusters at a merging distance around 0.2, i.e., no significant number of scopes is agglomerated at distances close to 0.2. This is typically pronounced for seeds of larger sizes from the networks of Arabidopsis and E. coli. We point out that the phenomenon of plateau was already observed elsewhere [12] and was used as a principle for choosing a threshold in the extraction of scope clusters and the resulting metabolite hierarchies. However, our analysis warrants caution when extending these observations to the networks of Yeast and Buchnera: While Arabidopsis and E. coli are organisms with complex metabolic networks, the opposite holds for Buchnera. Although Yeast is a generalist model organism with complex metabolic functions, its scope size dis-

Biosynthetic Potentials from Species-Specific Metabolic Networks 147

tribution does not exhibit characteristic peaks and, therefore, does not contain any distinct scope communities. Likewise, there is no plateau observed in the cluster agglomeration of Yeast. The observed differences in scope sizes and clustering between Arabidopsis and E. coli on one hand, and Yeast and Buchnera on the other hand may be due to either differing qualities in the curation of the networks or a possible realistic difference in the biosynthetic potential of these species. To further assess the quality of scope clusterings, we applied a generalization of the modularity measure. While for certain values of parameters (i.e., cut-off distance and seed size) we obtained relatively high modularities for the respective scope clustering, the observed values have significantly different implications: The highest value for the modularity in the investigated species was obtained at cut-off distances of 0.95 and 0.7, corresponding to a small number of clusters comprising scopes with a wide range of similarities. Moreover, for most cut-off distances, the highest modularity is reached for small seed sizes (c = 2), suggesting that the cluster agglomeration may be highly dependent on the discretization capacity of the employed Jaccard distance. We point out that the same empirical analysis was performed and comparable results were obtained using the Manhattan distance as a (dis)similarity measure. Therefore, we can conclude that the method for extracting scope clusters and metabolite hierarchies may be most appropriate to large scope sizes, most likely resulting from large seed sizes and complex networks, for which both, the plateau principle and the observed scope communities, are clearly pronounced. To conclude, we identified features based on the concept of scopes, which allow for a structural comparison of different species, and indicate the existence of consensus scopes and metabolite hierarchies in Arabidopsis and E. coli. In addition, our sensitivity analysis revealed a strong influence of the evaluated parameter values in the quality of clustering. Future research may aim at characterizing the scope communities via their metabolite compositions and hierarchical organization, and extending the analysis to additional organisms. References

[1] Barabasi, A. L. and Albert, R., Emergence of scaling in random networks. Science, 286:509-512, 1999. [2] Brandes, V., Delling, D., Gaertler, M., Gorke, R., Hoefer, M., Nikoloski, Z. and Wagner, D., On modularity clustering. IEEE Trans. Knowl. Data Eng., 20(2}:172-1&8, 2008. [3] Ebenhoh, 0., Handorf, T., and Heinrich, R., Structural analysis of expanding metabolic networks. Genome Informatics, 15:35-45, 2004. [4] Ebenhoh, 0., Handorf, T., and Heinrich, R., A cross species comparison of metabolic network functions. Genome Informatics, 16(1}:203-213, 2005. [5] Edwards, J.S. and Palsson, RO., The escherichia coli mg1655 in silico metabolic genotype: Its definition, characteristics, and capabilities. Proc Natl Acad Sci USA, 97:5528-5533, 2000. [6] f・ヲウエセ@ A. M., Henry, C. S., Reed, J. L., Krummenacker, M., Joyce, A. R., Karp, P. D., Broadbelt, L. J., Hatzimanikatis, V., and Palsson, B. 0., A genome-scale metabolic

148

[7] [8]

[9] [10]

[11]

[12] [13] [14]

[15]

[16J

[17] [18] [19J [20]

G. Basler et al. reconstruction for escherichia coli k-12 mg1655 that accounts for 1260 orfs and thermodynamic information. Mol Syst Bioi., 3(121), 2007. Handorf, T., Ebenhoh, 0., and Heinrich, R, Expanding metabolic networks: Scopes of compounds, robustness, and evolution. J. Mol. Evol., 61:498-512, 2005. Handorf, T., Ebenhoh, 0., Kahn, D., and Heinrich, R, Hierarchy of metabolic compounds based on their synthesizing capacity. fEE Proc. Systems Biology, 153(5):359363,2006. Hastie, T., Tibshirani, R, and Friedman, J., The elements of statistical learning: Data mining, inference and prediction. Springer, New York, 200l. Kanehisa, M., Goto, S., Hattori, M., Kinoshita, K.F., Itoh, M., Kawashima, S., Katayama, T., Araki, M., and Hirakawa, M., From genomics to chemical genomics: new developments in kegg. Nucleic Acids Res., 34:D354-357, 2006. Karp, P.D., Ouzounis, C.A., Moore-Kochlacs, C., Goldovsky, L., Kaipa, P., Ahren, D., Tsoka, S., Darzentas, N., Kunin, V., and Lopez-Bigas, N., Expansion of the biocyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Research, 19:6083-6089, 2005. Matthiius, F., Salazar, C., and Ebenhoh, 0., Biosynthetic potentials of metabolites and their hierarchical organization. PLoS Comput Bioi, 4(4):e1000049, Apr 2008. Newman, M. E. J. and Girvan, M., Finding and evaluating community structure in networks. Physical Review E, 69(026113), 2004. Price, N.D., Reed, J.L., Papin, J.A., Wiback, S.J., and Palsson, B.O., Network-based analysis of metabolic regulation in the human red blood cell. Journal of Theoretical Biology, 225:185-194, 2003. Rapoport, T. A., Heinrich, R, and Rapoport, S. M., The regulatory principles of glycolysis in erythrocytes in vivo and in vitro. a minimal comprehensive model describing steady states, quasi-steady states and time-dependent processes. Biochem J, 154(2):449-469, Feb 1976. Schomburg, I., Chang, A., Ebeling, C., Gremse, M., Heldt, C., Huhn, G., and Schomburg, D., Brenda, the enzyme database: updates and major new developments. Nucleic Acids Research, 32:D431-D433, 2004. Schuster, S. and Hilgetag, C., On elementary flux modes in biochemical reaction systems at steady state. J. Bioi. Syst., 2:165-182, 1994. Strogatz, S. H., Exploring complex networks. Nature, 410:268-276, 200l. Varma, A. and Palsson, B.O., Metabolic flux balancing:basic concepts, scientific and practical use. Bio/Technology, 12:994-998, 1994. Wagner, A. and Fell, D. A., The small world inside large metabolic networks. Proc. R. Soc. Lond. B, 268:1803-1810, 2001.

GENERALIZED REACTION PATTERNS FOR PREDICTION OF UNKNOWN ENZYMATIC REACTIONS YUGOSHIMIW [email protected]

MASAHIRO HATTORI [email protected]

SUSUMUGOTO [email protected]

MINORU KANEHISA [email protected]

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan Prediction of unknown enzymatic reactions is useful for understanding biological processes such as reactions to external substances like endocrine disrupters. To create an accurate prediction, we need to define a similarity measure in the reaction. We have developed the KEGG RPAIR database which is a collection of chemical structure transformation patterns, called RDM patterns, for substrateproduct pairs of enzymatic reactions. In this study, we compared RDM patterns with Ee numbers which are the well-known hierarchical classification scheme for enzymes. Additionally, we performed hierarchical clustering of RDM patterns using the information stating whether each subsubclass ofEe has a particular RDM pattern or not. To represent the variation ofRDM patterns in a cluster, we generalized RDM patterns in the same cluster using the hierarchy of KEGG Atomtypes, which are the components of RDM patterns. Using this generalized pattern, we can predict which cluster includes a given RDM pattern even if the reaction of the pattern has not been assigned any Ee numbers. Thus we will be able to define the similarity between enzymatic reactions by using this cluster information.

Keywords: Ee number; KEGG RPAIR; classification of enzymes; enzymatic reaction

1.

Introduction

Recently, a large amount of biochemical information as well as genomic information and chemical information has become available [5, 6J. For example, in the KEGG LIGAND database, much information about biochemical small molecules, biochemical reactions, enzymes, glycans, and drugs are available [1, 12J. Here enzymes are proteins that catalyze the biochemical reactions; however there are lots of enzymes whose function have yet to be unveiled. This causes missing enzymes in metabolic pathways and many unknown reactions should be characterized. Thus, the computational prediction of unknown enzymatic reactions may be useful for understanding the biological processes such as xenobiotics biodegradation: reactions to external substances like endocrine disrupter [2, 7, 8]. To improve the accuracy of prediction, we need to better systematize the reaction mechanisms of known enzymatic activities and to define an appropriate measure of similarity among the enzymatic reactions for further analysis. To achieve these objectives, we performed comprehensive analyses using the Be classification and KEGG RP AIR database.

149

150

Y. Shimizu et al.

The EC (Enzyme Commission) number is a well-known classification scheme for enzymes [9, 11]. In EC classification, enzymes are hierarchically classified by types of catalyzed reactions and their substrates and products. Each EC number consists of the letters "EC" followed by four numbers separated by periods (e.g. EC 1.1.1.1). The first, second, and third numbers are called class, subclass, and sub-subclass respectively. The fourth number represents the substrate specificity. The EC numbers have been utilized for many computational applications such as classification or prediction of enzymatic reactions. However, there are also some problems in EC classification. The EC numbers are classified manually, based on published experimental data, by the IUPAC-IUBMB Joint Commission on Biochemical Nomenclature. This requirement of published articles leaves many reactions unclassified. Additionally, the structural transformation between single compounds pair is unclear since EC represents the relationships between multiple substrates and multiple products. In order to avoid these problems, we have developed the KEGG RP AIR database that is a collection of chemical structure transformation patterns, called RDM patterns, for every substrate-product pairs of enzymatic reactions [4]. In this study, we compared the RDM patterns with EC numbers and performed hierarchical clustering of the RDM patterns using the information whether each subsubclass of EC has the RDM pattern or not. To represent the variation of the RDM patterns in a cluster, we introduced the generalized RDM patterns in the same cluster using the hierarchy ofKEGG Atomtypes, which are the components of the RDM patterns. 2.

Materials and Methods

2.1. KEGG LIGAND database

KEGG LIGAND is a composite database which contains various databases about biochemical compounds. In this study we have used ENZYME, REACTION, and RPAIR from the KEGG LIGAND database (as of 2008/05/13). ENZYME (4976 entries) is a database of EC numbers and contains names of enzymes, catalyzed reactions, genes, as well as other types of information. REACTION (7567 entries) is a database of all biochemical reactions that are included in ENZYME or appear on KEGG metabolic pathways. RP AIR (8706 entries) is a database of chemical structure transformation patterns, called RDM patterns, for every substrate-product pair (reactant pair) in REACTION. 2.2. RDM pattern 2.2.1.

KEGG RPAIR database

Each entry in RP AIR contains the alignment of atoms between the substrate-product pairs and the structural transformation pattern called RDM pattern. In general, one enzymatic reaction contains multiple substrates and multiple products, which result in

Generalized Reaction Patterns for Prediction of Unknown Enzymatic Reactions

151

multiple pairs. Here each pair of chemical compounds should be distinguished by its biochemical role under the reaction, and in the RP AIR database five types of such roles have been available with the annotated labels, "main", "cofac", "leave", "ligase" and "trans", which are exemplified in Figure 1. In this study, to reduce the noise of poorly characterized pairs we used only the main type which corresponds to a major component of pairs in each reaction.

main

セャエ@

+ BJ) {:} セi@

main

セNォ@

+ JOB I

leave

I

j

II

AB+C{:}A+BC I I trans

Fig. I. Examples of substrate-product pairs and their assigned types. In the left example, both the pair (AB, AH) and pair (AB, BOH) are classified as the main type and the other pair (H 20, BOH) is defined as the leave type. In the right example, there are also two main types and the trans type is assigned the last one. In any cases, the hydrogen atoms are not considered.

Table I. Definition ofKEGG Atomtypes. (extracted from http://www.genome.jpikegg/reactionIKCF.html) Atom

Atom class

Description

CI

alkane

C2

alkene

01

single bond

C

0

2.2.2.

Atomtype Cia Clb Clc Cld Clx Cly Clz C2a C2b C2c C2x C2y Ola Olb Olc Old 02a 02b 02c 02x

Description R-CH3 R-CH2-R R-CH(-R)-R R-C(-R)2-R ring-CH2-ring ring-CH( -R)-ring ring-C( -Rh-ring R=CH2 R=CH-R R=Ci-Rfl ring-CH=ring ring-C(-R)=ring or ring-C(=R)-ring R-OH N-OH P-OH S-OH R-O-R P-O-R P-O-P ring-O-ring

KEGG Atomtype

In KEGG RP AIR, all atoms are represented by KEGG Atomtypes, which have been hierarchically defined by the physicochemical environment of atoms. Mostly, atomtypes are represented as three letter codes as shown in Table 1. The first letter indicates the atomic species, the second indicates information about the atomic bonds, and the third

152

Y. Shimizu et al.

indicates the information of the substituted groups. In particular, the second level of hierarchy in KEGG Atomtypes is called the atom class. For example, "c" is the carbon atom itself, the atom class "Cl" represents the carbon atom observed in alkanes and the atomtype "CIa" represents the carbon atom which connects to another carbon atom and three hydrogen atoms. There are 68 atomtypes in RP AIR database and a portion of them is shown in Table 1. 2.2.3.

RDMpattern

An RDM pattern is defined as a set of KEGG Atomtype changes at the reaction center (R), the difference region (D), and the matched region (M) for each reactant pair (Fig. 2). R atoms are boundary atoms between the matched regions and the unmatched regions. D atoms are next to the reaction center (R atoms) in the unmatched regions. M atoms are adjacent to the R atoms in the matched regions. In most cases R, D, and M atoms are all single pairs and the RDM pattern is represented as "R\-R2 :D]-Dz:M]-M2" (Fig. 2). Multiple pairs in D or M atoms can be considered and are represented by concatenating all atomtypes using "+", and multiple pairs in R are represented by multiple RDM patterns in which R atoms are a single pair. The asterisk "*,, in the RDM patterns indicates that there is no atom or it is only a hydrogen atom. The structural transformation between single compounds pair is now clear since each entry of the RP AIR database is a binary pair. Also the RDM pattern represents the transformational pattern around the reaction center. Hence it can be assumed that the RDM patterns may basically reflect the reaction mechanism at the site where each enzyme catalyzes. RDM patterns are generated first computationally by the chemical structure comparison program SIMCOMP, followed by manual curation [3]. There were 2401 kinds of the RDM patterns in RPAIR.

i

Ii

Nib /

"

Nla

" Cia

/ cSa,

./ II

05a

RDM pattern

セ@

Nla-Nlb:*-C5a:Clc-Clc

Fig. 2. Examples of a substrate-product pair and its RDM pattern. The red colored atoms (Nla and Nib in the boundary of the dashed line) are R, the blue and the yellow atoms are D (C5a connected to N I b) and M (C I b connected to R atoms), respectively. The rest of the matched region is depicted by green color.

Generalized Reaction Patterns for Prediction of Unknown Enzymatic Reactions

153

2.3. EC-RDM dot matrix All EC numbers and corresponding RDM patterns were extracted from the databases. Then, the EC-RDM dot matrix was created to overview the relationship between the EC classification and the RDM patterns. The row of the matrix corresponds to subsubclasses of EC numbers, and the column of the matrix corresponds to the RDM patterns. The characteristic relationship between EC sub-subclasses and RDM patterns in the matrix is shown in the Result section.

2.4. Hierarchical clustering After obtaining the EC-RDM dot matrix, we performed a hierarchical clustering of the RDM patterns, using the information whether each EC sub-subclass has a particular RDM pattern or not. The distance (D) between two RDM patterns (RDMJ and RDM2 ) can be formulated as follows: (1) where V(RDMJ) and V(RDM2 ) are respective bit vectors of the RDM patterns RDMl and RDM2, and each element of a vector corresponds to the existence (1) or nonexistence (0) of each sub-subclass of Ee. Tc indicates the Tanimoto coefficient which is defined as follows: 7'

(

lC X

Y ,

)

=

The number of bits where x; = 1 andy; = 1 The number of bits where x; = lor y; = 1

(2)

where {Xi} and {Yi} are bit vectors [10]. We used the average linkage method for the hierarchical clustering.

2.5. Generalization of RDM patterns Using the cluster information obtained in the above section, we constructed the generalized patterns of the RDM patterns to represent the variation of the RDM patterns in each cluster. We implemented an algorithm that compares character strings of the RDM patterns in the same cluster to generate their generalized pattern. In this generalization process, the hierarchy ofKEGG Atomtypes (atom species, atom class, and atomtype) is used. The detailed procedure of generalizing two RDM patterns, RDMJ and RDM2, is described as follows: Step 1: All possible representations ofRDM\ are generated and stored in {RDMJ}. Step 2: The following procedures (2-2) are performed for each RDMIi of the set {RDM J}.

154

Y. Shimizu et al.

Step 2-2: RDMIi is separated into R li, D 1i, and M li . RDM2 is also separated into R2, D2, and M2. Then, RJi and R2, DJj and D2, and MJi and M2 are compared respectively. When multiple atoms are incorporated into each D or M representation, they are compared at the corresponding position of atoms. That is, when comparing DJi (= D1 1i+D 21i) with D2 (= D I2+D22), the comparison is done between D\i and DI2 and between D2Jj and D22. Step 3: The most matched case is selected and the generalized pattern is generated. The priority of the matching the atom representations when comparing KEGG Atomtypes in Step 2-2 is shown in Table 2. Generalized patterns are made via following conditions. The example of generalization is also shown in Table 2 and Fig. 3. i) The parts which have complete match in Step 2-2 are output directly. ii) The parts which have match at the atom class level or atom species level in Step 2-2 are substituted by the atom class or atom species respectively. iii) The parts which have no match in Step 2-2 are substituted by both components separated by comma and in parentheses. Table 2. Definition of the priority in the atomtype comparison and examples of generalization between atom types. Priority I 2 3 4

Example of generalization

Description

Original atomtypes

Generalized pattern

Plb and Plb 02c and02b Olc and 03b Plb andClb

Plb 01 0 jPlb,Clb)

Complete match Matching at the atom class level Matching at the atom species level No match in comparison

)2c

:*..1'

i「Zイャセ@

02b:*·C Ib: PI「セ@

TT

\

·Ol:"-(P lb.C Ib):PIb-P'lb Fig. 3. An example of generalization.

155

Generalized Reaction Patterns for Prediction of Unknown Enzymatic Reactions

3.

Results

3.1. Relationship between Ee sub-subclasses and RDM patterns There were 3116 EC numbers (195 EC sub-subclasses) which correspond to at least one RDM pattern (1571 main types). Fig. 4 shows the EC-RDM dot matrix. Some RDM patterns correspond to many sub-subclasses of EC numbers. For examples, the RDM pattern "Olc-02c:*-Plb:Plb-Plb" in the box A in Fig. 4 corresponds to 25 sub-subclasses of EC numbers and "Sla-S2a:*-C5a:Clb-Clb" in the box B corresponds to 14 sub-subclasses of EC numbers. These patterns are found in reactions such as the hydrolysis of A TP and the formation of a thioester bond respectively. These reactions are most significant and can be observed extensively in biochemical reactions since they are frequently used as the energy source of other reactions.

A B

6.5.1 イZMセ]エKiG@ 6.1.1 Mセ

r

5.3.1

4.3.3 4.1.2 セL@ 3.6.4

........:-.

セ@ 3.4.19 '+- 3.1.5 o 2.8.2 en 2],4

2.4.1 oセ@ 1.21.4

セMG



. •.

. . : . . . . •••

.....- - ' ..... _ _ .... _ . _ ... _. _ _•

セMッNZ⦅G

-§ セNQR@

:1(

-" '=.:

'.

.•• 1.17.4;=f---"" '('11417 !>

セ@

NセM@

......... __

-

. .-

_ _ _'----'--_-=.;cc_ _ _

'; セ[N@

セ@

:' •• ..

」⦅Mᄋセ@

..

セ[NlZ@

",,-

⦅[イ|Nセ@

.. セN@ セZN@

GLMZ[セ@ .---- Gセjd@

;;,..

-. .

en 1.9.3

....

1.6.1 1.4.2 1.2.4 1.1.1

:

..

-

'"

セ@

I.)

h

h

h

,,'-'

u





u

-rid

"x



" "

iG

;,;

セ@

U

.-

'. セN@

セ@

U

セ@ U

e " G" "

'-' セ@

-1->.

U

U

u



u

セ@

".- Jj

x

".

_.

.- .

セZ[イ@

uE



Mセ

:::J

1.7.99

,

•.------'-----'----+t----t!c-j-,

セ@

'JJF

tM ,,- @セ セ@ セ@ Uセ@ '-' U U U U セ@ £ ,$ J!l J!l セ@ -I'(d

u

u

u

u

RDM pattern

u

(j; .n U

6l

u

b

.n

z z z., z" !!z z

セ@ Zセ@ Z

J; z

0

セ@0

セ@

0

cG

0

" a" a" a"

0

!L

.il0 g セ@ .u .B '"' u; 0 (I)

.

-10

>

/

'S

\

セ@

OJ

>

0.5

-15 '--_-.J.._ _-'""

0.5 V perturbed

-15-10 -5

0

v perturbed

5

o

0.5

1.5

v perturbed

Figure 4: Single perturbation trajectory. I\. matrices were calculated to restore steady state to perturbations to the flux V3 in the network in Figure I. Plots A and 8 use a I\. calculated to return a perturbation of 2 to I while plot C uses a I\. that returns a perturbation of 113 to I. In each plot, the dotted line is the diagonal, the dashed line is the parabola described in Figure 3, and the solid line is the trajectory after several iterations of Equation (14). (A) Convergent regulation. A perturbed flux value of 0.1 will return to steady state after several regulatory steps. (8) Divergent regulation. A perturbed flux value of3.5 will approach - 00. (C) Chaotic regulation. For some values of I\. and initial perturbations (here, the initial perturbation is 0.4), any regulation performed may behave chaotically, never converging on a steady state or diverging toward infinity.

This dynamical regulation process behaves similar to a logistic map [13], displaying regimes of convergence, divergence or apparent chaotic trajectories, depending on the values of the parameters A and v. With regard to metabolic regulation this finding potentially implies that chaotic or divergent behavior might be easily encountered by regulatory networks, unless specific ranges of parameters are avoided. This may pose constraints on possible regulatory networks optimized through evolutionary adaptation. 4. Glycolysis An obvious question is whether our method can be used to predict the topology and dynamics of regulation in real-world networks. As a simple example, we chose a simplified (condensed) version of the glycolytic pathway, previously used for similar testing of computational approaches (Figure 5) [15J. Similarly to what done for the simple linear pathways (Figure 2), we approach this network by perturbing each flux individually and predicting the optimal network to restore homeostasis.

A

c

Figure 5: Perturbations in a simplified model of glycolysis. Solid lines represent metabolic reactions, and dashed lines represent predicted optimal metabolic regulation. Reactions represented as bold lines are the ones being perturbed. G = glucose, F = fructose-6-phosphate, 8 = fructose-I,6-bisphosphate, P = phophoenolpyruvate, Y = pyruvate, L = lactate, T = adenosine triphosphate, D = adenosine diphosphate.

168

W. J. Riehl €3 D. Segre

In all cases, only one regulatory metabolite was necessary for optimal regulation that restores a given steady state. For each of the reactions involving an energy-carrier, ADP was predicted to act as the main regulatory molecule (Figures 5B, 5C, and data not shown). Lactate also acted as a negative feedback regulator on its own production (Figure 5D), and glucose acted as a negative regulator on the influx of glucose (Figure 5A). 5. Discussion

In this work we developed new algorithms and methods for predicting optimal metabolic regulation based on the topology and stoichiometry of a metabolic network. Thus far, we have applied these algorithms to small pathways that are linear in nature in order to understand how accurate and robust the predictions are. Initially we found that while a single regulatory scheme can be robust for some perturbed values (Figures 3 and 4), it quickly becomes clear that a single regulatory approach predicted by this method is incapable of effectively regulating all perturbations. For example, a regulatory scheme focused on regulating perturbations to a single flux will have little or no effect on other fluxes. We also observed that multiple applications of a single regulatory system can produce unexpected, apparently chaotic results (Figure 4C). While some of these results may be unrealistic consequences of the mathematical approximations used, they may also capture some fundamental properties of biological regulation systems evolved to respond to multiple perturbations. Recent work has shown, for example, that some metabolic states are more stable than others, and that perturbations occurring on top of unstable states can lead to cell death [9]. It is worth emphasizing that each of these predicted optimized regulatory mechanisms represents just that: the optimal amount of regulation necessary to respond to a given perturbation. In all cases explored (perturbations to a single flux in the network), the optimal controlling metabolite turns out to be either a reactant or product in the perturbed reaction. However, it remains a point of interest that for many perturbations in glycolysis, the controlling metabolite predicted most often was ADP. This is interesting because both ADP and ATP are known to be strong regulators (either activators or inhibitors) of glycolysis. This may point to the utility of this method as both a quantitative (degree of regulation necessary) and a qualitative (type of metabolite functioning as a regulator) prediction generator. The current model involves simplifying hypotheses and approximations, some of which may be unjustified from the biochemical point of view. These include the assumption that the regulatory response is based on concentration changes, rather than absolute concentration values; the fact that we do not include flux relaxation induced by plain kinetic effects; the use of arbitrary values for flux perturbations; the implementation of a dynamical process based on discrete time points; and the limitation to noncompetitive inhibition as the only form of feedback. In ongoing work, we are addressing each of these assumptions to determine their impact on our results, and possible

Optimal Metabolic Regulation Using a Constraint-Based Model

169

strategies for more realistic implementations. We plan to expand on this work and use it to explore more complex systems. At first, we will use this method to understand how it predicts regulation of different and multiple perturbations to a system. We expect that when two or more fluxes are perturbed, the regulatory network will quickly become complex and intricate. Next, we plan to explore the regulation of networks with complex topologies that include branching and cyclical pathways. Eventually we intend to apply this predictive method to whole-genome models of flux balance, such as the Escherichia coli model produced by Feist et al. [6] or the Saccharomyces cerevisiae model produced by Blank, et al. [1]. Acknowledgements

The authors wish to thank Hsuan-Chao Chiu, Niels Klitgord, and Evan Snitkin for meaningful discussion and critical reading of the manuscript. Linear Programming calculations were performed using the software Xpress, kindly provided by Dash Optimization under free academic license. This work was partially supported by the NASA Astrobiology Institute, the US Department of Energy and the US National Institutes of Health (NIGMS). References

[1]

[2] [3]

[4]

[5] [6]

[7] [8]

Blank, L.M., Kuepfer, L. and Sauer, U., Large-scale 13C-flux analysis reveals mechanistic principles of metabolic network robustness to null mutations in yeast, Genome Bioi, 6(6):R49, 2005. Covert, M.W., Schilling, C.H. and Palsson, B., Regulation of gene expression in flux balance models of metabolism, J Theor Bioi, 213(1):73-88, 2001. Covert, M.W. and Pals son, B.O., Constraints-based models: regulation of gene expression reduces the steady-state solution space, J Theor Bioi, 221(3):309-25, 2003. EbenhOh, O. and Heinrich, R., Stoichiometric design of metabolic networks: multifunctionality, clusters, optimization, weak and strong robustness, Bull Math Bioi, 65(2):323-57,2003. Edwards, J.S. and Palsson, B.O., Metabolic flux balance analysis and the in silico analysis of Escherichia coli K-12 gene deletions, BMC Bioinjormatics, 1(1,2000. Feist, A.M., Henry, C.S., Reed, J.L., Krummenacker, M., Joyce, A.R., Karp, P.D., Broadbelt, L.J., Hatzimanikatis, V. and Palsson, B.O., A genome-scale metabolic reconstruction for Escherichia coli K-12 MG 1655 that accounts for 1260 ORFs and thermodynamic information, Mol Syst Bioi, 3(121, 2007. Fell, D., Understanding the Control of Metabolism, Portland Press Ltd., 1997. Goyal, S. and Wingreen, N.S., Growth-induced instability in metabolic networks, Phys Rev Lett, 98(13):138105,2007.

170

W. J. Riehl €3 D. Segre

[9J

Grimbs, S., Selbig, J., Bulik, S., Holzhutter, H.G. and Steuer, R., The stability and robustness of metabolic states: identifying stabilizing sites in metabolic networks, Mol Syst BioI, 3(146, 2007. Hatzimanikatis, V., Floudas, C.A. and Bailey, lE., Optimization of regulatory architectures in metabolic reaction networks, Biotechnology and Bioengineering, 52(4):485-500, 1996. Heinrich, R. and Rapoport, T.A., A linear steady-state treatment of enzymatic chains. General properties, control and effector strength, Eur J Biochem, 42(1):8995, 1974. Kauffman, KJ., Prakash, P. and Edwards, lS., Advances in flux balance analysis, Curr Opin Biotechnol, 14(5):491-6,2003. May, R.M., Simple mathematical models with very complicated dynamics, Nature, 261(5560):459-67, 1976. Shlomi, T., Eisenberg, Y., Sharan, R. and Ruppin, E., A genome-scale computational study of the interplay between transcriptional regulation and metabolism, Mol Syst BioI, 3: 10 I, 2007. Vance, W., Arkin, A. and Ross, l, Determination of causal connectivities of species in reaction networks, Proc Natl Acad Sci USA, 99(9):5816-21, 2002.

[10]

[l1J

[12J

[13J [14J

[15]

COMPARATIVE DETERMINATION OF BIOMASS COMPOSITION IN DIFFERENTIALLY ACTIVE METABOLIC STATES HSUAN-CHAO cmu! [email protected] ! 2

DANIEL SEGREY [email protected]

Graduate Program in Bioinformatics, Boston University, Boston, MA, 02215, USA Departments of Biology and Biomedical Engineering, Boston University, Boston, MA, 02215, USA

Flux Balance Analysis (FBA) has been successfully applied to facilitate the understanding of cellular metabolism in model organisms. Standard formulations of FBA can be applied to large systems, but the accuracy of predictions may vary significantly depending on environmental conditions, genetic perturbations, or complex unknown regulatory constraints. Here we present an FBA-based approach to infer the biomass compositions that best describe multiple physiological states of a cell. Specifically, we seek to use experimental data (such as flux measurements, or mRNA expression levels) to infer best matching stoichiometrically balanced fluxes and metabolite sinks. Our algorithm is designed to provide predictions based on the comparative analysis of two metabolic states (e.g. wild-type and knockout, or two different time points), so as to be independent from possible arbitrary scaling factors. We test our algorithm using experimental data for metabolic fluxes in wild type and gene deletion strains of E. coli. In addition to demonstrating the capacity of our approach to correctly identifY known exchange fluxes and biomass compositions, we analyze E. coli central carbon metabolism to show the changes of metabolic objectives and potential compensation for reducing power due to single enzyme gene deletion in pentose phosphate pathway.

Keywords: flux balance analysis; systems biology; data integration; metabolic objectives

1.

Introduction

An important goal of systems biology is to reconstruct and simulate biological networks to facilitate the understanding of complex cellular metabolism. Constraint based approaches have been applied to characterize the cellular flux distribution and predict metabolic phenotypes for cells grown in different conditions. One of the most prominent constraint based approaches, Flux Balance Analysis (FBA), relies on a steady state approximation and optimization algorithms to predict metabolic fluxes at cellular level [15]. The steady state approximation translates into a set of constraints on the fluxes, namely that the net sum of all fluxes producing or consuming each metabolite has to be zero. FBA determines these steady state fluxes by searching the space of feasible solutions, a polyhedral space defined by multiple constraints, for a choice of fluxes that minimizes/maximizes an objective function associated with a biological task. For instance, for a unicellular organism, one may ask what is the solution that maximizes an appropriately defined growth (or biomass production) flux, reflecting selection for fastgrowth during evolution [15]. In addition to maximizing growth, van Gulik and Heijnen suggested maximization of ATP yield, based on the assumption that evolution drives

171

172

H.-C. Chiu €3 D. Segre

maximal energy efficiency [14]. Bonarius et al. suggested minimization of overall intracellular flux, reflecting the hypothesis that organisms are evolved to maximize enzymatic efficiency [1]. Several works have proposed methods to identify objective functions from experimental data. Knorr et al. proposed a Bayesian-based probability ranking method to evaluate multiple objective functions [7]. Schuetz et al. have measured fluxes and evaluated different objectives with a Euclidean metric approach [11]. Among all the objectives studied by Schuetz et at., nonlinear maximization of the A TP yield best described unlimited growth on glucose in oxygen or nitrate respiring batch cultures while linear maximization of the overall A TP or biomass yields achieved the highest accuracy under nutrient limited continuous cultures [11]. Although FBA optimal growth seems to work well in several cases, it has been shown to be sometimes insufficient for predicting perturbed metabolic states, such as the one found in gene deletion knockout strains. A better way to determine mutant fluxes is to use Minimization Of Metabolic Adjustment (MOMA) [12], which assumes that the mutants would stay as close to wild type flux distribution as possible. One lesson learned from MOMA is that metabolic networks perturbed from a simple average behavior may be better described by objective functions different than standard growth rate maximization. One can imagine, in general, that a living system may switch its objective when facing a physiological change. For example, the diauxic shift in yeast, which is the switching from anaerobic growth to aerobic respiration upon depletion of glucose, is known to be correlated with widespread changes in the expression of genes involved in carbon metabolism, protein synthesis, and carbohydrate storage [3, 6]. Understanding the physiology of such a natural progress is still an open challenge. Lacking knowledge of objectives for perturbed cells and changes of objectives under different metabolic states limits the capacity to correctly describe metabolic networks using FBA methods. An alternative way to study metabolism is to infer metabolic flux objectives from available data. Comparative analyses of biomass compositions in different physiological states, either between wild type and mutants or throughout naturally occurring physiological transitions, could provide insight helpful towards understanding the design of metabolic networks. Previously Burgard and colleagues proposed ObjFind and BOSS to identify putative objective functions from flux measurements. Specifically, these methods identify the coefficients of importance responsible for flux distributions in E. coli and yeast [2, 4]. Uygun et al. proposed a multilayer optimization framework to discover the major fluxes of metabolic objective that account for the flux distribution in a mammalian cell [13]. However, these methods rely on flux measurements and cannot take advantage of other high throughput data. Here we present an FBA-based approach to infer the biomass compositions that best describe multiple physiological states of a cell. Our method is designed to incorporate high throughput data for comparatively determining metabolic objectives in two physiological states. As a first step, we analyze here flux data from E. coli central carbon metabolism pathways [5] to demonstrate our method for predicting metabolic objectives.

Comparative Determination of Biomass Composition

2.

173

Method

2.1. Flux Balance Analysis FBA describes the cellular level reaction rates (fluxes) under a steady state approximation, thereby imposing linear mass balance constraints. All the nutrients taken from the extracellular environment would be consumed to produce biomass or other byproducts and taken out from the system without intracellular metabolite accumulation. The steady state equation responsible for mass balance can be written as follows: dxldl = Sv = 0 (1) where x is the vector of metabolites, v is the vector of reaction fluxes and S is the stoichiometric matrix of the network. S is an m by n matrix where m is the number of metabolites and n is the number of reactions. The value Sij in S is the stoichiometric coefficient for metabolite i in reaction j. Additional constraints such as lower and upper bound for specific enzymatic reactions or nutrient uptake rates may also be imposed as LBj'S.v;S.UBj, for reaction Vj' FBA determines a specific flux prediction by maximizing/minimizing a linear objective function associated with a biological task. A typical FBA objective used in microbial systems is the maximization of biomass production [15] based on the assumption that unicellular organisms have been selected to reach maximum growth performance during evolution. Biomass production is approximated by a growth flux Vgrowlh, which is defined as follows: (2) where c is the vector of biomass coefficients, whose component Ci indicates the proportion of metabolite Xi required for the formation of a unit of biomass. The linear programming statement for maximizing growth in FBA could be formulated as: max

Vgrowth

(3)

s.t. Sv = 0 LB j セ@

Vj

セ@

UBj

2.2. Objective inference We extend the conventional FBA formulation to concurrently infer metabolic objectives in two different metabolic states of a system. Here we limit our search to maximization of biomass production as an objective function, but we allow the biomass composition to assume in principle any vector of coefficients. For instance, the two states could be the wild type and a given mutant. The goal is then to infer the corresponding c l and c2 vectors of biomass coefficients best representing the metabolic objectives for the two corresponding physiological states.

174

H.-C. Chiu f3 D. Segre

To reverse engineer the objectives, we implement a linear optimization procedure to identify the FBA objectives maximally compatible with given vectors El and E2 encoding reference experimental data:

L ivセQ@ E - EvセRQ@

min

j

j

ejBセo@

S.t. S' ·v'

=0,

LBJ セカ@

j

whereS'

]{セ}L@

v'

]{Zセ@

]

(4)

セubj@

Llvj iセカュゥョ@ where v is the vector of fluxes to be determined, is a zero-containing matrix with the same dimensions of S. In this optimization problem the overall flux activity (the sum of the absolute values of all fluxes) is imposed to be above a threshold Vmin (e.g. 25% of the flux activity obtained with regular FBA). Biomass production reactions for the first and second state are disabled from the stoichiometric matrix and a sink reaction for each biomass component is added. Each single biomass component originally flowing into biomass is exported separately and the inferred fluxes will correspond to the biomass coefficients for the corresponding metabolic state. Our optimization method tries to optimize biomass coefficients simultaneously for two metabolic states, hence allowing us to take advantage of the fact that certain data could provide only relative changes between reaction activities in the two states. Here we limit the optimization to intracellular fluxes. To test our objective function inference approach and demonstrate its performance, we apply our method to experimental flux measurements in E. coli central carbon metabolism pathways, taken from the paper published by Ishii et al. [5]. In their flux measurements, wild type strain of E. coli K-12 and 24 single gene deletion mutants of glycolysis and pentose phosphate pathway were grown in glucose-limited chemostat cultures. The mutant cells were grown at fixed dilution rate of 0.2 hours-I, and wild-type cells cultured at the same specific growth rate were used as a reference sample. They also cultured wild type cells in different dilution rates (0.1, 0.4, 0.5, and 0.7 hours-I) for comparison. In this work, we apply an E. coli central carbon metabolism FBA model [9] to study these data. The biomass production reaction in this model is a sink for the linear combination of several metabolites that are precursors of amino acids, nuc1eotides or lipids: 0.205 g6p + 0.361 e4p + l.496 3pg + l.787 oaa + l.079 akg + 2.833 pyr + 0.898 r5p + 0.519 pep + 0.129 g3p + 0.071 f6p + 18.225 nadph + 3.748 accoa + 3.547 nad + 55.703 atp + 55.703 h20

7 18.225 nadp + 3.748 coa + 3.547 nadh + 55.703 adp + 55.703 pi + 4l.025 h

(5)

Comparative Determination of Biomass Composition

175

The fact that we are dealing with a small model and there are a lot of sink reactions for the metabolites results in many alternative optima for the optimization in Eq. (4). Therefore we further use Minimization Of Metabolic Adjustment (MOMA) [12] to find the most probable steady state solution for Vi by exploring the solution space we get from optimizing Eq. (4). Coefficients for biomass precursors listed above are inferred after the primary and secondary optimization process.

3.

Results

Performing gene deletions is a commonly used approach to study how an organism responds to perturbations. FBA and MaMA have been used for generating predictions of these metabolic responses. MaMA, in particular, has been shown to be more accurate for predicting mutant fluxes than FBA. However, there are cases in which neither FBA nor MaMA objectives seem to capture well enough the true metabolic state (Fig. 1). Hence this is a good test case for our algorithm, in search for biomass composition coefficients that would be compatible with experimental data. (b)

(a) 15

.

15

0

o

0

10

'2

'2

.Q

セ@

5

,

"0

セ@

0.

0

1.L

-5

«III

:;--

.. . •

0

5

Co

0

:E 0 :E

-5

« セM

-10 -15

セ@ "0 セ@

0

o

-5

v ( experiment) i

5

10

. ,.

0

00

o

0 0

-10 -15

o

-5 Vi

5

(experiment)

Fig. I. Intracellular flux determination for mutant strain .1.zwf. Units for both axes are millimoles per gram dry weight per hour (mmollgDWIh). (a) FHA flux predictions for the mutant do not correlate well with experimental measurements. (b) MOMA predictions are expected to better correlate with experimental fluxes. However, in this case, even MOMA predicted mutant fluxes are not satisfactory enough for inferring biomass coefficients.

We applied our method to flux measurements in E. coli central carbon metabolism pathways [5] to infer the metabolic objectives in wild type and mutant strains. The reference (Ref) strain we use is the average of the four replicates available experimentally. Fig. 2 shows the correlation of predicted and experimental exchange rates (which were not part of the input of the above inference algorithm). Our predictions for glucose uptake rates agree with all wild type and mutant measurements studied here. In general, predicted oxygen uptake rates match well with experiments except for those of the OR03 wild type strain (See Table 1). The less accurate predictions for oxygen or CO2 production rates may be caused by reactions consuming or producing these compounds that are not in central carbon metabolism pathways. For instance,

176

H.-C. Chiu & D. Segre

Ubiquinone-8 biosynthesis requires oxygen. Therefore, under-predicted oxygen uptake rates will result in a corresponding under-prediction of Ubiquinone-8 related reactions, such as NADH dehydrogenase or succinate dehydrogenase. In addition, predictions may also be affected by inaccurate flux measurements. For example, three CO 2-associated reactions have large standard deviations (larger than 0.5*mean, see Table S5B in [5]) in the wild type replicates, possibly due to experimental difficulties or resulted from the fitting procedure for flux corrections to achieve isotopomeric steady state. (b)

(a) x

40

セ@

30 '0 Q) U

20

'C セ@

セM

c.

10 0

L, ./

:0-

GR03

セク@

セ@

x x

b.

0

glc °2

x

-10 -10

10

20

vj(experiment)

grPSセK@

0.15

GR04

30

'6 セ@ c.

>-

&wf

0.05

CO2

40

t

GR04

0.1

+ 0

セ@

+ 0.05

ethanol

0.1

I

0.15

vj(experiment)

セーァャ@ and セァョ、@ (mutants Fig. 2. Predicted uptake and secretion rates in wild type and three mutant strains セコキヲL@ for pentose phosphate pathway single gene deletion). Unit for flux is millimoles per gram dry weight per hour (rnmol/gDW/h). Negative values refer to uptake rates. (a) Glucose uptake rates in all cultures are predicted quite well. Some oxygen uptake rates are under-predicted in wild type strains under high dilution rates. The wild type strain with largest dilution rate (GR04) has a large deviation for CO 2 predictions. (b) All significant ethanol production rates are correctly predicted.

3.1.

Conserved biomass coefficients across different glucose supply rates

For the wild type strains grown in different dilution rates, we implement our algorithm relative to the reference strain (Ref) mentioned above. In Fig. 3, the predicted production rates of the ten biomass precursors defined in the FBA model are plotted against the corresponding biomass coefficients. A linear correlation is observed across all dilution rates, ranging from an almost glucose-starved state to a nearly unlimited glucose supply. In addition, the slope of the line determined by the aligned data points roughly reflects the growth rate of each wild type strain. For instance, the slope observed in Fig. 3d is roughly 1.8 times the one in Fig. 3b, which matches the fold change of growth rates (0.7 vs. 0.4 h· l ) between these two experiments. These results suggest that E. coli grown in glucose supply cultures apply robust metabolic objectives for biomass precursors, in agreement with the FBA optimal growth assumption for central carbon metabolism, regardless of glucose supply rate.

Comparative Determination of Biomass Composition

177

Influence of single gene deletion in pentose phosphate pathway

3.2.

The pentose phosphate pathway is responsible for generating NADPH and nucleotides. A perturbation in the pentose phosphate pathway could change the levels of NADPH and nucleotides and may result in less efficient growth. To study the changes of metabolic objectives caused by the deletion of pentose phosphate pathway genes, we applied our algorithm to three single gene deletion mutants ilzwf (glucose 6-phosphate-ldehydrogenase), ilpgl (6-phosphogluconolactonase) and ilgnd (glucose 6-phosphate dehydrogenase), relative to the Ref state. Our goal was to see whether considerable changes of biomass coefficients or fluxes rerouting could be detected. (a)

(b) GROl

GR02

3

3

'E 2.5 Q)

'E 2.5 Q)

'(3

'(3

if: Q)

if:

8.,

2

.,.,8

2

'"cu

cu

E 1.5 0 :0

E 1.5 0 :0

'0

'0

Q)

Q)

t5

t5

'6

'6

i!?

i!?

0.5

[L

0

[L

0

2 Expected biomass coefficient

3

PYR

2 Expected biomass coefficient

0

3

(d) GR04

GR03

3

3

'E 2.5 Q)

'E 2.5 Q)

'(3

'(3

8

.. . . 0 '

(c)

if: Q)

0.5

if: Q)

8

2

2

'"'"cu

'"'" cu

E 1.5 0 :0

E 1.5 0 :0 '0

'0

'6

'6

Q)

セ@

t5

i!?

[L

i!?

0.5 0

[L

.. 0

0.5

..

PYR '

0

2 Expected biomass coefficient

3

0

' 2 Expected biomass coefficient

3

Fig. 3. Predicted production rates for biomass precursors in wild type strain under different dilution rates. Units for both axes are millimoles per gram dry weight per hour (mmol/gDW/h). The y coordinate for each data point represents the predicted flux production rate for the corresponding biomass component, and x coordinate is the biomass coefficient taken from the FBA model (see Eq. (5)). E. coli is cultured at dilution rate ofO.lh-'(a), O.4h1 I (b), 0.5 h- ' (c), 0.7h- (d) respectively [5]. The slope of the data line roughly reflects the growth rate for each experiment. Pyruvate coefficients in GR02 and in GR03 are erroneously predicted to be zero. This might be related to the less accurate prediction of CO2 production rate, since several reactions consuming or producing pyruvate generate CO2 •

178

H.-C. Chiu & D. Segre

rg6p

D

--4

6pgl

zwf

r5p

NADPH

NADPH

---. pgl

6pgc

I \ x5p

Pentose Phosphate Pathway

Glycolysis

Fig. 4. Map of the initial reactions in the pentose phosphate pathway. zwf and gnd are responsible for NADPH production to generate reducing power for growth.

Fig. 4 illustrates the reactions being knocked out from the pentose phosphate pathway in our computational study. Detailed predictions for biomass components and several key fluxes are shown in Table 1. Note that all mutants were grown in chemostat cultures at the same dilution rate (O.2h- l ) as the wild type strain (column Ref in Table 1). Hence they can all be considered to grow at the same rate. The predicted production rates for the different biomass precursors can therefore be directly compared between different mutants, and relative to the corresponding biomass composition coefficients used in FBA calculations, appropriately normalized. Our results show that most biomass coefficients for biomass precursors change proportionally to the coefficients themselves across different strains. However, individual deviations from this trend can be seen. Fig. 5 shows the predicted production rates for biomass precursors in wild type and mutants. Amino acids and nucleotide precursors (e4p bar to pep bar) tend to be over-produced and under-produced in セーァャ@ and in セァョ、@ respectively, compared to the Ref strain. Meanwhile, the measured dry weight for セーァャ@ and セァョ、@ show the same trend as the production rates for these biomass precursors. One explanation for the deviations is that these mutants may not grow at exactly the same rate due to possible experimental error, since the dry weight matches the under/over production trend. Another interpretation would be that these mutants reprogram their fluxes differently in response to gene deletion. However, more investigation would be required to draw a clear conclusion. 0.8



0.7 0.6

0.3

o o •

v

O.S 0.4

C=::J FBA model Ref (0.2h-1) (0.2h-1 ) セーァャ@ (0.2h-1) セァョ、@ (0.2h-1) セコキヲ@

1.2





IQJ

0.8

v

•v

0.6

0

0.2

0.4 0.2

0.1

oGMセ@

0

g6p e4p 3pg oaa akg pyr rSp pep g3p f6p

dry weight

Fig. 5. Left panel is the predicted production rates for biomass precursors (rnmol/gDW/h) in wild type and mutants under the same dilution rate (O.2h·'). Right panel is the measured dry weight (gIL) for these strains.

Comparative Determination of Biomass Composition

179

NADPH serves as the electron donor in reductive biosynthesis. Gene deletions in the pentose phosphate pathway perturb NADPH levels and further cause oxidative damage to the mutants [8]. One possible response for these mutants is to reroute their fluxes and generate NADPH from NADP in other pathways. Our results suggest that these mutants may use another strategy for replenishing NADPH level. As shown in Table 1, all three mutants are predicted to have higher PntAB transhydrogenase activity, suggesting that mutants may replenish NADPH level by converting NADH into NADPH. This prediction supports the previous suggestion that PntAB transhydrogenase plays an important role for generating NADPH in E. coli [10]. The predicted PntAB flux ratio for セコキヲャr・@ (1.55) agrees with previously reported セコキヲjゥャ、Mエケー・@ mRNA ratio (about 1.7) [10]. In contrast to the predicted robust metabolic requirements for biomass precursors, the cofactor requirements vary. NAD and NADPH requirements show considerable increase per unit of biomass production in セコキヲ@ and セーァャN@ It is not clear how to biologically interpret the increase of redox requirements in these mutants. One possible explanation is the mutants result in redox imbalance and their regulatory networks react consequently, causing unusual ways to direct the network operation for central carbon metabolism. In all cases analyzed here, ATP coefficients are predicted to be zero. This is because ATP synthase is present in the FBA model, and we have no information about the proton and phosphate uptake rate, Hence, the reverse ATP synthase flux is indistinguishable from the sink flux of ATP in biomass. Therefore the amount of ATP synthase reaction actually contains the ATP biomass production (but in the opposite direction). When we block (set to zero) ATP synthase in the model, the ATP biomass coefficient results equal to the absolute value of the ATP synthase flux listed in Table 1. However, this is not enough to draw a conclusion at this point on the actual A TP biomass coefficient for each strain. This issue could be examined in detail in the future, with more experimental information. 4.

Discussion

We proposed an FBA-based approach to infer the biomass compositions that best describe multiple physiological states of a cell. Our results show that E. coli maintains robust biomass coefficients for biomass precursors in central carbon metabolism pathways under glucose supply medium, ranging from an almost glucose-starved state to a nearly unlimited glucose supply. This result suggests that E. coli operates its central carbon metabolism pathways with the same biomass objective, in agreement with optimal growth criteria under glucose supply medium. One should keep in mind that this result might be partially biased by the fact that experimental inference of fluxes requires fitting to a stoichiometric model that usually involves a biomass production flux as well. Our predictions for mutants indicate that there is an increase usage for the PntAB transhydrogenase flux, suggesting another potential strategy for the mutants to

180

H.-C. Chiu

fj

D. Segre

compensate the less efficient NADPH production caused by single gene deletion in the pentose phosphate pathway. Some of our flux predictions cannot be fully understood, partly due to the use of an incomplete model (as opposed to a genome-scale one) and partly due to potential experimental errors in the flux measurements. For instance, if we had information about Ubiquinone-8 associated fluxes, we could correct the missing information in the current model and improve the accuracy of oxygen uptake rate prediction. On the other hand, it would be difficult to apply a genome-scale E. coli FBA model in our study, since the experimental data is limited to central carbon metabolism pathways. At present, large scale flux measurements are still unavailable, due to experimental difficulties. One way to overcome this limitation would be to take advantage of other types of high throughput data. Our method is designed to incorporate not only flux measurements, but also other high throughput data as the reference vector E in Eq. (4), such as mRNA expression or protein levels for two distinct physiological states. In ongoing work, we are applying our method to time series data such as gene expression along the cell cycle, to provide insights into the physiology of cellular growth. This will allow us to learn more about how living organisms organize their biomass requirements and manage energy or redox balance during their life cycle. The method should provide insights into how a cell allocates its metabolic resources in a timedependent and condition-specific manner, and can be extended to integrate multiple data sources with FBA models, to shed new light on the system-level organization of metabolic networks.

Comparative Determination of Biomass Composition Biomass comeonent Biomass Precursors

'FBA model

Ref (0.2h·')

t:.zwf (0.2h-')

Biomass coefficients Llgnd Llpgl (0.2h-') (0.2h-')

'GR01 (0.1h·')

'GR02 (O.4h·')

'GR03 (0.5h-')

181

'GR04 (0.7h-')

g6p

0.043

0.043

0.044

0.050

0.038

0.040

0.043

0.042

0_044

e4p

0.076

0.089

0.089

0.101

0.076

0.080

0.086

0.085

0.088

3pg

0.314

0.299

0.296

0.343

0.257

0.272

0.288

0.292

0.297

oaa

0.375

$0.398

0.394

0.454

0.345

0.364

0.384

0.387

0.396

akg

0.226

0.243

0.242

0.279

0.210

0.222

0.235

0.236

0.241

pyr

0.594

0.615

0.610

0.702

0.000

0.562

0.000

0.000

0.613

r5p

0.188

0.198

0.197

0.225

0.169

0.182

0.190

0.191

0.196

pep

0.109

$0.109

0.108

0.124

0.094

0.098

0.104

0.106

0.107

g3p

0.027

0.033

0.032

0.037

0.029

0.030

0.033

0.032

0.033

f6e

0.015

0.021

0.022

0.024

0.018

0.018

0.020

0.021

0.021

Cofactors 'atp

11.684

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

nad

0.744

23.631

40.600

63.507

25.496

0.000

23.516

0.000

0.000

nadph

3.823

31.252

45.218

61.827

34.733

12.992

28.807

12.322

21.787

accoa

0.786

0.825

0.817

0.941

0.000

0.752

0.000

0.000

0.000

'GR04

'GROl

'GR02

'GR03

0.000

0.000

0.000

0.000

0.000

'57.991

'32.026

9.538

25.244

9.618

15.141

Seecificreac

Ref

NADPH->NADH

0.000

0.000

11e91 0.000

'NADH->NADPH

27.713

'42.818

hzwf

11gnd

eNADH->NAD #ADP->ATP (ATP synthase)

9.279

14.506

21.057

9.053

2.526

8.043

e1.146

-5.906

-7.114

-6.839

-8.288

-5.030

-5.695

-6.455

EX_ac

0.000

0.000

0.000

1.245

0.000

1.389

1.597

1.371

EX_akg

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

セNTPV@

-9.590

EX_co2

8.251

9.747

9.482

9.456

7.512

6.128

6.311

12.380

EX_etoh

0.000

0.013

0.000

0.000

0.000

0.000

0.058

0.044

EX_for

0.000

0.000

0.000

0.532

0.000

0.594

0.600

0.000

EX_fum

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

EX-9lc

-2.934

-3.178

-3.361

-2.922

-2.676

-2.525

-2.653

-3.813

EX_h20

4.252

8.711

15.300

2.103

-2.172

2.939

-3.947

-6.669

EX_h

9.924

6.903

0.955

12.474

15.098

8.902

16.164

25.450

EXJac_D

0.000

0.000

0.000

0.000

0.000

0.000

EX_o2

-5.792

-8.765

-11.867

-6.006

-2.250

-4.785

0.000 e_1.408

"-2.786

EX_p'l

-0.792

-0.788

-0.904

-0.681

-0.722

-0.763

-0.769

-0.787

EX succ

0.000

0.000

0.000

5.2E-6

0.000

0.000

0.000

0.000

0.000

Table I. Predicted biomass production and important fluxes (mmol/gDW/h). Negative values refer to uptake fluxes. 'Normalized to the same scale with Ref column for comparison. £PntAB transhydrogenase activities increase in all three mutants. 'The ATP biomass coefficient would be the absolute value of ATP synthase fluxes if we block ATP synthase reaction from the model. sOne flux pair (Ref and セコキヲI@ fails to predict the correct value for oaa and pep (results not shown). This is due to erroneous prediction for a single reaction, ppe (phosphoenolpyruvate carboxylase), which converts pep and co2 into oaa. The deviation of ppe fluxes in two and Ref vs. セァョ、I@ matches the deviation of co2 production rates. In predictions of Ref biomass (Ref vs. セコキヲ@ addition, the flux measurement for ppe has large standard deviation [5). ©The NADH dehydrogenase flux seems to be under-predicted in GR03 and GR04 due to the unprecise oxygen uptake prediction.

182

H.-C. Chiu

fj

D. Segre

Acknowledgements

The authors would like to thank Evan Snitkin, Niels Klitgord and William Riehl for discussion and critical reading of the manuscript. Linear Programming calculations were performed using the software Xpress, kindly provided by Dash Optimization under free academic license. This work is supported by research grants from the US National Institute of Health (5012846-00) and the US Department of Energy (DE-FG0207ER64388 and DE-FG02-07ER64483). References

[1] Bonarius, H.PJ., Hatzimanikatis, V., Meesters, K.P.H., et al., Metabolic flux analysis of hybridoma cells in different culture media using mass balances, Biotechnol Bioeng, 50(3):299-318, 1996. [2] Burgard, A.P. and Maranas, C.D., Optimization-based framework for inferring and testing hypothesized metabolic objective functions, Biotechnol Bioeng, 82(6):670-7, 2003. [3] DeRisi, J.L., Iyer, V.R. and Brown, P.O., Exploring the metabolic and genetic control of gene expression on a genomic scale, Science, 278(5338):680-6, 1997. [4] Gianchandani, E.P., Oberhardt, M.A., Burgard, A.P., et al., Predicting biological system objectives de novo from internal state measurements, BMC Bioinjormatics, 9(43,2008. [5] Ishii, N., Nakahigashi, K., Baba, T., et al., Multiple high-throughput analyses monitor the response of E. coli to perturbations, Science, 316(5824):593-7, 2007. [6] Johnston, M. and Carlson, M., The Molecular Biology of the Yeast Saccharomyces: Gene Expression, 1992. [7] Knorr, A.L., Jain, R. and Srivastava, R., Bayesian-based selection of metabolic objective functions, Bioinjormatics, 23(3):351-7, 2007. [8] Minard, K.1. and McAlister-Henn, L., Antioxidant function of cytosolic sources of NADPH in yeast, Free Radic Bioi Med, 31(6):832-43,2001. [9] Palsson, B.D., Systems Biology: Properties oj Reconstructed Networks, Cambridge University Press, 2006. [10] Sauer, U., Canonaco, F., Heri, S., et al., The soluble and membrane-bound transhydrogenases UdhA and PntAB have divergent functions in NADPH metabolism of Escherichia coli, J Bioi Chern, 279(8):6613-9, 2004. [11] Schuetz, R., Kuepfer, L. and Sauer, U., Systematic evaluation of objective functions for predicting intracellular fluxes in Escherichia coli, Mol Syst Bioi, 3(119, 2007. [12] Segre, D., Vitkup, D. and Church, G.M., Analysis of optimality in natural and perturbed metabolic networks, Proc Natl Acad Sci USA, 99(23): 15112-7,2002. [13] Uygun, K., Matthew, H.W. and Huang, Y., Investigation of metabolic objectives in cultured hepatocytes, Biotechnol Bioeng, 97(3):622-37, 2007. [14] van Gulik, W.M. and Heijnen, J.J., A metabolic network stoichiometry analysis of microbial growth and product formation, Biotechnol Bioeng, 48(6):681-698, 1995. [15] Varma, A., Boesch, B.W. and Pals son, B.D., Stoichiometric interpretation of Escherichia coli glucose catabolism under various oxygenation rates, Appl Environ Microbial, 59(8):2465-73, 1993.

SUFFIX TECHNIQUES AS A RAPID METHOD FOR RNA

SUBSTRUCTURE SEARCH RAPHAEL A. BAUERl,2,>

KRISTIAN ROTHER3,4,> ォイッエィ・セァョウゥャ」Nー@

イ。ーィ・ャN「オセ」ゥエ、@

JANUSZ M. BUJNICKI3,4 ゥ。ュ「セァ・ョウャ」ッNー@

ROBERT PREISSNERI イッ「・エNーゥウョセ」ィ。、@

1 Institute

of Molecular Biology and Bioinformatics, Structural Bioinformatics Group, Charite Universitiitsmedizin (Medical University), Arnimallee 22, 14195 Berlin, Germany 2 Graduate School: Genomics and Systems Biology of Molecular Networks, Monbijoustr. 2, 10117 Berlin, Germany 3 International Institute of Molecular and Cell Biol09Y in Warsaw, ul. Ks. Trojdena 4, 02-109 Warsaw, Poland 4 Laboratory of Bioinformatics, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, ul. Umultowska 89, 61-614 Poznan, Poland The RNA Ontology Consortium recently proposed a two-letter representation of the RNA backbone conformation. In this study, we compare the suite notation to a custom string representation that utilizes '7 - () pseudotorsion angles. Both representations were used to assess similarity and self-similarity in several RNA structure datasets. For the detection of similarities between two RNA structures we are utilizing suffix techniques that allow for the detection of substructure similarity within some degree of inexactness. The suite representation as well as the pseudotorsion representation was tested on four diverse RNA datasets. The possibility to detect structural similarities on these datasets allowed to recover many homologous structural elements that have implications for further understanding of the RNA apparatus in Systems Biology. The software as well as the utilized datasets are freely available from http://suiterna.sourceforge.net.

Keywords: RNA; structural search; suffix array; suite encoding

1. Introduction

String-based approaches to RNA structure analysis are widely used as long as secondary structures are concerned. But, there have been few attempts to express 3D features in a string notation. Recently, the RNA Ontology Consortium [11] proposed a string representation for the conformation of RNA backbones. This allows the use of classical string matching methodology to compare structural features in turn. In this manuscript, we explore how suffix techniques can be used to find similar regions in RNA backbone strings. >Both authors contributed equally to the paper.

183

184

R. A. Bauer et al.

RNA secondary structures are most commonly expressed in the dot-bracket grammar, which contains all nested Watson-Crick and wobble base pairs. This string notation is easy to handle, and therefore has been widely used to describe local motifs [10], for computational approaches comparing RNA sequences by tree grammars [16], and for aligning two or more sequences [4]. To distinguish subtle structural motifs, like the sarcin-ricin motif, RNAse P, pseudoknots, and tertiary interactions, this notation is not enough. These features depend on specific base pairing and stacking interactions, and a specific arrangement of the RNA backbone. The RNA Ontology Consortium has bundled efforts to describe RNA structures. It poses a platform where structural Bioinformaticians can exchange ideas and discuss formal nomenclature. Systematic approaches to describe RNA tertiary structure have been started from many sides: A typology of base pairs as the basic unit of which RNA is built was defined [19]. This allowed to identify interchangeable pairs of base-base interactions (known as the isostericity principle) [12]. Stacking is conceived as a major stabilizing force, and two complementary typologies have been introduced [13]. To describe larger local structural units, circular topologies, residues interconnected by backbone, base-pair or stacking interactions, have been introduced. Assembly of these building blocks has been successfully used in constructing tertiary structures, given that the topology is known or well-predicted [15]. Jane Richardson et al. created a string representation of the RNA backbone [17], where the backbone conformation of ribose-to-ribose 'suite' units can be represented by two letters. To analyze the RNA backbone, the most significant feature are torsion angles. For each base, there are six of them, one for each bond from one phosphodiester unit to the next. These torsion angles show a characteristic distribution. More distinct clusters of the torsions can be found if RNA 'suites' - units from one ribose to another - instead of the traditional phosphate-phosphate units are considered [14]. Each suite consists of seven torsion angles, including both C4'-C3' bonds. The torsion angles were clustered, each cluster being defined as a hyperellipsoid in the 7D space formed by the seven torsions of one suite. In total 46 distinct conformations of the backbone were identified. For each cluster, a two-character code was assigned. The first character corresponds to the first three torsion angles, and the second to the other four. Thus, it is possible to write an entire RNA 3D structure as a ID string representing the backbone. The main disadvantage of the suite representation is that its scope is limited to well-defined backbones. For a high quality dataset, it covers 90-95% of the residues in RNA structures. The other residues are disregarded either because any of the backbone torsions are outside well-defined boundaries, or because the suite is not close enough to any of the hyperellipsoids in 7D space. Most of the unassigned residues are in flexible regions having a high temperature factor, or they simply belong to clusters that are too sparsely populated to form a separate cluster. An alternative description of the RNA backbone is based on pseudotorsion angles. For this, the RNA structure is reduced to C4' and P atoms similar to the Co: trace of proteins. Between these atoms, two pseudotorsions f/ and () are defined.

Suffix Techniques as a Rapid Method for RNA Substructure Search

185

Even though it is more coarse-grained, the TJ - (J angles encode important features such as the sugar pucker to a satisfying degree. The Amigos program can be used to calculate pseudotorsions [6]. The P and C4' atoms are frequently used to construct initial backbone trace in x-ray crystallography. Recently, it was reported that using P-C1' pseudo torsions improves the assignment of the backbone and ribose to electron density maps (K. Keating, personal communication), but it was not explored how these pseudotorsions map to other structural features. It is very tempting to utilize these backbone representations to compare local structures of RNA to each other. There are only few instruments available to compare RNA structures. Most of them are based on secondary structures, and they use the dot-bracket grammar. Among them, RNAforester [16], Vienna [9] and ARTS [5] are the most common. Recently a webserver (SARSA) was released [3] that uses a custom vector quantification to cluster the RNA bases into 23 distinct conformers that are translated into a string representation. SARSA is subsequently applying traditional string alignments to find similar motifs. SARSA is especially useful when applied to multiple alignments of RNA structures, however a search against a database of RNA structures is not supported. The RNAFRABASE web site (http://rnafrabase.ibch.poznan.pl/) contains a big number of loop fragments from RNA structures, but it is very limited in both the kind of fragments contained, and possible search methodology. To our knowledge there exists no method that allows fast queries for similar RNA substructures against a database. Therefore, we decided to use string representations of the RNA backbone in order to take advantage of existing algorithmic solutions for the efficient string search. Alternatively we are calculating a pseudotorsion representation of TJ - (J angles. To cope with the problem of thousands of motifs and thousands of RNA structures available we are using a suffix technique [7] that holds all information in an index and can be crawled almost linearly. The main objectives of this work are in brief: (1) Verification of the applicability of the RNA Ontology Consortium suite code, by examining the suites of differently structured RNA. (2) Presentation of a suffix method to compare RNAs to each other and giving an overview which structures and substructures are similar. (3) Discussion of possible alternatives (regarding the structure - string coding, used search algorithms) and applications.

2. Methods We constructed suffix arrays from strings consisting of the RNA Ontology Consortium suite codes for four different datasets: motifs from the SCOR database, all tRNA structures, a high-resolution dataset, and the representative RNADB05 set. Each of them was then queried for matching subsequences in the suffix array to detect structural similarities. As an alternative approach, strings representing TJ -

186

R. A. Bauer et al.

() angles of the RNA backbone were constructed and processed in the same way. 2.1. Datasets used

2.1.1. SCOR dataset First, we wanted to know, whether known RNA motifs annotated in SCOR can be recovered by the suite representation. SCOR is a database containing 15,945 structural, functional and tertiary interaction motifs that have been annotated manually [18J. A hierarchical classification inspired by the SCOP database [IJ has been established, but the database lacks updates after 2004. Therefore, a reliable automatic recognition of motifs could be useful. Currently, no such procedure is available with the circular motif library of the MC-Sym program probably coming closest [15J. For this analysis, all 4,501 structural and 100 tertiary interaction motifs from SCOR (version 2.0.4) data were used. Functional motifs annotate entire RNAs, and were excluded. The according fragments of PDB structures had lengths between 2-11 suites for structural, and 4-60 suites for tertiary interaction motifs. This set was termed "SCOR". Functional motifs are annotating entire RNAs, and are considered in the later datasets.

2.1.2. tRNA dataset Second, we were interested in proofing that a set of structurally highly conserved RN As can be recognized by the suite representation as a positive control. For this, the tRNA as one of the most conserved molecules in life was chosen. Although tRNA sequences started diverging even before the genetic code itself was fixed and their structures are highly modified by post-transcriptional additions, all of them need to have a highly conserved tertiary structure in order to work in the translation machinery. Thus, it is not surprising, that all example tRNAs from the PDB look the same from afar - and we were convinced that they should have very similar backbone conformation when represented as suites. To examine whether this hypothesis holds, all tRNA structures from the NDB database [2J were retrieved. The resulting tRN A set consists of 102 tRN A structures from all kingdoms of life and is termed "TRN A" .

2.1.3. RNADB05 and HIRES sets Third, we wanted to check for similarities among RNAs of different origin. This was done for two sets of RNA structures. One was the dataset used by Richardson et al. (termed RNADB05) [17J. The RNADB05 set is a manually refined representative set of 173 RNA structures from both X-Ray and NMR experiments. The second set (HIRES) consists of 74 high-resolution X-Ray structures. They were filtered from the PDB by applying resolution::; 2.5 A and r-value ::; 0.25 constraints. Structures with identical sequences, and sequences with less than four bases were discarded.

Suffix Techniques as a Rapid Method for RNA Substructure Search

187

2.2. Calculation of RNA backbone string representation For each structure in each of these datasets, a string using the suite representation, and another one based on the pseudotorsions was calculated. The calculation is also applied to structures that are queried against one of these datasets. The method to calculate suites from a structure was re-implemented according to the description in [17]. The seven torsion angles were calculated according to Figure 1 in 5' to 3' direction. They were then assigned to one or none out of the 46 suite clusters. First they are grouped according to their 8, 8 - 1, and 'Y angles to limit the number of clusters to be considered. Second, the 7D distances to the 7D hyperellipsoids for each cluster were calculated. If the suite was inside a hyperellipsoid, its name was assigned to the suite. The extents of these hyperellipsoids varies depending on the cluster. Especially, some of the clusters were partially overlapping; in these cases the closest hyperellipsoid center was used.

suite code

dihedrals

1b23

Fig. 1. Definition of RNA suites. A suite stretches from one ribose unit to the next, involving seven dihedral angles along the RNA backbone. Note that the 8 angle is used by two adjacent suites. In the suite encoding, the first three dihedral angles are represented by a number, the next four by a letter. The example is taken from the tRNA structure with PDB-code Ib23.

Even though it is recommended by Richardson et al. not to calculate suites for residues with a high B-factor and with clashes, we decided to include them anyway. This was done for two reasons: First, to have a continuous string representation for all RNA structures. This is particularly important considering that 5-15% of the residues are unassignable to suites, and thus in average only short fragments of structure would remain for calculation at all. Second, we wanted to assess the number of errors that occur in a real-life dataset. There were four kinds of errors: Missing atoms in the residue (resulting in a '--' suite code), a single torsion angle outside boundaries defined in [17] (S0called triaged residue, resulting in a 'tt' suite code), an outlier suite which is not

188

R. A. Bauer et al.

close to any cluster (resulting in a '00' suite code), and a close outlier inside a 4D hyperellipsoid but outside in 7D space (resulting in a '!!' suite code). The second possibility to translate a 3D structure of an RNA into a sequence of characters is implemented by calculating the 'f/ - () pseudotorsion angles from the backbone atoms of the same residues as the suites. For 'f/, these were the C4'i-Pi+ 1C4'i+1-Pi+2 dihedral, and for () the Pi-C4'i-Pi+1-C4'i+1 dihedral angles. Each of these angles was divided into 36 ten-degree bins, and for each bin, an alphanumeric character was assigned. Thus, a single 'f/ - () tuple - conceptually corresponding to the RNA suite - was represented by two characters as well. Only in the case when either of the atoms defining the dihedral was missing, an '--' code was assigned in place of the 'f/ - () tuple.

2.3. Suffix tree and array implementation Our studies where performed using a suffix array. While even simple implementations of suffix trees fulfill the property to search for a given substring in O(m) with m being the length of the input string we used the slightly slower suffix array implementation because of a better memory footprint. An algorithmic introduction to suffix trees and suffix arrays is given in [8]. The implementation we used as suffix array can search in O(mlogn) with m being the length of the search string, and n the number of strings in the index. This performance is fast enough considering the absolute amount of structures to index - even for all RNA structures in the PDB (currently 1500). A suffix array works in principle in the following manner: To index a string s with length m in the suffix array each substring from 0 - m is put into an array. This array is then sorted alphabetically. After the sorted array is established a substring of s can be retrieved by using binary search over the index that fulfills the O(mlogn) property. A conceptual disadvantage of suffix techniques is that a substring search can only be performed in an exact manner. To overcome this disadvantage we are using the notion of n-grams to perform an inexact search and to get a scoring of one input structure against a whole database. This similarity score (SCORE) is generated by searching all consecutive substrings of length n (n-grams) of the input string against the database.

SCORE =

number_of_matches-found number -of _matches_expected

(1 )

This allows us to generate a ranking of the best matching entries in the database as well as a nice way to generate an all-against-all ranking of entities in one database. One drawback of this scoring scheme is that ubiquitous repeating substrings (like 'la1a1a1a') are found in nearly every entity in the database and therefore add a huge bias to the calculation. To avoid that, a search of substrings with repeating entities is excluded.

Suffix Techniques as a Rapid Method for RNA Substructure Search

189

Apart from the theoretical runtimes given by O(x) the practical runtimes for the n-gram search with the current Suffix Array implementation is below 5 seconds for an all against all search of the RNADB05 set (257 entries) on a commodity pc (dual core 2.2 GHz, 3 GB RAM). 3. Results

In this analysis, we systematically looked for similar backbone conformations, and then checked whether they occur in RN As that are somehow annotated in a similar way. We calculated the suite strings and 'f/ - () binning strings for for 4,950 structures in all datasets. In Table 1, the distribution of suite codes is shown. Table l. Ratio of suite codes, as they occur in the four datasets examined here. The table is filled with number of suites of a particular kind, divided by the total number of suites (including outliers) for the corresponding dataset.

!! Ob 1[ 1b Ie 19 1t 2[ 2g 20 3a 3d 4b 4n 5d 5n 5q 6g 6p 7d 7r 9a tt

TRNA

SCOR

RNADB05

HIRES

0.0221 0.0005 0.0007 0.0077 0.0110 0.0045 0.0226 0.0046 0.0011 0.0056 0.0003 0.0045 0.0026 0.0023 0.0003 0.0009 0.0012 0.0004 0.0001 0.0003 0.0042 0.0012 0.0019 0.1120

0.0167 0.0015 0.0012 0.0065 0.0165 0.0063 0.0217 0.0019 0.0057 0.0007 0.0007 0.0084 0.0056 0.0028 0.0013 0.0029 0.0023 0.0006 0.0032 0.0052 0.0046 0.0023 0.0052 0.0352

0.0100 0.0094 0.0020 0.0078 0.0202 0.0049 0.0127 0.0025 0.0048 0.0015 0.0009 0.0038 0.0027 0.0045 0.0019 0.0019 0.0010 0.0005 0.0033 0.0044 0.0027 0.0017 0.0042 0.0543

0.0094 0.0343 0.0011 0.0057 0.0244 0.0068 Om05

0.0031 0.0026 0.0017 0.0020 0.0020 0.0011 0.0048 0.0017 0.0017 0.0009 0.0003 0.0028 0.0043 0.0011 0.0000 0.0051 0.0709

&a Oa 1L 1a 1c 1£ 1m 1z 2a 2h 2u 3b 4a 4d 4p 5j 5p 6d 6j 7a 7p 8d 00

TRNA

SCOR

RNADB05

HIRES

0.0188

0.0252 0.0047 0.0252 0.5760 0.0426 0.0058 0.0177 0.0029 0.0110 0.0010 0.0005 0.0022 0.0026 0.0012 0.0021 0.0020 0.0008 0.0020 0.0008 0.0076 0.0029 0.0026 0.0766

0.0170 0.0041 0.0269 0.5943 0.0477 0.0044 0.0111 0.0023 0.0109 0.0019 0.0009 0.0022 0.0020 0.0017 0.0019 0.0016 0.0011 0.0030 0.0008 0.0043 0.0028 0.0020 0.0723

0.0119 0.0034 0.0201 0.6015 0.0471 0.0023 0.0071 0.0011 0.0122 0.0011 0.0020 0.0014 0.0020 0.0017 0.0014 0.0014 0.0009 0.0045 0.0006 0.0014 0.0034 0.0000 0.0590

0.0007 0.0422 0.4504 0.0769 0.0098 0.0314 0.0001 0.0117 0.0017 0.0016 0.0004 0.0003 0.0042 0.0017 0.0007 0.0007 0.0057 0.0003 0.0078 0.0020 0.0005 0.1179

As expected, the helical stem suite variants (la, 1m, 1L, &a) are predominant. In the two representative datasets, the la suites account for up to 60% of all suites, its three satellite clusters contain together another 5%. In SCOR these numbers are very close to that, indicating that the 1a backbone conformation is apt to form many of the motifs annotated there (verified by visual inspection of the primary suite strings). In TRNA the number of la is lower (45%). This is a common feature of the tRNA fold, as this observation is the same for all tRNA suite strings. In turn,

190

R. A. Bauer et al.

some of the other suites are more highly represented. In particular, 1L, 1c, 1m, 2g, 4d, 6d, and It seem to play an important structural role in tRNA. The total number of all four kinds of invalid suites ('tt', 'oo','!!', and '--') are 25.25% in the tRNA set, 12.00% in SCOR, and 14.60%/17.36% in the RNADB05 and HIRES datasets, respectively. At first, the latter seems surprising, because one would expect less errors in high resolution structures. The percentage is mainly caused by 3.4% residues with missing atoms. The remaining 13.9% are caused by 'triaged' dihedral angles, and by outlier suites for which no suitable cluster could be found. An interpretation of this is that these are unusual backbone conformations which are only visible at a better resolution - in low-resolution structures they probably get smoothed out by the refinement process. In SCOR, the number of invalid suites is much lower. It is clearly biased by the manual selection of motifs, which by definition must occur in well-defined regions. In the tRNA set, the high error rate was examined in more detail. It appears that the three loop regions contain many conformations that do not fit in any cluster (resulting in '00' or 'tt' suites in a row for some structures). This can be a result of strong constraints in the structure during the refinement or by interaction with other molecules. In the high resolution tRNA entry with PDB id 1ehz, the rate of triaged and outlier suites is lower than in the RNADB05 and HIRES sets and the clusters of outliers do not occur here. It is unclear whether modified bases contribute to the problem, but in the examined high-resolution structures this was no problem either. This observation indicates that the lower resolution RNA structures are to be treated with caution.

3.1. Analysis of SCOR motifs The 4,601 motifs from SCOR were divided into a 20% training set and a 80% test set. The training motifs were stored in the suffix tree, and the test motifs searched in it by all their subsequences of 12 characters. One should assume that e.g. loops of a given type should have similar backbone conformations. Therefore we wanted to know which motifs can be identified this way, and whether they are distinct from other motifs. It was counted how many motifs from the test set could be correctly identified based on matchings of their suite strings. In Figure 2, the sensitivity and specificity of this analysis is given for each motif class separately. It turns out, that the predictability of the SCOR motifs is low. While the specificity is above 0.6 for almost all classes examined, and at 1.0 for many of them, the sensitivity covers almost the entire range from zero to one. The reason is a high number of false negatives in each class. To find out where these come from, the suite strings of several classes were inspected in more detail: The '180 degree turn' class consists of 24 motifs. 17 of them are just two suites (three residues) long, all having the suite string '4b6p'. The remaining 7 contain five suites, which are small variations of 'la3a1g9a1a'. These two groups fully correspond

Suffix Techniques as a Rapid Method for RNA Substructure Search 1.0 イMNセG[@

191

Recognition of SCaR motifs by substring matching

r----,

.. •

T• ....,...... セイZN@ LNBGZMセ@

0.8

.•

セ@

it

0.6

1,1,.

セ@

0.2

ッセ@

U

M sセョウゥエャケ@

M

0.8

1.0

[1 - TP/(TP+fN)]

Fig. 2. SCOR motifs recognized by substring matching. The entire set of SCOR motifs was divided into a 20% training set and an 80% test set. The number of correctly matched sUbstrings of length 12 (or the entire motif, if it was shorter), the number of matches from different SCOR motifs, and the total number of motif pairs compared were used to calculate the sensitivity and specificity of the search.

to two homologous positions in different structures of the 23S rRNA (1874-1876 for the first, and 1789-1794 for the second). A similar effect can be observed for many other motifs like '3 non-We base pair', 'About 90 Degree '!Urn With All Bases Simply Stacked', and 'Multiple Twist'. In other cases, like the 'Ustk stack swap' motif, even more variations can be found. On the positive side, it has to be noted that the homologous motifs can be recognized well from as few as 2-4 suites, and their structures are conserved. As stated above, the manual selection of motifs probably facilitates this. There were no examples found, where two non-homologous motifs belonging to the same class can be identified on the bases of their suites alone. One of the reasons for this observation is that the rules upon which SeOR motifs have been annotated, are based on singular decisions made by experts. It appears, that the base pairing/secondary structure scheme that is specific for a particular motif class, does not impose a constraint on the backbone strong enough to allow a prediction. On the other hand, this implies that in the RNA backbone, an independent set of frequently occurring conformations could exist that has not been described.

3.2. Similarities among tRNA Next, a set of 102 tRNA structures with a well-defined backbone structures was examined. Because all tRNA structures have a highly conserved tertiary structure, one would expect this to be represented in the suite strings as well. In the TRNA dataset, several suites are over-represented compared to the RNADB05 and HIRES sets (partiCUlarly '6d', '2g', '7d', '1£', 'lc' and '11'). These

192

R. A. Bauer et al.

can be found in corresponding positions of most tRN As. We have locally aligned a couple of D-Ioops from tRNA structures with the corresponding suite strings in Figure 3. While each backbone follows the loop along the same path, there are several small differences in the suite codes. These include local variants, often replacing one suite by one close in the 7D dihedral space (e.g. the 'la'-'lL' and 'lm'-'l[' exchanges). The structures are also occasionally interrupted by outlier suites. These outliers are visible, but hardly distinguishable in the visualization. They do not alter the direction of the backbone and by no means disrupt the loop structure. Rather, it seems that many of them are results of improper refinement or low structure quality, as high-resolution structures such as PDB-code lehz and PDB-code 1b23 are less affected by this. One important conjecture of this is, that the suite codes are a very detailed description of tRNA backbone structure. It is apparently not suitable to describe a well-defined structure such as the D-Ioop in a general and unambiguous way. For the same loop trace, many combinations of suites are possible.

48

49 51

50

Fig. 3. The backbone of the dihydrouridine loops from the tRNA structures with PDB-codes: Ib23, lefwC, 19ts, lqf6, and lqrs superimposed by their backbone atoms. The labels indicate the residue numbers. The suite codes of the dihydrouridine loops are described in the table on the right. Outlier suites are underlined valid, but singleton suite codes at a given position are highlighted in bold case.

Another observation is that up to half of the D-Ioop suites are of the 'la' type, which was described by [17] as the conformer forming' A-form helices'. The D-Ioop contains a noncanonical base pair between residues 54 and 58, and two adjacent GC base pairs (53-61 and 52-62). But apart from that, many of the bases are involved in tertiary stacking (57, 58) and base pairing (59, 60) interactions. In total, the D-Ioop stem is more than a simple helix, showing that the abundant 1a suite can accommodate different structural roles. Although it was not attempted to align all structures explicitly, this seems feasible from these observations, and can be expected to result in a consensus alignment

Suffix Techniques as a Rapid Method for RNA Substructure Search

193

of suites. A more detailed analysis could be used to identify individual conformations of tRNA at a high level of detail. An all-against-all search of subsequences of all tRNA suite strings was performed using the suffix array, and the n-gram algorithm, as described in section 2.3. In Table 2, the numbers of hits found for different word lengths are given. Table 2. Results of the all-against-all search in the TRNA, RNADB05, and HIRES datasets using the n-gram approach. The column "total hits" indicates how many exactly matching n-grams were found for the given word length. "score" gives the average score for these hits. The score is calculated by the sum of the inverse frequencies from Table 1 for the matching n-gram.

n-gram length

TRNA number hits

score

RNADB05 number hits

score

HIRES number hits

score

4 6 8 10 12 14 16 18 20

6824 6732 6386 5381 3812 2817 1990 1542 1306

5.4 13.0 19.1 24.5 38.1 60.9 96.0 140.3 175.4

27978 22543 17674 13657 10436 6504 3554 2376 1443

6.1 10.6 14.9 16.6 20.4 30.7 45.2 62.0 86.7

10543 10111 8917 6497 4823 3321 2683 2001 1283

3.7 7.1 10.9 16.5 20.8 34.3 46.8 59.2 86.2

The tRN A dataset is different enough among itself, that in average only 69 other structures contain a sufficient number of matching n-grams. But, for structures found, the number of words within one hit is high. With increasing word length, the number of hit structures decreases continuously. This is expected as it gets increasingly difficult to find a longer word in the set of suite strings, because each of the occasional variations will disrupt the search for a local match. The number of words found within a structure drops correspondingly at first, but starts to rise again at a word length of 16 (data not shown). This observation can be explained by the fact that these hits are only occurring in a few but highly similar tRNA structures, where little or no variation occurs. We therefore conclude that a word size of 12 or 14 is optimal to find similarities within the set with as little background noise as possible, and at the same time not restricting the search to almost-identical structures. The outcome of the all-against-all search has been visualized in Figure 4 (TRN A depicted left). There, the normalized number of word hits for a given pair of structures is plotted. This indicates that an overall level of similarity exists between most pairs of tRNAs. The bright spots result from a group of few highly similar tRNA structures (the ones still remaining with word size 20). The dark regions (the lines at 31, and several ones between 56-68) are structures with very low similarity. The structures in this region (among others, PDB-codes: 1y14, 2ow8, 2v46, 3tra) were examined more closely. It turned out that these contain a much higher proportion

194

R. A. Bauer et al.

(up to 40%) of outlier and erroneous suite codes. Three of the examples here are structures of tRNAs bound to ribosomes, having resolutions of 3.7 A and higher. The fourth (PDB-code: 3tra) is alone, but it also has been determined at an inferior resolution. This dearly shows that the suite nomenclature is of very limited use for non-high-resolution structures.

Fig. 4. Scores of the all-against-all search in the a) TRNA (left), b) RNADB05 (middle), and c) HIRES (right) datasets. On each axis, the structures used are sorted according to their PDBcode. The color indicates the score found for a particular structure-structure-pair. The scaling was chosen such as that dark areas correspond to repeating 'la' matches. The higher the score, the more uncommon suites a particular hit contains. The results shown here are for n-grams of length 12.

3.3. Similarities in the representative RNA sets To assess whether these observations are meaningful, we compared both the 107 high-resolution structures and the 254 structures from the RNADB05 set. The number of hits found is described in Table 2. The according similarity maps are depicted in Figure 4. At first, it is observed that some of the suite strings in the datasets were too short to match anything (empty rows/columns and an interrupted diagonal in the heat map). Also, both the HIRES and RNADB05 datasets contained a number of sequences with trivial structures, consisting of 'la'-repeats and not much more. The scoring also depends on the length of the query string and therefore the matrices must not necessarily be symmetric. In Figure 4, it is clearly visible that the overall number of structures in RNADB05 and HIRES with detected similarities drops more sharply compared to the TRNA set. In the same way, the total number of hits changes. Even though the RNADB05 set is larger, only few hit structures remain there at word size 20 (also see Table 2). One reason for that is that the average size of both reference datasets is smaller, as they contain many hairpin loops and other short RNA. In both reference sets, the number of A-form helical stems (repeating regions consisting of 'la' suites) is higher, and they are practically excluded from the eval-

Suffix Techniques as a Rapid Method for RNA Substructure Search

195

uation by the scoring function. This leaves only a fraction of hits in the reference compared to the tRNA set. In tRNA not only a higher number of hits exists, but they are also less random because they consist of less frequently occurring suites. This shows that the similarity among tRNAs is non-random, which can be taken as a proof of concept for the method. One structure in the RNADB05 set - rr0082H09, the 23S subunit of the ribosome - was matched by almost any other from this database. The structural variety in this single structure easily matches that of the remaining dataset taken together, and any motif found somewhere else is probably found there as well (see the white vertical line in Figure 4 at dataset RNADB05). Interestingly, when searching for a set of local RNA structures other than helical stems with either of the methods, we find non-homologous hits. This works for: a) an internal loop of the SRP and the ribosomal SSU, b) a biotin-binding pseudoknot and the tRNA, and c) a tRNA and the E-Ioop from 5S-RNA. 4. Discussion

Geometrically, the suite representation does not cover variations that could occur in the bond lengths and flat angles of the RNA backbone. While bond lengths have a very narrow distribution throughout all structure files, bond angles show significant variation. This means that there is a degree of freedom that makes it impossible to rebuild RNA structures from a string, even if the suite nomenclature would determine the dihedrals with perfect precision. There are two obvious possibilities to resolve this: (1) Encode the flat angles in a similar way as the suites. (2) Encode base-base interactions in the string in order to constrain the structure, and use a 3D modeling procedure subsequently. We believe that the second method is more promising, because it would include those interactions that shape the function of RNA instead of restricting the structure of RN A to the backbone alone. Such a reconstruction of structures from a descriptive grammar (not string-based) was demonstrated already in [15]. Another implication of this approach would be, that if an RNA has in some region no further constraints, it may be structurally flexible. Therefore, the second approach would indirectly encode the flexibility. Having a rapid method for string-based motif recognition has a number of potential applications. First, it could be used to systematically find frequently occurring backbone motifs in RNA structures - as it has been demonstrated here. Further, it can be used to sample big numbers of backbone conformations in order to generate native-like RNA backbones which could be modeled subsequently. Finally, it allows on-the-fly evaluation of RNA models which are generated during manual structure modeling or automatic refinement. The combination of this technique with more

196

R. A. Bauer et at.

elaborate string representations would impose further improvement. We therefore think it is possible to accurately re-model the structure of RNA from a string representation by including additional structural features like base pairs, base stacking, or even tertiary interactions with energy minimization instead of extensive probing of the local conformational space. The T/ - () binning approach was shown to produce too many different local conformations for an effective substring matching. One could argue that by decreasing the number of bins, the matching could be improved. But, it has been shown earlier, that the pseudotorsion angles contain specific regions that are characteristic for some structural motifs [6]. Decreasing the bin size would ignore these and therefore be hopelessly inaccurate. Therefore, either explicit clusters in the pseudotorsion space would have to be defined or string matching techniques allowing for more inexact matches than the current suffix array would be necessary. We emphasize, that a more fuzzy search method could improve the usefulness of the suite codes as well. In particular, this could eliminate the adversary effects of the occasionally occurring erroneous or undefined suites. Practically, this could be implemented as a classical similarity matrix between the suite codes, and for the beginning, its values could simply be based on a normalized 7D distance between the 46 suite clusters. Given the performance of the suffix array the analysis presented here could easily be extended to the entire NDB [2]. Identifying structures that should be expected to be similar (e.g. based on their function) is more challenging, if one does not want to rely on sequence similarity alone. 5. Conclusions

This work presented the first approach that uses an indexing technique to scan the structural space of RNA. The indexing was implemented using suite codes and an T/ - () binning approach and tested on four distinct datasets. It could be shown that this approach can be used to rapidly identify similar substructures. This has applications not only for querying the RNA space but also for the modeling of RNAs by rapidly predicting possible conformations and in turn on-the-fiy evaluation of proposed RNA models regarding structural and functional similarities. All datasets as well as the sourcecode is freely available from http: / / sui terna. sourceforge. net. We hope this will be useful for the community and are looking forward to receiving feedback. Acknowledgements

This effort is supported by DFG SFB-449, Deutsche Krebshilfe, DFG (Deutsche Forschungsgemeinschaft) International Research Training Group (IRTG) on "Genomics and Systems Biology of Molecular Networks" (GRK1360) and the 6th MarieCurie EU Research Training Network "DNA Enzymes", grant no. MRTNCT-2005019566. Without the use of free and/or open source software this effort would not

Suffix Techniques as a Rapid Method for RNA Substructure Search

197

have been possible. References [1] Andreeva, A., Howorth, D., Chandonia, J. M., Brenner, S.E., Hubbard, T.J., Chothia, C., Murzin, A.G., Data growth and its impact on the SCOP database: new developments. Nucleic Acids Research, 36(Database issue):419-425, January 2008. [2] Berman, H. M., Westbrook, J., Feng, Z., lype, L., Schneider, B., and Zardecki, C., The Nucleic Acid Database. Acta Crystallographica Section D, 58(6 Part 1):889-898, Jun 2002. [3] Chang, Y.F., Huang, Y.L., and Lu, C.L., SARSA: a web tool for structural alignment of RNA using a structural alphabet. Nucleic Acids Research, 36(Web Server Issue): 1924, May 2008. [4] Dowell, R. D. and Eddy, S. R., Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics, 7:400+, September 2006. [5] Dror, 0., Nussinov, R., and Wolfson, H. J., The ARTS web server for aligning RNA tertiary structures. Nucleic Acids Research, 34(Web Server issue), July 2006. [6] Duarte, C. M. and Pyle, A. M., Stepping through an RNA structure: a novel approach to conformational analysis. Journal of Molecular Biology, 284(5):1465-1478, December 1998. [7] Giegerich, R. and Kurtz, S., From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction. Algorithmica, 19(3):331-353, November 1997. [8] Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, January 1997. [9J Hofacker, I. L., Vienna RNA secondary structure server. Nucleic Acids Research, 31(13):3429-3431, July 2003. [10J Hofacker, I. L., Bernhart, S. H., and Stadler, P. F., Alignment of RNA base pairing probability matrices. Bioinformatics, 20(14):2222-2227, September 2004. [l1J Leontis, N.B., Altman, RB., Berman, H.M., Brenner, S.E., Brown, J.W., Engelke, D.R, Harvey, S.C., Holbrook, S.R, Jossinet, F., Lewis, S.E., Major, F., Mathews, D.H., Richardson, J.S., Williamson, J.R, and Westhof, E., The RNA Ontology Consortium: an open invitation to the RNA community. RNA, 12(4):533-541, April 2006. [12] Lescoute, A., Leontis, N. B., Massire, C., and Westhof, E., Recurrent structural RNA motifs, Isostericity Matrices and sequence alignments. Nucleic Acids Research, 33(8):2395-2409, 2005. [13] Lescoute, A. and Westhof, E., The interaction networks of structured RNAs. Nucleic Acids Research, 34(22):6587-6604, December 2006. [14] Murray, L. J. W., Richardson, J. S., Iii, A. W. B., and Richardson, D. C., RNA backbone rotamers finding your way in seven dimensions. Biochemical Society Transactions, pages 485-487, 2005. [15] Parisien, M. and Major, F., The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature, 452(7183):51-55, 2008. [16] Reeder, J., Hochsmann, M., Rehmsmeier, M., Voss, B., and Giegerich, R., Beyond Mfold: Recent advances in RNA bioinformatics. J Biotechnol, March 2006. [17] Richardson, J.S., Schneider, B., Murray, L.W., Kapral, G.J., Immormino, RM., Headd, J.J., Richardson, D.C., Ham, D., Hershkovits, E., Williams, L.D., Keating, K.S., Pyle, A.M., Micallef, D., Westbrook, J., Berman, H.M., RNA backbone: consensus all-angle conformers and modular string nomenclature (an RNA Ontology Consortium contribution). RNA, 14(3):465-481, March 2008.

198

R. A. Bauer et al.

[18J Tamura, M., Hendrix, D. K., Klosterman, P. S., Schimmelman, N. R., Brenner, S. E., and Holbrook, S. R., SCOR: Structural Classification of RNA, version 2.0. Nucleic Acids Res, 32(Database issue), January 2004.

[19J Yang, H., Jossinet, F., Leontis, N., Chen, L., Westbrook, J., Berman, H., and Westhof, E., Tools for the automatic identification and classification of RNA base pairs. Nucl. Acids Res., 31(13):3450-3460, July 2003.

THE RELATIONSHIP BETWEEN FINE SCALE DNA STRUCTURE, GC CONTENT, AND FUNCTIONAL ELEMENTS IN 1% OF THE HUMAN GENOME ELLIOTT H. MARGULIES 2 [email protected]

STEPHEN C. J. PARKER] [email protected]

THOMAS D. TULLIUS]' 3 [email protected]

] Graduate Program in Bioinjormatics, Boston University, Boston MA 02215, US.A. National Human Genome Research Institute, National Institutes 0/ Health, Bethesda MD 20892, US.A. 3 Department o/Chemistry, Boston University, Boston MA 02215, US.A. 2

GC content has been shown to be an important aspect of human genomic function. Extending beyond the scope of GC content alone, there is a class of regions in the genome that have especially high GC content and are enriched for the CG dinucleotide-called CpG islands. CpG islands have been linked to biologica\1y functional genomic elements. DNA structure also contributes to biological function. Recent studies found that some DNA structural properties are correlated with CpG island functionality [5, 14]. Here, we use hydroxyl radical cleavage patterns as a measure of DNA structure, to explore the relationship between GC content and fine-scale DNA structure. We show that there is a positive correlation between GC content and the solvent-accessible structural properties of a DNA sequence, and that the strength of this correlation decreases as genomic resolution increases. We demonstrate that regions of the genome that have highly solvent-accessible DNA structure tend to overlap functional genomic elements. Our results suggest that fine-scale DNA structural properties that are encoded in the genome are important for biological function, and that the highly solvent-accessible nature of high GC content regions and some CpG islands may account for some of their functional properties.

Keywords: DNA structure; GC content; CpG islands; hydroxyl radical cleavage; functional element;

human genome

1.

Introduction

GC content-the fraction of G or C nucleotides within a given window-is variable across the human genome [17, 36]. This observed heterogeneity in sequence composition has been implicated as a marker for some functional genomic regions. One example of this is CpG islands, which are regions of the genome characterized by high GC content and enrichment of the CG dinucleotide [11]. CpG islands have been linked to many regulatory processes [7, 18,24,33,37-39]. Beyond the primary order of nucleotides in a genome that is used to define GC content and CpG islands, the local structural profile of DNA has been implicated in a number of biological processes. Recent studies suggest that DNA structure is important for some of the same processes as CpG islands: namely DNA-protein interactions [20], promoter function [1, 29], epigenetically controlled gene regulation [4, 23, 32, 34, 40],

199

200

S. C. J. Parker, E. H. Margulies &J T. D. Tullius

and DNase I hypersensitivity [14]. However, the precise relationship between GC content, fine-scale DNA structure, and genome function remains unclear. A critical first step in assessing this relationship is the ability to predict the local DNA structural profile for genomic sequences. Hydroxyl radical cleavage patterns of DNA have been used to study structural properties for a wide variety of sequences [13, 19, 30]. The cleavage pattern of naked DNA is a reflection of an important structural parameter, the solvent-accessible surface area of the DNA backbone [2]. The cleavage pattern thus provides a high-resolution quantitative measure of the shape of the DNA backbone and how it varies with respect to its sequence. We have recently shown that using a database of experimentally-determined hydroxyl radical cleavage patterns, the cleavage pattern of any DNA sequence can be predicted with a high degree of accuracy [13]. Although GC content has recently been implicated in defining hydroxyl radical cleavage patterns of DNA [35], this analysis was conducted at a relatively low genomic resolution of 333 base pairs. Single-nucleotide, genome-scale DNA structure predictions are feasible [13], which makes exploring the relationship between GC content and finescale DNA structure possible. Since different DNA sequences can have similar local structural properties [10, l3], directly correlating GC content with DNA structure is an important experiment. Results from the ENCODE Pilot Project provide a rich resource for functional annotations in 1% of the human genome [3]. These developments facilitate the investigation of the relationship between GC content, DNA structure, and functional elements in this 1% of the human genome. Here, we compare GC content to DNA structure (measured as hydroxyl radical cleavage patterns) at various genomic resolutions, with an emphasis on fine-scale DNA structure. We then measure the occurrence of significantly over-represented DNA structural motifs with known functional annotations. Our results show that GC content only weakly influences fine-scale DNA structure, and that local structural properties may be important in conferring biological functionality to genomic regions like CpG islands. 2.

Materials and Methods

2.1. DNA sequence andfunctional annotation data sources The DNA sequence for NCBI build 36 (March 2006), hg18 version of the ENCODE regions within the human genome was downloaded from the UCSC genome browser (http://genome.ucsc.edulENCODEJ) [21,22]. We used the following functional annotations for comparisons with DNA sequence and structural features. All the annotations are available through the UCSC genome browser (see above), unless otherwise noted. For all analyses, the hg18 version of each annotation track was used.

Fine Scale DNA 8t.r?lrt?i.rp

r;c

Content. and Functional Elements

201



DNase I hypersensitive sites (DHSs) represent regions of open chromatin architecture where protein-DNA interactions occur. We used a Union set ofDHSs derived from the human GM06990 cell line, as described in [3, 14].



Formaldehyde Assisted Isolation of Regulatory Elements (FA IRE) is an alternative method used to locate regions of open chromatin. FAlRE sites are enriched for regulatory elements [12].



Promoters were defined as the region 2.5 kilobases upstream from gene start sites. We used the GENCODE [16] gene track to define genes.



Ancestral Repeats (ARs) are mobile elements that inserted before the common ancestor of most mammals. They are thought to be neutrally evolving and are therefore typically used to represent nonfunctional regions of the human genome [9, 15,28,31,41]. We used the AR regions defined in [3].



CpG islands are regions of the human genome with high GC content and higherthan-expected CG dinucleotide density. We used the CpG islands track from the UCSC genome browser, which was constructed using the CpG island definition described in [11].



Evolutionarily constrained regions are areas of the human genome that are under purifying selection against nucleotide changes. We used the 'moderate track' which is a summary of regions identified by multiple sequence alignment and constraint detection algorithms-described in [3, 25] for this analysis.



Transcription start sites used here are described in [3, 8].



As a control, we constructed a 'random annotation' by randomly selecting 500 base pair intervals within the ENCODE regions. We repeated this process 1000 times to create the random annotation track used here. Since this annotation set was derived randomly, there should be no association with any given set of functional elements.

2.2.

Local DNA structure prediction and GC content analysis

We used predicted hydroxyl radical cleavage patterns as a measure of local DNA structure. Hydroxyl radical cleavage patterns were predicted using the Sliding Tetramer Window algorithm described in [13] for all the ENCODE regions. After the cleavage intensity at each base was predicted, we averaged the cleavage values within a window for all possible windows within the ENCODE regions. For GC content analysis we calculated the fraction of G or C bases within all possible windows of various sizes within the ENCODE regions. To calculate CpG density we counted the observed number of CG dinucleotides within the same windows.

202

S. C. J. Parker. E. H. MarGulies €3 T. D. Tullius

2.3. Annotation proximity and overlap statistics To calculate the proximity of various windows to functional annotations we computed the distance, in base pairs, from the closest base in a given window to the closest base from the nearest element in the specified annotation. To calculate the observed overlap statistics between different annotations, for example-comparing the regions in annotation X to the regions in annotation Y, we first computed the fraction of regions in annotation X that overlap any region from annotation Y. We then constructed a null distribution of the fraction of expected overlaps by using the block bootstrap method described in [3]. We calculated the mean and standard deviation from the null distribution to assess the statistical significance of the observed overlap. This allowed us to determine if the regions in annotation X overlap the regions in annotation Y significantly more or less than random expectation. 3.

Results

3.1. Correlation between GC content and local DNA structure Given the data reported in [35] that shows a high correlation between GC content and mean hydroxyl radical cleavage patterns at a window size of 333 base pairs, we first sought to reproduce and supplement these results. We computed the Pearson correlation between GC content and mean hydroxyl radical cleavage for windows of size N, where N = {2, 3, 4, 5, 10, 20, 50, 100, 333, 500, 1000, 10000}, in the ENCODE regions. We observe a positive correlation between the size of a window and the strength of the correlation between GC content and hydroxyl radical cleavage (Figure IA). That is, while large windows have a high correlation between GC content and mean hydroxyl radical cleavage, small windows-which are a reflection of the fine-scale structure of DNA-donot. To determine if the above result is unique to the DNA in the ENCODE regions we randomized all of the ENCODE sequences. We used a first order Markov model trained on the real ENCODE sequences to preserve all dinucleotide frequencies. The random sequences follow the same correlation trend as the real ENCODE sequences (data not shown), which suggests that the observed correlations are in inherent property of DNA and not an artifact of the ENCODE sequences chosen for this analysis. We next focused on the relationship between CpG density and mean hydroxyl radical cleavage over windows of size N (Figure IB). For equivalent values of N, the strength of the correlation between DNA structure and CpG density is less than for GC content (compare Figure IB to Figure IA).

Pine Scale DNA Structure, GC Content, and Punctional Elements

203

A 1

,---- .

i

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 N

M

V

LI'>

0 N

8

0 LI'>

0

8

M M M

0 0 LI'>

0 0

0 0 0

8

8

Window size (bases)

B 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 N

M

v

LI'>

8

0 N

0 LI'>

0

8

M M M

0 0

III

0 0 0 .-
20

Acceptors [nJ

Fig. 4. Distribution of the amounts of hydrogen bond acceptors of toxic compounds (TC), natural compounds (NC) and drugs. The toxic compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = slightly toxic, 6-9 = medium toxic, >9 = highly toxic).

Toxicity versus Potency

237

H-Bond Donors TPイMQセ@

40

35 30 セ@

e.-

TC3-6

25

_.+_.- TC 6-9

'"

"0

§ 20

--B-TC>9

a c. E 15

o

a

o

- ·e-·· Drugs

1 2 3 4

5

6 7 8 9 1011

.....--NC

10

o

1

2

3

4

5

6

7

8

9

10 11

12 13 14 15 16 17 18 19 20 >20

Donors [n]

Fig. 5. Distribution of the amounts of hydrogen bond donors of toxic compounds (TC), natural compounds (NC) and drugs. The toxie compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = slightly toxic, 6-9 = medium toxic, >9 = highly toxic). The small diagram shows a detailed distribution of the amounts of hydrogen bond donors regarding the group of medium toxicity (-log (LC50): 6-9).

To analyze this supposition, the amount of hydrogen bond donors and acceptors was compared between toxic compounds, natural compounds, and drugs (Figures 4 and 5). It was found that the group of natural compounds, slightly and medium toxic compounds, and drugs have very similar amounts of hydrogen bond acceptors as well as donors, ranging between three and six hydrogen bond acceptors and between zero and two hydrogen bond donors. The lowest number of hydrogen bond acceptors was found within drugs, as they are chemically designed to fulfill the Lipinski's rule of five [15]. According to this rule, they are supposed to comprise not more than 10 hydrogen bond acceptors in order to have adequate ADME properties [16]. In contrary to this, the group of highly toxic compounds shows both, more hydrogen bond donors and acceptors. It is obvious that within the groups of slightly, medium, and highly toxic compounds the amount of hydrogen bond acceptors and donors rises. This was confirmed by a more detailed investigation of the medium toxic compounds which show the same trend regarding the hydrogen bond acceptors (data not shown) and donors (Figure 5 small graph). Comparing the molecular weight and the hydrogen bond acceptors the same sequence of compound groups can be found: the drugs feature the least amount of hydrogen bond acceptors followed by the slightly toxic compounds, natural compounds, and the medium toxic compounds concluding with the highly toxic compounds as the group with the highest amount of hydrogen bond acceptors. The same order occurs regarding the hydrogen bond donors, except that the natural compounds show the least amount of hydrogen bond donors and the drugs follow the slightly toxic compounds. Thus, the assumption was confirmed, that the more toxic a compound the more hydrogen bond donors and acceptors can be found in the structure.

238

3.2.

S. Struck et al.

Functional properties

The distribution of functional groups in toxic compounds, drugs and natural compounds was analyzed and is depicted exemplarily in Figure 6. It can clearly be seen, that the occurences of functional groups rises with increasing toxicity whereas the natural compounds and the drugs exhibit frequencies among those of the toxic compounds. The highly toxic compounds differ significantly in the amounts of alcohol and sugar groups compared to the other compounds. The more hydroxyl groups can be found in a molecule, the more hydrogen bond donors are available and the higher is the reactivity. Sugar molecules have many chiral centers and therefore, are characterized by a high stereo selectivity. Regarding the huge amount of different sugar molecules there is a vast number of possible combinations resulting in a high specificity according to the binding affinity to their targets. Alcohol or phenol as an aromatic alcohol are characterized by their reactivity and corrosiveness resulting in a high toxicity. These properties are explained by the denaturing effect of phenol on membrane proteins forming pores which may lead to cell death. Acetal includes a hydroxyl group which, as mentioned above, makes molecules more reactive. Acetals are stable with respect to hydrolysis by bases. This is an important property for toxic compounds since the more protected they are from hydrolysis the better they can perform their effects. In summary, an order can be defined, starting with the sligthly toxic compounds with the least amounts of the depicted functional groups followed by the natural compounds, the drugs, and the medium toxic compounds concluding with the highly toxic compounds which possess the highest frequencies of the mentioned functional groups. Functional properties Alcohol

TC3-6 Acetal/Acetal-like

TC6-9 .. TC >9

!ljiNC

Alenol

Drugs Sugar

o

10

20

30

40

50

60

70

80

90

100

compounds [%] Fig. 6. Distribution of the occurrences of functional groups of toxic compounds (TC), natural compounds (NC) and drugs. The toxic compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = slightly toxic, 6-9 = medium toxic, >9 = highly toxic).

Toxicity versus Potency

3.3.

239

Structural properties

Structural properties were also investigated as toxicity indicators. The most distinct ones are represented in Figure 7. The analyses of the structural characteristics in the three groups of toxicity show results analogous to the analyses of the functional properties: the more toxic a compound the more distinctive the property. Since chiral centers can be found in high amounts in sugar molecules their distributions correlate with those of the sugar group having the same origin: the high specificity and selectivity they provide ensure a very efficient and specific mode of action of toxic compounds. Conjugated double bonds contribute to the stability of a molecule so that a high amount hamper degradation and enable the toxin to perform its effects. Earlier studies revealed that the center of aromatic rings act as hydrogen bond acceptors [17] which is expected to playa significant role in molecular associations. This ensures a very specific and selective mode of action which explains the increasing amount of ring systems with increasing toxicity. Structural properties Ring system

TC3-6 TC6-9 Conjugated double bond

III

TC>9

III

NC

mDrugs Chiral center

o

10

20

30

40

50

60

70

80

90

100

compounds [%]

Fig. 7. Distribution of the occurrences structural properties of toxic compounds (TC), natural compounds (NC) and drugs. The toxic compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = low, 6-9 = medium, >9 = high).

3.4.

Case study

Amatoxins are cyclic non-ribosomal oligopeptides found in several members of the Amanita genus of mushrooms, one being the Death cap (Amanita phalloides). The most deadly of all the amatoxins is the a-amanitin with an oral LD50 of approximately 0.1 mglkg. It is an inhibitor of the RNA polymerase II blocking the transcription of DNA and RNA [18]. This leads to a total failure of the protein synthesis causing severe effects on liver and kidney [19]. Death usually occurs around a week from ingestion [20]. A map of the purine and pyrimidin pathway which can be found in the Kyoto Encyclopedia of

240

S. Struck et al.

Genes and Genomes (KEGG) [21] is shown in Figure 8. It displays in detail the function of the RNA polymerase II and the effects its inhibition by a-amanitin would cause. 5 '·Acetylphosphoadenosine 0 (mitochondria) 5L Bell2oylpho.phoadeno.ine 0 (mi1OchorulIia)

Fig. 8. Excerpt of the purine pathway extracted from KEGG. The enzyme colored in red with the number "2.7.7.6" depicts the RNA polymerase II.

With a molecular weight of 918.97 g/mol, 13 hydrogen bond donors, and 15 hydrogen bond acceptors the chemical "toxicity properties" of a-amanitin are consistent with our findings of the highly toxic compounds. A lot of ring systems, conjugated double bonds, and chiral centers also fit in our results of the structural "toxicity properties" of the highly toxic compounds.

4.

Conclusion and Future Perspectives

In this work we were able to elucidate a continuous trend in structural, chemical, and functional properties within the different groups of toxic. The analysis of hydrogen bond donors and acceptors as well as certain functional groups and structural features revealed a positive correlation between occurrence and toxicity whereas the amounts of drugs and natural compounds have similar values compared to the slightly toxic compounds. Toxic compounds function in a variety of ways and subgroups, like the highly toxic ones, react with their target in a completely different manner than drugs. While drugs are usually small compounds, able to enter the cell and to affect targets within the cells, a lot of toxic compounds function by forming pores in membranes (e.g. alpha toxin from Staphylococcus aureus), by permanent activation of for example sodium channels (aconitin) or by interaction with neurotransmitter receptors (strychnin). With the help of

Toxicity versus Potency

241

such mechanisms these toxic compounds are able to affect critical pathways which often cannot be circumvented. Therefore, these molecules are very effective. The data presented here provide valuable insight into the phenomenon of toxicity by elucidating "toxicity properties", characteristics of toxic compounds. Thus, the properties analyzed here will function as additional criteria to predict toxicities with the help of QSAR. Additional toxicity relevant properties, as presented here, will be helpful to improve such analysis. Further efforts will be made in the prediction of potential targets of unknown compounds. Acknowledgements

This work was supported by the International Research Training Group Boston-KyotoBerlin, funded by the German Research Foundation (DFG). References

[1] Watson, P., Spooner RA., Toxin entry and trafficking in mammalian cells, Adv Drug Deliv Rev, 58: 1581-1596,2006. [2] Hong, H., Xie, Q., Ge, W., Qian, F., Fang, H., Shi, L., Su, Z., Perkins and R, Tong, W., Mold(2), Molecular Descriptors from 2D Structures for Chemoinformatics and Toxicoinformatics, J Chern Inf Model, 2008. [3] Hughes, L.D., Palmer, D.S., Nigsch, F. and Mitchell, J.B., Why are some properties more difficult to predict than others? A study of QSPR models of solubility, melting point, and Log P, J Chern Inf Model, 48: 220-232, 2008. [4] Dunkel, M., Fullbeck, M., Neumann, S. and Preissner, R., SuperNatural: a searchable database of available natural compounds, Nucleic Acids Res, 34: D678683,2006. [5] Goede, A., Dunkel, M., Mester, N., Frommel, C. and Preissner R.,. SuperDrug: a conformational drug database, Bioinforrnatics, 21: 1751-1753,2005. [6] http://dtp.nci.nih.gov/ [7] http://chem.sis.nlm.nih.gov/chemidplus [8] http://pubchem.ncbi.nlm.nih.gov/ [9] Teuscher, E. and Lindequist, U., Biogene Gifte, Gustav Fischer Verlag, Germany, 1994 [10] Gunther, S., Kuhn, M., Dunkel, M., Campillos, M., Senger, c., Petsalaki, E., Ahmed, J., Urdiales, E.G., Gewiess, A., Jensen, L.1. et al., SuperTarget and Matador: resources for exploring drug-target relationships, Nucleic Acids Res, 36: D919-922, 2008. [11] Guha, R., Howard, M.T., Hutchison, G.R, Murray-Rust, P., Rzepa, H., Steinbeck, c., Wegner, J. and Willighagen, E.L., The Blue Obelisk-interoperability in chemical informatics, J Chern Inf Model. 46: 991-998,2006. [12] http://openbabel.sourceforge.netl [13] http://mychem.sourceforge.netl. [14] http://www.daylight.comldayhtmVdoc!theory/theory.smarts.html

242

S. Struck et al.

[15] Lipinski, CA., Lombardo, F., Dominy, B.W. and Feeney, PJ., Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings, Adv Drug Deliv Rev, 46: 3-26, 2001. [16] van de Waterbeemd, H. and Gifford, E., ADMET in silico modelling: towards prediction paradise?, Nat Rev Drug Discov, 2: 192-204,2003. [17] Levitt, M. and Perutz, M.F., Aromatic rings act as hydrogen bond acceptors, J Mol Bioi, 201: 751-754,1988. [18] Lindell, T.1. et aI., Specific Inhibition of Nuclear RNA Polymerase II by agrAmanitin, Science, 170: 447-449, 1970. [19] Wieland, T., Poisonous Principles of Mushrooms of the Genus Amanita: Fourcarbon amines acting on the central nervous system and cell-destroying cyclic peptides are produced, Science, 159: 946-952, 1968. [20] Mas, A., Mushrooms, amatoxins and the liver, Journal of Hepatology, 42: 166-169, 2005. [21] http://www.genome.jp/kegg/.

COMPARATIVE VEGF RECEPTOR TYROSINE KINASE MODELING FOR THE DEVELOPMENT OF HIGHLY SPECIFIC INHIBITORS OF TUMOR ANGIOGENESIS ULRIKE SCHMIDT! [email protected]

JESSICA AHMED! [email protected] MICHAEL HOEPFNER2 [email protected]

ELKE MICHALSKY! [email protected]

ROBERT PREISSNER! [email protected]

Structural Bioinformatics Group, Institute for Molecular Biology and Bioinformatics, Charite (CBF), Arnimallee 22, 14195 Berlin, Germany, http://bioinformatics.charite.de 2 Molecular Tumor Therapy and Tumor Angiogenesis Group, Institute of Physiology, Charite (CBF), Arnimallee 22, 14195 Berlin, Germany !

The Vascular Endothelial Growth Factor receptors (VEGF-Rs) playa significant role in tumor development and tumor angiogenesis and are therefore interesting targets in cancer therapy. Targeting the VEGF-R is of special importance as the feed of the tumor has to be reduced. In general, this can be carried out by inhibiting the tyrosine kinase function of the VEGF-R. Nevertheless, there arise some problems with the specificity of known kinase inhibitors: they bind to the ATP-binding site and inhibit a number of kinases, moreover the so far most specific inhibitors act at least on these three major types of VEGF-Rs: Fit-I, Flk-I/KDR, Flt-4. The goal is a selective VEGF-R-2 (FlkIIKDR) inhibitor, because this receptor triggers rather unspecific signals from VEGF-A, -C, -D and E. Here, we describe a protocol starting from an established inhibitor (Vatalanib) with 2D-/3Dsearching and property filtering of the in silico screening hits and the "negative docking approach". With this approach we were able to identifY a compound, which shows a fourfold higher reduction of the proliferation rate of endothelial cells compared to the reduction effect of the lead structure.

Keywords: VEGF; cancer; tumor angiogenesis; homology modeling; in silica screening; docking

1.

Introduction

Angiogenesis, the fonnation of new blood vessels, nonnally occurs moderately in adults, e.g. during wound healing and during the menstrual cycle. The process of angiogenesis is regulated by activators and inhibitors [1]. Tumor angiogenesis is the fonnation of networks of blood vessels supplying the tumor with oxygen and nutrients. Tumor cells induce this process by releasing signaling proteins to the surrounding nonnal tissue. The most important signaling proteins, which are also released by most of the cancer cells, are the vascular endothelial growth factors (VEGFs). The VEGF family consists of the following secreted glyco-proteins: VEGF-A, VEGF-B, VEGF-C, VEGF-D, VEGF-E and the placental growth factors (PIGF-l and -2)

243

244

U. Schmidt et al.

[2-4]. The VEGFs bind to VEGF receptor (VEGF-R) proteins on the endothelial cell surface with different binding affinities for each of the VEGF-Rs. Expression of VEGF-Rs varies in specific endothelial cell layers. The VEGF-R-2 is located on almost all endothelial cells; however, the VEGF-R-I and -3 are alternatively located on endothelial cells in distinct vascular layers [5]. Since angiogenesis was found to be necessary for tumor growth [6], the inhibition of pathological angiogenesis is a main goal in cancer therapy. Particularly, the VEGFNEGF-R pathway plays a significant role in the development of angiogenesis and therefore represents a point of interference for therapy in oncology [5]. Different strategies to inhibit tumor angiogenesis exist: It is possible to interfere with angiogenesis from the extracellular as well as from the intracellular site. In the extracellular region, for example, antibodies and soluble receptors can avoid binding of the VEGF to the binding site of the receptor [6]. Moreover, VEGF antagonists block the ligand binding site of the VEGF-R on the extracellular site. Another way is the inhibition of the VEGF-R in the intracellular region by blocking the ATP-binding site of the tyrosine kinase [7]. However, there arise some problems concerning the specificity of known tyrosine kinase inhibitors: they bind into the ATP-binding site and inhibit a number ofkinases. So far the most specific inhibitors act on the VEGF-Rs. The goal would be to find a selective inhibitor for the VEGF-R-2 (KDR) , because it is expressed on almost all endothelial cells and the majority of the effects in angiogenesis, including cell proliferation, micro-vascular permeability [8], invasion, migration, and survival [9, IOJ, are mediated by VEGF-R-2. To find new compounds by using structure-based drug design, structural information about the target is needed. But today, no complete crystal structures of the VEGF-Rs are available. Here, we describe a protocol to find novel potential VEGF-R inhibitors starting from an established inhibitor (Vatalanib, see Figure 1) [11]. A known inhibitor was used as lead structure for an in silica two- and three-dimensional searching in an "Inhouse" database to identify novel potential VEGF-R tyrosine kinase inhibitors. Moreover, the structures of the ATP-binding site of three VEGF-Rs were modeled, starting from an incomplete crystal structure of the VEGF-R-2. These homology models were then used for comparative docking as qualitative evaluation of the in silica screening results.

2.

Methods

The in silica searching protocol consists of several steps, which are described in this section. In Figure I the procedure is schematically depicted.

2.1. Compound database To search for new potential VEGF-Rs inhibitors we used our Inhouse database which contains about four million compounds and more than 140 million conformers, which

Comparative VEGF Receptor Tyrosine Kinase Modeling

245

were pre-calculated by using the MedChemExplorer of Accelrys [12, 13]. Around 95% of the compounds stored in the Inhouse database are commercially available for experimental validation. In silleD screening

Lead structure

l

(Vatalanlb)

lBGセi@

, ==::. sequence

Preliminary alignment

-.e

セNュ。エ・ーャBI@

.

r.v;,c,v"--;,vm.;m,,mx.tW.l m;m.,.M.1

1 Fig. 1. Scheme of the in silico and in vitro screening protocol.

2.2. Two-dimensional searching To search for similar structures in our Inhouse database we pursued 2D-searching. The screening is based on the chemical similarity between two molecules according to the similar property principle of Johnson and Maggiora [14].

246

U. Schmidt et al.

A structural fingerprint [15], a binary string encoding for the chemical characteristics of a compound, was calculated for the lead structure as well as for the database compounds. To screen the database, the fingerprint of the lead structure was compared to the fingerprints of the database entries by using the Tanimoto coefficient [16]. The Tanimoto coefficient is defined as:

Na describes the number of bits, which were set 1 in the fingerprint of compound a, Nb stands for the number of bits, which were set to I for compound b and Nab is the number of bits, which have compound a and compound b set to 1 in common. A molecule with a similarity greater than 85% (2: 0.85) to an active compound is assumed to be biologically active itself [17]. Therefore, only compounds with a similarity greater than 85% to the lead structure were considered. 2.3. Three-dimensional searching

A 3D-similarity search was applied to identify potential scaffold hoppers. For this purpose, the lead structure was compared to the conformers of drug-like compounds stored in our database. A plane representing the moment of inertia was put into all structures. For a comparison of two structures, the long and short sides of the planes were superimposed, which resulted in four different superimposition possibilities. The superimpositions were evaluated by using a scoring function, which includes the number of superimposed atoms and the Root Mean Square Deviation (RMSD). This scoring function is defined as: score = (percentage of superimposed atoms) . e,RMSD 2.4. Homology modeling

For homology modeling of the three VEGF-Rs several steps were necessary and were performed with the aid of the Swiss-PDBViewer [18]. A crystal structure of the VEGF-R-2 (PDB-code: 1YWN) was obtained from the Protein Data Bank (PDB). This structure is not complete; two gaps are located in and near the ATP-binding site. The ATP-binding pocket was completed by using the SuperLooper web server [19]. Loops were extracted from the LIP database [20] and inserted into the structure via the web service. Furthermore, the completed model of the VEGF-R-2 was used as template structure for the VEGF-R-l and VEGF-R-3. Finally, the models were subjected to an energy minimization using the respective function of the Swiss-PDB Viewer.

Comparative VEGP Receptor Tyrosine Kinase Modeling

247

2.5. Property filtering To estimate the drug-likeness of the 2D/3D-searching results the compounds were filtered according to their molecular properties by using the "Lipinski rule of five". There are four empirical rules, which say, that an orally available drug has: • not more than 5 hydrogen bond donors • not more than 10 hydrogen bond acceptors • a molecular weight below 500 glmol and • a 10gP (water/n-octanol partition) < 5. If a compound breaks more than one rule, it does not promise to become a drug [21]. Therefore, only compounds with no or at most one violation of the Lipinski rules were considered. The properties were calculated with the Accord for Excel Add-On [22].

2.6. Docking To evaluate the remaining drug-like candidates, they where docked into the ATP-binding site of the modeled VEGF-Rs by using the docking program Glide from Schrodinger [23). The Glide scoring function (Glide SP score) was used to rank the docking results. The docking scores and the visual inspection of the docked ligand-protein complexes were used as qualitative evaluation of the candidates and resulted in a ranking of those compounds. The best molecules were used for further in vitro screening.

2.7. In vitro screening A kinase assay was used to test the drug candidates for their inhibitory effect on VEGFRs. The potential of inhibition is expressed by the IC50 value (the concentration where kinase activity is reduced to 50%). Cytotoxicity was measured using a LDH-assay. The ability of cell proliferation inhibition was tested on different cell lines (endothelial cell line EA-HY 926) for each of the potential angiogenesis inhibitors.

3.

Results and Discussion

3.1. Sequence alignment and homology modeling The sequence alignment of the VEGF-Rs, as shown in Figure 2, is the basis of homology modeling. In a second step the non-identical amino acids of the template structure were exchanged according to the VEGF-R sequences. Only gaps in the ATP-binding pocket were filled in. VEGFR-3 VEGFR-2 VEGFR-l

827 IIp:ILIIYDlo,SINE 816 809

VEGFR-3 VEGFR-2 VEGFR-1

877 AVrCML],EGATIilIS 866 859

VEGFR-3 VEGFR-2 VEGFR-1

248

U. Schmidt et al. VEGFR-3 VEGFR-2 VEGFR-1

1024 1015 1009

VEGFR-3 VEGFR-2 VEGFR-l

1074 1065 1059

VEGFR-3 VEGFR-2

1124 1115

VEGFR-3 VEGFR-2 VEGFR-1

1174 QGRGI,QE 1165 QANAQQD 1159 QANVQQD

Fig. 2. Sequenee alignment of the three VEGF-Rs after the homology modeling. Amino acid differenees in thc ATP-binding site arc highlighted in black; other differences in grey.

Figure 3 shows a superimposition of the ATP-binding sites of all three homology modeled VEGF-Rs. Different amino acid residues in the ATP-binding site are shown in stick representation.

Fig. 3. Superimposition of the homology models of the VEGF-R-l (light grey), VEGF-R-2 (dark grey) and VEGF-R-3 (black). Different amino acid residues are shown in stick representation.

3.2. In silico screening The 2D-/3D-similarity screening of the Inhouse database for chemically and structurally similar compounds resulted in about 60 compounds which resemble the lead structure (with a Tanimoto セ@ 0.85). The number of potential candidates could be reduced to 21 drug-like compounds by applying the Lipinski rule of five as molecular property filter.

Comparative VEGF Receptor Tyrosine Kinase Modeling

249

3.3. Docking The remaining 21 structures were docked into the ATP-binding site of the VEGF-Rs. The docking scores and the visual inspection of the docked ligand-receptor complexes were combined as qualitative evaluation of the in silico screening results. The docked structures of the lead compound Vatalanib and compound 10 to VEGF-R-l, -2 and -3 are exemplarily shown in Figure 4a-c) and Figure 4d-f), respectively.

Fig. 4. Ligand docked into the ATP-binding site (surface representation of the VEGF-Rs). Lead structure (Vatalanib) : a) in VEGF-R-l b) in VEGF-R-2 and c) in VEGF-R-3. Compound 10: d) in VEGF-R-l e) in VEGF-R-2 and t) in VEGF-R-3.

In Table 1 the docking scores for Vatalanib and compound 10 are listed. The evaluation of the docking results reveals better scores for compound 10 as for the lead structure. This suggests that compound 10 should have similar or even better biological activity. Therefore, compound 10 was one of the 21 substances selected for experimental validation. Table 1: Docking scores (Glide Score SP)

VEGF-R-l VEGF-R-2 VEGF-R-3

Lead (Vatalanib) -4.51 -4.27 -4.92

Compound 10 -5.01 -4.86 -5.15

3.4. Experimental validation The twelve compounds were tested in vitro for VEGF-R kinase activity inhibition, cell proliferation, migration inhibition and cytotoxicity. In Figure 5 the result of a cell proliferation assay on the endothelial cell line EA-HY 926 for compound 10 compared to the lead structure Vatalanib is exemplarily shown.

U. Schmidt et al.

250

It can be concluded that compound 10, at a concentration of 10 11M, reduces cell proliferation by セTPE@ (light grey) whereas the cell proliferation decreases about 8% when treated with the lead compound. The results shown here confirm the in silica screening results.

セ@ ..... c: 0

Cell proliferation (EA-HY 926) 100

:i2 ..Q

:.c

--..... -----.. Vatalanib

.5

(dark grey)

Comp10



(light gray)

(0

セ@

....

...

"0

Q.

4.1

()

10

Concentration [J-lM]

Fig. 5. Cell proliferation assay (endothelial cell line EA-HY 926).

4.

Conclusion and Future Work

Using this approach, we were able to identify a new potential VEGF-R tyrosine kinase inhibitors. One of the hits was found to have a better effect on the inhibition of cell proliferation than the lead structure. Therefore, we reason that this compound is a specific inhibitor of tumor angiogenesis. This compound will undergo further in vitro and in vivo experiments and will be starting point for further refinement cycles.

Acknowledgements

This work was supported by the International Research Training Group Boston-KyotoBerlin, funded by the DFG. References

[1] Nishida, N., et al., Angiogenesis in cancer. Vase Health Risk Manag, 2(3): 213-219, 2006. [2] Ferrara, N., H.P. Gerber, and J. LeCouter, The biology ofVEGF and its receptors. Nat Med, 9(6): 669-676,2003. [3] Tischer, E., et al., The human gene for vascular endothelial growth factor. Multiple protein forms are encoded through alternative exon splicing. J Bioi Chem, 266(18): 11947-1154,1991.

Comparative VEGF Receptor Tyrosine Kinase Modeling

251

[4] Houck, K.A., et al., The vascular endothelial growth factor family: identification of a fourth molecular species and characterization ofaltemative splicing of RNA. Mol Endocrinol, 5(12): 1806-1814, 1991. [5] Hicklin, D.I. and L.M. Ellis, Role of the vascular endothelial growth factor pathway in tumor growth and angiogenesis. J Clin Oncol, 23(5): 1011-1027.,2005. [6] Los, M., I.M. Roodhart, and E.E. Voest, Target practice: lessons from phase III trials with bevacizumab and vatalanib in the treatment of advanced colorectal cancer. Oncologist, 12(4): 443-450, 2007. [7] Underiner, T.L., B. Ruggeri, and D.E. Gingrich, Development of vascular endothelial growth factor receptor (VEGFR) kinase inhibitors as anti-angiogenic agents in cancer therapy. Curr Med Chem, 11(6): 731-745.,2004. [8] Dvorak, H.F., Vascular permeability factor/vascular endothelial growth factor: a critical cytokine in tumor angiogenesis and a potential target for diagnosis and therapy. J Clin Oncol, 20(21): 4368-4380, 2002. [9] Zeng, H., H.F. Dvorak, and D. Mukhopadhyay, Vascular permeability factor (VPF)/vascular endothelial growth factor (VEGF) peceptor-l down-modulates VPFIVEGF receptor-2-mediated endothelial cell proliferation, but not migration, through phosphatidylinositol 3-kinase-dependent pathways. J BioI Chem, 276(29): 26969-26979. [10] Millauer, B., et af., High affinity VEGF binding and developmental expression suggest Flk-l as a major regulator of vas cuiogene sis and angiogenesis. Cell, 1993. 72(6): 835-846,2001. [11] Drevs, J., PTKlZK (Novartis). !Drugs, 6(8): 787-794,2003. [12] Smellie, A., et al., Conformational analysis by intersection: CONAN. J Comput Chem, 24(1): 10-20,2003. [13] MedChemExplorer, Accelrys Inc., http://www.accelrys.comldstudio/ds_medchem. [14] Johnson, M. and G. Maggiora, Concepts and Applications of Molecular Similarity. Wiley, NY, 1998. [15] 960 bit MDL (Molecular Design LTD.) MACCS keys [16] Delaney, J.S., Assessing the ability of chemical similarity measures to discriminate between active and inactive compounds. Mol Divers, 1(4): 217-222, 1996. [17] Martin, Y.c., I.L. Kofron, and L.M. Traphagen, Do structurally similar molecules have similar biological activity? J Med Chem, 45( 19): 4350-4358, 2002. [18] Guex Nand P. MC, SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis, 18(15): 2714-2723, 1997. [19] SuperLooper, http://bioinformatics.charite.de/superlooper. 2007. [20JMichalsky E, Goede A, and P. R, Loops in Proteins (LIP) - a comprehensive loop database for homology modelling. Protein Eng, 16: 979,2003. [21] Lipinski CA, et aI., Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev, 46(1-3): 3-26, 2001. [22J Accelrys Inc., http://accelrys.coml [23J Schrodinger, Glide, version 4.5, Schr6dinger, LLC, New York, NY. 2007.

NETWORK ANALYSIS' OF ADVERSE DRUG INTERACTIONS MASATAKA TAKARABE' [email protected] TOSIHAKI TOKIMATSU' [email protected]

SHUJIRO OKUDA' [email protected] SUSUMU GOTO' [email protected]

MASUMIITOH' [email protected] MINORU KANEHISA'·2 [email protected]

'Bioin/ormatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan 2Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan Harmful cffects associated with use of drugs are caused as a result of their side effects and combined use of different drugs. These drug interactions result in increased or decreased drug effects, or produce other new unwanted effects and are serious problems for medical institutions and pharmaceutical companies. In this study, we created a drug-drug interaction network from drug package inserts and characterized drug interactions. The known information about the potential risk of drug interactions is described in drug package inserts. Japanese drug package inserts are stored in the JAPIC (Japan Pharmaceutical Information Center) database and GenomeNet provides the GenomeNet pharmaceutical products database, which integrate the JAPIC and KEGG databases. We cxtracted drug interaction data from GenomeNet, where interactions are classified according to risks, contraindications or cautions for coadministration, and some entries include information about cnzymes metabolizing the drugs. We defined drug target and drug-metabolizing enzymes as interaction factors using information on them in KEGG DRUG, and classified drugs into pharmacological/chemical subgroups. In the resulting drug-drug interaction network, the drugs that are associated with the same interaction factors are closely interconnected. Mechanisms of these interactions were then identified by each interaction factor. To characterize other interactions without interaction factors, we used the ATC classification system and found an association between interaction mechanisms and pharmacological/chemical subgroups.

Keywords: drug interaction; network; KEGG

1.

Introduction

Adverse drug events caused by drug interactions are significant problems in medications and the development of new drugs. These drug interactions lead to increase or decrease of drug effects or other serious reactions. For example, cyclosporin, which is widely used as an immunosuppressant drug, is known to interact with many other drugs such as ketoconazole and erythromycin [1, 2]. Cyclosporin is metabolized by CYP3A4, which is a member of a cytochrome P450 family and catalyzes the oxidation of a number of substrates, whereas, ketoconazole and erythromycin inhibit CYP3A4 enzyme activity. Thus, the combined use of these drugs results in delayed clearance and elevated blood level of cyclosporin and increase or prolong both its therapeutic and adverse effects. Assessing and managing such drug interactions are significant problems for clinical practice and drug development. In this study, we focused on adverse drug interactions

252

Network Analysis of Adverse Drug Intemctions

253

and created drug-drug interaction networks to characterize and investigate the drug interactions. To create the drug-drug interaction networks, we extracted drug interaction data from Japanese drug package inserts, which contain known information about potential risk of drug interactions. The Japanese drug package inserts are stored in the JAPIC (Japan Pharmaceutical Information Center) database [12]. We have integrated the JAPIC and KEGG databases [3] and provide it as the GenomeNet pharmaceutical products database [13]. Additionally we defined interaction factors and merged drugs into pharmacological/chemical subgroups to characterize the drug interactions. In the resulting drug-drug interaction networks, drugs that are associated with the same interaction factors are closely interconnected, and mechanisms of the drug interactions were identified by the interaction factors (CYP enzyme family or monoamine receptors, for example). Some other drug interactions without interaction factors were characterized by using information from pharmacological/chemical subgroups.

2.

Method

2.1. Datasets The GenomeNet pharmaceutical products database provides Japanese drug package insert data linked to the KEGG DRUG database. Each entry contains information on the brand/generic name, physicochemicallpharmacokinetic properties, drug interactions, etc. The drug interaction section lists the drugs or the classes of drugs that cause adverse interactions with the product, and these interactions are classified according to risks, contraindications or cautions for coadministration. Additionally, some drugs contain additional sections which include information on enzymes metabolizing the products like cytochrome P450 family. Most entries are assigned KEGG DRUG IDs (D numbers), which correspond to the active ingredient of the products. The KEGG DRUG database is a chemical structure-based database in which each entry includes information on chemical structure, efficacy, drug target, pathway, ATC code, etc.

2.2. Drug interaction network We used the data from the GenomeNet pharmaceutical products database as of March 26, 2008. 13973 pharmaceutical product entries were stored in the database, of which 7562 entries contained drug interaction information. We extracted drug names from the drug interaction section of each entry and listed JAPIC IDs that correspond to the drug names to create drug interaction data between JAPIC IDs. Next, JAPIC IDs were merged with respect to the D numbers that the JAPIC IDs are assigned because we considered that products assigned the same medicinal properties have the same potential risk of drug interactions. Consequently, we obtained drug interaction data between D numbers and used the data to create drug interaction networks.

254

M. Takarabe et al.

To characterize the drug interactions, we defined drug targets and drug-metabolizing enzymes as interaction factors for each D number and searched drug interactions associated with the same interaction factors. Information on the interaction factors was collected from the package insert data and the KEGG DRUG database. Drug target genes data stored in the KEGG DRUG database were merged with respect to each functional type of protein according to KEGG BRITE, which is a collection of hierarchical classifications [3].

2.3. PharmacologicaVchemical subgroups We used the Anatomical Therapeutic Chemical classification system (ATC classification system), developed by the WHO Collaborating Centre for Drug Statistics Methodology [14], to group D numbers. The ATC classification system divides drugs at 5 different levels according to the sites of action and their therapeutic and chemical characteristics. Each level is assigned a code which consists of 1 letter or 2 digits corresponding to pharmacological/chemical subgroups of the level. The drugs assigned the same ATC codes indicate that they are assigned the same pharmacological/chemical subgroups. Thus, D numbers were grouped into chemical substance subgroups in terms of the pharmacological/chemical categories based on the ATC classification system. 3.

Results

The numbers of extracted interactions between JAPIC IDs are 29,663 and 1,196,494 in contraindications and cautions for coadministration respectively, and we merged JAPIC IDs into D numbers. As a result, 1,513 and 36,040 interactions between D numbers were obtained respectively (Table 1). Table I. Number of drug interactions and entries involved in the interactions.

JAPIC ID D number

Contraindications Interaction Entry 29,663 3,043 1,513 517

Cautions Interaction Entry 1,196,494 9,432 36,040 1,431

3.1. Interaction/actors We created network graphs from the resulting data on the drug interaction and interaction factors. Figure 1 shows the obtained network of contraindications for coadministration. In the network, nodes represent the D numbers that correspond to the drugs, and edges represent interactions. Node sizes are proportional to the numbers of edges they have. Bold edges indicate the interactions between the drugs associated with the same interaction factors and are colored according to the interaction factors.

Network Analysis of Adverse Drug Interactions

255

. ..



• '

't



/

- - - - CYP family

- - - - Monoamine receptor

Other interaction factors

Fig. I. Drug interaction network of contraindications for coadministration. Interaction factors were merged into the CYP enzyme family, monoamine receptor, and others. Bold edges were colored according to these interaction factor groups.

Obtained interaction factors were 12 and 38 in contraindications and cautions for coadministration, respectively. Table 2 shows the top 5 interaction factors that both drugs in the interaction are associated with. CYP families and monoamine (adrenaline, serotonin, dopamine, histamine, etc.) receptors are the most frequently observed interaction factors which are associated with both drugs in the interactions. The interactions between the drugs associated with the same interaction factors are closely interconnected.

256

M. Takarabe et al. Table 2. Number of interactions and drugs with interaction factor.

Contraindications Interaction factor # of interaction CYP3A Adrenaline receptor Serotonin receotor CYP2D CYPIA

181 33 28 17 16

# of drugs 77 17 8 14 16

Interaction factor CYP3A Adrenaline receptor CYP2C Dooamine receptor CYPIA

Cautions # of interaction 1,916 200 200 182 113

# of drugs 147 52 50 42 31

Information on action mechanisms of these interactions are provided in the package inserts. For instance, drug interactions from CYP families are caused by inhibition/induction of the enzymes and result in a decrease/increase in the effects of drugs. In the case of drug interactions with monoamine receptors, both drugs affect the same receptors, which results in the additive effect of the receptors. Next, we investigated other interactions without interaction factors by using information from pharmacological/chemical subgroups. In the network of contraindications for coadministration, 398 D numbers were assigned ATC codes and merged into 331 pharmacological/chemical subgroups. 1042 D numbers were merged into 941 subgroups in the network of cautions for coadministration. To explore an association between interaction mechanisms and pharmacological/chemical subgroups, we searched hub nodes and common pharmacological/chemical categories of their neighboring nodes. Figure 2 shows an example of D00951 (Medroxyprogesterone acetate) and its neighboring nodes with pharmacological/chemical subgroup information in the network of contraindications for coadministration. D00951 interacts with 97 different drugs, of which 43 are included in the most common category "Corticosteroids, plain" which corresponds to third level A TC code "D07 A". These interactions between D00951 and "Corticosteroids, plain" subgroup increase the risk of side effect of both drugs such as cardiovascular disease [4, 5, 6]. 4.

Discussion

We created drug interaction networks from Japanese drug package insert information to explore adverse drug interactions. In the resulting networks, many drugs are associated with the same interaction factors and closely connected with each other. Therefore there are many drugs that mostly interact only with drugs associated with the same interaction factors. For example, D02211 (Dihydroergotamine mesilate) interacts with 37 different drugs, of which 30 are associated with CYP3A, and D00560 (Pimozide) interacts with 23 different drugs, of which 21 drugs are associated with CYP3A. Dihydroergotamine mesilate and pimozide are reported to be metabolized by CYP3A [7,8], and coadministrations of the two drugs with CYP3A inhibitors or drugs metabolized by CYP3A cause serious side effects such as QT prolongation or ventricular arrhythmia. These interaction factors enabled us to characterize drug interactions and identify mechanisms of these interactions because their interaction mechanisms or clinical symptoms depend on the interaction factors. Obtained drug interaction networks include many nodes and edges. Particularly, in the network of cautions for coadministration, it is difficult to explore drug interactions from the network graph. For efficient analysis,

Network Analysis of Adverse Drug Intemctions

257

elimination of drugs and interactions associated with the same interaction factors may be effective to reduce nodes and edges in the drug networks. Next, we used ATC classification system to investigate interactions between drugs assigned no information of interaction factors or assigned different interaction factors respectively. We applied the information of pharmacological/chemical subgroups to neighboring nodes of each node and searched their common pharmacological/chemical categories that correspond to third level or forth level of ATC code. In some interactions between drugs and their neighboring nodes, common pharmacological/chemical categories were found in the neighboring nodes, and there are characteristic interaction mechanisms or clinical symptoms related to the pharmacological/chemical categories. We illustrated Figure 2 as an example of the association between interaction mechanisms and pharmacological/chemical subgroups, and Figure 3A shows another example of the associations. D00386 (Triamterene) interacts 8 different drugs, of which 6 drugs are classified "Acetic acid derivatives" subgroup, and these interactions cause acute renal failure [9, 10]. Figure 3B illustrates the case of D00089 (Oxytocin), and these interactions result in the enhancement effect of both drugs and lead to serious events [11]. The results indicate this method using pharmacological/chemical subgroups is effective to investigate drug interactions without information of interaction factors. However, some drug interactions remain uncharacterized. For further research, there is a need for more exhaustive data including drug interactions, targets and other new pharmacological/chemical properties to determine the uncharacterized drug interactions. Acknowledgments

We thank lB. Brown for critical reading of our manuscript. This work was supported by grants from the Ministry of Education, Culture, Sports, Science and Technology of Japan and the Japan Science and Technology Agency. The computational resources were provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University.

258

M. Takarabe et al.

Fig. 2. 000951 and its neighboring nodes in the network of contraindications for coadministration. Red nodes represent nodes that included in the "Corticosteroids, plain" subgroup ("007 A").

A

B

Fig. 3. Associations between interaction mechanisms and pharmacological/chemical subgroups in the network of contraindications for coadministration. Red nodes represent nodes that included in the same pharmacological/chemical subgroups. (A) 000386 (Triamterene) interacts with 6 drugs classified in "Acetic acid derivatives" subgroup. (8) 000089 (Oxytocin) interacts with 5 drugs classified in "Prostaglandins" subgroup.

Network Analysis of Adverse Drug Interactions

259

References [1] Wadhwa, N.K., Schroeder, T.J., Pesce, AJ., Myre, S.A, Clardy, C.W., First, M.R., Cyclosporine drug interactions: a review, Ther. Drug Monit., 9(4):399-406, 1987. [2] Pichard, L., Fabre,l., Fabre, G., Domergue, J., Saint Aubert, B., Mourad, G., Maurel, P., Cyclosporin A drug interactions. Screening for inducers and inhibitors of cytochrome P-450 (cyclosporin A oxidase) in primary cultures of human hepatocytes and in liver microsomes, Drug Metab. Dispos., 18(5): 595-606, 1990. [3] Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S., Tokimatsu, T., Yamanishi, Y., KEGG for linking genomes to life and the environment, Nucleic Acids Res., 36, D480-D484, 2008. [4] Falkeborn, M., Persson, I., Adami, H.O., Bergstrom, R., Eaker, E., Lithell, H., Mohsen, R., Naessen, T., The risk of acute myocardial infarction after oestrogen and oestrogen-progestogen replacement, Br. J. Obstet. Gynaeeol., 99(10), 821-828, 1992. [5] Lacroix, K.A, Bean, C., Reilly, R., Curran-Celentano, J., The effects of hormone replacement therapy on antithrombin III and protein C levels in menopausal women, Clin. Lab. Sci., 10(3): 145-148, 1997. [6] AI-Farra HM, AI-Fahoum SK, Tabbaa MA., First MR., Effect of hormone replacement therapy on hemostatic variables in post-menopausal women, Saudi Med. J., 26(12):1930-1935, 2005. [7] Moubarak AS, Rosenkrans CF Jr, Johnson ZB., Modulation of cytochrome P450 metabolism by ergonovine and dihydroergotamine, Vet. Hum. Toxieol., 45(1):6-9, 2003. [8] Desta Z, Kerbusch T, Soukhova N, Richard E, Ko JW, Flockhart DA, Identification and characterization of human cytochrome P450 isoforms interacting with pimozide, J. Pharmaeo.l Exp. Ther., 285(2):428-437,1998. [9] Favre L, Glasson P, Vallotton MB., Reversible acute renal failure from combined triamterene and indomethacin: a study in healthy subjects, Ann. Intern. Med., 96(3):317-320, 1982. [10] Favre L, Vallotton MB., Relationship of renal prostaglandins to three diuretics, Prostaglandins Leukot. Med., 14(3):313-319, 1984. [11] Tomialowicz M, Florjanski J, Zimmer M., The use of oxytocin and prostaglandin in pregnancies after cesarean delivery or uterine surgery, Ginekol Pol., 71(4):242-246, 2000. [12] http://database.japic.or.jp/nw/index

[13] http://www.genome.jp/kusuri/ [14] http://www.whocc.no/atcdddl

SAMPLING GEOMETRIES OF PROTEIN-PROTEIN COMPLEXES STEPHAN LORENZEN

A YSAM GUERLER [email protected]

[email protected]

FLORIAN KRULL

ERNST-WALTER KNAPP

[email protected]

[email protected]

Frie Universitat Berlin, Department a/Chemistry and Biochemistry, Fabeckstr. 36a, 14195, Berlin-Dahlem, Germany Protein-protein docking is a major task in structural biology. In general, the geometries of protein pairs are sampled by generating docked conformations, analyzing them with scoring functions and selecting appropriate geometries for further refinement. Here, we present an algorithm in real space to sample geometries of protein pairs. Therefore, we initially determine uniformly distributed points on the surfaces of the two protein structures to be docked and additionally define a set of uniformly distributed rotations. Then, the sampling method generates structures of protein pairs as follows: (i) We rotate one protein of the protein pair according to a selected rotation and (ii) translate it along a line connecting two surface points belonging to different proteins such that these surface points coincide. The resulting protein pair geometries are then analyzed and selected using a scoring function that considers residues and atom pairs. We applied this approach to a set of 22 enzymeinhibitor complexes and demonstrate that a discretisation of the rigid-body search in real space provides an efficient and robust sampling scheme. Our method generates decoy sets with a considerable fraction of near-native geometries for all considered enzyme-inhibitor complexes.

Keywords: protein-protein docking; rigid-body geometry search; interface analysis

1.

Introduction

Proteins are important regulators of biochemical processes in biological cells. They are for instance used to catalyze chemical reactions, to transport substrates through membranes and to stabilize cellular structures. Interactions with other molecules can affect a protein's macromolecular structure and functionality. For proteins, whose function is to form specific complexes with other proteins, the shape of the contact surface and the residue pair interactions at the contact surface are especially relevant [1]. This protein-protein interaction obeys the key-lock principle and is driven by free energy contributions, resulting in high binding affinities. Binding can influence the function of proteins in diverse ways from total inhibition to enhancement or induction. Although genome-wide proteomics studies indicate that many proteins interact with each other, the number of complexes in the Protein Data Bank (PDB) increases very slowly. Possibly, this is related to the instability of transient protein-protein interactions, which make a crystallographic analysis difficult. Therefore, theoretical approaches for the identification and prediction of protein-protein interactions can be of great importance. Many efforts have been made to find a computational solution to this problem. Unlike the prediction of the binding modes for small molecules (i.e. FlexX [2],

260

Sampling Geometries of Protein-Protein Complexes

261

ICM [3] and Fado [4]), most protein-protein docking approaches consider the structures of the individual proteins in the complex to be rigid. Initially, a wide variety of docked conformations are generated and simultaneously evaluated by scoring functions. In general, these methods perform well when applied on individual protein conformations that are directly taken from the corresponding co-crystallized structures. However, predicting protein complex geometries using protein structures obtained from separate crystallizations essays remains difficult, often leading to many false positives. The binding process often involves conformational changes. Although these are generally subtle, they make it more difficult to find the proper complex geometry. Therefore, a further refinement of the proposed complex geometries by other methods, e.g. Monte Carlo approaches, is often necessary. Currently, most established methods for rigid-body analysis of protein-protein interactions are based on the convolution technique in Fourier space as initially utilized by Katchalski-Katzir et al. in 1992 [5]. These approaches include ZDOCK [6], MolFit [7], 3D-Dock [8], DOT [9], GRAMM [10] and others. These methods use a scoring function defined on a discrete grid for each of the two proteins. Instead of evaluating the scoring function in real space, which is computationally expensive, the values of the scoring function are obtained by multiplication the corresponding Fourier transformed grids. This is done by assigning the atomic interaction parameters for each protein on separate grids, which are subsequently transformed by the fast Fourier transform (FFT) algorithm. In the Fourier space the Fourier coefficients are multiplied and the results are transformed back to real space. This is done for a large set of protein orientations [5]. Besides the FFT-based approaches, a variety of other procedures have also been applied on the protein-protein docking problem. Nussinov et al. proposed an algorithm based on geometric matching of knobs on the interacting surfaces [11]. Others, such as Baker [12] and Abagyan [13] have developed highly accurate methods using Monte Carlo simulations. The protein complex geometries are clustered [14] and their stability is analyzed by perturbation studies using different scoring functions [15]. The development of proper scoring functions is a non-trivial problem in proteinprotein docking. A large variety of scoring functions attempt to capture the biophysically relevant properties for protein complex formation, such as e.g. interactions based on physical principles, on residue pair distributions or on geometric fit [16-20]. In this work, we describe a real space rigid-body protein-protein docking approach. Instead of assigning atom specific interaction parameters to each grid point, as necessary for FFT methods, we can take into consideration all interactions of atom pairs within a certain cutoff distance from the protein surfaces. In order to reduce the computational costs in real space, an efficient sampling strategy of the search space is used, which in tum allows to consider additional parameters in the scoring function. Two proteins are translated and rotated by a discrete set of transformations. To obtain the corresponding parameters for the transformations, the protein surfaces are uniformly covered by surface points. In addition, a set Q of uniformly distributed quatemions is generated from which the rotations are obtained. The translational vector is defined by the line connecting the

262

A. Guerler et al.

pair of surface points selected from each of the two proteins. The residues interacting in the resulting geometry are evaluated by a statistical scoring function, which comprises geometrical and physicochemical components by considering residue pairs and atom pairs. The parameters of the scoring function were determined by Heuser et al. for enzyme-inhibitor complexes [20, 21]. 2.

Methods

2.1. Preparing surface and grid representation

From now on, we call the smaller of both proteins ligand (L) and the larger receptor (R). We embed both proteins by a grid with grid constant of 1.0 A. Points of the receptor grid GR, which are in the van der Waals (vdW) sphere of a receptor atom (radius of 1.8 A for all atoms) are inside the receptor and marked as receptor points. If the receptor grid points are outside of the vdW volume of the corresponding protein they contain a neighbor list of protein atoms, which are within a distance cutoff of rcutCneighbor) 7 A. This neighbor list provides an efficient way to find atomic interaction partners between the two proteins in the complex structure.

a)

b)

c)

Fig. I. Generation of neighbor list and surface points. Small spheres denote the protein atoms. a) Atom neighbor list of a reference grid point (center of large sphere) contains the numbers of atoms within the cut-off distance (largest sphere). b) Initial surface points (thicker red points of the grid) are all grid points, which are within a specified minimal and maximal distance (medium size blue spheres denoted by dashed lines) to the nearest protein atoms. c) The initial surface points are translated towards the center of the nearest protein atom until the vdW surface of the atom is reached (blue points on the surface of the gray spheres).

For both proteins (ligand and receptor) the grids are also used to determine surface points and surface normal vectors (see Fig. 1 for more details). In a first approximation the protein surface points are those grid points whose distances to the nearest protein atoms are between 4.0 and 6.0 A. These points are then projected on the vdW surface of the nearest atom sphere. For each such surface point, we calculate a surface normal vector connecting the assigned atom center with the surface point. Then, we compute for

Sampling Geometries of Protein-Protein Complexes

263

all atoms of a residue the average of the surface normal vectors. Now we reduce the number of surface points. To obtain an even distribution of surface points we randomly select a single surface point and delete all other surface points within a distance of rcut(surface) = 7 A. Next, we select the nearest remaining surface point and repeat the procedure until all surface points have been selected or deleted. We denote the resulting sets of surface points SR and SL and of corresponding normal vectors V R and V L for the receptor and ligand, respectively. For the rotations a set Q of 8000 uniformly distributed quatemions is calculated with the approach described by Kuffner [22].

2.2. Sampling strategy During the generation of the protein-protein geometries (called decoys), the receptor stays fixed, while the ligand is moved, i.e. translated and rotated. A decoy is defined by the triplet [q(k), sR(i), SL(j)], of quatemion q(k) E Q and surface points sR(i) and SL(j) of receptor and ligand, respectively. For each pair [SR(i), SL(j)] of surface points we compute the angle