Multiscale Approaches to Protein Modeling: Structure Prediction, Dynamics, Thermodynamics and Macromolecular Assemblies [1 ed.] 1441968881, 9781441968883, 144196889X, 9781441968890

Multiscale Approaches to Protein Modeling is a comprehensive review of the most advanced multiscale methods for protein

220 24 5MB

English Pages 355 [368] Year 2011

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Front Matter....Pages i-xii
Lattice Polymers and Protein Models....Pages 1-20
Multiscale Protein and Peptide Docking....Pages 21-33
Coarse-Grained Models of Proteins: Theory and Applications....Pages 35-83
Conformational Sampling in Structure Prediction and Refinement with Atomistic and Coarse-Grained Models....Pages 85-109
Effective All-Atom Potentials for Proteins....Pages 111-126
Statistical Contact Potentials in Protein Coarse-Grained Modeling: From Pair to Multi-body Potentials....Pages 127-157
Bridging the Atomic and Coarse-Grained Descriptions of Collective Motions in Proteins....Pages 159-178
Structure-Based Models of Biomolecules: Stretching of Proteins, Dynamics of Knots, Hydrodynamic Effects, and Indentation of Virus Capsids....Pages 179-208
Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms....Pages 209-230
Protein Structure Prediction: From Recognition of Matches with Known Structures to Recombination of Fragments....Pages 231-254
Genome-Wide Protein Structure Prediction....Pages 255-279
Multiscale Approach to Protein Folding Dynamics....Pages 281-293
Error Estimation of Template-Based Protein Structure Models....Pages 295-314
Evaluation of Protein Structure Prediction Methods: Issues and Strategies....Pages 315-339
Back Matter....Pages 341-355
Recommend Papers

Multiscale Approaches to Protein Modeling: Structure Prediction, Dynamics, Thermodynamics and Macromolecular Assemblies [1 ed.]
 1441968881, 9781441968883, 144196889X, 9781441968890

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Multiscale Approaches to Protein Modeling

Andrzej Kolinski Editor

Multiscale Approaches to Protein Modeling

13

Editor Andrzej Kolinski Department of Chemistry University of Warsaw ul. Pasteura 1 02-093 Warszawa Poland [email protected]

ISBN 978-1-4419-6888-3 e-ISBN 978-1-4419-6889-0 DOI 10.1007/978-1-4419-6889-0 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010934732 © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface

Thanks to enormous progress in sequencing of genomic data, presently we know millions of protein sequences. At the same time the number of experimentally solved protein structures is much smaller, ca. 60,000. This is because of large cost of structure determination. Thus, the theoretical in silico prediction of protein structures and dynamics is essential for understanding the molecular basis of drug action, metabolic and signaling pathways in living cells, and designing new technologies in the life science and material sciences. Unfortunately, a “brute force” approach remains impractical. Folding of a typical protein (in vivo or in vitro) takes milliseconds to minutes, while the state-of-the-art all-atom molecular mechanics simulations of protein systems can cover only a time period of nanoseconds to microseconds. This is the reason for the enormous progress in the development of various multiscale modeling techniques applied to protein structure prediction, modeling of protein dynamics and folding pathways, in silico protein engineering, model-aided interpretation of experimental data, modeling of macromolecular assemblies, and theoretical studies of protein thermodynamics. Coarse-graining of the proteins’ conformational space is a common feature of all these approaches, although the details and the underlying physical models span a very broad spectrum. This book contains comprehensive reviews of the most advanced multiscale modeling methods in protein structure prediction, computational studies of protein dynamics, folding mechanisms, and macromolecular interactions. The presented approaches span a wide range of the levels of coarse-grained representations, various sampling techniques, and a variety of applications to biomedical and biophysical problems. It was our intention to provide a collection of comprehensive reviews that could be used as a reference book for those who just are beginning their adventure with biomacromolecular modeling but also as a valuable source of more detailed information for those who are already experts in the field of biomacromolecular modeling and in related areas of computational biology or biophysics. Proteins are linear copolymers composed of amino acids. Important ideas of polymer physics inspired the field of protein modeling. Chapter 1 explains some basic concepts of polymer conformational statistics and dynamics of chain molecules in context of simple lattice models. This chapter demonstrates how

v

vi

Preface

these ideas could be employed in protein modeling. Chapter 2 describes application of a lattice-based protein model to the very challenging problem of protein docking. Chapter 3 provides a comprehensive overview of various coarse-grained protein-like and protein models. This chapter describes (among other approaches) probably the most rigorous system of physics-based reduced modeling of proteins. Coarse-grained, multiscale, protein modeling requires specific designs of interaction schemes. Chapters 4–6 provide in-depth overviews of various level force-fields for the reduced representations of protein conformational space, including knowledgebased statistical potentials. Chapters 7 and 8 (but also, in part, Chapters 3–5 and 12) describe a variety of applications of reduced models in the study of protein dynamics, folding pathways, molecular mechanisms of mechanical unfolding, and protein interactions. Chapter 9 gives an overview of the most effective sampling strategies in a reduced, although unrestricted conformational space. Chapters 10 and 11 present a very efficient philosophy of a conformational search, where the target structures are assembled from fragments excised from already known protein structures. These strategies proven to be very effective in the large-scale, automated in silico structure prediction. Chapter 12 describes a multiscale method, based on a high-resolution lattice model, for modeling protein folding pathways. Chapters 13 and 14 discuss the most important ideas and techniques of comparative modeling – the most effective and the most popular method for theoretical prediction of protein structures. These chapters provide also reviews of the model-quality assessment methods. The contributing authors are world-wide recognized experts. Some of them (Bujnicki and Zhang) are leaders in the field of protein structure prediction, as assessed by the recent (CASP6–CASP8) community-wide experiments in a blind structure prediction. Others also developed very successful methods for the protein structure prediction (Scheraga, Liwo, Feig, and Kihara). Several of the authors of this book developed very efficient coarse-grained interaction schemes for protein models based on either an evolutionary knowledge approach (Jernigan and Scheraga have built theoretical foundations of this class of approaches, but others also contributed significantly: Feig and Micheletti) or a physics-based approach (Scheraga, Liwo, Feig, and Irback). Among the authors are also the world top leaders of comparative modeling (Bujnicki, Zhang, Tramontano, and Kihara) and automated structure prediction (Zhang and Bujnicki) – the structure prediction server created by Zhang is the best till date. The book presents also the state-of-the-art methods of evaluation of quality of the theoretical protein models (Tramontano and Kihara). Recently, a significant progress has been achieved in multiscale modeling of protein dynamics and folding mechanisms. The authors of the chapters dealing with this class of problems are also world-class leaders (Scheraga, Liwo, Irback, Feig, Cieplak, Jernigan, and Micheletti). The conformational search strategies are crucial in protein modeling. Developers of the most efficient computational techniques and strategies are also among the authors (Hansmann, Scheraga, and others). Warsaw, Poland

Andrzej Kolinski

Contents

1 Lattice Polymers and Protein Models . . . . . . . . . . . . . . . . Andrzej Kolinski

1

2 Multiscale Protein and Peptide Docking . . . . . . . . . . . . . . . Mateusz Kurcinski, Michał Jamroz, and Andrzej Kolinski

21

3 Coarse-Grained Models of Proteins: Theory and Applications . . . . . . . . . . . . . . . . . . . . . . . Cezary Czaplewski, Adam Liwo, Mariusz Makowski, Stanisław Ołdziej, and Harold A. Scheraga 4 Conformational Sampling in Structure Prediction and Refinement with Atomistic and Coarse-Grained Models . . . . . . Michael Feig, Srinivasa M. Gopal, Kanagasabai Vadivel, and Andrew Stumpff-Kane 5 Effective All-Atom Potentials for Proteins . . . . . . . . . . . . . . Anders Irbäck and Sandipan Mohanty 6 Statistical Contact Potentials in Protein Coarse-Grained Modeling: From Pair to Multi-body Potentials . . . . . . . . . . . Sumudu P. Leelananda, Yaping Feng, Pawel Gniewek, Andrzej Kloczkowski, and Robert L. Jernigan 7 Bridging the Atomic and Coarse-Grained Descriptions of Collective Motions in Proteins . . . . . . . . . . . . . . . . . . . . Vincenzo Carnevale, Cristian Micheletti, Francesco Pontiggia, and Raffaello Potestio

35

85

111

127

159

8 Structure-Based Models of Biomolecules: Stretching of Proteins, Dynamics of Knots, Hydrodynamic Effects, and Indentation of Virus Capsids . . . . . . . . . . . . . . . . . . . . . Marek Cieplak and Joanna I. Sułkowska

179

9 Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . Ulrich H. E. Hansmann

209

vii

viii

Contents

10 Protein Structure Prediction: From Recognition of Matches with Known Structures to Recombination of Fragments . . . . . . . . . . . . . . . . . . . Michal J. Gajda, Marcin Pawlowski, and Janusz M. Bujnicki

231

11 Genome-Wide Protein Structure Prediction . . . . . . . . . . . . Srayanta Mukherjee, Andras Szilagyi, Ambrish Roy, and Yang Zhang

255

12 Multiscale Approach to Protein Folding Dynamics . . . . . . . . . Sebastian Kmiecik, Michał Jamroz, and Andrzej Kolinski

281

13 Error Estimation of Template-Based Protein Structure Models . . Daisuke Kihara, Yifeng David Yang, and Hao Chen

295

14 Evaluation of Protein Structure Prediction Methods: Issues and Strategies . . . . . . . . . . . . . . . . . . . . . . . . . Anna Tramontano and Domenico Cozzetto

315

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

341

Contributors

Janusz M. Bujnicki Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, Warsaw, Poland; Laboratory of Bioinformatics, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Poznan, Poland, [email protected] Vincenzo Carnevale Institute for Computational Molecular Science, Temple University, Philadelphia, PA, USA, [email protected] Hao Chen Department of Biological Sciences, College of Science, Purdue University, West Lafayette, IN, USA, [email protected] Marek Cieplak Institute of Physics, Polish Academy of Sciences, Warsaw, Poland, [email protected] Domenico Cozzetto Department of Biochemical Sciences, “Sapienza” University of Rome, Rome, Italy, [email protected] Cezary Czaplewski Faculty of Chemistry, University of Gda´nsk, Gda´nsk, Poland; Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA, [email protected] Michael Feig Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA; Department of Chemistry, Michigan State University, East Lansing, MI, USA, [email protected] Yaping Feng Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA; L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA, [email protected] Michal J. Gajda European Molecular Biology Laboratories, Hamburg Outstation, Hamburg, Germany, [email protected] Pawel Gniewek L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA; Laboratory of Theory of Biopolymers, Faculty of Chemistry, University of Warsaw, Warsaw, Poland, [email protected]

ix

x

Contributors

Srinivasa M. Gopal Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA, [email protected] Ulrich H. E. Hansmann Department of Physics, Michigan Technological University, Houghton, MI, USA, [email protected] Anders Irbäck Computational Biology & Biological Physics, Department of Theoretical Physics, Lund University, Lund, Sweden, [email protected] Michał Jamroz Faculty of Chemistry, University of Warsaw, Warsaw, Poland, [email protected] Robert L. Jernigan Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA; L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA, [email protected] Daisuke Kihara Department of Biological Sciences, College of Science; Department of Computer Science, College of Science; Markey Center for Structural Biology, Purdue University, West Lafayette, IN, USA, [email protected] Andrzej Kloczkowski Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA; L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA, [email protected] Sebastian Kmiecik Faculty of Chemistry, University of Warsaw, Warsaw, Poland, [email protected] Andrzej Kolinski Faculty of Chemistry, University of Warsaw, Warsaw, Poland, [email protected] Mateusz Kurcinski Faculty of Chemistry, University of Warsaw, Warsaw, Poland, [email protected] Sumudu P. Leelananda Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA; L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA, [email protected] Adam Liwo Faculty of Chemistry, University of Gda´nsk, Gda´nsk, Poland; Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA, [email protected] Mariusz Makowski Faculty of Chemistry, University of Gda´nsk, Gda´nsk, Poland; Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA, [email protected]

Contributors

xi

Cristian Micheletti Scuola Internazionale Superiore di Studi Avanzati, Trieste, Italy; Democritos CNR-IOM and Italian Institute of Technology (SISSA Unit), Trieste, Italy, [email protected] Sandipan Mohanty Jülich Supercomputing Centre, Forschungszentrum Jülich GmbH, Jülich, Germany, [email protected] Srayanta Mukherjee Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Center for Bioinformatics, University of Kansas, Lawrence, KS, USA, [email protected] Stanisław Ołdziej Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA; Laboratory of Biopolymer Structure, Intercollegiate Faculty of Biotechnology, University of Gda´nsk and Medical University of Gda´nsk, Gda´nsk, Poland, [email protected] Marcin Pawlowski Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, Warsaw, Poland, [email protected] Francesco Pontiggia Department of Biochemistry, Brandeis University, Waltham, MA, USA, [email protected] Raffaello Potestio Scuola Internazionale Superiore di Studi Avanzati, Trieste, Italy, [email protected] Ambrish Roy Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Center for Bioinformatics, University of Kansas, Lawrence, KS, USA, [email protected] Harold A. Scheraga Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA, [email protected] Andrew Stumpff-Kane Department of Biochemistry and Molecular Biology, Michigan State University, Michigan, USA, [email protected] Joanna I. Sułkowska Institute of Physics, Polish Academy of Sciences, Warsaw, Poland; CTBP, University of California, Gilman Drive 9500, La Jolla, San Diego, CA, USA, [email protected] Andras Szilagyi Center for Bioinformatics, University of Kansas, Lawrence, KS, USA; Institute of Enzymology, BRC, Hungarian Academy of Sciences, Budapest, Hungary, [email protected] Anna Tramontano Department of Biochemical Sciences, “Sapienza” University of Rome, Rome, Italy; Istituto Pasteur – Fondazione Cenci Bolognetti, “Sapienza” University of Rome, Rome, Italy, [email protected] Kanagasabai Vadivel Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA, [email protected]

xii

Contributors

Yifeng David Yang Department of Biological Sciences, College of Science, Purdue University, West Lafayette, IN, USA, [email protected] Yang Zhang Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Center for Bioinformatics, University of Kansas, Lawrence, KS, USA, [email protected]

Chapter 1

Lattice Polymers and Protein Models Andrzej Kolinski

Abstract The size of conformational space of chain polymers is enormous. Much has been learned about polymer structure, thermodynamics, and dynamics by theoretical considerations and numerical study of simple lattice models. Self-avoiding random walks on a lattice provide a good approximation for the excluded volume effect and nature of the coil–globule transition. Semiflexible polymers on a lattice exhibit two-state collapse transition that captures some essential features of the allor-none folding transition of small globular proteins. More complex, decorated with some structural details, lattice polymers provide a very powerful means for study of protein dynamics and thermodynamics and protein structure prediction.

1.1 Reduced Models of Chain Molecules The torsional rotations, only around the main-chain backbone bonds, make the conformational space of chain molecules enormous in size (Flory 1969). For a chain containing N single bonds, the number of conformations is in the range of qN , where q is approximately equal to the number of distinct low-energy regions of the rotational potential. For a polyethylene chain, q would be 3. Obviously, when N is hundreds or many thousands, a detailed conformational analysis becomes impractical. Impractical are also detailed all-atom computer simulations, unless only very local conformational changes require examination. Thus, in order to make the problem tractable, simplified models have often been designed and studied (Milik et al. 1990; Kolinski and Skolnick 1996), either from statistical analyses or/and by computer simulations. As it will become apparent later, the statistical analysis itself is of rather limited utility and in typical cases requires quite drastic simplifications. Usually, it is difficult to estimate a priori the effect of such simplifications on the final results.

A. Kolinski (B) Faculty of Chemistry, University of Warsaw, Warsaw, Poland e-mail: [email protected] A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_1, 

1

2

A. Kolinski

Let us consider two extremely simple models of polymers, one for idealized conformational statistics and the second for the first level of approximation for chain dynamics. These models can be solved rigorously by simple analytical considerations (Flory 1969). The first is the freely jointed chain (sometimes it is also called “the random flight model”). The freely jointed chain (see Fig. 1.1) consists of n segments of equal length l. Mutual orientations of the segments are completely uncorrelated. It is well known that for a sufficiently large number of segments, the mean-square end-to-end distance of such a chain scales with the number of segments as = l2 n. This result closely resembles the central formula obtained for a Brownian particle theory, where the mean-square displacement is proportional to time. It is also easy to show that the mean-square radius of gyration (a quantity that is easier to measure experimentally than the ) is related to as = /6. The distribution of the end-to-end distance and distribution of the segment density is Gaussian. Such an ideal polymer random coil is frequently called the Gaussian chain, although the freely jointed chain is not uniquely Gaussian since other types of chains can also follow Gaussian statistics. Fig. 1.1 An example of the freely jointed chain

The simplifications of the physical properties of real polymers assumed in the freely jointed chain model are essentially of two types. First, the correlations between the chain segments, especially between those that are close to one another along the chain contour, are an important property of polymers and strongly depend on their chemical structure. As long as these correlations extend only to a distance small in comparison with the chain length, it is relatively straightforward to generalize the model by introducing various approximation of the local chain stiffness related to sometimes complex profiles of the rotational potential energy. All the short-range (short distance along the chain contour) correlations do not change the general picture. For all such ideal models = Cl2 n, and the value of the prefactor C depends on the shape of the rotational potential and the temperature. Approximations of the second type are much more significant and much more difficult to deal with. Namely, all ideal chains neglect the effective interactions between the chain segments that are far away from one another along the chain

1

Lattice Polymers and Protein Models

3

but close to each other in space. On the most trivial level, the fact that two segments cannot occupy the same element of space must be taken into account. A rigorous analytical treatment of such “real” chains is not possible, although approximate theories exist (de Gennes 1979). Probably, the most famous is Flory’s mean-field theory (Flory 1953). The theory assumes that a balance between intramolecular interactions and those with solvent defines the average coil size. A quasi-chemical approximation is employed and an average Gaussian density of segments is assumed. The resulting formula describes the chain dimension as a function of temperature: α 5−α3 = const. (1−/T) n1/2

(1.1)

where α is the so-called expansion factor and is defined as α 2 = /

(1.2)

with denoting the ideal chain dimensions. Note that for T =  the chain dimensions become identical with the dimensions of the ideal chain. Thus, idea behind Flory’s “theta” () temperature closely resembles the Boyle temperature for real gases. At temperatures below , the chain undergoes a transition to a dense globular state, and this transition is somewhat similar to the gas–liquid transition of small molecule systems. However, the transition for flexible polymers is continuous and has most of the features of a secondorder phase transition (Kolinski et al. 1987b). At high temperatures (see Eq. (1.1)) ∼ n6/5 , and the average chain dimensions are much larger than for an equivalent ideal chain. Interestingly, despite a rather poor estimation of chain entropy and internal energy, Flory’s theory gives quite an accurate estimation of the free energy and conformational properties of chain molecules. Such a cancellation of errors is quite typical of mean-field-type theories. Ideal chain statistics provides a zero-order picture of the protein denatured state, while Flory’s theory is a zero-order approximation for the folding (or collapse) transition. The approximation is quite crude for several reasons. First, protein chains are relatively stiff polymers and the limit of infinitely long chains is hardly satisfied even for large proteins (Creighton 1993). Second, proteins are heteropolymers with highly specific patterns of intramolecular interactions (Branden and Tooze 1991). Even in the random coil state, there is a significant extent of residual structure. Thus, the mean-field theory is hardly applicable. We will address these issues later in more detail. Somewhat analogous to ideal chain statistics, models for ideal chain dynamics were designed. Probably the best known of these is the Rouse model (Rouse 1953), shown in a schematic fashion in Fig. 1.2. It assumes that a flexible polymer chain can be represented as a chain of points joined by harmonic springs of equal strength. This model is analytically solvable. The results are quite interesting. For short times, when the average displacements of chain segments must be small in comparison with the coil size a single segment moves according to

4

A. Kolinski

Fig. 1.2 Schematic drawing of beads-and-springs Rouse chain

(r)2 ∼ t1/2

for

l2 < (r)2 =

1  Ok M

(9.10)

where Ok is the value measured for the quantity O in the configuration k, and M the number of measurements. This average approximates the ensemble average < O >=



dxi dvi O(xi )e−E(xi ,vi )/kB T  . dxi dvi e−E(xi ,vi )/kB T

As E = Epot (xi ) + Ekin (vi ) and Ekin = 1/2 velocities, and < O >=





(9.11)

mi v2i , it is possible to integrate out the

dxi O(xi )e−E(pot xi )/kB T  . dxi e−Epot (xi )/kB T

(9.12)

As a consequence, for the generation of configurations by way of the Metropolis algorithms (Eq. (9.9)) one needs to calculate only the difference of the potential energies Epot . For this reason, we will write most times simply E when only the potential energy Epot is relevant. Note also that Monte Carlo does not require calculation of derivatives reducing the numerical workload. As the configurations are drawn randomly in Monte Carlo, it is not possible to follow the trajectory of a protein, and therefore Monte Carlo – unlike molecular dynamics – is not suitable for probing the kinetics of folding. On the other hand, Monte Carlo allows one to sample the configurational space much faster through utilizing artificial but fast move sets. These are often necessary because in the canonical ensemble crossing of an energy barrier of height E is suppressed by a factor ∝ exp(−E/kB T). This is the reason for the multiple minima problem and the resulting slowing down of protein simulations discussed in the introduction.

9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms

213

9.2.3 Optimization Techniques Most proteins are thermodynamically stable at room temperature (Anfinsen 1973). This implies that the biologically active configuration is the global minimum in free energy at T ≈ 270 − 300 K. For many proteins, this state is unique up to oscillations around a fixed structure. For this reason, one can identify the global minimum in free energy with that in potential energy, reducing the prediction of protein structures to a global optimization problem. While deterministic methods (for instance, the αBB algorithm (Androulakis et al. 1997)) have many conceptual advantages, stochastic algorithms are often faster and easier to implement. Take as an example simulated annealing (Kirkpatrick et al. 1983) which is inspired by the crystal growth process and realized by gradually decreasing the temperature in a Monte Carlo or molecular dynamics program. While only a logarithmic annealing schedule will ensure that the simulation finds the global minimum (Geman and Geman 1984), limitations in available computer resources require faster annealing schedules where success is no longer guaranteed. Still, because of its simplicity simulated annealing is often the first choice in protein optimization problems. Genetic algorithms (Holland 1975) and Monte Carlo minimization (Li and Scheraga 1987) are two other stochastic optimization techniques commonly used. As simulated annealing they try to avoid entrapment in local minima and continue to search for further solutions. This is a general characteristic of successful optimization techniques. For instance, in tabu search (Cvijovic and Klinowski 1995) the system is guided away from previously explored areas. This can result in slow convergence as the method does not distinguish between important and unimportant regions of the landscape. A somehow opposite approach (Besold et al. 1999; Wenzel and Hamacher 1999) aims at transforming the original energy landscape in a funnellandscape, where convergence toward the global minimum is fast. However, many landscape-deformation methods are hampered either by the required fine tuning or a priori information, or by difficulties with connecting back to the original landscape. Often, minima on the deformed surface are displaced or merged. The latter problem is avoided in energy landscape paving (ELP) (Hansmann and Wille 2002) which merges ideas from tabu search with energy landscape deformation. In ELP, low-temperature Monte Carlo simulations utilize an effective energy: ) w() E) = e−E/kB T

with ) E = E + f (H(q, t)).

(9.13)

Here, T is a (low) temperature and f (H(q, t)) a function of the histogram H(q, t) in a pre-chosen “order parameter” or “reaction coordinate” q. The weight of a local minimum state decreases with the time the system stays in that state, i.e., ELP deforms the energy landscape locally till the local minimum is no longer favored and the system will explore higher energies. It will then either fall in a new local minimum or walk through this high-energy region till the corresponding histogram entries all have similar frequencies and the system again has a bias toward low

214

U.H.E. Hansmann

energies. Since the weight factor is time dependent it follows that ELP violates detailed balance. Hence, the method cannot be used to calculate thermodynamic averages. Note, however, that for f (H(q, t)) = f (H(q)) detailed balance is fulfilled, and ELP reduces to the generalized-ensemble methods (Hansmann and Okamoto 1998) discussed in the following section. We have evaluated the efficiency of ELP in simulations of the 20-residue trp-cage protein whose structure we could “predict” within a root-mean-square deviation (rmsd) of 1 Å (Schug et al. 2005). Energy landscape paving allows also the possibility of zero-temperature simulaE ≤ 0 will be accepted. If tions (Schug et al. 2005). For T → 0 only moves with ) one chooses: ) E = E + cH(E, t), the acceptance criterion is given by: E + cH(q, t) ≤ 0 ↔ cH(q, t) ≤ −E

(9.14)

where E is the “physical” energy. Hence, energy landscape paving can overcome even at T = 0 any energy barrier. The waiting time for such a move is proportional to the height of the barrier that needs to be crossed. The factor c sets the timescale, and in this sense the T = 0 form of ELP is parameter-free.

9.3 Advanced Simulation Techniques Determining the structure of proteins through global optimization assumes the existence of a cost function whose global minimum describes the native structure. In most cases this is an energy that describes the physical interactions within a protein and between the protein and the surrounding environment, in most cases water. Since neither the available force fields nor the inclusion of solvation effects are perfect, it is not certain that the folded structure (as determined by X-ray or NMR experiments) corresponds to the global minimum conformation. Hence, the accuracy of the force fields sets a limit on any global optimization approach to structure prediction of proteins. Global optimization techniques are also not suitable for investigations of the folding mechanism, the change in shape when interacting with other molecules, or the appearance of mis-folded structures. As with structure prediction, it is necessary to go beyond global optimization techniques and to measure thermodynamic quantities, i.e., to sample a set of configurations from a canonical ensemble and take an average of the chosen quantity over this ensemble. In principle, this is possible with molecular dynamics and Monte Carlo simulations, however, but as argued earlier in this review this requires strategies that lead to a faster sampling of low-energy configurations.

9.3.1 Unfolding Simulations The poor sampling of protein configurations at physiologically relevant temperatures results from their rough energy landscape where barriers of height E are

9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms

215

suppressed by e−E/kB T . Hence, by increasing the temperature T it becomes easier for a protein to cross energy barriers. This can be used to induce the thermal unfolding of a protein. Such unfolding simulations at high temperature are interpreted sometimes as reversed-in-time folding (Daggett and Fersht 2003; Daggett 2002). This approach has been used in the past with some success (Daggett and Fersht 2003; Daggett 2002), but it is not clear whether in general it is justified in protein simulations. We have recently demonstrated that the C-fragment of Top 7, named by us as CFr, folds by a non-trivial pathway that involves caching of an N-terminal segment in an adjunct helix. Only when all other parts of the proteins are folded and in place, the N-terminal segment unfolds and re-folds to a strand that completes the final structure in a three-stranded sheet. We found that this folding mechanism cannot be interfered from unfolding simulations at high temperatures. In fact, the interpretation of unfolding data in Mohanty and Hansmann (2008) as folding in reversed time would miss the caching mechanism that governs folding of this protein. Likely, such an interpretation is restricted to simple two-state folder and associated with a nucleation mechanism, as observed, for instance, for CI2 (Daggett and Fersht 2003; Daggett 2002).

9.3.2 Advanced Updates A possible strategy to increase sampling of relevant protein configurations are improved updates. Within the context of molecular dynamics these are techniques that either guide the simulation and/or allow for larger time steps in the integrator. In the context of Monte Carlo these are usually collective moves that lead to a larger change in configurations. Examples are the re-bridging scheme (G¯o and Scheraga 1970; Wu and Deem 1999) and the biased Gaussian step method (Favrin et al. 2001). In hybrid Monte Carlo (Duane et al. 1987; Brass et al. 1993) a short molecular dynamics run is used as a collective move to provide a trial configuration, which is then accepted or rejected according to the Metropolis criterion. This allows to follow a trajectory over a long time with a large step size, because the Metropolis step corrects for the discretization errors in the molecular dynamics run. A general problem with all improved updates is that they depend strongly on the chosen model and are often not known a priori. A collective move that avoids this pitfall has been recently proposed by Berg (Berg 2003) under the name Rugged Metropolis (RM). The idea is to bias a Monte Carlo simulation by using informations from a simulation at a higher temperature. Assume a range of temperatures T1 > T2 > . . . > Tr > . . . > Tf −1 > Tf .

(9.15)

The simulation at the highest temperature, T1 , is performed with the usual Metropolis algorithm and the results are used to construct an estimator of the probability density function

216

U.H.E. Hansmann

ρ(x1 , . . . , xn ; T1 ). that biases the simulation at T2 . In turn, this simulation provides a bias for the one at T3 and iteratively continued down to Tf . For this purpose, Berg assumes the approximation ρ(x1 , . . . , xn ; Tr ) =

n !

ρ 1i (xi ; Tr ),

(9.16)

i=1

where ρ 1i (xi ; Tr ) are estimators of reduced one-variable probability densities ρi1 (xi ; T) =

 !

dxj ρ(x1 , . . . , xn ; T) .

(9.17)

j=i

Recursively, the estimated probability density function ρ(x ¯ 1 , . . . , xn ; Tr−1 ) is generated as an approximation of ρ(x1 , . . . , xn ; Tr ). The acceptance step in the (biased) Metropolis procedure at temperature Tr is now given by PRM

*

+   exp −β E′ ρ(x1 , . . . , xn ; Tr−1 ) = min 1, exp (−β E) ρ(x′ 1 , . . . , x′ n ; Tr−1 )

(9.18)

Rugged Metropolis has been tested successfully for simulations of small peptides, however, as with other improved updates, by itself the gain in efficiency is not enough to make folding simulations of protein domains (consisting usually of 50–200 residues) feasible. On the other hand, improved updates are very useful when combined with the other techniques that we describe in the following subsections.

9.3.3 Generalized-Ensemble Techniques A very successful approach for improving the sampling of low-energy protein configurations is the generalized-ensemble approach. Its underlying idea is not to sample directly the canonical ensemble but an artificial ensemble tailored to enable efficient search for local minima while at same time avoiding entrapment. These generalized ensembles are defined in such a way that re-weighting techniques allow one to connect back to the canonical (i.e., physical) ensemble and to calculate thermodynamic averages at temperatures of interest (Hansmann 2003). A great number of such ensembles have been proposed, and while not all of them can be discussed in this review, we can classify them in principle according to whether they are generated by a random walk through order parameter space (for instance, energy),

9

Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms

217

control parameter space (temperature), or through model space (i.e., different energy functions). 9.3.3.1 Random Walks in Order Parameter Space In generalized ensembles that are defined by random walks in order parameter space, one requires that a Monte Carlo or molecular dynamics simulation leads to a broad distribution of a pre-chosen physical quantity. This allows one to sample both low and high-energy states with sufficient probability. For simplicity we will consider only ensembles that lead to flat distributions in one variable. Extensions to higher dimensional generalized ensembles are straightforward (Kumar et al. 1996). Probably the earliest realization of this idea is umbrella sampling (Torrie and Valleau 1977), but now more common is multicanonical sampling (Berg and Neuhaus 1991). Its first application of these techniques to protein simulations can be found in Hansmann and Okamoto (1993) where a Monte Carlo technique was used. Later, it was also adapted to molecular dynamics (Hansmann et al. 1996). The idea is to assign configurations with energy E a weight w(E) such that the distribution of energies Pmu (E) ∝ n(E)wmu (E) = const,

(9.19)

where n(E) is the spectral density. Since all energies appear with the equal probability, a free random walk in the energy space is enforced: the simulation can overcome any energy barrier and will not get trapped in one of the many local minima. For a wide range of temperatures it is now possible to obtain a canonical distribution by the re-weighting techniques (Ferrenberg and Swendsen 1988): −βE PB (T, E) ∝ Pmu (E)w−1 , mu (E)e

(9.20)

since a large range of energies is sampled. This allows one to calculate the expectation value of any physical quantity O at temperature T by < O >T =



dEO(E)PB (T, E)  . dEPB (T, E)

(9.21)

The price for the resulting improved sampling is that (unlike in the canonical ensemble) the weights wmu (E) ∝ n−1 (E) are not a priori known (in fact, knowledge of the exact weights is equivalent to obtaining the density of states n(E), i.e., solving the system) and one needs their estimates for a numerical simulation. Calculation of the weights is usually done by an iterative procedure (Berg 2004; Hansmann and Okamoto 1993, 1994). Another efficient recursion is the so-called Wang–Landau sampling (Wang and Landau 2001) where one performs updates with estimators n(E) of the density of states   p(E1 → E2 ) = min n(E1 )/n(E2 ), 1 .

(9.22)

218

U.H.E. Hansmann

Each time an energy level is visited, the estimator is updated according to n(E) → n(E) f

(9.23)

where, initially, n(E) = 1 and f = f0 = e1 . Once the desired energy range is covered, the factor f is refined, f1 =

, , f , fn+1 = fn+1 ,

(9.24)

until some small value is reached. In multicanonical simulations the computational effort increases with the number of residues like ≈ N 4 (when measured in Metropolis updates) (Hansmann and Okamoto 1999b). In general, the computational effort in simulations increases with ≈ X 2 where X is the variable in which one wants a flat distribution. This is because generalized-ensemble simulations realize by construction of the ensemble a 1D random walk in the chosen quantity X. In the multicanonical algorithm the reaction coordinate X is the potential energy X = E. Since E ∝ N 2 the above scaling relation for the computational effort ≈ N 4 is recovered. Hence, multicanonical sampling is not always the optimal generalized-ensemble algorithm in protein simulations. A better scaling of the computer time with size of the molecule may be obtained by choosing more appropriate reaction coordinate for our ensemble than the energy. This is the motivation behind the various other realizations of the generalizedensemble approach that exist. All aim at sampling a broad range of energies. In this way the simulation will overcome energy barriers and allow escape from local minima. For instance, in Hansmann and Okamoto (1999a) it was proposed that configurations are updated according to a special choice of the Tsallis generalized mechanics formalism (Curado and Tsallis 1994) (the Tsallis parameter q is chosen as q = 1 + 1/nF ):   β(E − E0 ) −nF w(E) = 1 + . nF

(9.25)

Here E0 is an estimator for the ground-state energy and nF is the number of degrees of freedom of the system. The weight reduces in the low-energy region to the canonical Boltzmann weight exp(−βE). This is because E − E0 → 0 for β → 0 leading to β(E − E0 )/nF