408 62 14MB
English Pages XII, 358 [359] Year 2020
Methods in Molecular Biology 2165
Daisuke Kihara Editor
Protein Structure Prediction Fourth Edition
METHODS
IN
MOLECULAR BIOLOGY
Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK
For further volumes: http://www.springer.com/series/7651
For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.
Protein Structure Prediction Fourth Edition Edited by
Daisuke Kihara Department of Computer Science, Purdue University, West Lafayette, IN, USA; Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
Editor Daisuke Kihara Department of Computer Science Purdue University West Lafayette, IN, USA Department of Biological Sciences Purdue University West Lafayette, IN, USA
ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-0707-7 ISBN 978-1-0716-0708-4 (eBook) https://doi.org/10.1007/978-1-0716-0708-4 © Springer Science+Business Media, LLC, part of Springer Nature 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Preface Welcome to the fourth edition of Protein Structure Prediction. This edition continues the editorial policy of the previous edition, where the chapters describe web servers and software for protein structure prediction and modeling that are freely available to the academic community. Therefore, the book is intended to be practical and immediately useful for biology researchers who wish to model protein structures. Authors were selected from the top researchers in the field who actively develop software in structural bioinformatics and biophysics. Almost all the authors’ groups were ranked among the top in recent communitywide protein structure prediction and modeling assessments, including the Critical Assessment of techniques for protein Structure Prediction (CASP), the Critical Assessment of Predicted Interactions (CAPRI), RNA-Puzzles, and the Cryo-EM Model Challenge. All the methods in this book have been described in peer-reviewed papers in scientific journals. Readers are referred to their papers to learn details of the algorithms and rigorous benchmark results of the methods if interested. From the third edition released in 2014, the computational protein structure prediction/modeling field has observed notable advancements: In protein structure prediction, residue-contact prediction has made significant improvement mainly by applying deep learning. Moreover, in protein docking, automatic servers have become more mature, and also various different types of docking models, including protein-peptide docking and disordered protein docking, have been developed. Last but not least, cryo-electron microscopy (cryo-EM) has emerged as another powerful experimental approach in structural biology due to the drastic resolution improvement in structure determination. These advancements are reflected in the chapters of this edition. The first two chapters are on methods for conventional protein structure prediction. Tsuchiya and Tomii describe their FORTE sequence-template alignment method in Chapter 1, while in Chapter 2, the Cheng group provide a tutorial of their server, MULTICOM, which is now equipped with contact prediction to guide the structure building process. Chapter 3 introduces Genome3D by UK-based researchers, which integrates various resources of protein structure prediction, protein structure classification databases, and function annotation. In Chapter 4, McGuffin and his team explain ModFOLD7, their quality assessment method for protein structure models. Chapter 5 is about QUARTER, a server developed by the Kurgan group for disordered region prediction, which is now equipped with quality assessment of predictions. Chapter 6 changes the gear toward RNA structure prediction, introducing SimRNA, software developed by Boniecki and his colleagues. Chapters 7–17 cover methods and resources for protein docking. The first chapter in this category is contributed by the Seok group, on GalaxyHomomer, a web-based tool for predicting structures of homo-complexes. Chapter 8 is on a web server for template-based complex structure prediction, PPI3D, by Dapku¯nas and Venclovas. In recent years, template-based modeling has become more practical in protein docking prediction due to the increase in experimentally determined complex structures that are available as templates in PDB. PPI3D represents such a trend. Chapters 9–12 describe ClusPro (Kozakov), pyDock (Ferna´ndez-Recio), SwarmDock (Bates), and HDOCK (Huang), respectively (PIs name shown in the parentheses), which perform ab initio or template-based protein
v
vi
Preface
docking prediction. Some of them can also take constraints, e.g., known interacting residues, in modeling. Chapter 13 is on IDP-LZerD, a method that docks a disordered protein to a structured protein, which is developed in my group. The next chapter is written by Page`s and Grudinin on their method, AnAnaS, which builds symmetric protein complexes. Chapters 15 and 16 are on tools for protein-peptide docking, MDockPeP by the Zou group and CABS-dock by Kmiecik and his colleagues. The last chapter of the docking category, by the Vakser group, covers DOCKGROUND, a web-based resource of protein complexes and their models named DOCKGROUND, which are useful for benchmarking docking software. The next two chapters are for protein structure modeling for cryo-EM maps. Chapter 18 by Sigharoy and his colleagues describes MDFF, a molecular dynamics-based flexible fitting method for cryo-EM. Chapter 19 is from my group, presenting MAINMAST, a de novo protein structure modeling method that is designed for medium to low resolution (~4 A˚) maps. The final chapter (Chapter 20) presents CABS-flex and SURPASS, methods for simulating protein flexibility, which are developed by Jamroz, Kolinki, and Kmiecik. The diversity of the methods in the chapters reflects the active developments and progress of the developments of computational methods for protein structure prediction/ modeling. I hope this book will be a valuable guide for using structure prediction/modeling tools as well as a vivid historical snapshot of the rapidly developing computational structural bioinformatics and biophysics field. In closing, I would like to thank all of the authors of chapters in this book who are truly leading experts in the field. I am also thankful to the series editor, Dr. John M. Walker, for his patience and guidance. West Lafayette, IN, USA
Daisuke Kihara
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 Structural Modeling and Ligand-Binding Prediction for Analysis of Structure-Unknown and Function-Unknown Proteins Using FORTE Alignment and PoSSuM Pocket Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuko Tsuchiya and Kentaro Tomii 2 The MULTICOM Protein Structure Prediction Server Empowered by Deep Learning and Contact Distance Prediction . . . . . . . . . . . . . . . . . . . . . . . . . Jie Hou, Tianqi Wu, Zhiye Guo, Farhan Quadir, and Jianlin Cheng 3 The Genome3D Consortium for Structural Annotations of Selected Model Organisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vaishali P. Waman, Tom L. Blundell, Daniel W. A. Buchan, Julian Gough, David Jones, Lawrence Kelley, Alexey Murzin, Arun Prasad Pandurangan, Ian Sillitoe, Michael Sternberg, Pedro Torres, and Christine Orengo 4 Estimating the Quality of 3D Protein Models Using the ModFOLD7 Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ali H. A. Maghrabi and Liam J. McGuffin 5 Prediction of Intrinsic Disorder with Quality Assessment Using QUARTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhonghua Wu, Gang Hu, Christopher J. Oldfield, and Lukasz Kurgan 6 Modeling of Three-Dimensional RNA Structures Using SimRNA . . . . . . . . . . . . Tomasz K. Wirecki, Chandran Nithin, Sunandan Mukherjee, Janusz M. Bujnicki, and Michał J. Boniecki 7 Modeling Protein Homo-Oligomer Structures with GalaxyHomomer Web Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minkyung Baek, Taeyong Park, Lim Heo, and Chaok Seok 8 Template-Based Modeling of Protein Complexes Using the PPI3D Web Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ eslovas Venclovas Justas Dapku ¯ nas and C 9 Protein–Protein and Protein–Peptide Docking with ClusPro Server . . . . . . . . . . . Andrey Alekseenko, Mikhail Ignatov, George Jones, Maria Sabitova, and Dima Kozakov 10 Modeling of Protein Complexes and Molecular Assemblies with pyDock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mireia Rosell, Luis Angel Rodrı´guez-Lumbreras, and Juan Ferna´ndez-Recio 11 A Guide for Protein–Protein Docking Using SwarmDock . . . . . . . . . . . . . . . . . . . Iain H. Moal, Raphael A. G. Chaleil, Mieczyslaw Torchala, and Paul A. Bates
vii
v ix
1
13
27
69
83 103
127
139 157
175
199
viii
12
13 14
15
16
17
18
19
20
Contents
Modeling Protein–Protein or Protein–DNA/RNA Complexes Using the HDOCK Webserver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yumeng Yan and Sheng-You Huang IDP-LZerD: Software for Modeling Disordered Protein Interactions . . . . . . . . . Charles Christoffer and Daisuke Kihara AnAnaS: Software for Analytical Analysis of Symmetries in Protein Structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guillaume Page`s and Sergei Grudinin MDockPeP: A Web Server for Blind Prediction of Protein–Peptide Complex Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xianjin Xu and Xiaoqin Zou Protocols for All-Atom Reconstruction and High-Resolution Refinement of Protein–Peptide Complex Structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aleksandra E. Badaczewska-Dawid, Alisa Khramushin, Andrzej Kolinski, Ora Schueler-Furman, and Sebastian Kmiecik Dockground Tool for Development and Benchmarking of Protein Docking Procedures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Petras J. Kundrotas, Ian Kotthoff, Sherman W. Choi, Matthew M. Copeland, and Ilya A. Vakser Molecular Dynamics Flexible Fitting: All You Want to Know About Resolution Exchange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John W. Vant, Daipayan Sarkar, Chitrak Gupta, Mrinal S. Shekhar, Sumit Mittal, and Abhishek Singharoy Protein Structure Modeling from Cryo-EM Map Using MAINMAST and MAINMAST-GUI Plugin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Genki Terashi, Yuhong Zha, and Daisuke Kihara Protocols for Fast Simulations of Protein Structure Flexibility Using CABS-Flex and SURPASS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aleksandra E. Badaczewska-Dawid, Andrzej Kolinski, and Sebastian Kmiecik
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
217 231
245
259
273
289
301
317
337
355
Contributors ANDREY ALEKSEENKO • Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY, USA; Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA; Institute of Computer Aided Design of the Russian Academy of Sciences, Moscow, Russia ALEKSANDRA E. BADACZEWSKA-DAWID • Faculty of Chemistry, Biological and Chemical Research Center, University of Warsaw, Warsaw, Poland; Department of Chemistry, Iowa State University, Ames, IA, USA MINKYUNG BAEK • Department of Chemistry, Seoul National University, Seoul, Republic of Korea; Department of Biochemistry, University of Washington, Seattle, WA, USA PAUL A. BATES • Biomolecular Modelling Laboratory, The Francis Crick Institute, London, UK TOM L. BLUNDELL • Department of Biochemistry, University of Cambridge, Cambridge, UK MICHAŁ J. BONIECKI • Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland DANIEL W. A. BUCHAN • Department of Computer Science, University College London, London, UK JANUSZ M. BUJNICKI • Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland; Bioinformatics Laboratory, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Poznan, Poland RAPHAEL A. G. CHALEIL • Biomolecular Modelling Laboratory, The Francis Crick Institute, London, UK JIANLIN CHENG • Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA SHERMAN W. CHOI • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA CHARLES CHRISTOFFER • Department of Computer Science, Purdue University, West Lafayette, IN, USA MATTHEW M. COPELAND • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA JUSTAS DAPKU¯NAS • Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania JUAN FERNA´NDEZ-RECIO • Barcelona Supercomputing Center (BSC), Barcelona, Spain; Instituto de Ciencias de la Vid y del Vino (ICVV), Consejo Superior de Investigaciones Cientı´ficas (CSIC)—Universidad de La Rioja—Gobierno de La Rioja, Logron ˜ o, Spain; Institut de Biologia Molecular de Barcelona (IBMB), Consejo Superior de Investigaciones Cientı´ficas (CSIC), Barcelona, Spain JULIAN GOUGH • MRC Laboratory of Molecular Biology, Cambridge, UK SERGEI GRUDININ • Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, Grenoble, France ZHIYE GUO • Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
ix
x
Contributors
CHITRAK GUPTA • The School of Molecular Sciences, Arizona State University, Tempe, AZ, USA LIM HEO • Department of Chemistry, Seoul National University, Seoul, Republic of Korea; Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA JIE HOU • Department of Computer Science, Saint Louis University, St. Louis, MO, USA GANG HU • School of Statistics and Data Science, Key Laboratory for Medical Data Analysis and Statistical Research of Tianjin, Nankai University, Tianjin, People’s Republic of China SHENG-YOU HUANG • School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, China MIKHAIL IGNATOV • Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY, USA; Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA; Institute of Computer Aided Design of the Russian Academy of Sciences, Moscow, Russia; Institute for Advanced Computational Sciences, Stony Brook University, Stony Brook, NY, USA DAVID JONES • Department of Computer Science, University College London, London, UK GEORGE JONES • Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY, USA; Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA LAWRENCE KELLEY • Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, London, UK ALISA KHRAMUSHIN • Department of Microbiology and Molecular Genetics, Institute for Medical Research Israel-Canada, Faculty of Medicine, The Hebrew University, Jerusalem, Israel DAISUKE KIHARA • Department of Computer Science, Purdue University, West Lafayette, IN, USA; Department of Biological Sciences, Purdue University, West Lafayette, IN, USA SEBASTIAN KMIECIK • Faculty of Chemistry, Biological and Chemical Research Center, University of Warsaw, Warsaw, Poland ANDRZEJ KOLINSKI • Faculty of Chemistry, Biological and Chemical Research Center, University of Warsaw, Warsaw, Poland IAN KOTTHOFF • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA DIMA KOZAKOV • Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY, USA; Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA; Institute for Advanced Computational Sciences, Stony Brook University, Stony Brook, NY, USA PETRAS J. KUNDROTAS • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA LUKASZ KURGAN • Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA ALI H. A. MAGHRABI • School of Biological Sciences, University of Reading, Berkshire, UK LIAM J. MCGUFFIN • School of Biological Sciences, University of Reading, Berkshire, UK SUMIT MITTAL • Department of Chemistry, VIT Bhopal University, Bhopal, India IAIN H. MOAL • European Bioinformatics Institute, Hinxton, UK SUNANDAN MUKHERJEE • Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland ALEXEY MURZIN • MRC Laboratory of Molecular Biology, Cambridge, UK
Contributors
xi
CHANDRAN NITHIN • Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland CHRISTOPHER J. OLDFIELD • Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA CHRISTINE ORENGO • Institute of Structural and Molecular Biology, University College London, London, UK GUILLAUME PAGE`S • Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, Grenoble, France ARUN PRASAD PANDURANGAN • MRC Laboratory of Molecular Biology, Cambridge, UK TAEYONG PARK • Department of Chemistry, Seoul National University, Seoul, Republic of Korea FARHAN QUADIR • Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA LUIS ANGEL RODRI´GUEZ-LUMBRERAS • Barcelona Supercomputing Center (BSC), Barcelona, Spain; Instituto de Ciencias de la Vid y del Vino (ICVV), Consejo Superior de Investigaciones Cientı´ficas (CSIC)—Universidad de La Rioja—Gobierno de La Rioja, Logron ˜ o, Spain MIREIA ROSELL • Barcelona Supercomputing Center (BSC), Barcelona, Spain; Instituto de Ciencias de la Vid y del Vino (ICVV), Consejo Superior de Investigaciones Cientı´ficas (CSIC)—Universidad de La Rioja—Gobierno de La Rioja, Logron ˜ o, Spain MARIA SABITOVA • Department of Mathematics, Queens College and CUNY Graduate Center, Flushing, NY, USA DAIPAYAN SARKAR • The School of Molecular Sciences, Arizona State University, Tempe, AZ, USA ORA SCHUELER-FURMAN • Department of Microbiology and Molecular Genetics, Institute for Medical Research Israel-Canada, Faculty of Medicine, The Hebrew University, Jerusalem, Israel CHAOK SEOK • Department of Chemistry, Seoul National University, Seoul, Republic of Korea MRINAL S. SHEKHAR • Oncology, IMED Biotech Unit, AstraZeneca R&D Boston, Waltham, MA, USA IAN SILLITOE • Institute of Structural and Molecular Biology, University College London, London, UK ABHISHEK SINGHAROY • The School of Molecular Sciences, Arizona State University, Tempe, AZ, USA MICHAEL STERNBERG • Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, London, UK GENKI TERASHI • Department of Biological Sciences, Purdue University, West Lafayette, IN, USA KENTARO TOMII • Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan MIECZYSLAW TORCHALA • Biomolecular Modelling Laboratory, The Francis Crick Institute, London, UK PEDRO TORRES • Department of Biochemistry, University of Cambridge, Cambridge, UK YUKO TSUCHIYA • Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan ILYA A. VAKSER • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA
xii
Contributors
JOHN W. VANT • The School of Molecular Sciences, Arizona State University, Tempe, AZ, USA ˇ ESLOVAS VENCLOVAS • Institute of Biotechnology, Life Sciences Center, Vilnius University, C Vilnius, Lithuania VAISHALI P. WAMAN • Institute of Structural and Molecular Biology, University College London, London, UK TOMASZ K. WIRECKI • Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland TIANQI WU • Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA ZHONGHUA WU • School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People’s Republic of China XIANJIN XU • Dalton Cardiovascular Research Center, University of Missouri, Columbia, MO, USA; Department of Physics and Astronomy, University of Missouri, Columbia, MO, USA; Department of Biochemistry, University of Missouri, Columbia, MO, USA; Institute for Data Science and Informatics, University of Missouri, Columbia, MO, USA YUMENG YAN • School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, China YUHONG ZHA • School of Computer Science, Carnegie Mellon University, Pittsburg, PA, USA XIAOQIN ZOU • Dalton Cardiovascular Research Center, University of Missouri, Columbia, MO, USA; Department of Physics and Astronomy, University of Missouri, Columbia, MO, USA; Department of Biochemistry, University of Missouri, Columbia, MO, USA; Institute for Data Science and Informatics, University of Missouri, Columbia, MO, USA
Chapter 1 Structural Modeling and Ligand-Binding Prediction for Analysis of Structure-Unknown and Function-Unknown Proteins Using FORTE Alignment and PoSSuM Pocket Search Yuko Tsuchiya and Kentaro Tomii Abstract Structural data of biomolecules, such as those of proteins and nucleic acids, provide much information for estimation of their functions. For structure-unknown proteins, structure information is obtainable by modeling their structures based on sequence similarity of proteins. Moreover, information related to ligands or ligand-binding sites is necessary to elucidate protein functions because the binding of ligands can engender not only the activation and inactivation of the proteins but also the modification of protein functions. This chapter presents methods using our profile–profile alignment server FORTE and the PoSSuM ligand-binding site database for prediction of the structure and potential ligand-binding sites of structure-unknown and function-unknown proteins, aimed at protein function prediction. Key words Function prediction, Homology/comparative modeling, Pocket detection, Potential ligand-binding site prediction, Profile–profile alignment
1
Introduction Structural data of biomolecules, which support a precise understanding of protein functions and interactions, have been accumulated in the Protein Data Bank (PDB) [1]. The number of structures is increasing because of the technological advancements of structure determinations, such as not only X-ray crystallography and nuclear magnetic resonance methods but also electron microscopy [2]. Particularly, electron microscopy has a high probability of revealing the high-resolution structures of macromolecules and structures of proteins for which crystallization is difficult [3]. However, there still exist many structure-unknown proteins. In silico structural modeling is, therefore, necessary to provide structural insight of such proteins. In fact, it had been estimated even in 2009 that >70% of proteins can be partially modeled [4]. Generally, the
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_1, © Springer Science+Business Media, LLC, part of Springer Nature 2020
1
2
Yuko Tsuchiya and Kentaro Tomii
structure of a structure-unknown (query) protein is modeled based on comparison of the amino acid sequence of the query protein to those of structure-known (template) proteins. A sensitive comparison of sequences between query and template proteins, that is, an accurate sequence alignment, is fundamentally important for the accurate prediction of the structure of a query protein [5, 6]. A profile–profile comparison tool for protein fold recognition (http://forteprtl.cbrc.jp/forte/), FOld Recognition TEchnique (FORTE) and DELTA-FORTE, is one method to align protein sequences with high sensitivity and accuracy [7]. It achieves high accuracy using Pearson’s correlation coefficient for the measurement of the similarity between two profile columns, where the position-specific score matrices (PSSMs) of both the query and templates obtained by PSI-BLAST iterations [8, 9] are used as profiles. Ligand-binding site information also provides clues for biological function prediction and drug discovery [10]. Normally, ligands bind to “pockets” of proteins such as concave regions on protein surfaces and holes of proteins. The search of such pocket regions, that is, putative ligand-binding sites, engenders the identification of binding sites not only of natural ligands but also of compounds derived from drug discovery programs. In the latter case, some of them can modulate the protein function, as with the case in G-protein-coupled receptors (GPCRs), where the function of GPCR triggered by the agonist or antagonist binding can be modulated by allosteric modulator binding [11]. Moreover, proteins without overall sequence or structural similarity can possess common substructures (a set of several residues) that have the potential to bind ligands [12–14]. Consequently, an exhaustive search of putative ligand-binding sites of a protein against all known ligand-binding proteins is necessary to identify potential ligand-binding sites that might engender function predictions of the query protein. Pocket Similarity Search using Multiple sketches (PoSSuM), a fast alignment-free method for comprehensive comparisons of ligand-binding sites on proteins (http://possum.cbrc. jp/PoSSuM/) [15, 16], is useful to identify potential ligandbinding sites at which ligand-binding sites are represented as bit strings, and a sorting algorithm, multiple sorting, enables ultrafast all-pair similarity search for strings with strictly controlled error rates. This chapter presents methods for modeling of threedimensional structures from the amino acid sequence of a structure-unknown (query) protein and how to predict potential ligand-binding sites on the structure models of the query protein using our servers FORTE/DELTA-FORTE and PoSSuM, with an example of an uncharacterized protein (UniProtKB accession number: A0A3N5YMW6, https://www.uniprot.org/ uniprot/A0A3N5YMW6), of unknown structure. Homology/
Protein Structure and Function Prediction Using FORTE and PoSSuM
3
comparative modeling of a query protein and the prediction of potential ligand-binding sites on the structure model using methods such as FORTE/DELTA-FORTE and PoSSuM facilitate complete understanding of the biological functions and interactions of structure-unknown and function-unknown proteins.
2
Materials Data of the amino acid sequence of the query uncharacterized protein A0A3N5YMW6 (https://www.uniprot.org/uniprot/ A0A3N5YMW6.fasta) were obtained from UniProtKB (https:// www.uniprot.org/) [17]. The alignment of the query protein against all the (template) proteins with structures that have been determined was performed using FORTE and DELTA-FORTE (http://forteprtl.cbrc.jp/forte/) [7]. The query protein structure was modeled using MODELLER [18], which is based on the alignment with template proteins by FORTE or DELTA-FORTE and structural data of the template proteins obtained from PDB (https://www.rcsb.org/) [19]. Then, potential ligand-binding sites on the query structure model were searched using Search P on PoSSuM (http://possum.cbrc.jp/PoSSuM/search_p.html) [15, 16].
3
Methods
3.1 Acquiring Amino Acid Sequence Data
We selected a structure-unknown uncharacterized protein (organism: Alphaproteobacteria bacterium, UniProtKB ID: A0A3N5YMW6) as a target (query) example of our analyses in this chapter. Sequence data of the query protein within the FASTA format were obtained from the UniProtKB database (https://www.uniprot.org/uniprot/A0A3N5YMW6.fasta).
3.2 Profile–Profile Alignment by FORTE and DELTA-FORTE
On the FORTE portal site (http://forteprtl.cbrc.jp/forte/) [7], we clicked “Submission page (HTML)” in FORTE or DELTAFORTE (Fig. 1, left panel). Then, the top page of FORTE or DELTA-FORTE (Fig. 1, right panel) appeared. The amino acid sequence within FASTA format (or a pure amino acid sequence) of the query protein was uploaded as a file or pasted in the “Query” box. It is noteworthy that the program requires sequence data including the header line starting with “>.” We also entered a short description of the sequence at the “Subject” box, and optionally an e-mail address. Clicking the “Submit” button starts the calculation. Then, similarities between the query sequence and all the protein sequences (templates) with structures stored in PDB were evaluated as described in Subheading 5 (see Note 1).
4
Yuko Tsuchiya and Kentaro Tomii
Fig. 1 FORTE web server. Screen images of the FORTE portal site (http://forteprtl.cbrc.jp/forte/) and the FORTE top page (left and right panels, respectively) are shown
After several minutes to several tens of minutes, we received an e-mail including the job ID and the URL of the result page. The result page (Fig. 2) presented information about the top 20 alignments, such as the scores and the library and PDB IDs of the template proteins, along with the sequence alignment between the query and the selected template protein. The alignments with the top 20 templates were downloaded by clicking “TEXT download” or “PDF download” buttons. The left panel of Fig. 3 presents the alignment with the highest score, where “Query” and “Library (PDB ID: 3LLI [20], Chain ID: A, and domain ID: ‘_’, meaning that this protein consists of one domain)” lines showed the respective aligned sequences of the query and template proteins. 3.3 Structural Modeling by MODELLER
Modeling of the query structure was performed using MODELLER [18], which requires the aligned sequences between the query and template proteins and the structure coordinate file with PDB format of the template protein. We first needed to modify the alignment obtained from FORTE, so that the sequence of the query and that of the template were divided (Fig. 3, right panel). We also prepared the python script which instructs the program about the file names of the alignment and the template structure, and the query (“sequence”) and template (“knowns”) names, as shown in section “4. Model building” in the “Basic example” page of the tutorial of MODELLER (https://salilab.org/modeller/tuto rial/basic.html). Then, run the python script to obtain the model structure.
Protein Structure and Function Prediction Using FORTE and PoSSuM
5
Fig. 2 Results obtained using FORTE and DELTA-FORTE. FORTE result page (left panel) and some obtained using DELTA-FORTE (right panel)
Fig. 3 FORTE alignment and MODELLER input. Highest Z-score alignment by FORTE (left panel) and modified sequence data for the input to MODELLER (right panel) 3.4 Searching Potential LigandBinding Sites by PoSSuM
The coordinate file of the structure model of the query protein, which was constructed using MODELLER based on the FORTE alignment, was inputted into PoSSuM at the bottom of the Search P page (http://possum.cbrc.jp/PoSSuM/search_p.html) [15, 16].
6
Yuko Tsuchiya and Kentaro Tomii
Fig. 4 Results of PoSSuM search. Results of the PoSSuM search (left panel) and information of potential ligand-binding sites on the query protein in the top hit (right panel), as predicted by superimposing the template
Then, PoSSuM started to search potential ligand-binding sites on the query structure, as described in detail in Subheading 5 (see Note 2). After several minutes to several tens of minutes, the results of potential ligands and their binding sites are shown on the web, where the template PDBIDs, HET IDs that identify ligand molecules, chain IDs in the PDBID, several similarity measures, such as cosine similarity value, p-value, aligned length and RMSD, protein names, and annotations, such as UniProt, UniRef50, EC, CATH, SCOPe, and GOs, are shown (Fig. 4, left panel). If a user clicks the “View” button in the Super Position column, then a potential ligand-binding site on the query protein, to which the ligandbinding site in the template protein was superimposed, is shown along with information about the amino acids forming the potential binding site and the potential ligand (Fig. 4, right panel).
Protein Structure and Function Prediction Using FORTE and PoSSuM
4
7
Case Studies Here, we present results of the structural modeling of the query protein, structure-unknown uncharacterized protein (UniProtKB ID: A0A3N5YMW6) based on profile–profile alignment by FORTE and DELTA-FORTE [7], and that of the search of potential ligand-binding sites on the structure model of the query by PoSSuM [15, 16]. Structural modeling and ligand search were provided according to the procedures described in Subheading 3. To construct a structure model of the query protein, we performed homology modeling as follows: first, we acquired sequence data in the FASTA format of the query protein from the UniProtKB website (https://www.uniprot.org/uniprot/A0A3N5YMW6.fasta). Then, the sequence data were submitted to FORTE and DELTAFORTE (http://forteprtl.cbrc.jp/forte/) [7]. Among the respective top 20 alignments by FORTE and DELTA-FORTE, we selected the alignment with the highest Z-score by FORTE because of the high score (¼27.05). It is noteworthy that the template protein with chain A in the PDB ID 3LLI. MODELLER [18] was used for modeling the query structure based on the FORTE alignment with the highest Z-score. Then, the coordinate file of the structure model of the query protein, constructed by MODELLER, was inputted to PoSSuM from the bottom of the Search P page (http://possum.cbrc.jp/ PoSSuM/search_p.html) [15, 16] to predict the potential ligandbinding sites on the query structure model. The PoSSuM search result (Fig. 4, left panel) implies that flavin adenine dinucleotide (FAD) is a potential ligand of the query protein. This is true because many top hits by PoSSuM search represented FAD as the potential ligand of the query protein. In addition, in these hits, the cosine similarity between the feature vectors in the query and template proteins, which shows the similarity between the two ligand-binding sites (see Note 2), was extremely high, the p-value was sufficiently low, and the aligned length was very long. Figure 5 presents a structural model of the query protein based on the FORTE alignment with the highest Z-score, along with the potential ligand, FAD, on the binding site predicted based on superposition of the query to the template proteins (the right panel of Fig. 4). InterPro (https://www.ebi.ac.uk/interpro/protein/ A0A3N5YMW6) [21] and PROSITE (https://prosite.expasy.org/ cgi-bin/prosite/ScanView.cgi?scanfile¼6551093141214.scan.gz) [22] databases define the query protein as a member of the family of Erv1/Alr sulfhydryl oxidase, which is involved in the biogenesis of Fe/S clusters [23]. Both yeast Erv1 and human Alr are known to be FAD-dependent sulfhydryl oxidase, which means that they bind to FAD [24]. FORTE also aligned the structure of human sulfhydryl oxidase 1 (QSOX1) in a complex with FAD as the template with the
8
Yuko Tsuchiya and Kentaro Tomii
Fig. 5 Structural model of the query protein in complex with the putative ligand, FAD. Structural model of the query protein (ribbon model) based on the highest Z-score alignment by FORTE, and the potential ligand, FAD (spacefill model), predicted by PoSSuM
highest Z-score (PDB ID: 3LLI). DELTA-FORTE also aligned FAD-dependent human sulfhydryl oxidase (PDBID: 1OQC). From the above, we speculate that the query protein functions as FAD-dependent sulfhydryl oxidase. It, therefore, binds with FAD, as presented in the result of PoSSuM ligand search. The ligand search depends strongly on the structure model quality, as described in Note 2. Therefore, we infer that accurate structural modeling of the query protein based on the precise FORTE alignment led to detection of the potential ligand and its binding site on the query structure model.
5
Notes 1. Profile–profile alignment in FORTE and DELTA-FORTE FORTE and DELTA-FORTE [7] require PSSMs for both structure-unknown query proteins and structure-known template proteins. In our current system, PSSMs are obtained using PSI-BLAST [8, 9] iterations with the NCBI nr database [25] for FORTE and DELTA-BLAST searches with the NCBI CDD database [26] for DELTA-FORTE. The Pearson’s correlation coefficient (Cpq) was calculated according to Eq. 1 as a similarity score for each pair of positions, p and q, respectively, in the query and template profiles as follows:
Protein Structure and Function Prediction Using FORTE and PoSSuM
C pq ¼
20 P x pi x p y qi y q
i¼1 20 P i¼1
x pi x p
2
2 20 P y qi y q
1=2 ,
9
ð1Þ
i¼1
where Cpq represents a similarity score between profile columns xp (at position p in the profile X) and yq (at position q in the profile Y); xpi and yqi, respectively, denote elements for 20 amino acids in profile columns xp and yq; x p and y q are average values of the 20 elements in profile columns xp and yq, respectively. Then, FORTE aligns the sequences of the query and template profiles using the standard dynamics programming algorithm, which requires the matrix containing the similarity score Cpq for all the positions. To select the top 20 alignments to be returned to a user, the statistical significance of each alignment score is estimated by calculating Z-scores with a simple log-length correlation. Alignments with the top 20 Z-scores are shown on the result page (Fig. 2). 2. Method for identifying potential ligand binding sites using PoSSuM To identify potential ligand-binding sites on the query protein, exhaustive search that tries to find all the known ligand-binding sites similar to putative ligand-binding sites on the query protein is implemented in the PoSSuM server, where the feature vector of each ligand-binding site is calculated. Similarity between two ligand-binding sites is measured as a cosine value of the two feature vectors. First, a pocket finder, GHECOM [27], detects pocket regions, that is, putative ligand-binding regions, on protein surfaces of the query structure. The search program in PoSSuM [15, 16] then decomposes a putative ligand-binding site into all possible triangles, each of which comprises three amino acids. The triangles are labeled according to the physicochemical properties of the component amino acids, such as positively or negatively charged, hydrophobic, aromatic and hydrogen bond donor or acceptor amino acids, as a vertex label, and geometrical properties as an edge label, as determined based on the Cα–Cα distance between two amino acids. From all possible triangles, defect triangles that include at least one nonlabeled vertex or edges are removed. Each ligand-binding site is then represented as a high-dimensional feature vector, each element of which represents an appearance frequency of a type of triangles determined based on the physicochemical and geometrical properties, that is, a combination of vertex and edge labels. A known or putative ligand-binding site is represented as a feature vector of appearance frequencies of predefined triangle types.
10
Yuko Tsuchiya and Kentaro Tomii
Acknowledgments This research was partially supported by Platform Project for Supporting Drug Discovery and Life Science Research (Basis for Supporting Innovative Drug Discovery and Life Science Research (BINDS)) from AMED under Grant Number JP19AM0101110. References 1. Berman HM, Henrick K, Nakamura H (2003) Announcing the worldwide Protein Data Bank. Nat Struct Biol 10:980. https://doi.org/10. 1038/nsb1203-980 2. wwPDB consortium (2019) Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res 47: D520–D528. https://doi.org/10.1093/nar/ gky949 3. Mitra AK (2019) Visualization of biological macromolecules at near-atomic resolution: cryo-electron microscopy comes of age. Acta Crystallogr F Struct Biol Commun 75:3–11. https://doi.org/10.1107/ S2053230X18015133 4. Levitt M (2009) Nature of the protein universe. Proc Natl Acad Sci U S A 106:11079–11084 5. Venclovas C, Zemla A, Fidelis K et al (2001) Comparison of performance in successive CASP experiments. Proteins Suppl 5:163–170 6. Fischer D, Elofsson A, Rychlewski L et al (2001) CAFASP2: the second critical assessment of fully automated structure prediction methods. Proteins Suppl 5:171–183 7. Tomii K, Akiyama Y (2004) FORTE: a profile–profile comparison tool for protein fold recognition. Bioinformatics 20:594–595 8. Altschul SF, Madden TL, Sch€affer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402 9. Sch€affer AA, Aravind L, Madden TL et al (2001) Improving the accuracy of PSI-BLAST protein database searches with compositionbased statistics and other refinements. Nucleic Acids Res 29:2994–3005 10. Montgomery AP, Xiao K, Wang X et al (2017) Computational glycobiology: mechanistic studies of carbohydrate-active enzymes and implication for inhibitor design. Adv Protein Chem Struct Biol 109:25–76 11. Congreve M, Oswald C, Marshall FH (2017) Applying structure-based drug design approaches to allosteric modulators of GPCRs. Trends Pharmacol Sci 38:837–847. https://doi.org/10.1016/j.tips.2017.05.010
12. Brady L, Brzozowski AM, Derewenda ZS et al (1990) A serine protease triad forms the catalytic centre of a triacylglycerol lipase. Nature 343:767–770 13. Wallace AC, Laskowski RA, Thornton JM (1996) Derivation of 3D coordinate templates for searching structural databases: application to Ser-His-Asp catalytic triads in the serine proteinases and lipases. Protein Sci 5:1001–1013 14. Via A, Ferre` F, Brannetti B et al (2000) Threedimensional view of the surface motif associated with the P-loop structure: cis and trans cases of convergent evolution. J Mol Biol 303:455–465 15. Ito J, Tabei Y, Shimizu K et al (2012) PoSSuM: a database of similar protein-ligand binding and putative pockets. Nucleic Acids Res 40: D541–D548. https://doi.org/10.1093/nar/ gkr1130 16. Ito J, Ikeda K, Yamada K et al (2015) PoSSuM v.2.0: data update and a new function for investigating ligand analogs and target proteins of small-molecule drugs. Nucleic Acids Res 43: D392–D398. https://doi.org/10.1093/nar/ gku1144 17. The UniProt Consortium (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47:D506–D515. https://doi.org/ 10.1093/nar/gky1049 18. Webb B, Sali A (2016) Comparative protein structure modeling using MODELLER. Curr Protoc Protein Sci 86:2.9.1–2.9.37. https:// doi.org/10.1002/cpps.20 19. Berman HM, Westbrook J, Feng Z et al (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242. https://doi.org/10.1093/nar/ 28.1.235 20. Alon A, Heckler EJ, Thorpe C et al (2010) QSOX contains a pseudo-dimer of functional and degenerate sulfhydryl oxidase domains. FEBS Lett 584:1521–1525 21. Mitchell AL, Attwood TK, Babbitt PC et al (2019) InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res 47:
Protein Structure and Function Prediction Using FORTE and PoSSuM D351–D360. https://doi.org/10.1093/nar/ gky1100 22. Sigrist CJA, de Castro E, Cerutti L et al (2012) New and continuing developments at PROSITE. Nucleic Acids Res 41:D344–D347. https://doi.org/10.1093/nar/gks1067 23. Lange H, Lisowsky T, Gerber J et al (2001) An essential function of the mitochondrial sulfhydryl oxidase Erv1p/ALR in the maturation of cytosolic Fe/S proteins. EMBO Rep 2:715–720. https://doi.org/10.1093/emboreports/kve161 24. Levitan A, Danon A, Lisowsky T (2004) Unique features of plant mitochondrial
11
sulfhydryl oxidase. J Biol Chem 279:20002–20008. https://doi.org/10. 1074/jbc.M312877200 25. NCBI Resource Coordinators (2018) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 46 (Database issue):D8–D13 26. Marchler-Bauer A, Derbyshire MK, Gonzales NR et al (2015) CDD: NCBI’s conserved domain database. Nucleic Acids Res 43(Database issue):D222–D226 27. Kawabata T (2010) Detection of multiscale pockets on protein surfaces using mathematical morphology. Proteins 78:1195–1211. https:// doi.org/10.1002/prot.22639
Chapter 2 The MULTICOM Protein Structure Prediction Server Empowered by Deep Learning and Contact Distance Prediction Jie Hou, Tianqi Wu, Zhiye Guo, Farhan Quadir, and Jianlin Cheng Abstract Prediction of the three-dimensional (3D) structure of a protein from its sequence is important for studying its biological function. With the advancement in deep learning contact distance prediction and residue–residue coevolutionary analysis, significant progress has been made in both template-based and template-free protein structure prediction in the last several years. Here, we provide a practical guide for our latest MULTICOM protein structure prediction system built on top of the latest advances, which was rigorously tested in the 2018 CASP13 experiment. Its specific functionalities include: (1) prediction of 1D structural features (secondary structure, solvent accessibility, disordered regions) and 2D interresidue contacts; (2) domain boundary prediction; (3) template-based (or homology) 3D structure modeling; (4) contact distance-driven ab initio 3D structure modeling; and (5) large-scale protein quality assessment enhanced by deep learning and predicted contacts. The MULTICOM web server (http://sysbio.rnet.missouri.edu/ multicom_cluster/) presents all the 1D, 2D, and 3D prediction results and quality assessment to users via user-friendly web interfaces and e-mails. The source code of the MULTICOM package is also available at https://github.com/multicom-toolbox/multicom. Key words Protein structure prediction, Protein contact prediction, Protein distance prediction, Protein quality assessment, Deep learning, Fold recognition, Protein domain
1
Introduction Three-dimensional (3D) structure information of proteins is vital for studying their function involved in the cellular processes. The uniquely folded three-dimensional (3D) conformation (tertiary structure) of a protein is primarily determined by its amino acid sequence. Over the past decade, the advancement of highthroughput DNA sequencing technology has drastically reduced the cost and time of genome sequencing and produced tens of millions of protein sequences [1]. However, determining 3D protein structure through experimental techniques (i.e., X-ray crystallography or NMR spectroscopy) is still time-consuming,
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_2, © Springer Science+Business Media, LLC, part of Springer Nature 2020
13
14
Jie Hou et al.
labor-intensive, and rather expensive, leaving most proteins without solved structures. The gap between the number of protein sequences and experimentally determined structures is exponentially enlarged [2]. Therefore, developing effective and accurate computational tools that can predict protein structure from its amino acid sequence is one of the most important tasks in bioinformatics and computational biology. Computational methods for protein structure prediction can be classified as template-based and template-free (ab initio) modeling. Template-based modeling methods (TBM) attempt to build the tertiary structure of a target protein by using the known structures of its homologous proteins as template [3–5]. It is also known as homology modeling or comparative modeling. These methods are able to generate accurate three-dimensional structures if the homologous proteins with known structures can be accurately detected and well aligned with a target protein. Otherwise, it cannot predict the correct structure. Ab initio protein structure prediction is to predict the 3D structure from protein sequence without using known structures as template. Fragment-assemblybased modeling is one of the representative ab initio methods for structure prediction [6]. Even though it can predict correct structures for some small proteins, it often fails to build the structures of medium-to-large proteins with complicated topology. Ab initio protein structure prediction has achieved major breakthroughs in the recent years due to the drastic improvement of the accuracy of residue–residue contact distance prediction based on the coevolutionary analysis and deep learning [7–10]. The distance-geometrybased ab initio modeling using predicted contact distances as restraints is able to build correct structures of proteins of large size and with complicated topologies on various benchmarks and the recent Critical Assessments of Techniques for Protein Structure Prediction (CASP) [4, 10, 11]. In addition to model construction by template-based modeling or ab initio modeling, model quality assessment and model refinement are also two integral parts of a protein structure prediction system [12, 13]. Our MULTICOM protein structure prediction system aims to leverage both mature and latest technologies to accurately predict protein structures at 1D, 2D, and 3D levels and provide a reliable quality assessment of the predicted 3D structural models that facilitate their usage in real-world applications [4, 14]. Figure 1 is an overview of our MULTICOM prediction system. Given a target protein sequence, MULTICOM first generates multiple sequence alignments (MSA) by searching the sequence against the nonredundant sequence database to build sequence profiles (i.e., position-specific scoring matrix (PSSM) and hidden Markov model (HMM)) for protein templates identification [15] and multiple sequence alignments for coevolutionary analysis and 2D residue–residue contact predictions at multiple distance thresholds
The MULTICOM Protein Tertiary Structure Prediction Server
15
Fig. 1 The MULTICOM protein tertiary structure prediction system
(i.e., 6, 7.5, 8, 8.5, and 10 A˚) [8]. The sequence profiles are also used to predict several important 1D protein features, including secondary structure, solvent accessibility, and disorder regions [16, 17]. The sequence alignments between the target and the identified templates are also used to predict domain boundaries. The regions of the target not aligned with any significant template are modeled by template-free (ab initio) methods with contacts (i.e., CONFOLD2, ROSETTA, UniCON3D, and FUSION) [6, 10, 18, 19], and the regions covered by templates are modeled by the multitemplate combination modeling approach [3, 20]. Both the fragment-assembly and distance-geometrybased ab initio modeling methods are used with predicted contacts to make 3D structure prediction when the target sequence does not have significant templates. A number of structures (i.e., generally more than 100 structures) are generated from various targettemplate alignments produced by a variety of sequence alignment algorithms or their combinations [14]. MULTICOM uses a deep learning-based quality assessment method to select the presumably most accurate structural models from all these predicted models. The structure of the selected model is then refined using the model refinement techniques [4].
16
Jie Hou et al.
The MULTICOM server was blindly tested in 2018 CASP13 experiment and was ranked among top ten servers. Compared with the existing servers such as I-TASSER [21] and ROSETTA [6], MULITCOM generates a more comprehensive set of predictions ranging from 1D features (secondary structures, solvent accessibility, disorder regions, and domain boundaries), 2D interresidue contact features, 3D structures and templates, to the state-of-theart quality assessment. These predictions such as 2D contact maps and 3D models are visualized in a user-friendly format. The crossvalidation between 2D predicted contact maps and 3D models is unique. The ab initio modeling driven by contact distance prediction is also different from the fragment assembly approach used in I-TASSER and ROSETTA servers. Therefore, the MUTLICOM server provides a unique, versatile tool for the community to predict protein structures.
2 2.1
Materials Input
Three types of information are required by the MULTICOM web server for protein structure prediction: (1) target name, (2) user’s e-mail address, and (3) one single-lettered protein sequence. The target name identifies the job being submitted. The prediction results will be sent to the user’s e-mail address once the task is finished. The protein sequence should be composed of 20 standard amino acids. Figure 2 shows an input example (CASP13 target
Fig. 2 The input web page of MULTICOM web server
The MULTICOM Protein Tertiary Structure Prediction Server
17
“T0951”). All data in the input fields, including the e-mail address, target name, and protein sequence, should be verified by users before clicking on the “predict” button. 2.2
Output
After the job is completed, the user receives two types of results through e-mail: (1) top five predicted protein structures with detailed atomic coordinates and (2) a unique web link for detailed results with visualization. 1. The structure file attached in the e-mail is in the standard Protein Data Bank (PDB) textual file format, containing the atomic coordinates (i.e., x, y, z) of each atom in the protein (http://predictioncenter.org/casp13/index.cgi? page¼format). The PDB file can be visualized using any viewer tools, such as Chimera [22], PyMOL [23], Rasmol [24], and Jmol (Jmol: an open-source Java viewer for chemical structures in 3D; http://www.jmol.org/). 2. The user will also receive one unique web link associated with the job identifier that the user provided. JavaScript enabled in the web browser is required to view the 3D structures in the web page. The recommended browsers are Google Chrome, FireFox, Safari, or Internet Explorer. Several predicted protein features are presented, including predicted secondary structure, solvent accessibility, disorder regions, and predicted domain boundaries. The top five predicted structures and their match with the predicted contact in terms of top L, top L/5, top L/2, and top 2L long-range contacts (see Note 1) are also visualized. Figure 3 shows an example of the detailed results for Target “T0951.” More details will be described in Subheading 3.
2.3
3
Availability
The MULTICOM web server is freely available at http://sysbio. rnet.missouri.edu/multicom_cluster/. The source code and tool packages are available at https://github.com/multicom-toolbox/ multicom. Prediction time (see Note 2) depends on several factors, including server load, length of the input sequence, and difficulty of the query sequence (i.e., whether good templates can be found).
Methods This section provides a step-by-step tutorial on how to use the MULTICOM server for protein structure prediction and how to interpret the predicted results.
3.1 Submit the Sequence
1. Open a web browser such as Google Chrome and type the address http://sysbio.rnet.missouri.edu/multicom_cluster/. User will be taken to the homepage as shown in Fig. 2.
18
Jie Hou et al.
Fig. 3 The MULTICOM web server’s prediction for CASP13 target “T0951.” The orange boxes denote the annotations of ten different kinds of contents
2. In the section “E-mail address,” input the e-mail address that the results will be sent to. 3. In the section “Target name,” input the name for the protein sequence. A duplicate name can be accepted in case the user wants to reproduce the predictions. We recommend a target name with a short length. 4. In the “Protein sequence” section, enter a protein sequence by copying the query sequence to the textbox. Nonstandard amino acids (i.e., J, O, B, U) and any special characters (i.e., $, ∗) or white space characters will be removed from the sequence automatically. Both upper- and lower-case letters of protein sequence are accepted, and lower-case letters will be converted to upper case automatically. 5. Press the “Predict” button to submit a job. Once the job is received, the user will receive a confirmation e-mail with the subject “Job submission to MULTICOM.” The e-mail includes the result link that the user can use to check the prediction results. The home page will also be directed to the
The MULTICOM Protein Tertiary Structure Prediction Server
19
waiting status page, and the result will be shown once the job is completed. The user will also be notified through e-mail when the job is completed. It may take hours or even longer for the results to be ready. 3.2 Acquire the Predictions
Once the server completes the prediction, the results link will be sent to the corresponding e-mail address. The user can click the link and view/download the predicted results for the input sequence, as shown in Fig. 3. The details of results are summarized as follows: 1. The entire predicted results can be downloaded as a file from the link shown in Box 1 in Fig. 3. 2. The predicted secondary structure, solvent accessibility, and disordered regions for the input sequence are provided in Boxes 2, 3, 4, and 5. The secondary structure and solvent accessibility are predicted by SSpro/ACCpro [16], showing the putative three-state secondary structure for each residue in the protein sequence, including alpha-helix (H), beta-strand (E), and coil (C). The disorder region is predicted by PreDisorder [17]. The disordered residues are marked as T, while the ordered residues are marked as N. 3. The predicted domain boundary in the protein sequence is visualized in Box 6. The domain boundary is parsed from the target-template sequence alignments. If the protein is identified as a multidomain protein, the user can select a specific domain for detailed results that are shown in Box 7 in Fig. 3; otherwise, the link for full-length results will be shown in Box 7 in Fig. 3. The predictions for each individual domain will be reported, including the template information, predicted contact maps, and predicted domain structures, as shown in Fig. 3. 4. In the 3D prediction section, the predicted tertiary structures by MULTICOM are visualized in the JSmol viewer, as shown in Box 8. The predicted structure can be viewed in 3D orientation by moving the mouse pointer to the JSmol screen and holding down the left-click mouse. More options are available by right clicking the mouse including downloading the structure file or changing the visualization configuration. More detailed information for using JSmol can be found in http:// wiki.jmol.org/index.php/Main_Page. The structural quality predicted by our quality assessment method, DeepRank [4], is provided along with the tertiary structure. 5. The predicted tertiary structure will be cross-validated by the predicted contacts using ConEva [25]. The match between predicted tertiary structures and predicted contacts made by deep learning is visualized in Box 9. The user can slide the window to view the comparison of top L/5, top L/2, top L, top 2L predicted long-range contacts (i.e., sequence
20
Jie Hou et al.
distance 24) one by one (see Note 1). In the contact map, the blue points are the residue contact derived from the predicted structure, and the red points show the contacts predicted from the sequence by deep learning. If the red points and blue points match very well (i.e., high precision), the quality of both tertiary structure predictions and contact predictions is expected to be good. Generally, a larger number of effective sequences in the sequence alignment is an indicator if the contacts are accurately predicted. The contact matching accuracy and the sequence alignment information are also provided for reference. 6. If the homologous templates are identified and used for structure modeling, the alignments between target protein and templates are reported in Box 10. The image shows the coverage of the templates aligned with the target protein. The detailed alignments can be viewed by clicking the button “View multiple sequence alignment.”
4
Case Studies In this section, we will use two cases to illustrate the results that the MULTICOM server can provide. The two examples cover the four categories of protein structure modeling, including single-domain modeling (i.e., T0951), multidomain modeling (i.e., T1022s1), template-based modeling (i.e., T0951, T1022s1: Domain 1), and template-free domain modeling (i.e., T1022s1: Domain 0).
4.1 Single-Domain Protein (T0951)
The first example is the CASP13 target T0951 (http://pre dictioncenter.org/casp13/target.cgi?id¼25&view¼all). According to the official domain definitions of CASP13 (http:// predictioncenter.org/casp13/domains_summary.cgi), T0951 was classified as a single-domain template-based target (see Note 3), and the PDB ID for this protein is 5z82. To predict the structure of T0951, its protein sequence consisting of 276 residues in a single line was copied and supplied as input to the MULTICOM web server as shown in Fig. 2. After providing the target name (i.e., T0951) and e-mail address (i.e., [email protected]), the job was submitted. MULTICOM server accepted the job and started to predict the 3D structure for the target. Once the task was completed, an e-mail was sent to the e-mail address and the results were visualized in the web page (Fig. 3). Based on the results that MULTICOM server provided, the prediction of secondary structure and solvent accessibility is provided in Boxes 3 and 4, and the protein contains disordered regions at the N-terminal and C-terminal (see Box 5). MULTICOM identified multiple significant templates (see Note 4) for this protein (see Box 10) that covered the full-length target sequence, suggesting that the protein
The MULTICOM Protein Tertiary Structure Prediction Server
21
Fig. 4 The native structure (shown in green color) and MULTICOM-predicted structure (shown in blue color) superimposed using Chimera for target “T0951.” (PDB code: 5z82)
was a single-domain protein (or a single-modeling unit covered by at least one complete template) (see Box 6). The predicted 3D structure was visualized in Box 8. Additionally, the predicted structure was evaluated by the contacts predicted by the deep learning method using ConEva (see Box 9). Since the number of effective sequences for the target protein is very high (i.e., 8941), the prediction of contacts can be generally considered as accurate and convincing. If the contacts in the model match well with the predicted contacts (i.e., the accuracy of long-range top L/5 contacts is 100.0%) (see Note 5), the quality of the predicted structure can be also considered as largely correct. MULTICOM also provides the results of the top five predicted structures. Compared with the native structure of the target (see Note 6), the TM-score and RMSD of the top one predicted structure is 0.977 and 0.967, respectively, indicating that the prediction is accurate. The predicted structure and the native structure are superimposed and visualized in Chimera (Fig. 4). 4.2 Multidomain Protein (T1022s1)
The second example is the CASP13 target T1022s1 (http://pre dictioncenter.org/casp13/target.cgi?id¼185&view¼all). According to the official domain definitions of CASP13 (http:// predictioncenter.org/casp13/domains_summary.cgi), the target T1022s1 was classified as two-domain protein, where the first domain is a free-modeling (FM) domain (position: 1–157) and the second domain is a template-based modeling (TBM) domain (position: 158–224) (see Note 3). To predict the structure of the protein target T1022s1, its protein sequence of 229 residues in a single line was copied and supplied as input to the MULTICOM
22
Jie Hou et al.
Fig. 5 The MULTICOM web server’s prediction for CASP13 target T1022s1. The orange boxes annotate different prediction results. The target was predicted to have two domains
web server. After providing the target name (i.e., T1022s1) and e-mail address (i.e., [email protected]), the job was submitted to MULTICOM server. Once the task was completed, the results link was sent to the user’s e-mail and the results were visualized in the web page as shown in Fig. 5. Based on the results that MULTICOM server provided, the protein was predicted as two-domain protein (see Box 6), where the first domain was predicted from the position 1–167 and the second domain ranged 168–229, which is largely correct. MULTICOM predicted the structures of the two domains individually. The detailed results for each domain can be viewed through Box 7. For instance, the predicted structures of the first domain were visualized in Fig. 6. For this domain, MULTICOM treated it as a “hard” domain since no significant templates were identified. The structure was predicted using contact distancebased ab initio modeling methods (i.e., CONFOLD2). Similar to the full-length predictions, the predicted structure of the domain was evaluated by the predicted contacts using ConEva. For the second domain, as shown in Fig. 7, MULTICOM predicted the structure of this domain using template-based modeling approaches because the significant template for this domain was
The MULTICOM Protein Tertiary Structure Prediction Server
Fig. 6 The MULTICOM web server’s prediction for the first domain of T1022s1
Fig. 7 The MULTICOM web server’s prediction for the second domain of T1022s1
23
24
Jie Hou et al.
found. The good match between the contacts derived from the predicted structure and the predicted contacts by deep learning also suggests that the prediction is reasonable because the good accuracy of predicted contacts was expected due to a large number of effective sequence (i.e., 1808). Finally, the structures of two domains were combined into a full-length structure which was visualized as Box 8 in Fig. 5.
5
Notes 1. A pair of residues in sequence is defined to be in contact when the distance between their Cβ atoms (Cα in case of GLY) in the three-dimensional structure is less than 8.0 A˚. The contacts with a separation of at least 24 residues along the sequence are defined as “long-range” contacts. The top L, top L/5, top L/2, and top 2L contacts can be derived when the contact pairs are ranked by the predicted probabilities from the high to low (L is the length of the protein). 2. MULTICOM server usually takes 1–2 days to finish a prediction. The execution time depends not only on the protein size but also on the computational resources. Currently, our server processes up to two sequences at the same time, and extra tasks will be waiting in the queue. The MULTICOM standalone package is also available for local installation, which is recommended if users want to predict structures for a large number of sequences. 3. A template-based modeling is a procedure of building 3D protein models based on the structural templates that share sequence similarity with the target protein sequence. A template-free modeling (or free modeling) is to predict the 3D protein models from the given target protein sequence without the explicit structural templates available. 4. The significance of a template against the target sequence is defined by the e-value, which is generated by using an alignment tool like HHsearch [26] to search the query against the template library. Usually, a low e-value means that the template sequence has high similarity to the target sequence. 5. The accuracy of contacts is defined as the percentage of correctly predicted contacts among the selected contacts. SpecifiTP cally, the accuracy is calculated by the equation TPþFP , where the true positives (TP) refer to the predicted contacts that are correct and false positives (FP) are the incorrectly predicted contacts. 6. TM-score, RMSD (average root mean square distance between the corresponding Ca atoms), and GDT-TS score are commonly used metrics to compare and evaluate protein structure
The MULTICOM Protein Tertiary Structure Prediction Server
25
predictions [27]. The online version of the TM-score tool that can compare the structures of the same protein is available at http://zhanglab.ccmb.med.umich.edu/TM-score/. TM-score tool can also be downloaded for local use.
Acknowledgments The work was supported by an NIH grant (R01GM093123) and NSF grants (IIS1763246 and DBI1759934) to J.C. References 1. The UniProt Consortium (2018) UniProt: the universal protein knowledgebase. Nucleic Acids Res 46(5):2699 2. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28(1):235–242 3. Wang Z, Eickholt J, Cheng J (2010) MULTICOM: a multi-level combination approach to protein structure prediction and its assessments in CASP8. Bioinformatics 26(7):882–888 4. Hou J, Wu T, Cao R, Cheng J (2019) Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13. Proteins 87(12):1165–1178 5. Eswar N, Webb B, Marti-Renom MA, Madhusudhan M, Eramian D, Shen MY, Pieper U, Sali A (2006) Comparative protein structure modeling using Modeller. Curr Protoc Bioinformatics 15(1):5.6.1–5.6.30 6. Rohl CA, Strauss CE, Misura KM, Baker D (2004) Protein structure prediction using Rosetta, Methods in enzymology, vol 383. Elsevier, Amsterdam, pp 66–93 7. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, Zecchina R, Onuchic JN, Hwa T, Weigt M (2011) Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci U S A 108(49):E1293–E1301 8. Adhikari B, Hou J, Cheng J (2017) DNCON2: Improved protein contact prediction using two-level deep convolutional neural networks. Bioinformatics 34(9):1466–1472 9. Wang S, Sun S, Li Z, Zhang R, Xu J (2017) Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput Biol 13(1):e1005324 10. Adhikari B, Cheng J (2018) CONFOLD2: improved contact-driven ab initio protein
structure modeling. BMC Bioinformatics 19 (1):22 11. Abriata LA, Tamo` GE, Monastyrskyy B, Kryshtafovych A, Dal Peraro M (2018) Assessment of hard target modeling in CASP12 reveals an emerging role of alignment-based contact prediction methods. Proteins 86:97–112 12. Cao R, Bhattacharya D, Adhikari B, Li J, Cheng J (2016) Massive integration of diverse protein quality assessment methods to improve template based modeling in CASP11. Proteins 84:247–259 13. Cao R, Bhattacharya D, Adhikari B, Li J, Cheng J (2015) Large-scale model quality assessment for improving protein tertiary structure prediction. Bioinformatics 31(12): i116–i123 14. Li J, Deng X, Eickholt J, Cheng J (2013) Designing and benchmarking the MULTICOM protein structure prediction system. BMC Struct Biol 13(1):2 15. Remmert M, Biegert A, Hauser A, So¨ding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9(2):173 16. Magnan CN, Baldi PJB (2014) SSpro/ ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics 30 (18):2592–2597 17. Deng X, Eickholt J, Cheng J (2009) PreDisorder: ab initio sequence-based prediction of protein disordered regions. BMC Bioinformatics 10(1):436 18. Bhattacharya D, Cao R, Cheng J (2016) UniCon3D: de novo protein structure prediction using united-residue conformational search via stepwise, probabilistic sampling. Bioinformatics 32(18):2791–2799
26
Jie Hou et al.
19. Bhattacharya D, Cheng J (2015) De novo protein conformational sampling using a probabilistic graphical model. Sci Rep 5:16332 20. Cheng J (2008) A multi-template combination algorithm for protein comparative modeling. BMC Struct Biol 8(1):18 21. Zhang Y (2008) I-TASSER server for protein 3D structure prediction. BMC Bioinformatics 9(1):40 22. Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE (2004) UCSF Chimera—a visualization system for exploratory research and analysis. J Comput Chem 25(13):1605–1612
23. Schrodinger L (2010) The PyMOL molecular graphics system. Version 1.3r1 24. Sayle R (1992) RasMol v2.5 25. Adhikari B, Nowotny J, Bhattacharya D, Hou J, Cheng J (2016) ConEVA: a toolbox for comprehensive assessment of protein contacts. BMC Bioinformatics 17(1):517 26. So¨ding JJB (2004) Protein homology detection by HMM–HMM comparison. Bioinformatics 21(7):951–960 27. Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins 57(4):702–710
Chapter 3 The Genome3D Consortium for Structural Annotations of Selected Model Organisms Vaishali P. Waman, Tom L. Blundell, Daniel W. A. Buchan, Julian Gough, David Jones, Lawrence Kelley, Alexey Murzin, Arun Prasad Pandurangan, Ian Sillitoe, Michael Sternberg, Pedro Torres, and Christine Orengo Abstract Genome3D consortium is a collaborative project involving protein structure prediction and annotation resources developed by six world-leading structural bioinformatics groups, based in the United Kingdom (namely Blundell, Murzin, Gough, Sternberg, Orengo, and Jones). The main objective of Genome3D serves as a common portal to provide both predicted models and annotations of proteins in model organisms, using several resources developed by these labs such as CATH-Gene3D, DOMSERF, pDomTHREADER, PHYRE, SUPERFAMILY, FUGUE/TOCATTA, and VIVACE. These resources primarily use SCOP- and/or CATH-based protein domain assignments. Another objective of Genome3D is to compare structural classifications of protein domains in CATH and SCOP databases and to provide a consensus mapping of CATH and SCOP protein superfamilies. CATH/SCOP mapping analyses led to the identification of total of 1429 consensus superfamilies. Currently, Genome3D provides structural annotations for ten model organisms, including Homo sapiens, Arabidopsis thaliana, Mus musculus, Escherichia coli, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Plasmodium falciparum, Staphylococcus aureus, and Schizosaccharomyces pombe. Thus, Genome3D serves as a common gateway to each structure prediction/annotation resource and allows users to perform comparative assessment of the predictions. It, thus, assists researchers to broaden their perspective on structure/function predictions of their query protein of interest in selected model organisms. Key words SCOP, CATH, Protein structure prediction, Annotation, Function prediction, Protein superfamily, Protein domain, Protein family, Hidden Markov model, Superfamily mapping, Fold recognition, Homology modeling
1
Introduction Genome3D is a UK-based collaborative resource (funded by the BBSRC since 2011), which integrates widely used protein structure-prediction and classification methods, developed by six UK-based structural bioinformatics research groups (namely Blundell, Murzin, Gough, Sternberg, Orengo, and Jones) [1, 2].
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_3, © Springer Science+Business Media, LLC, part of Springer Nature 2020
27
28
Vaishali P. Waman et al.
The predictions generated by these labs are integrated in Genome3D, to provide consensus structural annotations and predicted threedimensional (3D) structures (models) for proteins, in selected model organisms [1, 2]. The resource is available online at http:// genome3d.eu/. 1.1 Annotations at Genome3D
The primary focus of Genome3D is to annotate genomes of organisms using predicted 3D structures of proteins based on domain classification systems from the SCOP (Structural Classification of Proteins) [3] and CATH (Class Architecture Topology/fold and Homology) databases [4]. Table 1 provides a summary of protein structure prediction and annotation methods that are integrated in Genome3D. The detailed accounts of each of these methods are provided in the next section (Subheading 2). The release cycle of Genome3D involves the following key steps: (1) Identification of the dataset of protein sequences that all resources (mentioned in Table 1) should annotate. (2) Retrieval of annotation data from these resources. (3) Releasing the database, with the following two types of structural annotations: 1. Predicted domain: location of a match to a similar structural domain (in CATH or SCOP). 2. Predicted 3D structure: fully modeled 3D coordinates of the protein domain. The following steps outline the process of generating the original version of Genome3D v1.0 [1, 2]: 1. Identification of the Dataset of Genome3D Sequences: (a) Identify Model Genomes: Genome3D currently provides annotation data for the following ten model organisms: Homo sapiens (human), Arabidopsis thaliana (mouse–ear cress), Mus musculus (mouse), Escherichia coli, Saccharomyces cerevisiae (baker’s yeast), Caenorhabditis elegans (nematode), Drosophila melanogaster (fruit fly), Plasmodium falciparum (malaria parasite), Staphylococcus aureus, and Schizosaccharomyces pombe. l
Use UniProtKB API to search for sequences that match the above taxons with the “reference proteome” keyword.
(b) Identify sequences to represent Pfam families: l
One example sequence from each family containing just a single domain.
l
One example sequence from each unique multidomain architecture. The upcoming release of Genome3D (v2.1) has updated the sequences for the proteomes of these model organisms and has also added four new organisms (Mycobacterium tuberculosis, cow, pig, wheat).
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
29
Table 1 Protein structure prediction and annotation methods that are part of Genome3D
Name of resource
Type
Research group
Website URL and reference
Protein domain classification resources CATH
Structure classification
Christine Orengo
http://www.cathdb.info/ [4]
SCOP
Structure classification
Murzin
http://scop.mrc-lmb.cam.ac. uk/scop/ [3]
Resources based on CATH domain annotations GENE-3D
Structure annotation
Christine Orengo
http://gene3d.biochem.ucl.ac. uk/Gene3D/ [5]
DOMSERF
Protein structure prediction
David Jones
http://bioinf.cs.ucl.ac.uk/ psipred/ [6]
pDomTHREADER
Structure prediction and annotation
David Jones
http://bioinf.cs.ucl.ac.uk/ psipred/ [7]
Resources based on SCOP domain annotations PHYRE
Structure prediction and annotation
Sternberg/ http://www.sbg.bio.ic.ac.uk/ Kelley phyre2/ [8, 9]
SUPERFAMILY
Structure prediction and annotation
Julian Gough
http://supfam.org/ [10]
Resources based on both CATH/SCOP mapping FUGUE/TOCATTA
Structure annotation
Tom Blundell
http://structure.bioc.cam.ac. uk/toccata/fugue
VIVACE
Protein structure prediction
Tom Blundell
http://structure.bioc.cam.ac. uk/vivace/ [11]
Additional key resources Tools at PSIPRED workbench server
Structure/function prediction David and annotation Jones
SWISSMODEL
Protein structure prediction
http://bioinf.cs.ucl.ac.uk/ psipred/ [12]
Torsten https://swissmodel.expasy. Schwede org [13]
2. Annotating Genome3D sequences: Each resource “pushes” annotations directly to the database via an API. This API is based on OpenAPI standards, and the documentation is provided at the URL http://www.genome3d.eu/api/; the command-line tool to upload these annotations can be found on GitHub (https://github.com/UCLOrengoGroup/ genome3d-openapi-client). The data on number of UniProtKB sequences with at least one 3D model prediction in Genome3D v1.0 and at least domain prediction are given in Tables 2 and 3.
30
Vaishali P. Waman et al.
Table 2 UniProtKB sequences with at least one 3D model prediction (Genome3D v1.0) Number of UniprotKB sequences (with at least one 3D model prediction)
Sr. No.
Organisms
1
Arabidopsis thaliana
44,668
2
Saccharomyces cerevisiae
14,342
3
Human
4
Caenorhabditis elegans
37,689
5
Plasmodium falciparum
10,348
6
Schizosaccharomyces pombe
12,672
7
Staphylococcus aureus
8
Drosophila melanogaster
48,981
9
Mus musculus
53,867
10
Escherichia coli
12,448
104,606
4920
Table 3 UniProtKB sequences with at least one domain prediction (Genome3D v1.0)
Sr. No.
Organisms
Number of UniprotKB sequences (with at least one domain prediction)
1
Arabidopsis thaliana
114,904
2
Saccharomyces cerevisiae
3
Human
4
Caenorhabditis elegans
96,876
5
Plasmodium falciparum
22,619
6
Schizosaccharomyces pombe
24,251
7
Staphylococcus aureus
11,591
8
Drosophila melanogaster
100,128
9
Mus musculus
110,447
10
Escherichia coli
28,605
1.2 Genome3D Search Utilities and Genome3D Output
42,019 194,177
Genome3D allows users to search for the query protein in the model organisms either using “Keyword/accession” or “Sequence” search utility (Fig. 1). The Keyword search option allows users to search either by the name/keyword of the query
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
31
Fig. 1 The Genome3D website. The home page of the Genome3D website. The website is available at http:// genome3d.eu
protein or UniProt accession ID. The “Sequence” search utility employs a fast JackHMMER [14] search of the input protein sequence against a library of sequences that are annotated by Genome3D [2]. Both of these search types provide result pages in the similar format; however, the sequence-based search generates an additional column providing “E-value” that are used to rank the significant hits. In case of multiple hits from multiple organisms, users can further filter their results based on species of interest. For a particular query protein, the Genome3D result page provides the following key results: (a) Overview of the query protein, that is, associated UniProt information, Enzyme annotation (if any), and sequence in fasta format. Subheading 2 provides the main annotations compiled at Genome3D from various resources, as follows: l
Predicted domains: This section provides predicted domains by structure annotation resources, such as FUGUE [11], GENE3D [5], PHYRE2 [8], SUPERFAMILY [10], and pDomTHREADER [7].
32
2
Vaishali P. Waman et al. l
Predicted structures: This section provides predicted structures by resources such as Domserf [6], Phyre2 [8], Vivace, and SUPERFAMILY [10].
l
External annotations by Pfam database [15].
l
Associated tools: This section provides a link to the PSIPRED workbench server [12].
Methods/Resources Available at Genome3D For every structural bioinformatics method/resource, integrated in Genome3D (listed in Table 1), this chapter outlines the following: the principle, salient features/annotations, and a tutorial, demonstrating key annotations provided by the resource.
2.1 Domain-Based Structural Classification Databases at Genome3D
A protein domain is a subsequence of protein that may evolve, fold, and function independently of the rest of the protein. SCOP (Structural Classification of Proteins) and CATH (Class, Architecture, Topology, Homologous superfamily) are the major structural classification resources developed in 1990s [3, 4]. Both of these resources basically classify protein domains, obtained from known protein 3D structures available in the Protein Data Bank (PDB) [16], into distinct evolutionarily related homologous superfamilies.
2.1.1 SCOP (Structural Classification of Proteins) Database
SCOP (Structural Classification of Proteins) database provides a comprehensive description of the structural and evolutionary relationships between all protein domains that are structurally characterized in PDB. The SCOP database was originally developed in 1995 [3]. The new prototype version of SCOP, that is, SCOP2, was released in 2014 (http://scop2.mrc-lmb.cam.ac.uk/) [17].
Principles of SCOP
Both the SCOP and SCOP2 (the successor of SCOP) databases aim to organize structurally characterized proteins in the PDB, according to their structural and evolutionary relationships. The SCOP database is organized as a tree-like classification scheme [3]: SCOP classifies proteins into the following hierarchical levels: Protein species, Protein, Family, Superfamily, Common fold, and Class. A protein species is a distinct protein sequence and its variants (naturally occurring or artificial). A “Protein” level indicates a group of similar sequences that have essentially the same functions and represent different isoforms within the same organism or originate from distinct biological species [3]. Protein domains are clustered together into families based on: (1) their significant sequence similarities and (2) similarity in structure and function (e.g., globins are classified into one family, as they are structurally very similar). Superfamily is a group of families,
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
33
whose proteins have low sequence identities, but their structural/ functional features indicate a common evolutionary origin (e.g., the variable and constant domains of immunoglobulins). Superfamilies and families are further clustered into a common fold, if their members possess the same arrangement and topological connectivity of their secondary structure units. Finally, different folds are classified into distinct classes. There are five structural classes in SCOP, All-α (domains whose structure is exclusively formed by α-helices), All-β (domains whose structure is exclusively formed by β-sheets), α + β (those in which α-helices and β-strands are largely segregated), α/β (α-helices and β-strands), and multidomain (those with domains of different class and for which no homologues are known at present) [18]. Salient Features of SCOP
SCOP2: an advancement of SCOP. SCOP2 (http://scop2.mrc-lmb.cam.ac.uk) is a new version of SCOP which is essentially distinct from its older version in defining the protein classification system. Instead of a simple tree-like hierarchy in SCOP, SCOP2 classifies proteins in terms of a directed acyclic graph, in which each node defines a relationship of a particular type [17, 19]. The relationships in SCOP2 are categorized into four major categories, such as Protein types, Evolutionary events, Structural classes, and Protein relationships. The first two categories, that is, Protein types and Evolutionary events, are essentially distinct from categories in the SCOP classification. The details of each of these categories are given at http://scop2.mrc-lmb.cam.ac. uk/about.html and ref. [17].
2.1.2 CATH (Class, Architecture, Topology, Homologous Superfamily) Database
The CATH database, first developed in mid-1990s [4], aims to generate hierarchical domain classification for the 3D structures of proteins, deposited in the PDB. The resource is freely available online at http://www.cathdb.info/.
Principle of CATH
CATH is a semiautomated procedure for classification of protein domains into four hierarchical levels such as Class (C-level), Architecture (A-level), Topology or fold groups (T-level), and Homologous Superfamily (H-level) [20]. Briefly, the CATH hierarchy is as follows: CATH processes only well-resolved 3D structures in the PDB, based on SIFT criteria, such as resolution: 4 A˚, experimental method: X-ray crystallography/NMR, length of protein sequence: at least 40 amino acid residues and at least 70% of the residues should have C-alpha atom coordinates (in PDB file). Each protein 3D structure is split into one or more chains which are further subdivided into one or more domains. Class (or C-level) is the simplest level of the CATH hierarchy, which indicates the secondary structure composition of each protein domain. There are four classes in CATH, such as
34
Vaishali P. Waman et al.
mainly α-helical (Class 1), mainly β-sheet (Class 2), alpha beta that contains a significant percentage of both alpha and beta (Class 3), and Class 4 containing few secondary structures [21]. The domains within each class are subclassified based on their architecture (A-level), that is, similarities in the orientations of secondary structure elements in three-dimensional space (such as beta barrels and sandwiches). Each of the architecture levels is further split into one or more topology (T-levels) or fold group(s), where the sequential connectivity between the secondary structure units is taken into account [4, 20, 21]. The protein domains having the same topology level are then categorized into their respective Homologous superfamilies (H-level) based on similarities in their sequence, structure, and/or function. The current version of the CATH (v.4.2) contains 434,857 protein domains that are classified into 41 architectures, 1391 topologies, and 6119 homologous superfamilies http://www. cathdb.info/. Salient Features of CATH
l
l
CATH FunFAMs (Functional Families) The homologous superfamilies in CATH are classified into functional families (FunFams) using an agglomerative clustering algorithm and protocol for separating families based on differences in putative function determining residues [22]. The members of a FunFam are likely to have highly similar structures and functions. Users can search the FunFam data within the query protein of interest using CATH FunFHMMER web server: http://www.cathdb.info/search/by_funfhmmer [22]. The FunFams are linked to GO (Gene Ontology) terms probabilistically, to provide functional annotations for uncharacterized protein sequences. The FunFam data have been utilized successfully in predicting functionally important sites pertaining to drug resistance in beta-lactamases [23], in detecting driver mutations in cancer genes [24], as well as for understanding structural and functional aspects of polypharmacology [25]. Superfamily Superposition For every Superfamily, CATH provides structural superpositions of all representative protein domains (selected from sequence clusters at 35% identity). Related structures are clustered which superpose within RMSD of 5 and 9 A˚. These superpositions are created by an in-house structure and sequence alignment program (SSAP, followed by superposition by Protein least-squares fitting program, developed by Andrew Martin’s group, personal communication) [26]. For every Superfamily web page, the structural superposition is shown as a static image and as a downloadable PyMOL script [20]. A SSAP score of 100 indicates identical proteins, while SSAP score of >80
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
l
l
l
2.2 Overview of CATH/SCOP Mapping
35
indicates homologous proteins. More distantly related folds (at T-level) possess SSAP score of >70. The SSAP web server is available at http://cath-tools.cathdb.info/structure/pairwise. FunTree This feature allows users to explore the multidomain architecture of a particular CATH Superfamily. Users can explore FunTree using http://cpmb.lshtm.ac.uk/FunTree/. For each structural cluster within a superfamily, a phylogenetic tree is constructed based on a filtered alignment. The representatives in these filtered alignments are annotated to provide detailed functional and taxonomic annotation. FunTree is generated using an interactive force-directed graph where at each node, domains with associated structural data (together with the link to PDBe) and EC numbers (if any) are annotated. CATHEDRAL (CATHs Existing Domain Recognition Algorithm) server CATH provides a utility, that is, CATHEDRAL server [27] to search CATH domains, by using user-defined query protein structures. This utility is available at http://www.cathdb.info/ search/by_structure. Access to HMM Libraries CATH allows users to locally run their sequence search against CATH HMM libraries, which are available at http:// www.cathdb.info/download.
One of the major objectives of the Genome3D is to provide a mapping of superfamilies in CATH and SCOP [1, 2], in order to identify consensus superfamily pairs. A consensus superfamily pair is defined as a pair of superfamilies (one each from SCOP and CATH) that are more similar to each other than to any other ([1, 2] http://genome3d.eu/cathscop). Identification of consensus superfamilies, thus, is advantageous to analyze consistency in predictions by various resources at Genome3D (listed in Table 1), where distinct resources are based on SCOP or CATH. CATH/SCOP mapping is done in two stages: (1) Domain mapping: SCOP domains are compared with CATH domains. (2) Superfamily mapping: This stage involves aggregating the results of domain mapping across pairs of superfamilies with CATH superfamilies. Briefly, the CATH/SCOP mapping involves the following steps: 1. Domain Mapping: (a) For all PDB structures, calculate whether each residue is assigned to a CATH domain and a SCOP domain (or both or neither).
36
Vaishali P. Waman et al.
(b) For each overlapping domain between CATH and SCOP, calculate the overlap as the overlapping residues divided by the length of the (1) CATH domain and (2) SCOP domain. 2. Mapping Superfamilies Between CATH and SCOP [1, 2]: (a) Filter out poorly overlapping CATH/SCOP domains by applying a minimum overlap cutoff. (b) Aggregate the remaining overlapping domains according to their CATH/SCOP superfamily. (c) For each pair of CATH/SCOP superfamilies that contains overlapping domains, calculate the overall overlap by dividing the number of overlapping domains by the total number of (1) CATH domains and (2) SCOP domains. (d) Categorize the CATH/SCOP superfamily pairs into “bronze,” “silver,” and “gold.” The consensus Superfamily pairs are subdivided into gold, silver, and bronze categories, based on the level of similarities. “Bronze Pair” comprises superfamily pairs (one each from SCOP and CATH) that are more similar to each other than to any other superfamily [1, 2]. “Silver Pairs” meet the Bronze Pair criteria, and at least 80% of each superfamily’s domains map to the other (ignoring differences in unclassified domains) and contain domains in each superfamily that map to domains in the other over an average of at least 80% of their residues [1, 2]. “Gold Pairs” meet the Silver criteria and indicate a pair of SCOP and CATH superfamilies that display consistently highly similar approaches to homology detection and domain boundary assignment ([1, 2]; http://genome3d. eu/cathscop). Currently, there are 893, 210, and 373 consensus superfamily pairs belonging to gold, silver, and bronze categories, respectively, based on CATH v3.75 and SCOP v1.75 mapping (http:// genome3d.eu/cathscop). This classification system is used to color corresponding superfamilies at Genome3D output page. This helps users to visualize the consistency of SCOP- and CATH-based predictions. 2.3 Methods/ Resources Based on Both CATH/SCOP Mapping
The following key resources for protein structure annotation (namely FUGUE) and modeling of proteomes (VIVACE), developed by the Blundell group, utilize domain classifications generated by consensus CATH/SCOP mapping.
2.3.1 FUGUE
Fugue is a sequence-structure homology recognition program, developed by the Blundell group (University of Cambridge, UK) in 2001 [11]. The program is freely available at http:// mizuguchilab.org/fugue/. The purpose of FUGUE is to
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
37
recognize distant (or remote) homologues by sequence-structure comparison. Principle of FUGUE
Salient Features of the FUGUE Method
FUGUE implements a global-local algorithm to align a sequencestructure pair when they greatly differ in length, while in other cases, FUGUE uses the global algorithm [11]. The characteristic feature of the FUGUE method is that it accounts for the structural environments of residues in protein 3D structures, while doing a sequence-structure alignment. The structural environment of amino acid residues (such as solvent exposure and secondary structure) is known to impact both the conservation of protein residues in evolution and the probability of insertions/deletions. Thus, in order to take into account this phenomenon, FUGUE uses BLOSSUM-like substitution tables and methodology. It also uses structure-dependent gap penalties and environment-specific substitution tables (ESSTs), where scores for amino acid matching and insertions/deletions are evaluated based on the local environment of each amino acid residue in a known structure [11]. The gap penalty at each position of the protein 3D structure is determined based on (1) its position relative to the secondary structure elements (SSEs), (2) the conservation of the SSEs, and (3) solvent accessibility. The FUGUE algorithm is described in detail in [11]. 1. Automated selection of sequence-structure alignment algorithm with information on structure-dependent gap penalties at every position of the alignment. 2. Improved environment-specific substitution tables (ESSTs). 3. FUGUE combines data from multiple sequences and multiple structures, which improves the quality of both the alignment and homology recognition. 4. TOCATTA database and FUGUE. For a particular query sequence (or an input sequence alignment), FUGUE performs a search against a database of structural profiles (a scoring matrix derived from the prealigned homologous sequences which is calculated using environment-specific substitution tables). In the original FUGUE implementation, this role was played by the HOMSTRAD database. More recently, the TOCATTA database, also developed by the Blundell lab (University of Cambridge, UK), is queried using FUGUE. A snapshot of the curated data is available at http://structure.bioc.cam.ac.uk/ toccata/. TOCATTA is a relational database of structure-based alignments of homologous protein families that tries to map the correspondences between CATH and SCOP annotations and generate a consensus profile which is its basic unit of organization. Each consensus profile consists of an alignment of representative PDB
38
Vaishali P. Waman et al.
protein structures grouped by consensus SCOP/CATH classification. TOCATTA generates both the single-domain and multidomain profiles. For each PDB entry, TOCCATA provides annotations such as (a) conformational states: binding status to biologically relevant ligands (i.e., ligand-binding state) and other chains (i.e., oligomeric state), (b) experimental quality: based on Q-score (a quality score that takes into account the resolution, the R-factor, and the extent of missing residues). The TOCATTA website allows users to search for a query protein sequence against the TOCATTA database of profiles. Thus, for a particular query sequence, FUGUE is employed to generate a target-template alignment using TOCATTA profiles. Subsequently, FUGUE calculates the sequence-structure compatibility scores and generates an output comprising a list of potential homologues and alignments, based on Z-scores. The profiles with the highest Z scores are used in the downstream process of the protein modeling pipeline, VIVACE, as described below. 2.3.2 VIVACE
VIVACE is an automated pipeline for modeling of the structural proteome of organisms (developed at Blundell’s lab).
Principle
TOCATTA and FUGUE are the important resources of the VIVACE pipeline. The pipeline processes the FUGUE output to build multitemplate models using the MODELLER program [28]. The models are annotated using the Xsult program [29]. Briefly, the VIVACE pipeline works as follows: the FUGUE output is used to select a set of maximum five of the best templates from TOCCATA, based on certain criteria, such as: coverage of the query sequence, similarity of the template(s) to the query and to each other, conformational compatibility, and crystallographic quality. The selected templates are first aligned using the BATON program, which is an updated version of COMPARER program [30]. FUGUE program is then used to incorporate the query into this template alignment. The resultant query-template alignment is used as input to MODELLER program to build a model. The quality/reliability of this model is then evaluated based on a range of quality assessment programs.
Salient Features of the Vivace Pipeline
The unique feature of the pipeline is that it generates models in the native as well as all possible functional states, that is, ligand-bound and oligomeric state. The pipeline generates models (where appropriate templates/profiles are available) for the entire structural proteome in a single batch run. VIVACE has been successfully used to build models of proteome of Mycobacterium tuberculosis [31]. It has also been used for Mycobacterium abscessus (www. mabellinidb.science), Mycobacterium leprae (HANSEN—website under development), and Pseudomonas aeruginosa (website under development).
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
Tutorial on FUGUE Sand VIVACE at Genome3D
39
This section provides a brief overview of the outputs generated by TOCATTA, FUGUE, and VIVACE, on the Genome3D website, using a query protein. A similar tutorial is also made available at Genome3D website (http://genome3d.eu/tutorials/page/Pub lic/Page/Tutorial/FUGUE_VIVACE). l
Step 1: Search the Genome3D Home Page: Genome3D allows users to perform both the keyword-based and sequence-based searches. Perform keyword search for the query protein, for example, using Uniprot ID: Q8N427 (Thioredoxin domaincontaining protein 3). This generates a result page, as shown in Fig. 2.
l
Step 2: Explore the FUGUE Output: Click on one of the FUGUE links, that is, FUGUE-SCOP or FUGUE-CATH, under the section: “Predicted domains.” Please note that both of these links lead to the same result page (since TOCATTA has an internal mapping between the two annotation sources) which
Fig. 2 The main result page at Genome3D (for query: Uniprot ID: Q8N427). This example is shown as a part of FUGUE/VIVACE tutorial
40
Vaishali P. Waman et al.
provides a detailed account of the FUGUE-based search against the TOCCATA profiles. The first column enlists TOCATTA profile hits whose nomenclature is composed from SCOP and CATH labels. The second column corresponds to the length of the profile. The third column corresponds to the “Z-score” which indicates the significance of a profile for the given region of the protein sequence (indicated in the last column). The Zscore > 6 is color-coded as blue, which indicates high confidence, while Z-score > 4 is indicated as green, which indicates that hits are likely to be significant. The last column indicates the range of the protein that is covered by each profile. l
Step 3: Exploration of the TOCATTA Profile/Output: Click of the TOCATTA profile, for example, “d.58.6.0-3.30.70.141.” The result page of TOCATTA profile consists of the following three sections: – An alignment of Representative Sequences in the JOY Format: The JOY program [32] uses typography and formatting to incorporate protein structural information into an alignment of protein sequences. Thus, the JOY alignment format facilitates comparison and inspection of protein 3D structures, often without the need to visualize them. The basic JOY format includes information such as solvent exposure, secondary structure, and main-chain hydrogen bonding, as shown in Fig. 3.
Fig. 3 JOY alignment of representative sequences in TOCATTA database (profile: d.58.6.0-3.30.70.141). The figure also indicates related profiles
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
41
– List of Related Profiles: For the profile entry, that is, profile d.58.6.0-3.30.70.141, a few related profile(s) are enlisted. For example, 3.30.70.141 is shown, as this superfamily is also a part of d.58.6.0-3.30.70.141 profile, as shown in Fig. 3. TOCATTA has generated these two distinct profiles because the protein structures in 3.30.70.141 are not characterized in SCOP. (Please note that TOCATTA also enlists related profiles if one of the elements of a particular profile is a part of multidomain patterns). – A Tabulated List of All Chains and Domains of a Profile: This lists all PDBs in this profile that clustered at various identity thresholds (such as 50%, 70%, 90%, and 95%). Chains belonging to the same cluster (at the given threshold) are indicated with the same cluster color in the table. The table also provides annotations on experimental data and conformational status. At the bottom of the page, TOCATTA provides an option (a button) to align a (user-defined) sequence to a particular profile of interest. It generates a result page with the Z-score of the alignment, an option to customize selection of templates from the current or related profiles. l
Step 4: Explore the VIVACE Prediction (Fig. 4): Click on the “Model and alignments” tab in the main FUGUE output page. It provides an interactive protein 3D model for each predicted domain (Fig. 4). It provides an XSuLT representation of the alignment used to generate the model (Fig. 4). X-SuLT is a program that annotates protein sequence alignments [29]. In addition to the features provided at the TOCCATA website, the X-SuLT program provides annotations on secondary structure, disorder prediction for the sequence, which is represented as a colored line on top of the modeled sequence. The X-SuLt representation is as follows: alpha helix (red), beta strand (blue), 310 helix (maroon), disulfide bond (c¸edilla); positive phi torsion angle (italic); hydrogen bond to main-chain amide (bold); hydrogen bond to main-chain carbonyl (underline); solvent accessible residues (lower case); and solvent inaccessible residues (upper case).
2.4 Methods/ Resources That Utilize SCOP Data
PHYRE2 [8, 9] and SUPERFAMILY [10] utilize SCOP domains for protein structure prediction and annotations, as detailed below.
2.4.1 PHYRE2
PHYRE2 (Protein Homology/analogY Recognition Engine) is a web server that provides a suite of tools for prediction and analysis of protein structure, function/mutations [8]. The PHYRE method
42
Vaishali P. Waman et al.
Fig. 4 The VIVACE-based predicted model for the query protein (Q8N427). The alignment of the query protein with templates is indicated in the X-Sult representation as follows: alpha helix (red), beta strand (blue), 310 helix (maroon), disulfide bond (c¸edilla); positive phi torsion angle (italic); hydrogen bond to main-chain amide (bold); hydrogen bond to main-chain carbonyl (underline); solvent accessible residues (lower case); solvent inaccessible residues (upper case)
was originally developed in 2009 by the Sternberg’s group at Imperial College London, UK [9]. PHYRE2, an updated version of PHYRE, uses advanced remote homology detection methods for protein structure prediction [8]. For a user-defined query protein sequence, the web server also provides various tools to analyze the effects of missense mutations on phenotype and to predict ligandbinding sites. The PHYRE2 server is available at http://www.sbg. bio.ic.ac.uk/phyre2/html/page.cgi?id¼index. Principle of PHYRE2
PHYRE2 predicts the 3D structure of a protein sequence using the alignment of hidden Markov models (HMMs) [8]. For a query protein’s sequence, it first detects known homologues by using the HMM-HMM alignment technique implemented in HHblits (HMM-HMM–based lightning-fast iterative sequence search) [33]. In particular, HHblits is used to search the query sequence against a sequence database (wherein no pair of sequences shares >20% identity), to generate a sequence profile for the query
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
43
sequence. This sequence profile, together with the PSI-PREDbased secondary structure prediction [34], is used to construct the HMM of the query sequence. The HMM of the query protein sequence is then scanned against a fold library (i.e., a database of precompiled HMMs of known 3D structures) via HHsearch alignment algorithm [35]. This fold library is weekly updated and is based on PDB and SCOP data. HHsearch-based fold library scanning, thus, generates a list of query-template alignments that are ranked based on their posterior probabilities. The top 20 ranked alignments are used for generation of backbone models, which are subsequently processed for loop modeling. PHYRE2 also allows modeling of multidomain proteins using the “intensive” mode option. Multiple template modeling is achieved by using an ab initio protein-folding simulator program named Poing [36]. Poing also allows ab initio modeling, that is, modeling regions of protein with no detectable homology with templates. Once a set of models are built for individual domains of a query protein, pairwise Cα-Cα distances are treated as linear inelastic springs in Poing. Protein regions that are not covered by templates are then modeled using an ab initio strategy implemented in the Poing algorithm. Subsequently, the new complete protein model is synthesized from a virtual ribosome. Finally, side-chain fitting is done using the R3 protocol [37]. The details of each of these stages of protein structure prediction are described in [8]. Salient Features of PHYRE2
PHYRE2 provides a set of tools and advanced utilities for protein structure prediction, function/mutation analyses, as listed below: l
Multiple Template Modeling: PHYRE2 provides an “intensive mode” option to perform modeling of multidomain proteins. Multiple template modeling is achieved by using Poing, that is, an ab initio protein-folding simulator. Poing also allows ab initio modeling, that is, modeling regions proteins with no detectable homology with known protein structures.
l
BackPhyre: For a user-defined “single-chain” protein 3D structure (in PDB format), BackPhyre allows users to check whether homologous sequences exist in genomes of interest [8]. Currently, BackPhyre allows search against genomes of 30 distinct species including model organisms (Please find the list of organisms at http://www.sbg.bio.ic.ac.uk/phyre2/html/help.cgi? id¼help/backphyre). BackPhyre result page provides a ranked list of hits from such genomes and also provides links to the respective alignments.
l
Phyre Alarm: PHYRE2 updates its fold library on a weekly basis, and hence, ~100 structures are newly deposited/week. Phyre Alarm feature is designed to identify possible confident matches
44
Vaishali P. Waman et al.
where no match has previously been found for query proteins. Thus, Phyre Alarm service allows user to submit a query protein sequence to be automatically scanned against such newly deposited entries in the fold library. If a confident match is found, users are then notified by email together with the results of modeling. l
One-to-One Threading: This feature allows users to upload both the query sequence and the user-defined template to model their query protein. In such case, PHYRE2 provides an output comprising of an alignment, the model, together with confidence of the match.
l
Phyre Investigator: This allows users to interactively examine various features of the query sequence as well as the modeled protein. These features include the following:
l
l
Phyre2 Tutorial (Using Phyre2 Link at Genome3D)
Model quality checks based on ProQ2 and Molprobility (for Clashes, Rotamers, Ramachandran analyses), prediction of effects of mutations using SusPect program, disorder prediction (by Disopred), Catalytic site (by CSA) and pocket detection (by fpocket2), Confidence of alignment (based on HHsearch), conservation analysis (based on Jensen-Shannon Divergence), and interface detection (using PI site and protindb databases). Phyre Investigator also allows detection of features using Conserved Domain Database [8]. Batch Processing: This feature allows users to submit more than one query sequence to Phyre2 modeling pipeline. Users can upload a file containing query protein sequences (in FASTA format). By default, users are allowed to upload maximum of 100 sequences in a batch; however, this limit can be raised upon request. PhyreRisk: This is a new dynamic web resource that presents PDB and Phyre-predicted structures for the human proteome. There is a dynamic sequence/structure link. Central to PhyreRisk is the facility to map genetic variants from either genomic or proteomic locations onto the structures.
This tutorial provides a brief overview of the output generated by Phyre2 on the Genome3D website, using a query protein such as the human ATP-sensitive inward rectifier potassium channel 11 (Gene Name KCN11; Uniprot ID: Q14654). A mutation (Arg-221-His) in this protein is known to be associated with neonatal disease. The experimental structure is not available for this human protein, and hence, a predicted structure would help to understand the structural implications of this mutation. A similar tutorial is also made available at Genome3D website (http://genome3d.eu/tutorials/page/Public/Page/Tutorial/ PHYRE2).
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
45
Fig. 5 Example output at Genome3D (Uniprot ID: Q14654). This example is shown as a part of Phyre2 tutorial l
Step 1: Search at the Genome3D Home Page: Genome3D allows users to perform both the keyword-based and sequence-based searches for the particular query protein. Perform keyword search for the query protein, for example, using Uniprot ID: Q14654. It generates a result page which indicates: “Displaying 1 matching UniProt entry.” You can find the main Genome3D output at either of the following links: “IRK11_HUMAN” link (under the gene column) or the link at number 8 (under the Structural Predictions column). The main result page at Genome3D is shown in Fig. 5.
l
Step 2: Explore the Phyre2 Output: Click on the Phyre2 link, which is shown under the “Predicted domains” or “Predicted 3D Structures” section. It links to the precomputed Phyre2based results for the query (Uniprot ID: Q14654), and the details of this which are described below. The Phyre2 outputs a comprehensive summary of protein of structure prediction as well as associated annotations. The Phyre2 result page is subdivided into following subsections: – Summary: This subsection provides information on the highest confidence single model, as shown in figure. It includes details of the template (PDB ID, chain, etc.), confidence in homology, % sequence coverage of query sequence that is modeled, and image of the model together with an option to
46
Vaishali P. Waman et al.
Fig. 6 The Phyre2 result (subsection: summary). The result is obtained for the query protein: Uniprot ID: Q14654
visualize the model using JMol (supported browsers: all except Internet Explorer). The summary results for the query protein (Uniprot ID: Q14654) are shown in Fig. 6. – Sequence Analysis: This provides a link to view the PSI-BLAST pseudo-multiple sequence alignment. It provides the results of scanning the input protein sequence against the nonredundant protein sequence library by PSI-BLAST. – Secondary Structure and Disorder Prediction: This subsection provides sequence-based predictions obtained by PSI-Pred [34] and Diso-Pred [38] programs as shown in Fig. 7. It provides predictions such as locations of the alpha helices, beta strands, and disordered regions (indicated as question mark symbol). Confidence key indicates the confidence of prediction of each state for every position in the sequence. The confidence key is shown using a rainbow color code, where red indicates high confidence while blue indicates low confidence. – Domain Analysis: This subsection provides a snapshot of which regions of the query protein have been modeled. It, thus, allows users to visualize the approximate domain structure of their query protein. Phyre2 provides results for the top 20 hits. For every hit, users can visualize what regions of the query have been matched to templates along with the confidence scores (color coded) of the match.
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
47
Fig. 7 The Phyre2 result (subsection: Secondary structure and disorder prediction). The result is obtained for the query protein: Uniprot ID: Q14654
– Detailed Template Information: This subsection provides the table of the ranked list of templates. For every template, the table provides information such as query-template alignment, alignment coverage (i.e., the length of the matched region between the query and template), an image of the model, confidence score, and % identity. Users can also download the PDB file of the model protein by clicking on the image of the respective model. At the bottom of the table, Phyre2 also provides an option to superimpose the selected models. – Phyre Investigator: Please note that Genome3D is a database that stores the precomputed modeling and annotation results. Users can re-run the main Phyre2 server (http://www.sbg.bio.ic.ac. uk/phyre2/) in order to explore the most updated results for this query. The current version of the Phyre2 provides certain additional features (in addition to those mentioned above) such as Phyre Investigator, Predicting effect of mutations using SuSPect program [39], and transmembrane site prediction.
48
Vaishali P. Waman et al.
Fig. 8 Example output obtained by Phyre Investigator utility at Phyre2 server
The new version of Phyre2 provides a link to “run Phyre Investigator.” This option is made available for every listed model in the table “Detailed template information.” For the chosen model, users can run this option, which provides following useful information: predictions pertaining to model quality assessment, pocket detection and interface prediction, conservation analysis, and predicted effects of mutations using SusPect server (http://www.sbg.bio.ic.ac. uk/~suspect/). As shown in Fig. 8, for a particular model entry, Phyre Investigator provides links to check Quality (ProQ2 quality assessment, clashes, rotamers, Ramachandran analyses, alignment confidence, and disorder) and to assess function (conservation, pocket detection, and mutational sensitivity prediction based on SusPect server). 2.4.2 SUPERFAMILY
The SUPERFAMILY database contains structural and functional annotation for proteins from completely sequenced genomes and major sequence collection resources [40, 41]. The structural annotations are provided using the hidden Markov models (HMMs) based on the Structural Classification of Proteins (SCOP) domain definitions at the superfamily level. The protein domains at the superfamily level groups together the most distantly related proteins have a common evolutionary ancestor and are useful for remote homology detection. Currently, the database contains annotations for 63,244 and 102,151 complete genomes taken from UniProtKB [42] and NCBI reference collection [43], respectively. The database contains about 51 and 45 million distinct protein sequences obtained from UniProtKB and NCBI,
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
49
respectively. Recently, the SUPERFAMILY HMM library 2.0 [41] was built by expanding the HMM library 1.75 to include domain sequences taken from the structural domain database SCOPe [44], CATH [45], ECOD [46], and full-length PDB sequences [47]. The database and its services are made available through the new SUPERFAMILY 2.0 website http://supfam.org. Principle
SUPERFAMILY uses HMMs for the structural annotation of protein sequences as they have been shown to offer better selectivity compared to the pairwise search and profile-based sequence comparison methods [48]. SUPERFAMILY adopts the strategy of building multiple HMMs from different single seed SCOP superfamily sequences and their homologues as opposed to using one model built from an expert alignment of selected diverse sequences [49, 50], and this approach has been shown to produce better results in terms of homology detection [40]. The seed sequences (called SUPERFAMILY sequences) used to build models are based on the sequences found for each SCOP superfamily in ASTRAL [51] database filtered at 95% sequence identity. All models in SUPERFAMILY 1.75 library were built using the SAM T99 iterative procedure described elsewhere [40, 52]. Given the model library, the SUPERFAMILY assignment procedure is used to provide structural domain annotations at superfamily level for all complete genomes and sequence sets in the database [40]. In addition, a hybrid family level assignment procedure is used to identify the specific family to which the domain belongs [53]. The closest structure derived from the family assignment procedure is used as a potential template for comparative models building using MODELLER package [28]. The built 3D models and its corresponding domain annotations obtained using SUPERFAMILY 1.75 library were submitted to Genome3D resource.
Salient Features
The SUPERFAMILY database and its tools were originally developed to answer evolutionary problems involving the most distant homologies. It has been widely used for genome annotation and for individual proteins sequence analysis and interpretation. The new SUPERFAMILY 2.0 website [41] provides various features originally derived from the legacy website (http://supfam.org/SUPER FAMILY/), including keyword search based on sequence IDs (e.g., UniProt and NCBI sequences), domain assignment of user submitted sequences, analysis of sequences with a selected domain architecture in genomes, and the visualization of phylogenetic distribution of domain architectures across the tree of life [54]. More sophisticated features are available from the legacy website which includes the alignment details for all protein assignments, searchable domain combinations in groups of genomes, detection of over- or under-represented superfamilies in a given genome by
50
Vaishali P. Waman et al.
comparison with other genomes [55], and the visualization of the distribution of domains across the major taxonomic kingdoms [56]. In addition, it also provides ontology-based annotations for SUPERFAMILY domains and architectures [57]. Tutorial on SUPERFAMILY at Genome3D
This short tutorial highlights some of the features of SUPERFAMILY accessed through the Genome3D portal using a query human protein called TOX high mobility group box family member 3 (Gene name: TOX3; UniProt ID: O15405, Sequence length: 576). The TOX3 gene is responsible for protecting against cell death by functioning as a transcriptional coactivator of the p300/ CBP-mediated transcription complex. Over 100 diseases including cancer have been associated with TOX3 in Open Targets platform (https://www.targetvalidation.org/target/ENSG00000103460/ associations). So far, no experimental structure has been reported for TOX3. The below steps will explore this query protein using Genome3D and its links to the SUPERFAMILY resource. A similar tutorial is also available at Genome3D website link: http://genome3d.eu/ tutorials/page/Public/Page/Tutorial/SUPERFAMILY. l
l
Step 1: Search at the Genome3D Home Page Open your favorite web browser and navigate to the Genome3D website (http://genome3d.eu/). Using the search bar at the top of the page, search for “O15405” and click on the first gene in the results titled TOX3_HUMAN (http://genome3d. eu/uniprot/id/O15405/annotations). The results page shows an overview of the predicted domains from the various Genome3D partners (Fig. 9). It contains links to six predicted domain annotations and three predicted 3D structures for TOX3 obtained from various Genome3D partners. On the right of the page is shown each of the predicted superfamilies and their corresponding source of the structural classification resource (SCOP or CATH). Note the gold, silver, and bronze ratings that indicate the degree of agreement between the SCOP and CATH classifications. The numbers along the bottom of the “Predicted Domains” section mark the protein sequence position, and the colored bars show each Genome3D partner’s domain predictions. From the figure, it is evident that all SCOP and CATH-based partners predict HMG-box domain between residues 200 and 400. The predicted 3D structure based on the domain annotations is available for viewing and download under the “Predicted 3D Structures” section in the main results page. Step 2: Explore the SUPERFAMILY Output The domain assignments can be further explored through the SUPERFAMILY online resource by clicking on the “SUPERFAMILY” link to the right of the domain predictions (http://supfam.org/genome/up/sequence/O15405). This
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
51
Fig. 9 An example output of the Genome3D annotation page for the protein TOX high mobility group box family member 3 (Uniprot ID: O15405). This example is shown as a part of SUPERFAMILY tutorial
page lists domains assignments for O15405 found in the UniProt sequence collection (http://supfam.org/genome/up) in the SUPERFAMILY database. The page shows assignment details including the SCOP HMG-box superfamily and family assignments, and their corresponding E-value, along with protein sequence details (Fig. 10). It also suggests a closest structural homologue 1cg7 (http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?ver¼1.75& key¼16428&search_type¼scop) for the predicted superfamily domain. The closet structure is used as a template to build a 3D comparative model and the generated 3D model submitted to Genome3D. The “Alignment” link (https://supfam. mrc-lmb.cam.ac.uk/SUPERFAMILY/cgi-bin/align.cgi?model¼ 0044636;sf¼47095;cgi_up_O15405_241-334¼1;seed¼1;local¼ Local) takes the user to the SUPERFAMILY legacy alignment page showing the alignment of the annotated UniProt sequence and the seed sequence with the HMG-box superfamily HMM
52
Vaishali P. Waman et al.
Fig. 10 SUPERFMAILY domain assignment for the UniProt sequence ID O15405. The whole length of the sequence is shown as a gray line, and the colored bar on the line shows the predicted SUPERFMAILY domain for that region in the sequence. The pictorial representation is complemented by the “Domain Assignments Details” table that shows the assignment details including the start and the end region of the assignment, predicted SCOP superfamily, and family domains along with their respective E-values. It also shows the link to the closest structural domain in the SUPERFMAILY model library as well as the link to the alignment page showing the alignment between the annotated UniProt sequence and the HMG-box superfamily model
0044636 (http://supfam.org/SUPERFAMILY/cgi-bin/model. cgi?model¼0044636) (Fig. 11). It is worth noting that each model is created from a seed sequence which is aligned to many superfamily homologues. The alignment page has sophisticated options to refine alignments, add alignments from genomes, and add sequences to the alignment. The domain assignment page also has a link to view all protein sequences in the UniProt sequence collection that contain HMG-box superfamily (http://supfam.org/genome/ up/sf/47095) and family (http://supfam.org/genome/up/ fa/47096) domain annotations. The user can browse other genomes in the SUPERFAMILY database for proteins containing the identical domain architecture
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
53
Fig. 11 Superfamily alignment showing the alignment of annotated UniProt sequence (ID: O15405, aligned region: 241–334) and seed sequence (d1v64a_ used to build the model) with the HMG-box superfamily hidden Markov model number 0044636. The alignment page also provides advanced options to refine alignments, add alignments from genomes, and add sequences to the alignment
(i.e., the ordered list of predicted superfamily domains from the Nto C-terminal of the protein sequence) by clicking the link labeled “Other proteins with this domain architecture” (http://supfam. org/allcombs/_gap_,47095,_gap_) (Fig. 12). The link “See the phylogenetic distribution of this domain architecture” (https://supfam.mrc-lmb.cam.ac.uk/SUPERFAM ILY/cgi-bin/createtree.cgi?tophl¼1;highlight¼arc__gap_,47095, _gap_) allows the user to view the phylogenetic tree of other species that have genes with this identical architecture. Green branches in the tree indicate assignment to the architecture and blue indicates no assignment. 2.5 Methods That Use CATH Domains for Structure Annotations/ Predictions
Gene3D [5, 58], pDomTHREADER [7], and Domserf [6] are resources that use CATH domain sequences for the development of fold and protein 3D structure prediction/annotation resources, as described below.
2.5.1 Gene3D
Gene3D was developed in 2002 by the Orengo group [5]. The primary focus of Gene3D is to provide comprehensive structural domain assignments and functional annotation for sequences of proteins, available from major protein sequence databases such as
54
Vaishali P. Waman et al.
Fig. 12 List of genomes that have protein sequences with a specific domain architecture containing the HMG-box superfamily domain. For clarity, the genomes are organized into various headings including eukaryotes, prokaryotes, strains, metagenome, pseudo gene, and other sequence collections including UniProt and NCBI reference sequence database
UniProt, RefSeq, Integr8 [59], and Ensembl [60]. The Gene3D resource is available at http://gene3d.biochem.ucl.ac.uk/. Principle
Gene3D uses CATH domain family information from PDB structures to assign a structural family to the millions of protein sequences for which PDB structures are not experimentally resolved. Gene3D generates a library of hidden Markov models (profile HMMs) from CATH domain sequences using jackhammer and scans them against various protein sequence databases (such as UniProt, RefSeq). Briefly, the Gene3D annotation pipeline is as follows: 1. Generation of a library of profile HMMs using representatives from each CATH Superfamily: The members of each Superfamily are subclustered based on 35% sequence identity cutoff, and one or more Structural representatives (S-rep) sequences are chosen from this cluster. On average, each Superfamily contains four representatives (S-reps), while many superfamilies contain only one S-rep sequence. Each Structural representative is provided as a seed for an iterative homology search
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
55
algorithm to search against a nonredundant database, thereby generating a profile of homologous sequences. After every iteration, newly identified homologous sequences are aligned with the previously identified homologues, using the MAFFT program [61], and are used to generate a new multisequence profile model. At the end of the iterative search, the final alignment of homologues (multisequence profile) is converted to an HMM (using HMMER3). Thus, a library of corresponding profile HMMs is generated. 2. Query protein sequences are then searched against this library of profile HMMs, which creates a set of potential domain assignments. The potential overlapping domain matches are processed further using an in-house program, that is, DomainFinderv3 [62–64], to provide confident single-domain boundaries (based on E-values). The detailed protocol of Gene3D is published in [5, 58]. Salient Feature of Gene3D
Gene3D maps structural annotations from known PDB structures in CATH Superfamily to millions of protein sequences, for which experimental structures are not resolved. It, thus, gives valuable information about ensemble of protein sequences that share a homologous relationship to a particular CATH domain family. Gene3D provides accurate structural domain assignments for ~100 million protein sequences and for more than 1000 genomes [65].
Tutorial on Gene3D/CATH at Genome3D
This tutorial will enable users to search with a user-defined protein query in Genome3D and explore the Gene3D/CATH output. This tutorial is focused on beta-lactamase protein (UniProt ID: P00811; sequence length: 377 amino acid residues), which is known to confer penicillin resistance in a pathogenic bacterium. l
Step 1: Search at the Genome3D Home Page: Perform the keyword search, at Genome3D homepage, using the UniProt accession ID: P00811. This provides the output at Genome3D as shown in Fig. 13. The CATH Superfamily information is shown, and a corresponding link to CATH database is provided at the right-hand corner.
l
Step 2: Explore the Gene3D Output: Under the “predicted domains” section, users can find the Gene3D link (http:// gene3d.biochem.ucl.ac.uk/protein? smd5¼0eccaf784fe93dfc9c1af8dd4a22bc94). Gene3D provides annotations pertaining to CATH domain assignments, functional annotations, information about associated drugs, and related sequences of the query protein from major protein sequence databases such as UniProt and Ensembl. These data can be accessed directly from the column at the left side of the Gene3D output, as shown in Fig. 14.
56
Vaishali P. Waman et al.
Fig. 13 Result page Genome3D for Uniprot ID: P00811 (part of Gene3D tutorial). It can be seen that betalactamase protein (UniProt ID: P00811) belongs to 3.40.710.10 CATH Superfamily (which is beta-lactamases/ DD-peptidases Superfamily)
Fig. 14 Gene3D output for the query protein beta-lactamase protein (UniProt ID: P00811). The query protein belongs to 3.40.710.10 CATH Superfamily
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
57
Fig. 15 CATH Functional family (ID: 22208) for the query protein (UniProt ID: P00811)
– Explore Drug Information: Users can search for associated drugs information, which indicates that there are two approved drugs (Cefalotin and Cloxacillin) and a total of 36 putative drugs for which experimental evidence is known. – Explore the Corresponding CATH Functional Family: Gene3D provides the link to the associated Functional Family (FunFam). This can be obtained by clicking on “CATH domain family predictions” (under the Domain View section), which provides a link to corresponding CATH domain, that is, 2ffyA00 (http://www.cathdb.info/version/latest/ domain/2ffyA00). This page provides a link to the corresponding functional family, that is, Class C betalactamase CMY-10, which has FunFam ID: 22208 (Fig. 15). l
Step 3: Explore the CATH Output: Genome3D output page (Fig. 13) provides a link to the CATH Superfamily 3.40.710.10 link, which shows the main summary output page for Superfamily 3.40.710.10 in the CATH data (Fig. 16). The different hierarchies of the Superfamily can be found by using the Superfamily link at the left-hand side (Fig. 17). The CATH summary page (Fig. 16) provides a comprehensive annotation summary pertaining to members of this Superfamily (3.40.710.10). The annotations statistics include Unique Gene Ontology, EC annotations and Species annotations, Functional families. The Superfamily summary also provides a detailed account of the number of domains in the Superfamily, Unique PDBs, and number of structural clusters (at 5 and 9 A˚).
58
Vaishali P. Waman et al.
Fig. 16 CATH result summary for Superfamily 3.40.710.10
Fig. 17 The CATH classification of Superfamily 3.40.710.10
As can be explored from the species diversity plot (users can mouse-over the plot), the members of 3.40.710.10 largely belong to bacteria (90.6%). Likewise, users can explore the unique EC annotations (beta-lactamase: 36.3%, Serine type DAla-D-Ala carboxypeptidase: 27.7%, etc.).
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
59
2.5.2 pDomTHREADER
pDomTHREADER (parametric-DomTHREADER) is a fold recognition algorithm available via the PSIPRED Workbench web server ([12]; http://bioinf.cs.ucl.ac.uk/psipred/).
Principle
pDomTHREADER uses both sequence and structural data to detect relationships to CATH structural domain superfamilies. For an input query protein sequence, the algorithm first generates a profile (position-specific scoring matrix, PSSM) using PSI-BLAST. Using an exhaustive search strategy, this profile is then aligned against a template library of profiles representing domains in CATH database. The algorithm generates profile–profile alignments which make use of additional features such as secondary-structure-specific gap penalties and classic pair and solvation potentials, the optimal combination of which was trained by linear Support Vector Machine regression [7]. pDomTHREADER was benchmarked using a total of 4008 full-chain queries obtained from CATH 3.1 S35 representative set [7].
Salient Features of pDomTHREADER
pDomTHREADER is a reliable and sensitive method for detecting homologous superfamily matches. pDomTHREADER has been shown to outperform other methods such as PSI-BLAST [66], HHPRED [67], and pGenTHREADER [7] at very low error rates.
2.5.3 Domserf Server
DomSerf is a fully automated, protein-domain homology modeling pipeline, developed by the UCL Bioinformatics Group [6] and available at PSIPRED workbench server: http://bioinf.cs.ucl.ac. uk/psipred/.
Principle
Domserf is a homology modeling pipeline which uses a variety of programs for protein domain recognition (including pDomTHREADER [7], PSI-BLAST [66], DomainFinder) as well as homology modeling using MODELLER [28]. Query protein sequences are first searched against a library of CATH domain superfamilies, using both the PSI-BLAST and pDomTHREADER programs. Significant homologous hits are identified using conservative cutoffs. In case of PSI-BLAST, a hit is considered significant if the sequence identity is >40% and Evalue is 5 105. In case of pDomTHREADER, hits are considered significant homologous matches, when the sequence identity is >40% and their scores have classed in “Certain” and “High” prediction categories in pDomTHREADER. For each query sequence, all the significant homologous hits are pooled and analyzed using DomainFinder to determine their multidomain architecture. The default settings in DomFinder program are used to resolve the highest scoring set of homologous domains, with minimal overlap and maximum coverage of the input query sequence. This provides the multidomain architecture for every protein analyzed, which corresponds to the highest scoring
60
Vaishali P. Waman et al.
set of homologous domains. A homology model is then built for each domain using MODELLER program (default settings) based on the alignments generated by PSI-BLAST or pDomTHREADER in the initial steps. Salient Feature of Domserf Server
3
The Domserf is an automated modeling pipeline, and the code is made available for at https://github.com/psipred/bioserf.
Additional Structural Annotation Resources In addition to pDomTHREADER and Domserf server, The PSIPRED Workbench web server (developed by the Jones Lab) provides an additional set of machine-learning-based tools for protein-structure and protein-function prediction. Hence, for every entry, Genome3D provides a link to The PSIPRED Workbench at the bottom of each Genome3D result page. The predictive methods available at PSIPRED (http://bioinf.cs.ucl.ac.uk/ psipred/; [12]) are listed in Table 4. A sample tutorial on the use of some of the available tools such as DISOPRED, FFPred, and MEMPACK is provided by the Genome3D (link: http://genome3d.eu/tutorials/page/Public/ Page/Tutorial/DISOPRED_FFPred_MEMPACK). Analyses using such tools may be useful for a given query protein of interest
Table 4 List of tools for protein structure and function prediction made available at The PSIPRED Workbench web server Sr. No.
Name of the tool
Type of prediction/annotation (and reference)
1.
PSIPRED (v4.0)
Secondary Structure [34]
2.
MetaPSICOV (v2.0)
Structural contacts [68]
3.
MEMSAT-SVM
Transmembrane helix [69]
4.
MEMPACK
Transmembrane helix packing [70]
5.
GenTHREADER
Fold recognition [7]
6.
pGenTHREADER
Fold recognition [7]
7.
pDomTHREADER
Folding Domain Recognition [7]
8.
DomPred and DOMSSEA
Domain Boundary Prediction [71]
9.
DISOPRED (v3)
Intrinsic disorder [72]
10.
FFPred (v3)
Eukaryotic GO (Gene Ontology) term [73]
11.
Domserf (v2.1)
Automated Homology modeling [6]
12.
BioSerf (v2.0)
Automated Homology modeling [6]
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
61
when there is limited knowledge available from the predicted model or functional annotation. For example, tools like DISOPRED are useful to analyze the extent of the intrinsically disordered regions in query proteins, as detailed below. DISOPRED Tutorial: Search Genome3D with the query human protein nuclear protein SkiP (Uniprot ID: Q13573). The “Predicted 3D structures” section shows predicted models generated by VIVACE and Phyre2. However, it can be observed that VIVACE generated a model for only 80 residues, while the query protein is 536 amino acid residues in length. Likewise, Phyre2 predicted models over only 4% of the sequence, with high confidence. Phyre2 makes use of DISOPRED to predict intrinsically disordered regions of the protein. The Phyre2 results indicate that 59% of the protein is intrinsically disordered. Users are also encouraged to perform a prediction using DISOPRED3 tool at PSIPRED. DISOPRED prediction results indicate that majority of the residues are predicted to be intrinsically disordered (marked in blue box in Fig. 18), and some of these are also predicted to be disordered protein-binding regions (Fig. 18: marked in green box). The results of the DISOPRED prediction are in agreement with prior NMR-based experimental studies on nuclear protein SkiP [74]. Thus, an additional analyses using DISOPRED may provide information about the disordered structure of the protein and the location of putative disordered protein-binding sites for this query protein. DISOPRED prediction results indicate that the majority of the residues are intrinsically disordered (marked in blue box), while a very few residues are also predicted to be protein binding (marked in green box). In addition to methods available via The PSIPRED Workbench, Genome3D users are encouraged to utilize additional useful resources for structure/function annotations, developed by each of the Genome3D partners. These resources are outlined in Table 5.
Fig. 18 The DISOPRED output for nuclear protein SkiP (Uniprot ID: Q13573)
Table 5 List of additional structural bioinformatics resources developed by Genoem3D partner groups
Name of resource
Description
Website URL and reference
Additional resources developed by Blundell group CHOPIN
A database for predicted structures of Mycobacterium tuberculosis proteome (built using VIVACE pipeline)
http://mordred.bioc. cam.ac.uk/chopin/ [31]
Mabellini
A database for predicted structures of Mycobacterium abscessus proteome (built using VIVACE pipeline)
http://mabellinidb. science
CREDO
A structural interactomic database which provides http://marid.bioc.cam. ac.uk/credo [75] all pairwise-atomic interactions between all molecules (nucleic acids, proteins, carbohydrates, and small molecules) within macromolecular complexes, deposited in PDB
DUET (SDM and mCSM)
DUET webserver combines two prediction http://structure.bioc. methods (namely SDM and mCSM) to assess the cam.ac.uk/duet/ [76] effect of missense mutations on protein stability
Additional resources developed by Gough group Proteome Quality Index
A database to provide a measure of assessing the http://pqi-list.org/ protein quality of proteomes (using 11 metrics)
DcGO (Domain centric gene ontology)
A gene ontology database for protein domains at superfamily and family levels (of SCOP)
http://supfam.org/ SUPERFAMILY/ dcGO/ [57]
D2p2 (Database of disordered protein prediction)
A resource for precomputed results of disorder predictions for proteins from completely sequenced genomes
http://d2p2.pro/ [77]
Additional resources developed by Sternberg group Suspect
A server for prediction of phenotypic effects of nonsynonymous point mutations
http://www.sbg.bio.ic.ac. uk/~suspect/ [39]
CombFunc
A gene ontology-based protein function prediction http://www.sbg.bio.ic.ac. web server uk/~mwass/ combfunc/ [78]
3DLigandSite
An automated method for prediction of ligandbinding sites
http://www.sbg.bio.ic.ac. uk/3dligandsite/ [79]
WINARNS
A web server for global alignment of protein–protein interaction networks (PPINs)
[80]
Additional resources developed by Orengo group FUN-L (FunctionalLists)
A tool for selection/prioritization of target genes for experiments
[81]
PAIN-Networks
A web resource for exploring pain expression experiments and visualization of pain-related gene networks
http://www. painnetworks.org/ [82]
The Genome3D Consortium for Structural Annotations of Selected Model Organisms
4
63
Future Perspectives Genome3D is a unique collaborative structural bioinformatics platform which facilitates comparative assessment of predicted models from various domain-based protein structure prediction methods, integrated by Genome3D. Genome3D uniquely provides mappings of CATH and SCOP superfamilies into distinct categories based on their levels of similarity. It, thus, helps researchers broaden their perspective on structure predictions/ annotations and encourages them to analyze the accuracy of the models, using various popularly used structural bioinformatics resources. Genome3D is a unique structural bioinformatics resource and is essentially distinct from a widely used complimentary resource, that is, InterPro [83], which provides sequence-based annotations. Since the publication of first version of Genome3D in 2013, several new features have been added as follows: 1. Provision to automatically submit the source sequence data to a particular group’s resource. 2. A fast “sequence-based” search utility. 3. Expanding the genome coverage from three model organisms to ten model organisms. 4. Comprehensive tutorials for Genome3D web searches as well as specific resources integrated at Genome3D. 5. Improvement in CATH/SCOP mapping criteria. 6. Expansion of coverage of protein families in Pfam: by providing annotations for at least one representative from every Pfam family. 7. Improved structural superposition algorithm which facilitates to assess similarities/differences between protein models. Future Areas for Improvement: The upcoming version of Genome3D (2.1) aims to further improve the website in terms of the following aspects: 1. Expansion of CATH/SCOP mapping using recent versions of CATH and SCOP. 2. Expansion of genome annotation space for additional model organisms such as Mycobacterium tuberculosis, cow, pig, and wheat. 3. Expansion of annotations for Pfam database entries.
64
Vaishali P. Waman et al.
4. Annotations from additional resources such as SWISSMODEL: The upcoming version of Genome3D will integrate SWISSMODEL, which is a widely used automated protein structure modeling resource, developed by Schwede group. References 1. Lewis TE, Sillitoe I, Andreeva A, Blundell TL, Buchan DW, Chothia C et al (2013) Genome3D: a UK collaborative project to annotate genomic sequences with predicted 3D structures based on SCOP and CATH domains. Nucleic Acids Res 41(D1):D499–D507 2. Lewis TE, Sillitoe I, Andreeva A, Blundell TL, Buchan DW, Chothia C et al (2015) Genome3D: exploiting structure to help users understand their sequences. Nucleic Acids Res 43(D1):D382–D386 3. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247 (4):536–540 4. Orengo CA, Michie A, Jones S, Jones DT, Swindells M, Thornton JM (1997) CATH—a hierarchic classification of protein domain structures. Structure 5(8):1093–1109 5. Buchan DW, Shepherd AJ, Lee D, Pearl FM, Rison SC, Thornton JM et al (2002) Gene3D: structural assignment for whole genes and genomes using the CATH domain structure database. Genome Res 12(3):503–514 6. Buchan DW, Minneci F, Nugent TC, Bryson K, Jones DT (2013) Scalable web services for the PSIPRED Protein Analysis Workbench. Nucleic Acids Res 41(W1): W349–W357 7. Lobley A, Sadowski MI, Jones DT (2009) pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination. Bioinformatics 25(14):1761–1767 8. Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJ (2015) The Phyre2 web portal for protein modeling, prediction and analysis. Nat Protoc 10(6):845 9. Kelley LA, Sternberg MJ (2009) Protein structure prediction on the Web: a case study using the Phyre server. Nat Protoc 4(3):363 10. Gough J (2002) The SUPERFAMILY database in structural genomics. Acta Crystallogr D Biol Crystallogr 58(11):1897–1900 11. Shi J, Blundell TL, Mizuguchi K (2001) FUGUE: sequence-structure homology recognition using environment-specific substitution
tables and structure-dependent gap penalties. J Mol Biol 310(1):243–257 12. Buchan DW, Jones DT (2019) The PSIPRED Protein Analysis Workbench: 20 years on. Nucleic Acids Res 47(W1):W402–W407 13. Waterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R et al (2018) SWISSMODEL: homology modelling of protein structures and complexes. Nucleic Acids Res 46(W1):W296–W303 14. Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7(10):e1002195 15. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC et al (2018) The Pfam protein families database in 2019. Nucleic Acids Res 47(D1):D427–DD32 16. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H et al (2000) The protein data bank. Nucleic Acids Res 28 (1):235–242 17. Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG (2014) SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res 42(D1): D310–D3D4 18. Hubbard TJ, Murzin AG, Brenner SE, Chothia C (1997) SCOP: a structural classification of proteins database. Nucleic Acids Res 25 (1):236–239 19. Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG (2015) Investigating protein structure and evolution with SCOP2. Curr Protoc Bioinformatics 49 (1):1.26.1–1.26.21 20. Sillitoe I, Lewis TE, Cuff A, Das S, Ashford P, Dawson NL et al (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res 43(D1): D376–D381 21. Cuff A, Redfern OC, Greene L, Sillitoe I, Lewis T, Dibley M et al (2009) The CATH hierarchy revisited—structural divergence in domain superfamilies and the continuity of fold space. Structure 17(8):1051–1062 22. Das S, Sillitoe I, Lee D, Lees JG, Dawson NL, Ward J et al (2015) CATH FunFHMMer web server: protein functional annotations using
The Genome3D Consortium for Structural Annotations of Selected Model Organisms functional family assignments. Nucleic Acids Res 43(W1):W148–W153 23. Lee D, Das S, Dawson NL, Dobrijevic D, Ward J, Orengo C (2016) Novel computational protocols for functionally classifying and characterising serine beta-lactamases. PLoS Comput Biol 12(6):e1004926 24. Ashford P, Pang CS, Moya-Garcı´a AA, Adeyelu T, Orengo CA (2019) A CATH domain functional family based approach to identify putative cancer driver genes and driver mutations. Sci Rep 9(1):263 25. Moya-Garcı´a A, Adeyelu T, Kruger FA, Dawson NL, Lees JG, Overington JP et al (2017) Structural and functional view of polypharmacology. Sci Rep 7(1):10102 26. Orengo CA, Taylor WR (1996) [36] SSAP: sequential structure alignment program for protein structure comparison, Methods in enzymology, vol 266. Elsevier, Amsterdam, pp 617–635 27. Redfern OC, Harrison A, Dallman T, Pearl FM, Orengo CA (2007) CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Comput Biol 3(11): e232 28. Sˇali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234(3):779–815 ˜ o B, Blundell TL (2017) 29. Ochoa-Montan XSuLT: a web server for structural annotation and representation of sequence-structure alignments. Nucleic Acids Res 45(W1): W381–W387 30. Sali A, Blundell TL (1990) Definition of general topological equivalence in protein structures: a procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J Mol Biol 212(2):403–428 ˜ o B, Mohan N, Blundell TL 31. Ochoa-Montan (2015) CHOPIN: a web resource for the structural and functional proteome of Mycobacterium tuberculosis. Database 2015. https://doi. org/10.1093/database/bav026 32. Mizuguchi K, Deane CM, Blundell TL, Johnson MS, Overington JP (1998) JOY: protein sequence-structure representation and analysis. Bioinformatics (Oxford, England) 14 (7):617–623 33. Remmert M, Biegert A, Hauser A, So¨ding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9(2):173 34. Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292(2):195–202
65
35. So¨ding J (2004) Protein homology detection by HMM–HMM comparison. Bioinformatics 21(7):951–960 36. Ofoegbu TC, David A, Kelley LA, Mezulis S, Islam SA, Mersmann SF et al (2019) PhyreRisk: a dynamic web application to bridge genomics, proteomics and 3D structural data to guide interpretation of human genetic variants. J Mol Biol 431(13):2460–2466 37. Xie W, Sahinidis NV (2005) Residue-rotamerreduction algorithm for the protein side-chain conformation problem. Bioinformatics 22 (2):188–194 38. Ward JJ, Mcguffin LJ, Bryson K, Buxton BF, Jones DT (2004) The DISOPRED server for the prediction of protein disorder. Bioinformatics 20(13):2138–2139 39. Yates CM, Filippis I, Kelley LA, Sternberg MJ (2014) SuSPect: enhanced prediction of single amino acid variant (SAV) phenotype using network features. J Mol Biol 426(14):2692–2701 40. Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 313(4):903–919 41. Pandurangan AP, Stahlhacke J, Oates ME, Smithers B, Gough J (2019) The SUPERFAMILY 2.0 database: a significant proteome update and a new webserver. Nucleic Acids Res 47(D1):D490–D494 42. The UniProt Consortium (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res 45(D1):D158–D169. PubMed PMID: 27899622 43. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R et al (2016) Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44(D1): D733–D745 44. Fox NK, Brenner SE, Chandonia JM (2014) SCOPe: structural classification of proteins— extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res 42(Database issue):D304–D309 45. Dawson NL, Lewis TE, Das S, Lees JG, Lee D, Ashford P et al (2017) CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Res 45 (D1):D289–D295 46. Cheng H, Schaeffer RD, Liao Y, Kinch LN, Pei J, Shi S et al (2014) ECOD: an evolutionary classification of protein domains. PLoS Comput Biol 10(12):e1003926. PubMed PMID: 25474468 47. Mir S, Alhroub Y, Anyango S, Armstrong DR, Berrisford JM, Clark AR et al (2018) PDBe:
66
Vaishali P. Waman et al.
towards reusable data delivery infrastructure at protein data bank in Europe. Nucleic Acids Res 46(D1):D486–D492 48. Madera M, Gough J (2002) A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res 30(19):4321–4328 49. Teichmann SA, Chothia C (2000) Immunoglobulin superfamily proteins in Caenorhabditis elegans. J Mol Biol 296(5):1367–1383 50. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S et al (2004) The Pfam protein families database. Nucleic Acids Res 32(Database issue):D138–D141 51. Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M et al (2004) The ASTRAL Compendium in 2004. Nucleic Acids Res 32(Database issue):D189–D192. PubMed PMID: 14681391 52. Karplus K, Barrett C, Hughey R (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics 14 (10):846–856 53. Gough J (2006) Genomic scale sub-family assignment of protein domains. Nucleic Acids Res 34(13):3625–3633 54. Fang H, Oates ME, Pethica RB, Greenwood JM, Sardar AJ, Rackham OJ et al (2013) A daily-updated tree of (sequenced) life as a reference for genome research. Sci Rep 3:2015 55. Madera M, Vogel C, Kummerfeld SK, Chothia C, Gough J (2004) The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res 32(Database issue): D235–D239. PubMed PMID: 14681402 56. Wilson D, Pethica R, Zhou Y, Talbot C, Vogel C, Madera M et al (2009) SUPERFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res 37(Database issue): D380–D386 57. Fang H, Gough J (2012) DcGO: database of domain-centric ontologies on functions, phenotypes, diseases and more. Nucleic Acids Res 41(D1):D536–D544 58. Lam SD, Dawson NL, Das S, Sillitoe I, Ashford P, Lee D et al (2016) Gene3D: expanding the utility of domain assignments. Nucleic Acids Res 44(D1):D404–D409 59. Pruess M, Kersey P, Apweiler R (2004) Integrating genomic and proteomic data: the Integr8 Project. J Integr Bioinform 1 (1):108–115 60. Cunningham F, Achuthan P, Akanni W, Allen J, Amode MR, Armean IM et al (2018) Ensembl 2019. Nucleic Acids Res 47(D1): D745–D751
61. Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30 (14):3059–3066 62. Pearl FM, Martin N, Bray JE, Buchan DW, Harrison AP, Lee D et al (2001) A rapid classification protocol for the CATH Domain Database to support structural genomics. Nucleic Acids Res 29(1):223–227 63. Pearl FM, Lee D, Bray JE, Buchan DW, Shepherd AJ, Orengo CA (2002) The CATH extended protein-family database: providing structural annotations for genome sequences. Protein Sci 11(2):233–244 64. Pearl FM, Bennett C, Bray JE, Harrison AP, Martin N, Shepherd A et al (2003) The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res 31(1):452–455 65. Lees J, Yeats C, Redfern O, Clegg A, Orengo C (2010) Gene3D: merging structure and function for a Thousand genomes. Nucleic Acids Res 38(Suppl_1):D296–D300 66. Altschul SF, Madden TL, Sch€affer AA, Zhang J, Zhang Z, Miller W et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402 67. Hildebrand A, Remmert M, Biegert A, So¨ding J (2009) Fast and accurate automatic structure prediction with HHpred. Proteins 77 (S9):128–132 68. Jones DT, Singh T, Kosciolek T, Tetchner S (2014) MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics 31(7):999–1006 69. Nugent T, Jones DT (2010) Predicting transmembrane helix packing arrangements using residue contacts and a force-directed algorithm. PLoS Comput Biol 6(3):e1000714 70. Nugent T, Ward S, Jones DT (2011) The MEMPACK alpha-helical transmembrane protein structure prediction server. Bioinformatics 27(10):1438–1439 71. Bryson K, Cozzetto D, Jones DT (2007) Computer-assisted protein domain boundary prediction using the Dom-Pred server. Curr Protein Pept Sci 8(2):181–188 72. Jones DT, Cozzetto D (2014) DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics 31(6):857–863 73. Cozzetto D, Minneci F, Currant H, Jones DT (2016) FFPred 3: feature-based function prediction for all Gene Ontology domains. Sci Rep 6:31865
The Genome3D Consortium for Structural Annotations of Selected Model Organisms 74. Wang X, Zhang S, Zhang J, Huang X, Xu C, Wang W et al (2010) A large intrinsically disordered region in SKIP and its disorder-order transition induced by PPIL1 binding revealed by NMR. J Biol Chem 285(7):4951–4963 75. Schreyer AM, Blundell TL (2013) CREDO: a structural interactomics database for drug discovery. Database 2013:bat049 76. Pires DE, Ascher DB, Blundell TL (2014) DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res 42(W1):W314–W3W9 77. Oates ME, Romero P, Ishida T, Ghalwash M, Mizianty MJ, Xue B et al (2012) D2P2: database of disordered protein predictions. Nucleic Acids Res 41(D1):D508–D516 78. Wass MN, Barton G, Sternberg MJ (2012) CombFunc: predicting protein function using heterogeneous data sources. Nucleic Acids Res 40(W1):W466–W470 79. Wass MN, Kelley LA, Sternberg MJ (2010) 3DLigandSite: predicting ligand-binding sites
67
using similar structures. Nucleic Acids Res 38 (Suppl_2):W469–WW73 80. Phan HT, Stemberg MJ, Gelenbe E (eds) (2012) Aligning protein-protein interaction networks using random neural networks. 2012 IEEE International conference on bioinformatics and biomedicine. IEEE 81. He´riche´ J-K, Lees JG, Morilla I, Walter T, Petrova B, Roberti MJ et al (2014) Integration of biological data by kernels on graph nodes allows prediction of new genes involved in mitotic chromosome condensation. Mol Biol Cell 25(16):2522–2536 82. Perkins JR, Lees J, Antunes-Martins A, Diboun I, McMahon SB, Bennett DL et al (2013) PainNetworks: a web-based resource for the visualisation of pain-related genes in the context of their network associations. Pain 154(12):2586.e1–2586.e12 83. Mitchell AL, Attwood TK, Babbitt PC, Blum M, Bork P, Bridge A et al (2019) InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res 47(D1):D351–D360
Chapter 4 Estimating the Quality of 3D Protein Models Using the ModFOLD7 Server Ali H. A. Maghrabi and Liam J. McGuffin Abstract Assessing the accuracy of 3D models has become a keystone in the protein structure prediction field. ModFOLD7 is our leading resource for Estimates of Model Accuracy (EMA), which has been upgraded by integrating a number of the pioneering pure-single- and quasi-single-model approaches. Such an integration has given our latest version the strengths to accurately score and rank predicted models, with higher consistency compared to older EMA methods. Additionally, the server provides three options for producing global score estimates, depending on the requirements of the user: (1) ModFOLD7_rank, which is optimized for ranking/selection, (2) ModFOLD7_cor, which is optimized for correlations of predicted and observed scores, and (3) ModFOLD7 global for balanced performance. ModFOLD7 has been ranked among the top few EMA methods according to independent blind testing by the CASP13 assessors. Another evaluation resource for ModFOLD7 is the CAMEO project, where the method is continuously automatically evaluated, showing a significant improvement compared to our previous versions. The ModFOLD7 server is freely available at http://www.reading.ac.uk/bioinf/ModFOLD/. Key words Estimates of model accuracy (EMA), Model quality assessment (MQA), Protein structure prediction, Protein modeling, Tertiary structure prediction, Critical assessment of techniques for protein structure prediction (CASP), Continuously evaluate the accuracy and reliability of predictions (CAMEO)
1
Introduction Since researchers from different fields of biological sciences started relying on the three-dimensional structural models of proteins, prediction programs have been improving rapidly. One of the major components of structure prediction pipelines is the evaluation or assessment of the predicted model accuracy. It is possible to generate many hundreds of alternative 3D models for any give protein target using many different algorithms. Often, the best modeling method is not always the most accurate for a given target, so it is problematic to choose rank and select the models that are most likely to be the closest to the native structure. Furthermore, local regions of models may differ in quality, and so it may help a
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_4, © Springer Science+Business Media, LLC, part of Springer Nature 2020
69
70
Ali H. A. Maghrabi and Liam J. McGuffin
biologist to know whether their specific regions of interest are accurately modeled, for example, predicted interface/interacting residues. Such problems have been recognized by the field of structural bioinformatics, and many developers have focused their attention toward improving methods for Model Quality Assessment (QA) that support their prediction pipelines. Such tools and servers are also currently referred to as the Estimates of Model Accuracy (EMA) methods. The EMA (a.k.a. QA) methods and servers were included for evaluation as a category in two major worldwide organizations that are specialized in the protein structure prediction field. The first organization conducts independent blind testing with the Critical Assessment of Techniques for Protein Structure Prediction (CASP) [1] experiments, which are held every other year. The second organization is the continuously automatic model evaluation project called CAMEO [2]. Both organizations have highlighted the importance of the EMA development for the improvement of protein structure prediction and have helped to encourage progress in the field. Modern methods of EMA can be classified into three broad categories. (1) The pure-single-model methods, which can score the data from the information of an individual model—they are featured by their rapid processing and their strong performance at model ranking and selection, but they often produce less consistent global scores. (2) The clustering/consensus approaches, which use multiple alternative models build for the same protein target— these types of methods have the opposite features of the singlemodel methods, and they have been far more accurate but are more computationally intensive and do not work when very few similar models are available. (3) The quasi-single-model methods, which can score an individual model against a pool of reference alternative models that are generated from the same target sequence. Quasisingle-model methods attempt to provide comparable accuracy to clustering methods, while addressing real-life needs of researchers with few/single models. ModFOLD [3] is our EMA protocol, and various successive versions have been competing with the top-leading model quality assessment programs throughout the past 10 years. ModFOLD was built in the beginning as two separate methods. The original singlemodel method was called by its own original name, ModFOLD. Additionally, we developed a clustering-based method, called ModFOLDclust [4]. Over the years, both methods have been merged with the adoption of a number of other methods to develop a new ModFOLD program which was a pioneer of the quasi-singlemodel approach. The quasi-single-model approach was firstly implemented with the third version of ModFOLD [5]. By using this approach, ModFOLD3 was able to using a similiar method to that of ModFOLDclust2 [4], by firstly generating reference sets of models from the
Model Quality Assessment Using ModFOLD
71
target sequence using the IntFOLD-TS [6] method, which were subsequently used for comparison with the submitted model. ModFOLD has since undergone a number of updates through versions 4 [7], 5 [8], and 6 [9], which have maintained the use of a quasi-single-model approach. Each successive version has been ranked among the top-performing EMA methods of the recent CASP experiments. The implementation of quasi-single method has helped our ModFOLD pipeline keep its competitiveness using the predictive power offered by clustering-based methods, as well as being capable of making predictions for a single model at a time. While we have made significant progress in performance over the years with our ModFOLD methods, there is still room for improvement in many aspects of EMA. Here, we describe significant major updates to the ModFOLD server. The server has been popular with modelers around the world, having completed hundreds of thousands of EMA jobs for thousands of unique users over the past decade.
2
Methods The latest version of our server, ModFOLD7, uses a new quality assessment technique which combines the strengths of multiple pure-single- and quasi-single-model methods for the improvement of prediction accuracy. The server comprises a single-model approach which combines ten scoring methods. Six of the methods are pure-single-model inputs methods, and they include the following: (1) Contact Distance Agreement (CDA) which uses MetaPSICOV [10] to relate to the agreement between the predicted residue contacts and the contacts in model; (2) Secondary Structure Agreement (SSA) which uses PSIPRED [11] to relate to the agreement between the predicted secondary structure of each residue and the secondary structure state of the residue in model according to Dictionary of Secondary Structures of Proteins (DSSP); (3) ProQ2 [12]; (4) ProQ2D [13]; (5) ProQ3D [13]; and (6) VoroMQA [14]. The remaining four methods are quasisingle-model input methods, and they are as follows: (1) ModFOLDclust_single (MFcs) which uses input model against the 130 IntFOLD5 reference models; (2) Disorder “B-factor” Agreement (DBA) which compares DISOPRED [15] scores against the MFcs score; (3) ModFOLDclustQ_single (MFcQs) [4] which uses input model against the IntFOLD5 reference models; and (4) ResQ [16] which estimates the residue-specific quality and B-factor, and it compares the input model against LOMETS [17] models. The combination of the component per-residue/local quality scores from each of the ten methods is processed using Neural Networks (NNs), resulting in a final consensus of per-residue quality scores for each model. A flowchart of the data and processes used in the ModFOLD7 server is shown in Fig. 1.
72
Ali H. A. Maghrabi and Liam J. McGuffin
Fig. 1 Flow of data illustrating the local and global estimates of model accuracy in ModFOLD7. The method pipeline starts with two inputs, the target sequence and a single model. The target sequence is evaluated with five preprocessing methods. The resulting data from the preprocessing methods with the input single model then are evaluated with ten scoring methods resulting in local score input data. Next, the local scores are processed using two neural networks (NN) trained to two target functions, the S-score and the lDDT score, resulting in the final local score outputs. Lastly, the mean local scores from each method are used to form 12 global scores, which are then optimally combined in the different ways indicated to form the three variants of ModFOLD7 2.1 The ModFOLD7 Component PerResidue/Local Quality Scoring Methods
The ModFOLD7 NNs were trained using two separate target functions for each residue in a model: the residue contact-based lDDT score and the superposition-based S-score which has been used in previous versions of ModFOLD. The RSNNS package for R was used to construct the NNs, which were trained using data derived from the evaluation of CASP11 and 12 server models versus native structures. The per-residue similarity scores were calculated using a simple multilayer perceptron (MLP). For the method trained using the lDDT score (ModFOLD7_res_lddt), the MLP input consisted of a sliding window (size ¼ 5) of per-residue scores from all ten of the methods described above, and the output was a single quality score for each residue in the model (50 inputs, 25 hidden, 1 output). For the method trained using the S-score (ModFOLD7_res), this time only seven of the ten methods were used as inputs—all apart from the ProQ2, CDA, and SSA scores—with a sliding window (size ¼ 5), therefore 35 inputs, 18 hidden, 1 output. For both of the per-residue scoring methods, the similarity scores, s, for each residue were converted back to distances, d, with d ¼ 3.5√((1/s) 1).
Model Quality Assessment Using ModFOLD
73
2.2 The ModFOLD7 Global Scoring Methods
Global scores were calculated by taking the mean per-residue scores (the sum of the per-residue similarity scores divided by sequence lengths) for each of the ten individual component methods, described above, plus the NN output from ModFOLD7_res and ModFOLD7_res_lddt. Furthermore, three additional quasi-single global model quality scores were generated for each model based on the original ModFOLDclust, ModFOLDclustQ, and ModFOLDclust2 global scoring methods (in a similar vein to the ModFOLD4_single and ModFOLD5_single global scores, tested in CASP10 and CASP11, respectively). Thus, we ended up with 15 alternative global QA scores, which could be combined in various ways in order to optimize for the different facets of the quality estimation problem. For the CASP13 experiment, we registered three ModFOLD7 global scoring variants: (1) The ModFOLD7 global score, which used the mean per-residue NN output score from ModFOLD7_res—this score considered alone was found to have a good balance of performance both for correlations of predicted versus observed scores and rankings of the top models. (2) The ModFOLD7_cor global score variant ((MFcQs + DBA + ProQ3D + ResQ + ModFOLD7_res)/5) was found to be an optimal combination for producing good correlations with the observed scores, that is, the predicted global quality scores produced should produce closer to linear correlations with the observed global quality scores. (3) The ModFOLD7_rank global score variant ((CDA + SSA + VoroMQA + ModFOLD7_res + ModFOLD7res_lDDT)/5) was found to be an optimal combination for ranking, that is, the top-ranked models (top 1) should be closer to the highest accuracy, but the relationship between predicted and observed scores may not be linear. The local scores of the ModFOLD7 and ModFOLD_rank variants used the output from the ModFOLD7_res NN, whereas the ModFOLD_cor variant used the local scores from the ModFOLD7_res_lddt NN.
2.3 Server Inputs and Outputs
Like the previous versions, the ModFOLD7 server requires only the amino acid sequence for the protein target and a single 3D model (in PDB format) for evaluation. However, users can upload more than one PDB file in a compressed archive. Optionally, users can also give their target a name and also provide their e-mail address, so that they can receive a notification of the result (see Notes 1–6). The results are provided in a clean and simple user interface so that it can be interpreted easily by nonexperts at a glance. Once the prediction process is complete, a results page is generated containing a single table summarizing the quality assessment scores for each submitted model. Each assessed model is represented in the table graphically, with thumbnail images of the local error plots and annotated 3D models. Images in the table are clickable for detailed
74
Ali H. A. Maghrabi and Liam J. McGuffin
3D visualization using the JSmol/HTML5 framework. Conveniently, interactive 3D results can also be viewed on mobile devices without any plugin requirement. The results table shows a global score for each model, a p-value indicating the likelihood that the model is incorrectly folded and a plot of the local errors in the model in A˚ngstro¨ms. Users can also download the models annotated with the ModFOLD7 predicted local quality scores, which have been inserted into the B-factor column of the ATOM records for each submitted model. The raw machine-readable data files for each set of predictions, which comply with the CASP data standards, are also provided for developers and more advanced users. An overview of the ModFOLD7 interface is shown in Fig. 2 (see Notes 7–12). 2.4 Independent Benchmarking and Cross-Validation
3
The three alternative optimized scoring methods of the ModFOLD7 server have been benchmarked against their respective previous versions from the ModFOLD6 server (Fig. 3). For the cumulative GDT_TS of top-ranked model, ModFOLD6_rank method was giving a score below 44.5 as their highest, whereas ModFOLD7_rank was able to cross the 45 and go higher. For the Pearson correlation comparing the predicted score versus the observed score (GDT_TS), ModFOLD6_cor achieved a correlation 0.9250, while for ModFOLD7_cor, the correlation was found to be over 0.9300. For the evaluation of local model quality prediction accuracy using the area under the ROC curve (AUC) (where residues with lDDT scores 0.6 ¼ 0), ModFOLD6 could not reach an AUC score of 0.93, whereas ModFOLD7 was closer to 0.95. Such results indicate that our latest version, ModFOLD7, has demonstrated progress in performance compared to ModFOLD6, and according to many measures, the improvements are significant. ModFOLD7 is also one of the EMA servers that are continuously independently benchmarked for local EMA performance by the evaluating organization, CAMEO. For the last year, the CAMEO public EMA data (https://www.cameo3d.org/) show that ModFOLD7 is one of the leading public EMA methods for producing local (per-residue) quality scores. The results from CAMEO also show that ModFOLD7 is performing significantly better than its previous versions, ModFOLD6 and ModFOLD4 [7, 9] (Table 1).
Case Study In 2018, the ModFOLD7 servers participated in the latest worldwide Critical Assessment of Techniques for Protein Structure Prediction competition (CASP13). The goal of this competition was to help advance the methods which identify protein structure from
Fig. 2 ModFOLD7 server inputs and outputs pages. Inputs page: containing a text box to paste the amino acid sequence of protein target in single-letter code, a push button to upload model/models (either a single PDB file or a tarred and gzipped directory of PDB files) of the protein target, three options to select the global accuracy score optimization preference, and two optional text boxes to input the user e-mail address and to give a short name for protein target. Outputs page: showing the result page for models submitted to CASP13 generated for target T0959. The main output page is shown with summary tables of the results for each model. Results can also be visualized in more detail by clicking on the thumbnail images in the main table
76
Ali H. A. Maghrabi and Liam J. McGuffin
Fig. 3 Histograms showing a comparison between the three variants of ModFOLD6 and the respective variants of ModFOLD7 using three evaluation methods: the cumulative GDT_TS of top-ranked models, the Pearson correlations between predictive and observed scores, and the local accuracy as measured by the AUC score (lDDT 0.6 ¼ 0). Evaluation is based on cross-validated CASP11 data
Table 1 Top EMA methods in CAMEO ROC normalized PR
PR normalized
Structural models
ROC
Server
Submitted Received %
AUC AUC AUC AUC AUC AUC AUC AUC 0,1 0,0.2 0,1 0,0.2 0,1 0.8,1 0,1 0.8,1
QMEANDisCo
9816
9041
92.1 0.93 0.77 0.86 0.71 0.9
ModFOLD7_lDDT 9816
8283
84.4 0.91 0.71 0.77 0.6
ModFOLD6
9816
6709
68.3 0.89 0.65 0.61 0.44 0.84 0.58 0.57 0.4
QMEAN
9816
9054
92.2 0.87 0.61 0.8
ProQ2
9816
9464
96.4 0.86 0.58 0.82 0.56 0.79 0.5
ModFOLD4
9816
7191
73.3 0.85 0.57 0.62 0.42 0.78 0.49 0.57 0.36
0.66 0.83 0.61
0.87 0.61 0.74 0.51
0.56 0.81 0.53 0.74 0.49 0.76 0.48
One year of data downloaded from http://www.cameo3d.org/. One year [2018-03-30–2019-03-23]—“All” dataset. The table is sorted by the ROC AUC score ROC receiver operating characteristic, AUC area under the ROC curve, PR precision and recall
sequence by testing them objectively via the process of blind prediction. The competition includes many subcategories, one of them is the Estimate of Model Accuracy (EMA) where our ModFOLD7 methods are independently evaluated. The CASP assessors provide sequences of proteins whose structures have never been observed before. Participants use their prediction servers in order to generate the 3D models of the target structures. Once server models have been generated for a given target, they are then used for the EMA category; participants use their model quality assessment methods in order to estimate the accuracy of the predicted models for each target.
Model Quality Assessment Using ModFOLD
77
In CASP13, the assessors provide predictors with anonymous protein sequence (targets), and these targets are submitted by different biological research teams around the world who have a vested interest in determining their structures. An example of one of these protein targets is Endolysin KPP12 (CASP3 target T0962), a bacteriophage found to have a therapeutic effect in Pseudomonas aeruginosa keratitis [18]. The study shows that the morphological and DNA sequence analysis of KPP12 have led to identifying the family of that protein and the similarities with other viruses, and therefore, researchers are testing whether the protein is the same as its family members. Using KPP12 as a treatment can result in the suppression of neutrophil infiltration, and it also can greatly enhance bacterial clearance in the infected cornea. The only available data for KPP12 were the sequence. Participants from different organizations and companies started to predict the structure of that protein by using their own methods. After structure prediction, the created models were assessed in terms of its quality and how close are these models to their protein native structures. The results showed that ModFOLD7 has given the best EMA score among all the other methods in all measurements such as LDDT with 0.660 and CAD with 1.990 (Table 2). Such information about model quality is invaluable in identifying: firstly, the very best 3D models of a protein that are the closest to the native Table 2 The top ten EMA methods for Target T0962 (KPP12) in CASP13 in terms of absolute differences in score between the top selected model and the best model according to observed structure (smaller scores indicate higher performing methods) Rank
Gr. Name
GDT_TS
LDDT
CAD(AA)
1
SG
SBROD-plus
0.000
0.660
1.990
0.000
2
ModFOLD7
0.000
0.660
1.990
0.000
3
ModFOLD7_cor
0.000
0.660
1.990
0.000
4
MASS2
10.170
2.110
3.991
8.475
5
Bhattacharya-Server
10.170
2.110
3.991
8.475
6
Pcons
6.215
2.660
3.121
10.452
7
VoroMQA-B
4.802
2.850
2.033
5.933
8
Kiharalab
4.802
2.850
2.033
5.933
9
ProQ4
4.802
2.850
2.033
5.933
10
MASS1
4.802
2.850
2.033
5.933
EMA methods are evaluated for target T0962 in CASP13. The evaluation was performed using GDT_TS, lDDT, CAD, and SG measuring scores. Only the top ten methods are shown, and the table is sorted using lDDT scores. The scores are calculated over all models for all targets (QA stage 2–best 150). The data are downloaded from http://predictioncenter. org/casp13/qa_diff2best.cgi
78
Ali H. A. Maghrabi and Liam J. McGuffin
structures, secondly, the likelihood that models are of good or poor quality overall, and finally, the magnitude of errors in specific local regions of the protein and the regions that are likely to have the fewest errors.
4
Notes 1. The ModFOLD server version 7.0 requires the amino acid sequence of your target protein and either a single 3D model file in PDB format or a tarball containing a directory of multiple separate files in PDB format. To produce a tarball file for your own 3D models, for Linux/OSX/other Unix users: (a) Tar up the directory containing your PDB files, for example, type the following at the command line: tar cvf my_models.tar my_models/, (b) Gzip the tar file, for example, gzip my_models.tar, (c) upload the gzipped tar file (e.g., my_models.tar.gz) to the ModFOLD server; and for Windows users: (a) download a file archiver application such as 7-zip, (b) select the directory (folder) of model files to add to the .tar file, click “Add,” select the “tar” option as the “Archive format:”, and save the file as something memorable, for example, my_models.tar, (c) select the tar file, click “Add,” and then select the “GZip” option as the “Archive format:”—the file should then be saved as my_models.tar.gz, and (d) upload the gzipped tar file (e.g., my_models.tar.gz) to the ModFOLD server. 2. Providing the e-mail address will give the permission to send a link with the graphical results and machine-readable results directly after the predictions are completed. However, if the user does not provide the e-mail address, then she/he must bookmark the results page in order to view and refer to it when it is available. 3. In the text box labeled “Input sequence of protein target,” users should carefully paste in the full amino acid for the interested target protein in single-letter format. An example sequence (CASP13 target T0949) is inserted as MAAKKGMTTVLVSAVICAGVIIGALQWEKAVALPNPSG QVINGVHHYTIDEFNYYYKPDRMTWHVGEKVELTIDN RSQSAPPIAHQFSIGRTLVSRDNGFPKSQAIAVGWKDNF FDGVPITSGGQTGPVPAFSVSLNGGQKYTFSFVVPNKPG KWEYGCFLQTGQHFMNGMHGILDILPAQGS. 4. It is important that the user provides the full sequence that corresponds to the sequence of residue coordinates in the model file. If the model does not contain numbering which corresponds directly to the order of residues in the sequence file, then the server will attempt to renumber the residues in
Model Quality Assessment Using ModFOLD
79
the model files accordingly. However, submitting a model file with residues that are not contained in the provided sequence will not complete the prediction for that model. 5. Users must ensure that each PDB file contains the coordinates for one model only. Please do not upload a single PDB file containing the coordinates for multiple alternative NMR models. The coordinates for multiple models should always be uploaded as a tarred and gzipped directory of separate files. 6. Assigning a short memorable name to user’s prediction jobs is useful for identifying and distinguishing them, because ModFOLD will not necessarily return the results in the order the user submitted them. 7. The results table is ranked according to decreasing global model quality score. The global model quality scores range between 0 and 1. In general, scores less than 0.2 indicate that there may be incorrectly modeled domains, and scores greater than 0.4 generally indicate more complete and confident models, which are highly similar to the native structure. If the global model quality scores are low, then the per-residue scores can give you an idea of specific domains or regions in your protein that might be correctly modeled. 8. From the global scores, the p-value which represents the probability that each model is incorrect can be calculated. In other words, for a given predicted model quality score, the p-value is the proportion of models with that score that do not share any similarity with the native structure (TM-score < 0.2). Each model is also assigned a color-coded confidence level depending on the p-value: p < 0.001 ¼ blue ¼ CERT ¼ Less than a 1/1000 chance that the model is incorrect, p < 0.01 ¼ green ¼ HIGH ¼ Less than a 1/100 chance that the model is incorrect, p < 0.05 ¼ yellow ¼ MEDIUM ¼ Less than a 1/20 chance that the model is incorrect, p < 0.1 ¼ orange ¼ LOW ¼ Less than a 1/10 chance that the model is incorrect, p > 0.1 ¼ red ¼ POOR ¼ Likely to be a poor model with little or no similarity to the native structure. 9. The per-residue scores indicate the predicted distance (in Angstroms) between the CA atom of the residue in the model and the CA atom of the equivalent residue in the native structure. Thumbnail images of plots depicting the per-residue error versus residue number are included in each row in the results table. Each of the thumbnails links to a page that displays a larger view of the plot and contains a further link to download a PostScript version. Each row in the table also displays a thumbnail of the 3D cartoon view of the model which is color coded with the residue error according to the RasMol temperature coloring scheme. Each small image also links to a
80
Ali H. A. Maghrabi and Liam J. McGuffin
page that shows a larger image of the 3D view and contains a link to download a PDB file of the model with residue accuracy predictions (Angstroms) in the B-factor column. The model is also loaded into JSmol for convenient interactive viewing of per-residue errors within the browser. 10. The time taken for a prediction will depend on the length of sequence, the number of models submitted, and the load on the server. For a new run on single model, the user should typically receive his/her results back within 24 h, once the job is running. Large batches of models (several hundred) for a single target may take several days to process. If the user has already submitted a model for the same target sequence within the same week, then the reference model library for that sequence will already be available to the server (the results will be cached) and so she/he will receive the results back much more quickly (within a few hours). 11. For fair usage policy, the users are allowed to have one job running at a time for each IP address, so please wait until your previous job completes before submitting further data. If you already have a job running, then you will be notified, and your uploaded data will be deleted. Once your job has completed, your IP address will be unlocked and you will be able to submit new data. 12. Users should check the header of the machine-readable results file (provided as a link at the top of the result page) for any errors that may have occurred following file submission. Please e-mail us for help if you encounter a persistent error. References 1. Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A (2014) Critical assessment of methods of protein structure prediction (CASP)—round x. Proteins 82 (Suppl 2):1–6. https://doi.org/10.1002/ prot.24452 2. Haas J, Barbato A, Behringer D, Studer G, Roth S, Bertoni M, Mostaguir K, Gumienny R, Schwede T (2018) Continuous Automated Model EvaluatiOn (CAMEO) complementing the critical assessment of structure prediction in CASP12. Proteins 86:387–398. https://doi.org/10.1002/prot. 25431 3. McGuffin LJ (2007) Benchmarking consensus model quality assessment for protein fold recognition. BMC Bioinformatics 8:345. https:// doi.org/10.1186/1471-2105-8-345 4. McGuffin LJ, Roche DB (2010) Rapid model quality assessment for protein structure predictions using the comparison of multiple models
without structural alignments. Bioinformatics 26:182–188. https://doi.org/10.1093/bioin formatics/btp629 5. Roche DB, Buenavista MT, McGuffin LJ (2014) Assessing the quality of modelled 3D protein structures using the ModFOLD server. In: Kihara D (ed) Protein structure prediction. Springer, New York, pp 83–103 6. McGuffin LJ, Roche DB (2011) Automated tertiary structure prediction with accurate local model quality assessment using the intfold-ts method. Proteins 79:137–146. https://doi.org/10.1002/prot.23120 7. McGuffin LJ, Buenavista MT, Roche DB (2013) The ModFOLD4 server for the quality assessment of 3D protein models. Nucleic Acids Res 41:W368–W372. https://doi.org/ 10.1093/nar/gkt294 8. McGuffin LJ, Atkins JD, Salehe BR, Shuid AN, Roche DB (2015) IntFOLD: an integrated server for modelling protein structures and
Model Quality Assessment Using ModFOLD functions from amino acid sequences. Nucleic Acids Res 43:W169–W173. https://doi.org/ 10.1093/nar/gkv236 9. Maghrabi AHA, McGuffin LJ (2017) ModFOLD6: an accurate web server for the global and local quality estimation of 3D protein models. Nucleic Acids Res 45:W416–W421. https://doi.org/10.1093/nar/gkx332 10. Jones DT, Singh T, Kosciolek T, Tetchner S (2015) MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics 31:999–1006. https://doi. org/10.1093/bioinformatics/btu791 11. Buchan DWA, Minneci F, Nugent TCO, Bryson K, Jones DT (2013) Scalable web services for the PSIPRED protein analysis workbench. Nucleic Acids Res 41:W349–W357. https://doi.org/10.1093/nar/gkt381 12. Uziela K, Wallner B (2016) ProQ2: estimation of model accuracy implemented in Rosetta. Bioinformatics 32:1411–1413. https://doi. org/10.1093/bioinformatics/btv767 13. Uziela K, Hurtado DM, Wallner B, Elofsson A (2016) ProQ3D: improved model quality assessments using Deep Learning. ArXiv161005189 Q-Bio
81
ˇ (2017) VoroMQA: 14. Olechnovicˇ K, Venclovas C assessment of protein structure quality using interatomic contact areas. Proteins 85:1131–1145. https://doi.org/10.1002/ prot.25278 15. Jones DT, Cozzetto D (2015) DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics 31:857–863. https://doi.org/10. 1093/bioinformatics/btu744 16. Yang J, Wang Y, Zhang Y (2016) ResQ: an approach to unified estimation of B-factor and residue-specific error in protein structure prediction. J Mol Biol 428:693–701. https://doi. org/10.1016/j.jmb.2015.09.024 17. Wu S, Zhang Y (2007) LOMETS: a local metathreading-server for protein structure prediction. Nucleic Acids Res 35:3375–3382. https://doi.org/10.1093/nar/gkm251 18. Fukuda K, Ishida W, Uchiyama J, Rashel M, Kato S, Morita T, Muraoka A, Sumi T, Matsuzaki S, Daibata M, Fukushima A (2012) Pseudomonas aeruginosa keratitis in mice: effects of topical bacteriophage KPP12 administration. PLoS One 7:e47742. https://doi. org/10.1371/journal.pone.0047742
Chapter 5 Prediction of Intrinsic Disorder with Quality Assessment Using QUARTER Zhonghua Wu, Gang Hu, Christopher J. Oldfield, and Lukasz Kurgan Abstract Intrinsically disordered regions (IDRs) are estimated to be highly abundant in nature. While only several thousand proteins are annotated with experimentally derived IDRs, computational methods can be used to predict IDRs for the millions of currently uncharacterized protein chains. Several dozen disorder predictors were developed over the last few decades. While some of these methods provide accurate predictions, unavoidably they also make some mistakes. Consequently, one of the challenges facing users of these methods is how to decide which predictions can be trusted and which are likely incorrect. This practical problem can be solved using quality assessment (QA) scores that predict correctness of the underlying (disorder) predictions at a residue level. We motivate and describe a first-of-its-kind toolbox of QA methods, QUARTER (QUality Assessment for pRotein inTrinsic disordEr pRedictions), which provides the scores for a diverse set of ten disorder predictors. QUARTER is available to the end users as a free and convenient webserver at http://biomine.cs.vcu.edu/servers/QUARTER/. We briefly describe the predictive architecture of QUARTER and provide detailed instructions on how to use the webserver. We also explain how to interpret results produced by QUARTER with the help of a case study. Key words Intrinsic disorder, Intrinsically disordered regions, Prediction, Quality assessment, QUARTER
1
Introduction Intrinsically disordered regions (IDRs) in protein sequences lack stable tertiary structure under physiological conditions and instead they form dynamic conformational ensembles [1–4]. Several largescale computational estimates reveal that proteins with IDRs are highly abundant in nature [5–10] and that these regions are functionally important [11–31]. The disordered nature of these regions is encoded in their sequences; IDRs often feature high net charge and low hydrophobicity, when compared to structured protein regions. IDRs are typically depleted in aromatic residues, large hydrophobic amino acids, and valine [32]. These marked differences between the sequences of IDRs and structured regions have
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_5, © Springer Science+Business Media, LLC, part of Springer Nature 2020
83
84
Zhonghua Wu et al.
motivated the development of accurate computational tools for the prediction of disorder. Over 50 disorder predictors have already been developed. A near complete list of these methods can be assembled from several recent surveys and comparative studies [2, 33–39]. Arguably the most popular methods (sorted by the number of Google Scholar citations as of Feb 21, 2019) include DisEMBL [40] (1115 citations), IUPred [41] (1561 citations), DISOPRED [42] (617 citations), VSL2 [43] (588 citations), PrDOS [44] (402 citations), ESpritz [45] (195 citations), MFDp [46] (131 citations), and SPINE-D [47] (117 citations). Several recent comparative analyses show that some disorder predictors offer high predictive quality. For instance, the disorder assessment in the CASP10 experiment (the latest CASP that has assessed these predictions) reveals that the top three methods achieve areas under the ROC curves (AUCs) equal 0.907 (PrDOS), 0.897 (DISOPRED3), and 0.890 (MFDp) [48]. Disorder predictors generate two types of outputs for each amino acid in the input protein sequence: a real-value propensity for disordered conformation and/or a binary category (disordered vs. ordered). High values of propensity suggest that the corresponding residues are likely disordered, while low values suggest that they are ordered. The binary category is usually generated using a predictor-specific threshold, where residues with propensities greater than the threshold are predicted as disordered, while the remaining residues are predicted as ordered. Figure 1 shows an example prediction from the VSL2B method [43] for the PsaF protein from spinach (UniProt ID: P12355). This relatively obscure example is used to illustrate prediction for a protein with disorder status that is most likely unknown to the reader. The black line in the top panel of Fig. 1 represents the real-value
Fig. 1 Prediction of intrinsic disorder and the associated quality assessment (QA) scores for the PsaF protein (UniProt ID: P12355). The top panel of the figure shows the putative propensities for disorder (black line) and the binary predictions track, i.e., disordered (in red) vs. ordered (in blue) residues, which were generated by the VSL2B method. The bottom panel of the figure shows the corresponding QA scores (black line) produced with the QUARTER method. Green shading denotes regions where the predictions are assumed to be correct according to the putative propensities for disorder (at the top) and according to the QA scores (at the bottom)
Quality Assessment for Disorder Prediction
85
propensities, the dotted horizontal line denotes the threshold, and the color-coded horizontal track right below shows binary predictions (red for disordered residues vs. blue for ordered) generated by VSL2B. After computing the predictions, it is left to the user to decide whether and which of these predictions can be trusted. Assuming all predictions are correct would be unreasonable since none of the predictors is or should be expected to be 100% accurate. The VSL2B predictor was previously benchmarked to have 81.6% accuracy [43], which means that on average 18.4% of its lowest quality predictions should be discarded. The residues with presumably the lowest predictive quality are those with propensities closest to the binary prediction threshold, i.e., residues with higher propensities (lower propensities) are more likely to be correctly predicted as disordered (ordered). The green shading of the VSL2B predictions in Fig. 1 shows the location of the corresponding set of 81.6% correctly predicted residues. This correct/green part of the prediction suggests that the PsaF protein has IDRs at both termini, a long IDR in the middle of the chain (positions 83 to 117), and a short IDR at positions 187 to 194. The remaining short putative IDR at positions 124 to 130 is likely inaccurately predicted. There are at least two problems with this interpretation of predictions. First, the 81.6% accuracy was measured on a benchmark data set, and it does not imply that this level of accuracy applies to each individual prediction. In fact, results for individual proteins vary widely from highly accurate predictions to cases where majority of predictions are wrong. Second, the propensity scores are not guaranteed to be accurate indicators of the quality of predictions, as shown in a recent investigation that considered a collection of ten representative disorder predictors [49]. This study has revealed that for nine out of ten of these methods (the exception being VSL2B), the propensity scores provide useful information to select accurate predictions only for the natively ordered residues, while being virtually unusable for the natively disordered residues. Altogether, we argue that selection of the subset of accurate predictions is a challenging task, which is influenced by the choice of a particular predictor and a particular protein sequence. One solution to this problem is to generate quality assessment (QA) scores together with the disorder predictions. QA scores quantify correctness (confidence) of the disorder predictions at a residue level to reveal which predictions are more likely to be correct. Correctly predicted native-disordered and structured residues should have high values of the QA scores, while residues that are incorrectly predicted should have low QA scores. The QA score predictions must be optimized for specific predictors of disorder since these methods use different types of disorder annotations, were designed using different training data sets, and use different predictive architectures [2, 35–37]. While prediction of QA scores
86
Zhonghua Wu et al.
has been pursued for over a decade for predictions of protein structure [50–54], so far only one method, called QUARTER (QUality Assessment for pRotein inTrinsic disordEr pRedictions), was developed to generate these scores for predictions of intrinsic disorder [55]. The bottom panel of Fig. 1 shows QUARTER’s QA scores for VSL2B’s predictions. The black line shows the QA scores and the green shading corresponds to the regions where VSL2B’s predictions are likely correct. They reveal that the N-terminus is intrinsically disordered while the region between positions 153 and 224 is ordered. The QA scores also suggest that the central part of this protein where the QA scores are low (positions 50 to 130) is likely incorrectly predicted. QUARTER was empirically shown to provide accurate and individually optimized QA scores for ten popular disorder predictors [55]. Empirical tests on a large test data set show that QUARTER’s QA scores are accurate and significantly better than the propensity scores generated by the disorder predictors, particularly for the native disordered residues [55]. Consequently, these QA scores avoid the two pitfalls of disorder propensities: (1) they are tailored to individual proteins and (2) they work equally well for both native disordered and native ordered residues. We note that at the end of this chapter, QA scores and disorder propensities are compared in the context of native annotations of disorder for the PsaF protein from Fig. 1. Overall, QUARTER’s QA scores provide a useful context for the underlying disorder predictions, guiding the user toward a subset of high-quality predictions that are calibrated for specific protein sequences and predictors. This chapter describes the predictive architecture of the QUARTER tool, explains how to use QUERTER’s webserver, and clarifies how to interpret results produced by this webserver. It concludes with a case study that focuses on the PsaF protein.
2
Materials
2.1 Disorder Predictors Supported by QUARTER
QUARTER supports ten disorder predictors that include three versions of the ESpritz method that were designed to predict disorder annotated using X-ray crystallography (EspritzX-ray), NMR (EspritzNMR), and the DisProt database (EspritzDisProt) [45]; two versions of IUPred that predict short (IUPredshort) and long (IUPredlong) disordered regions [56]; two versions of the DisEMBL method that predict disordered regions defined as hot loops (DisEMBLHotLoops) and based on remark 465 from Protein Data Bank (DisEMBLremark465) [57], GlobPlot [58], RONN [59], and VSL2B [43]. The selection of the ten predictors was motivated by several factors: (1) they are sufficiently computationally efficient to perform genome-scale predictions, i.e., their runtime is under 1 min for an average size protein sequence; (2) they are
Quality Assessment for Disorder Prediction
87
Table 1 Availability for the ten disorder predictors that are supported by the QUARTER method Disorder predictor URL DisEMBLHotLoops DisEMBLremark456 GlobPlot
http://dis.embl.de/
EspritzDisProt EspritzNMR EspritzX-ray
http://protein.bio.unipd.it/espritz/
IUPredlong IUPredshort
https://iupred2a.elte.hu/
RONN
https://www.strubi.ox.ac.uk/RONN, https://www.bioinformatics.nl/berndb/ ronn.html
VSL2B
http://www.dabi.temple.edu/disprot/predictorVSL2.php
Fig. 2 Architecture of the QUARTER method
incorporated into the two popular databases of predicted disorder: MobiDB [60–62] and D2P2 [63]; (3) they provide relatively accurate disorder predictions [34, 64]; and they are freely accessible online, see Table 1. 2.2 Architecture of the QUARTER Method
QUARTER uses a three-step procedure to generate QA scores for each of the ten considered disorder predictors. The three steps correspond to the following three architectural layers (Fig. 2): 1. Layer 1: Sequence profile. The first layer uses the input protein sequence to produce sequence profile that includes sequencederived information useful for the QA prediction. The profile integrates the disorder predictions, amino acid type (AA type) encoded in binary, and several selected physicochemical
88
Zhonghua Wu et al.
properties of residues including putative solvent accessibility, hydrophobicity, flexibility, net charge, propensity for disorder, and sequence complexity (see Note 1). 2. Layer 2: Feature encoding. The second layer converts the sequence profile into a fixed number of custom-designed numerical features that combine information across multiple elements of the profile and across multiple residues. Three types of sliding windows are used to encode profile information across residues (Fig. 2): window of three adjacent residues (in blue); window of 13 neighboring residues (in red); and window of background residues (in green). QUARTER is optimized for specific disorder predictors by empirically selecting a set of features that maximizes predictive performance for the resulting putative QA scores on a training data set (see Note 2). 3. Layer 3: Predictive model. The third layer inputs the disorder predictor-specific features into a logistic regression model that outputs the QA scores. The use of this model type is motivated by several factors. First, logistic regression generates real values in [0, 1] interval that intuitively correspond to the QA scores. Second, this model is computationally efficient, which consequently speeds up generation of the QA score. Third, the logistic regression-based models have been used to make numerous related types of predictions including predictions of disorder [10, 65], disordered protein and nucleic acids binding [66], disordered linkers [67], protease cleavage sites [68], and phosphorylation sites [69].
3
Methods
3.1 Running the QUARTER Webserver
The QUARTER predictor is available to the end users as a convenient and free webserver at http://biomine.cs.vcu.edu/servers/ QUARTER/. The webserver calculates the QA scores for the ten popular disorder predictors listed in Table 1. The computations are done on the server side, and the user only needs a modern web browser (Internet Explorer, Firefox, Opera, or Chrome) and internet connection to make predictions. After arriving at the QUARTER webserver page, four easy steps are required to request the prediction of the disorder QA scores (Fig. 3): 1. Select a disorder predictor using a preset drop box. The QA scores will be computed for the selected predictor (see Note 3). 2. Insert the FASTA-formatted protein sequence(s) and disorder propensities for the selected disorder predictor into the white text box. If your input has more than one protein, then these proteins should be placed in consecutive lines (see Notes 4 and 5). Figure 3 shows an example input for the PsaF protein (see Note 6).
Quality Assessment for Disorder Prediction
89
Fig. 3 Submission page for the QUARTER webserver. The webpage is setup to make predictions of the QA scores for the VSL2B’s disorder predictor and the PsaF protein
3. Enter your email address. This is the address where a unique web links to the results will be sent (see Note 7). 4. Click “Run QUARTER.” This submits the sequences to the webserver for the prediction of the QA scores. Once the job is submitted, the browser is redirected to the QUARTER-processing page that provides information about the current position of this submission in the biomine server queue (see Notes 8 and 9). This page is automatically updated to indicate when the prediction reaches the top of the queue and when it is
90
Zhonghua Wu et al.
Fig. 4 Webpage that summarizes results obtained from the QUARTER webserver
being processed (see Note 10). The webpage automatically redirects to the page with the results when the prediction is completed. This page includes a direct link to the result, which is the same link that is communicated to the user-provided email. The email with the link to the results is sent even in the event when the processing or results page is closed or when the web browser is shut down in the middle of the prediction process. 3.2 Results Generated by the QUARTER Webserver
Figure 4 shows the webpage that summarizes the results generated by the QUARTER webserver. The QA score results can be downloaded by clicking “Download CSV file with the results” link. The same link is sent to the user’s email address. The link leads to a text file that provides the predicted QA scores for each submitted sequence using four lines (see Note 11): 1. Protein name that corresponds to the annotation header from the FASTA-formatted input. 2. The input sequence where the residues predicted as correct predictions are capitalized. 3. Comma delimited disorder prediction scores. 4. Comma delimited predicted disorder QA scores. The residues identified in the second line of the output text file as correctly predicted are annotated by processing the predicted QA scores using a threshold. The residues with high values of the QA scores that are above the threshold are assumed to be correctly
Quality Assessment for Disorder Prediction
91
Fig. 5 Relation between false positive rate (fpr) and coverage (fraction of residues predicted as correct predictions) for the quality assessment (QA) scores generated with the QUARTER method. These data were computed using the benchmark data set from [55]. Each color-coded curve corresponds to the QA scores generated for a different disorder predictor
predicted. The value of the threshold is set to balance the fraction of the residues that are predicted to be correct disorder predictions and the corresponding false positive rate, i.e., fraction of residues that are incorrectly predicted by a given disorder predictor but identified by QUARTER as correct predictions. Figure 5 shows this relation for the ten predictors that are included in the QUARTER webserver. As expected, the false positive rate increases as the coverage by the correct predictions goes up. The best coverage is secured for the EspritzX-ray predictor, while the worst is for the GlobPlot predictor. The threshold values for the ten disorder predictors are selected to result in a low, 10%, false positive rate. Figure 5 demonstrates that this rate corresponds to the coverage that ranges between 22% (for GlobPlot) and 58% (for EspritzX-ray). Precise, numerical values of thresholds for several selected false positive rates are shown in Table 2. These values are useful for the users who would like to process the predicted QA scores to annotate the correct predictions at different levels of false positive rates and the corresponding coverage values. For instance, a user who would like to annotate correct predictions at the 1% false positive rate for the VSL2B predictor should use threshold ¼ 0.960. The notification that is sent to the end-user-provided email address is shown in Fig. 6. It provides direct links to the webpage from Fig. 4 and to the text file with the results. This email can be used to access results at a later time (see Note 12).
92
Zhonghua Wu et al.
Table 2 Threshold values that should be used to attain specific false positive rates for the prediction of QA scores for the ten disorder predictors covered by the QUARTER webserver False positive rates Predictors
0.01
0.02
0.03
0.04
0.05
0.1
0.15
0.2
0.25
0.3
VSL2B
0.960
0.946
0.933
0.921
0.911
0.865
0.829
0.797
0.77
0.745
RONN
0.887
0.863
0.845
0.83
0.817
0.771
0.739
0.715
0.696
0.679
IUPredshort
0.929
0.92
0.913
0.907
0.901
0.875
0.85
0.824
0.797
0.771
IUPredlong
0.89
0.879
0.871
0.865
0.859
0.831
0.806
0.782
0.759
0.738
GlobPlot
0.785
0.766
0.753
0.744
0.736
0.708
0.689
0.674
0.662
0.651
EspritzX-ray
0.838
0.826
0.819
0.813
0.808
0.783
0.753
0.725
0.701
0.682
EspritzNMR
0.872
0.86
0.853
0.846
0.841
0.819
0.797
0.774
0.749
0.724
EspritzDisProt
0.691
0.659
0.644
0.633
0.624
0.593
0.568
0.544
0.519
0.494
DisEMBLHotLoops
0.868
0.832
0.806
0.786
0.769
0.713
0.679
0.656
0.639
0.625
DisEMBLremark465
0.934
0.924
0.915
0.908
0.901
0.873
0.848
0.824
0.8
0.777
Fig. 6 Notification email generated by the QUARTER webserver
4
Case Study The PsaF protein is a component of the Photosystem I (PSI) complex. It facilitates electron transfer to the PSI from electron donors—plastocyanin and cytochrome c6 [70]—where absence of PsaF from PSI drastically reduces electron transfer [71]. PsaF’s
Quality Assessment for Disorder Prediction
93
Fig. 7 Quality assessment (QA) of the disorder predictions for the PsaF protein (UniProt ID: P12355). The top of the figure shows the putative propensities for disorder (black line) and the binary predictions track, i.e., disordered (in red) vs. ordered (in blue) residues, which were produced by the VSL2B method. The “native” annotation track visualizes the native annotation of intrinsic disorder collected from DisProt (DisProt ID: DP00990), where red and blue denote disordered and structured regions. The green regions in the “correct predictions” line indicate regions of agreement between native and predicted disorder. The bottom of the figure shows the QA scores (black line) computed by the QUARTER method. Green shading denotes regions where the predictions are assumed to be correct according to the putative propensities for disorder (at the top) and according to the QA scores (at the bottom)
mechanism of action is through direct interaction with electron donors through a N-terminal region, which is helically amphipathic [72]. This interaction region falls within a larger IDR at the N-terminus (Fig. 7, middle panel, “native” annotation track), as determined from the structure of the PSI complex [73]. Molecular recognition is a common function for IDRs [4]. Frequently, as is the case for PsaF, short recognition regions are located within longer IDRs; these short regions have been called molecular recognition features [74–76] and are predicted to be common in nature [77, 78]. The N-terminal interaction region of PsaF recruits electron donors and/or activates them for electron transfer. Figure 7 illustrates the QA predictions from the QUARTER webserver and compares them to the putative disorder propensities for the PsaF protein in the context of the native annotation of the disorder. Intrinsic disorder predictions by VSL2B [43] for PsaF identify several potential IDRs throughout the protein (Fig. 7, middle panel, “predicted” track). These predicted IDRs correspond to prediction scores (Fig. 7, top panel, black line) greater than 0.5, and prediction scores less than 0.5 correspond to the predictions of order. As discussed in the introduction, 81.6% of residues in PsaF are assumed to be predicted correctly by VSL2B, where the most likely correct predictions correspond to the most extreme putative disorder propensities generated by VSL2B. In the case of the 231 residues long PsaF protein, this approach identifies 188 confidently predicted residues (Fig. 7, top panel, green shaded regions). However, comparing predicted ordered and disordered
94
Zhonghua Wu et al.
regions with known ordered and disordered regions shows that only 152 residues are predicted correctly (Fig. 7, middle panel, “correct” predictions track). Confident VSL2B predictions overlap with these correct predictions, but incorrectly predicted disordered regions in the middle and C-terminus of PsaF are spuriously indicated to be confident predictions. QUARTER predicted QA scores for the VSL2B predictions of the PsaF protein (Fig. 7, bottom panel, black line) contrast with putative disorder propensity-based assessment. Adjusted to the false positive rate of 10% at a threshold of a quality score of 0.865 (Table 2) gives 74 residues predicted to be classified correctly by VSL2B for PsaF (Fig. 7, bottom panel, green shaded regions). These residues include the known disordered region at the N-terminus, as well as the C-terminal portion of the ordered region. All of these 74 residues are in fact correctly predicted by VSL2B. Conversely, incorrectly predicted IDRs, including a large IDR in the center of the sequence and two smaller regions at the C-terminus, are not predicted to be correctly identified by VSL2B. Overall, this example demonstrates effectiveness of QUARTER in finding high-quality predictions that are calibrated for a specific protein and a specific disorder predictor.
5
Notes 1. The putative solvent accessibility is predicted with the ASAquick method [79]. The sequence complexity is computed using the SEG algorithm [80]. The hydrophobicity is estimated using the Kyte and Doolittle index [81]. The flexibility, which is expressed with the B-factors, is computed by the method described in [82]. Lastly, propensity for intrinsic disorder is estimated using the TOPIDP scale [32]. 2. The training and test data sets that were used to optimize and benchmark the QUARTER method, respectively, are available on the “Materials” section of the webpage of the QUARTER webserver at http://biomine.cs.vcu.edu/servers/QUAR TER/. 3. Links to the websites of the ten disorder predictors that are covered by QUARTER are given in the “Help” section of the QUARTER webserver page. This is useful to collect their prediction that must be entered in the second step of the prediction process. 4. The webserver accepts up to 1000 protein sequences for a single run. Information for each input protein must be placed in three consecutive lines: (line 1) protein identifier and/or name; (line 2) protein sequence using one-letter amino acid encoding; and (line 3) comma-separated disorder predictions.
Quality Assessment for Disorder Prediction
95
5. The inputs must satisfy several requirements. First, the protein sequence must not contain any spaces or incorrect characters, i.e., only letters that denote amino acids are allowed. Second, the sequences must be longer than 20 residues. This is required by the ASAquick method [79] that is run in the background to collect putative solvent accessibility. Third, the number of putative disorder propensities must be equal to the number of residues in the input protein chain. The webserver returns a descriptive error message in case if any of these requirements are not met. 6. The three buttons underneath the input box provide two correctly formatted sample inputs, one for the Espritz-DisProt predictor and another for the VSL2B predictor, and ability to clear the input box. 7. The users are required to provide an email address where notification of completed prediction and a private URL to the results are sent. The email is required since this is the most reliable way to inform the user how to locate the results. While the results also appear in the browser window, closing this window or shutting down the web browser would effectively prevent the users from having access to the results. 8. The biomine server services several other predictors including (in alphabetical order) CONNECTOR [83], CRYSTALP2 [84], Cypred [85], DFLpred [67], DisCon [86], disCoP [87], DisoRDPbind [66, 88], DMRpred [89], DRNApred [90], fDETECT [91, 92], fMoRFpred [78], hybridNAP [93], ILbind [94], MFDp [46], MFDp2 [95, 96], MoRFpred [75, 76], NsitePred [97], PPCpred [98], RAPID [99], SSCon [100], and SLIDER [10]. 9. The biomine webserver utilizes the first-come-first-serve queue. However, the number of concurrent submissions across all predictors listed in Note 8 that are coming from the same source is limited to three. Users that submit too many times receive a message that informs them to resubmit after one of their pending submissions is completed. This limit intends to equalize access to the webserver across different users. 10. Prediction of a single protein by the QUARTER webserver takes less than 1 s. 11. A sample text file with the results produced by the QUARTER webserver for a user’s query that includes one protein follows: >P12355 MSfTiPtnlykPLATKPKHLSSsSfaprskivcqqendqqqpkklela kvganaaaalalssvllsswsvapdaamadiagltpckeskqfakrekqalkklqaslkl yaddsapalaikatmektkkrfdnygkygllcgsdglphlivsgDQRHWGEF ITPGILFLYIAGWIGWVGRSYLIAirdekkptqkeiiIDVPLASSL LFRGFSWPVAAYRELLnGelvdnnf
96
Zhonghua Wu et al.
0.838,0.827,0.811,0.783,0.759,0.729,0.711,0.679,0.695, 0.722,0.752,0.781,0.799,0.813,0.817,0.825,0.829,0.830, 0.831,0.829,0.831,0.832,0.827,0.815,0.797,0.786,0.774, 0.753,0.737,0.704,0.702,0.729,0.698,0.738,0.757,0.784, 0.790,0.789,0.795,0.795,0.792,0.783,0.773,0.757,0.737, 0.717,0.692,0.665,0.618,0.610,0.594,0.551,0.511,0.497, 0.459,0.482,0.475,0.462,0.468,0.466,0.431,0.436,0.456, 0.457,0.454,0.465,0.472,0.472,0.466,0.468,0.475,0.472, 0.466,0.456,0.454,0.444,0.448,0.456,0.417,0.414,0.456, 0.486,0.508,0.558,0.593,0.621,0.647,0.675,0.695,0.703, 0.720,0.735,0.743,0.758,0.771,0.775,0.778,0.784,0.793, 0.784,0.786,0.769,0.733,0.707,0.705,0.670,0.657,0.640, 0.630,0.637,0.628,0.625,0.616,0.591,0.561,0.552,0.512, 0.495,0.478,0.474,0.480,0.493,0.502,0.506,0.514,0.539, 0.598,0.577,0.556,0.510,0.442,0.381,0.324,0.268,0.232, 0.225,0.210,0.206,0.227,0.237,0.239,0.260,0.265,0.256, 0.259,0.282,0.299,0.299,0.311,0.333,0.342,0.314,0.301, 0.277,0.254,0.231,0.212,0.195,0.169,0.131,0.090,0.063, 0.049,0.036,0.024,0.019,0.016,0.010,0.009,0.010,0.011, 0.012,0.016,0.018,0.022,0.027,0.032,0.034,0.041,0.048, 0.055,0.077,0.124,0.193,0.291,0.402,0.499,0.581,0.632, 0.652,0.646,0.626,0.566,0.479,0.390,0.312,0.231,0.180, 0.155,0.135,0.114,0.104,0.114,0.117,0.120,0.123,0.124, 0.117,0.114,0.110,0.102,0.096,0.082,0.082,0.083,0.090, 0.104,0.115,0.136,0.161,0.202,0.224,0.277,0.354,0.450, 0.542,0.664,0.765,0.810,0.851,0.872 0.888,0.895,0.851,0.872,0.856,0.872,0.857,0.852,0.848, 0.862,0.844,0.872,0.888,0.889,0.895,0.903,0.901,0.891, 0.894,0.899,0.877,0.866,0.851,0.869,0.859,0.857,0.850, 0.834,0.832,0.830,0.832,0.834,0.814,0.815,0.808,0.802, 0.792,0.793,0.788,0.786,0.793,0.802,0.809,0.801,0.767, 0.772,0.754,0.743,0.740,0.712,0.664,0.631,0.618,0.599, 0.562,0.546,0.560,0.552,0.520,0.526,0.519,0.551,0.539, 0.523,0.547,0.522,0.541,0.559,0.590,0.584,0.618,0.645, 0.666,0.672,0.632,0.639,0.618,0.596,0.608,0.617,0.601, 0.585,0.626,0.598,0.605,0.614,0.633,0.649,0.676,0.714, 0.733,0.774,0.779,0.795,0.811,0.797,0.793,0.812,0.813, 0.794,0.801,0.795,0.787,0.798,0.777,0.788,0.756,0.713, 0.696,0.703,0.652,0.652,0.658,0.659,0.653,0.605,0.570, 0.523,0.523,0.546,0.546,0.598,0.554,0.547,0.554,0.542, 0.565,0.549,0.608,0.668,0.708,0.737,0.762,0.771,0.763, 0.784,0.810,0.818,0.823,0.823,0.853,0.839,0.829,0.821, 0.807,0.807,0.807,0.808,0.806,0.805,0.842,0.854,0.873, 0.870,0.892,0.897,0.896,0.910,0.931,0.941,0.953,0.960, 0.968,0.969,0.973,0.975,0.979,0.981,0.980,0.984,0.984, 0.981,0.981,0.980,0.982,0.979,0.975,0.970,0.970,0.957, 0.942,0.926,0.908,0.889,0.857,0.839,0.806,0.742,0.717, 0.696,0.683,0.709,0.759,0.797,0.818,0.828,0.844,0.876,
Quality Assessment for Disorder Prediction
97
0.888,0.907,0.914,0.934,0.937,0.932,0.929,0.925,0.935, 0.936,0.942,0.943,0.950,0.955,0.957,0.955,0.950,0.949, 0.946,0.933,0.919,0.886,0.878,0.870,0.859,0.876,0.848, 0.814,0.784,0.856,0.828,0.826,0.828 12. The predictions are kept on the webserver for the period of at least 3 months. They can be accessed via the direct link send in the return email.
Acknowledgments This research was supported in part by the National Science Foundation grant 1617369 and the Robert J. Mattauch Endowment funds to L.K. References 1. Habchi J, Tompa P, Longhi S, Uversky VN (2014) Introducing protein intrinsic disorder. Chem Rev 114(13):6561–6588 2. Lieutaud P, Ferron F, Uversky AV, Kurgan L, Uversky VN, Longhi S (2016) How disordered is my protein and what is its disorder for? A guide through the “dark side” of the protein universe. Intrinsically Disord Proteins 4(1):e1259708 3. Keith Dunker A, Barbar E, Blackledge M, Bondos SE, Doszta´nyi Z, Jane Dyson H, Forman-Kay J, Fuxreiter M, Gsponer J, Han K-H, Jones DT, Longhi S, Metallo SJ, Nishikawa K, Nussinov R, Obradovic Z, Pappu RV, Rost B, Selenko P, Subramaniam V, Sussman JL, Tompa P, Uversky VN (2013) What’s in a name? Why these proteins are intrinsically disordered. Intrinsically Disord Proteins 1(1):e24157 4. van der Lee R, Buljan M, Lang B, Weatheritt RJ, Daughdrill GW, Dunker AK, Fuxreiter M, Gough J, Gsponer J, Jones DT, Kim PM, Kriwacki RW, Oldfield CJ, Pappu RV, Tompa P, Uversky VN, Wright PE, Babu MM (2014) Classification of intrinsically disordered regions and proteins. Chem Rev 114 (13):6589–6631 5. Peng Z, Yan J, Fan X, Mizianty MJ, Xue B, Wang K, Hu G, Uversky VN, Kurgan L (2015) Exceptionally abundant exceptions: comprehensive characterization of intrinsic disorder in all domains of life. Cell Mol Life Sci 72(1):137–151 6. Xue B, Dunker AK, Uversky VN (2012) Orderly order in protein intrinsic disorder distribution: disorder in 3500 proteomes
from viruses and the three domains of life. J Biomol Struct Dyn 30(2):137–149 7. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337 (3):635–645 8. Hu G, Wang K, Song J, Uversky VN, Kurgan L (2018) Taxonomic landscape of the dark proteomes: whole-proteome scale interplay between structural darkness, intrinsic disorder, and crystallization propensity. Proteomics 18(21–22):e1800243 9. Dunker AK, Obradovic Z, Romero P, Garner EC, Brown CJ (2000) Intrinsic protein disorder in complete genomes. Genome Inform Ser Workshop Genome Inform 11:161–171 10. Peng Z, Mizianty MJ, Kurgan L (2014) Genome-scale prediction of proteins with long intrinsically disordered regions. Proteins 82(1):145–158 11. Wang C, Uversky VN, Kurgan L (2016) Disordered nucleiome: abundance of intrinsic disorder in the DNA- and RNA-binding proteins in 1121 species from Eukaryota, Bacteria and Archaea. Proteomics 16(10):1486–1498 12. Romero PR, Zaidi S, Fang YY, Uversky VN, Radivojac P, Oldfield CJ, Cortese MS, Sickmeier M, LeGall T, Obradovic Z, Dunker AK (2006) Alternative splicing in concert with protein intrinsic disorder enables increased functional diversity in multicellular organisms. Proc Natl Acad Sci U S A 103 (22):8390–8395 13. Hu G, Wu Z, Uversky VN, Kurgan L (2017) Functional analysis of human hub proteins
98
Zhonghua Wu et al.
and their interactors involved in the intrinsic disorder-enriched interactions. Int J Mol Sci 18(12) 14. Na I, Meng F, Kurgan L, Uversky VN (2016) Autophagy-related intrinsically disordered proteins in intra-nuclear compartments. Mol BioSyst 12(9):2798–2817 15. Meng F, Na I, Kurgan L, Uversky VN (2016) Compartmentalization and functionality of nuclear disorder: intrinsic disorder and protein-protein interactions in intra-nuclear compartments. Int J Mol Sci 17(1) 16. Xue B, Blocquel D, Habchi J, Uversky AV, Kurgan L, Uversky VN, Longhi S (2014) Structural disorder in viral proteins. Chem Rev 114(13):6880–6911 17. Peng Z, Oldfield CJ, Xue B, Mizianty MJ, Dunker AK, Kurgan L, Uversky VN (2014) A creature with a hundred waggly tails: intrinsically disordered proteins in the ribosome. Cell Mol Life Sci 71(8):1477–1504 18. Fuxreiter M, Toth-Petroczy A, Kraut DA, Matouschek A, Lim RY, Xue B, Kurgan L, Uversky VN (2014) Disordered proteinaceous machines. Chem Rev 114 (13):6806–6843 19. Fan X, Xue B, Dolan PT, LaCount DJ, Kurgan L, Uversky VN (2014) The intrinsic disorder status of the human hepatitis C virus proteome. Mol BioSyst 10(6):1345–1363 20. Peng Z, Xue B, Kurgan L, Uversky VN (2013) Resilience of death: intrinsic disorder in proteins involved in the programmed cell death. Cell Death Differ 20(9):1257–1267 21. Xue B, Mizianty MJ, Kurgan L, Uversky VN (2012) Protein intrinsic disorder as a flexible armor and a weapon of HIV-1. Cell Mol Life Sci 69(8):1211–1259 22. Peng Z, Mizianty MJ, Xue B, Kurgan L, Uversky VN (2012) More than just tails: intrinsic disorder in histone proteins. Mol BioSyst 8(7):1886–1901 23. Buljan M, Chalancon G, Dunker AK, Bateman A, Balaji S, Fuxreiter M, Babu MM (2013) Alternative splicing of intrinsically disordered regions and rewiring of protein interactions. Curr Opin Struct Biol 23 (3):443–450 24. Korneta I, Bujnicki JM (2012) Intrinsic disorder in the human spliceosomal proteome. PLoS Comput Biol 8(8):e1002641 25. Dyson HJ (2012) Roles of intrinsic disorder in protein-nucleic acid interactions. Mol BioSyst 8(1):97–104 26. Dunker AK, Silman I, Uversky VN, Sussman JL (2008) Function and structure of
inherently disordered proteins. Curr Opin Struct Biol 18(6):756–764 27. Tompa P, Fuxreiter M, Oldfield CJ, Simon I, Dunker AK, Uversky VN (2009) Close encounters of the third kind: disordered domains and the interactions of proteins. BioEssays 31(3):328–335 28. Varadi M, Zsolyomi F, Guharoy M, Tompa P (2015) Functional advantages of conserved intrinsic disorder in RNA-binding proteins. PLoS One 10(10):e0139731 29. Dosztanyi Z, Chen J, Dunker AK, Simon I, Tompa P (2006) Disorder and sequence repeats in hub proteins and their implications for network evolution. J Proteome Res 5 (11):2985–2995 30. Pancsa R, Tompa P (2016) Coding regions of intrinsic disorder accommodate parallel functions. Trends Biochem Sci 41(11):898–906 31. Tantos A, Kalmar L, Tompa P (2015) The role of structural disorder in cell cycle regulation, related clinical proteomics, disease development and drug targeting. Expert Rev Proteomics 12(3):221–233 32. Campen A, Williams RM, Brown CJ, Meng J, Uversky VN, Dunker AK (2008) TOP-IDPscale: a new amino acid scale measuring propensity for intrinsic disorder. Protein Pept Lett 15(9):956–963 33. Peng ZL, Kurgan L (2012) Comprehensive comparative assessment of in-silico predictors of disordered regions. Curr Protein Pept Sci 13(1):6–18 34. Walsh I, Giollo M, Di Domenico T, Ferrari C, Zimmermann O, Tosatto SCE (2015) Comprehensive large-scale assessment of intrinsic protein disorder. Bioinformatics 31 (2):201–208 35. Meng F, Uversky VN, Kurgan L (2017) Comprehensive review of methods for prediction of intrinsic disorder and its molecular functions. Cell Mol Life Sci 74 (17):3069–3090 36. Meng F, Uversky V, Kurgan L (2017) Computational prediction of intrinsic disorder in proteins. Curr Protoc Protein Sci 88:12.16.11–12.16.14 37. He B, Wang K, Liu Y, Xue B, Uversky VN, Dunker AK (2009) Predicting intrinsic disorder in proteins: an overview. Cell Res 19 (8):929–949 38. Uversky VN, Radivojac P, Iakoucheva LM, Obradovic Z, Dunker AK (2007) Prediction of intrinsic disorder and its use in functional proteomics. Methods Mol Biol 408:69–92 39. Necci M, Piovesan D, Dosztanyi Z, Tompa P, Tosatto SCE (2017) A comprehensive
Quality Assessment for Disorder Prediction assessment of long intrinsic protein disorder from the DisProt database. Bioinformatics 34 (3):445–452 40. Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB (2003) Protein disorder prediction: implications for structural proteomics. Structure (London, England: 1993) 11 (11):1453–1459 41. Doszta´nyi Z, Csizmok V, Tompa P, Simon I (2005) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21(16):3433–3434 42. Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT (2004) The DISOPRED server for the prediction of protein disorder. Bioinformatics 20(13):2138–2139 43. Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z (2006) Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics 7(1):208 44. Ishida T, Kinoshita K (2007) PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res 35(Suppl 2): W460–W464 45. Walsh I, Martin AJM, Di Domenico T, Tosatto SCE (2012) ESpritz: accurate and fast prediction of protein disorder. Bioinformatics 28(4):503–509 46. Mizianty MJ, Stach W, Chen K, Kedarisetti KD, Disfani FM, Kurgan L (2010) Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources. Bioinformatics 26(18): i489–i496 47. Zhang T, Faraggi E, Xue B, Dunker AK, Uversky VN, Zhou Y (2012) SPINE-D: accurate prediction of short and long disordered regions by a single neural-network based method. J Biomol Struct Dyn 29(4):799–813 48. Monastyrskyy B, Kryshtafovych A, Moult J, Tramontano A, Fidelis K (2014) Assessment of protein disorder region predictions in CASP10. Proteins 82(Suppl 2):127–137 49. Wu Z, Hu G, Wang K, Kurgan L (2017) Exploratory analysis of quality assessment of putative intrinsic disorder in proteins. 6th International conference on artificial intelligence and soft computing, vol LNAI 10245. Zakopane, Poland 50. Kihara D, Chen H, Yang YD (2009) Quality assessment of protein structure models. Curr Protein Pept Sci 10(3):216–228 51. Skwark MJ, Elofsson A (2013) PconsD: ultra rapid, accurate model quality assessment for protein structure prediction. Bioinformatics 29(14):1817–1818
99
52. McGuffin LJ, Buenavista MT, Roche DB (2013) The ModFOLD4 server for the quality assessment of 3D protein models. Nucleic Acids Res 41(Web Server issue):W368–W372 53. Cao R, Bhattacharya D, Adhikari B, Li J, Cheng J (2016) Massive integration of diverse protein quality assessment methods to improve template based modeling in CASP11. Proteins 84(Suppl 1):247–259 54. Cao R, Adhikari B, Bhattacharya D, Sun M, Hou J, Cheng J (2017) QAcon: single model quality assessment using protein structural and contact information with machine learning techniques. Bioinformatics 33 (4):586–588 55. Hu G, Wu Z, Oldfield C, Wang C, Kurgan L (2018) Quality assessment for the putative intrinsic disorder in proteins. Bioinformatics 35(10). https://doi.org/10.1093/bioinfor matics/bty881 56. Dosztanyi Z, Csizmok V, Tompa P, Simon I (2005) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21(16):3433–3434 57. Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB (2003) Protein disorder prediction: implications for structural proteomics. Structure 11(11):1453–1459 58. Linding R, Russell RB, Neduva V, Gibson TJ (2003) GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res 31(13):3701–3708 59. Yang ZR, Thomson R, McNeil P, Esnouf RM (2005) RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics 21(16):3369–3376 60. Piovesan D, Tabaro F, Paladin L, Necci M, Micetic I, Camilloni C, Davey N, Dosztanyi Z, Meszaros B, Monzon AM, Parisi G, Schad E, Sormanni P, Tompa P, Vendruscolo M, Vranken WF, Tosatto SCE (2018) MobiDB 3.0: more annotations for intrinsic disorder, conformational diversity and interactions in proteins. Nucleic Acids Res 46(D1):D471–D476 61. Di Domenico T, Walsh I, Martin AJM, Tosatto SCE (2012) MobiDB: a comprehensive database of intrinsic protein disorder annotations. Bioinformatics 28 (15):2080–2081 62. Potenza E, Di Domenico T, Walsh I, Tosatto SC (2015) MobiDB 2.0: an improved database of intrinsically disordered and mobile proteins. Nucleic Acids Res 43(Database issue):D315–D320
100
Zhonghua Wu et al.
63. Oates ME, Romero P, Ishida T, Ghalwash M, Mizianty MJ, Xue B, Dosztanyi Z, Uversky VN, Obradovic Z, Kurgan L, Dunker AK, Gough J (2013) D(2)P(2): database of disordered protein predictions. Nucleic Acids Res 41(Database issue):D508–D516 64. Necci M, Piovesan D, Dosztanyi Z, Tosatto SCE (2017) MobiDB-lite: fast and highly specific consensus prediction of intrinsic disorder in proteins. Bioinformatics 33(9):1402–1404 65. Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker AK (2005) Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 61(Suppl 7):176–182 66. Peng Z, Wang C, Uversky VN, Kurgan L (2017) Prediction of disordered RNA, DNA, and protein binding regions using DisoRDPbind. Methods Mol Biol 1484:187–203 67. Meng F, Kurgan L (2016) DFLpred: highthroughput prediction of disordered flexible linker regions in protein sequences. Bioinformatics 32(12):i341–i350 68. Song J, Li F, Leier A, Marquez-Lago TT, Akutsu T, Haffari G, Chou KC, Webb GI, Pike RN, Hancock J (2018) PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy. Bioinformatics 34(4):684–687 69. Li F, Li C, Marquez-Lago TT, Leier A, Akutsu T, Purcell AW, Smith AI, Lithgow T, Daly RJ, Song J, Chou KC (2018) Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome. Bioinformatics 34:4223 70. Hippler M, Drepper F, Farah J, Rochaix JD (1997) Fast electron transfer from cytochrome c6 and plastocyanin to photosystem I of Chlamydomonas reinhardtii requires PsaF. Biochemistry 36(21):6343–6349 71. Farah J, Rappaport F, Choquet Y, Joliot P, Rochaix JD (1995) Isolation of a psaFdeficient mutant of Chlamydomonas reinhardtii: efficient interaction of plastocyanin with the photosystem I reaction center is mediated by the PsaF subunit. EMBO J 14 (20):4976–4984 72. Hippler M, Drepper F, Haehnel W, Rochaix JD (1998) The N-terminal domain of PsaF: precise recognition site for binding and fast electron transfer from cytochrome c6 and plastocyanin to photosystem I of Chlamydomonas reinhardtii. Proc Natl Acad Sci U S A 95(13):7339–7344 73. Amunts A, Toporik H, Borovikova A, Nelson N (2010) Structure determination and
improved model of plant photosystem I. J Biol Chem 285(5):3478–3486 74. Oldfield CJ, Cheng Y, Cortese MS, Romero P, Uversky VN, Dunker AK (2005) Coupled folding and binding with alphahelix-forming molecular recognition elements. Biochemistry 44(37):12454–12470 75. Disfani FM, Hsu W-L, Mizianty MJ, Oldfield CJ, Xue B, Dunker AK, Uversky VN, Kurgan L (2012) MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins. Bioinformatics 28(12):i75–i83 76. Oldfield CJ, Uversky VN, Kurgan L (2018) Predicting functions of disordered proteins with MoRFpred. Methods Mol Biol 1851:337–352 77. Mohan A, Oldfield CJ, Radivojac P, Vacic V, Cortese MS, Dunker AK, Uversky VN (2006) Analysis of molecular recognition features (MoRFs). J Mol Biol 362(5):1043–1059 78. Yan J, Dunker AK, Uversky VN, Kurgan L (2016) Molecular recognition features (MoRFs) in three domains of life. Mol BioSyst 12(3):697–710 79. Faraggi E, Zhou YQ, Kloczkowski A (2014) Accurate single-sequence prediction of solvent accessible surface area using local and global features. Proteins 82(11):3170–3176 80. Wootton JC, Federhen S (1993) Statistics of local complexity in amino-acid-sequences and sequence databases. Comput Chem 17 (2):149–163 81. Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157(1):105–132 82. Vihinen M, Torkkila E, Riikonen P (1994) Accuracy of protein flexibility predictions. Proteins 19(2):141–149 83. Wang C, Kurgan L (2019) Review and comparative assessment of similarity-based methods for prediction of drug-protein interactions in the druggable human proteome. Brief Bioinform 20(6):2066–2087 84. Kurgan L, Razib AA, Aghakhani S, Dick S, Mizianty M, Jahandideh S (2009) CRYSTALP2: sequence-based protein crystallization propensity prediction. BMC Struct Biol 9:50 85. Kedarisetti P, Mizianty MJ, Kaas Q, Craik DJ, Kurgan L (2014) Prediction and characterization of cyclic proteins from sequences in three domains of life. Biochim Biophys Acta 1844 (1 Pt B):181–190 86. Mizianty MJ, Zhang T, Xue B, Zhou Y, Dunker AK, Uversky VN, Kurgan L (2011)
Quality Assessment for Disorder Prediction In-silico prediction of disorder content using hybrid sequence representation. BMC Bioinformatics 12:245 87. Fan X, Kurgan L (2014) Accurate prediction of disorder in protein chains with a comprehensive and empirically designed consensus. J Biomol Struct Dyn 32(3):448–464 88. Peng Z, Kurgan L (2015) High-throughput prediction of RNA, DNA and protein binding regions mediated by intrinsic disorder. Nucleic Acids Res 43(18):e121 89. Meng F, Kurgan L (2018) High-throughput prediction of disordered moonlighting regions in protein sequences. Proteins 86:1097 90. Yan J, Kurgan L (2017) DRNApred, fast sequence-based method that accurately predicts and discriminates DNAand RNA-binding residues. Nucleic Acids Res 45 (10):e84 91. Meng F, Wang C, Kurgan L (2018) fDETECT webserver: fast predictor of propensity for protein production, purification, and crystallization. BMC Bioinformatics 18(1):580 92. Mizianty MJ, Fan X, Yan J, Chalmers E, Woloschuk C, Joachimiak A, Kurgan L (2014) Covering complete proteomes with X-ray structures: a current snapshot. Acta Crystallogr D Biol Crystallogr 70 (Pt 11):2781–2793 93. Zhang J, Ma Z, Kurgan L (2019) Comprehensive review and empirical analysis of hallmarks of DNA-, RNA- and protein-binding
101
residues in protein chains. Brief Bioinform 20 (4):1250–1268 94. Hu G, Gao J, Wang K, Mizianty MJ, Ruan J, Kurgan L (2012) Finding protein targets for small biologically relevant ligands across fold space using inverse ligand binding predictions. Structure 20(11):1815–1822 95. Mizianty MJ, Uversky V, Kurgan L (2014) Prediction of intrinsic disorder in proteins using MFDp2. Methods Mol Biol 1137:147–162 96. Mizianty MJ, Peng ZL, Kurgan L (2013) MFDp2: accurate predictor of disorder in proteins by fusion of disorder probabilities, content and profiles. Intrinsically Disord Proteins 1(1):e24428 97. Chen K, Mizianty MJ, Kurgan L (2012) Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics 28 (3):331–341 98. Mizianty MJ, Kurgan L (2011) Sequencebased prediction of protein crystallization, purification and production propensity. Bioinformatics 27(13):i24–i33 99. Yan J, Mizianty MJ, Filipow PL, Uversky VN, Kurgan L (2013) RAPID: fast and accurate sequence-based prediction of intrinsic disorder content on proteomic scale. Biochim Biophys Acta 1834(8):1671–1680 100. Yan J, Marcus M, Kurgan L (2014) Comprehensively designed consensus of standalone secondary structure predictors improves Q3 by over 3%. J Biomol Struct Dyn 32(1):36–51
Chapter 6 Modeling of Three-Dimensional RNA Structures Using SimRNA Tomasz K. Wirecki, Chandran Nithin, Sunandan Mukherjee, Janusz M. Bujnicki, and Michał J. Boniecki Abstract The molecules of the ribonucleic acid (RNA) perform a variety of vital roles in all living cells. Their biological function depends on their structure and dynamics, both of which are difficult to experimentally determine but can be theoretically inferred based on the RNA sequence. SimRNA is one of the computational methods for molecular simulations of RNA 3D structure formation. The method is based on a simplified (coarse-grained) representation of nucleotide chains, a statistically derived model of interactions (statistical potential), and the Monte Carlo method as a conformational sampling scheme. The current version of SimRNA (3.22) is able to predict basic topologies of RNA molecules with sizes up to about 50–70 nucleotides, based on their sequences only, and larger molecules if supplied with appropriate distance restraints. The user can specify various types of restraints, including secondary structure, pairwise atom–atom distances, and positions of atoms. SimRNA can be also used for studying systems composed of several chains of RNA. SimRNA is a folding simulations method, thus it allows for examining folding pathways, getting an approximate view of the energy landscapes. Key words RNA structure, RNA folding simulation, De novo modeling, Restraints supported modeling, Coarse-grained models, Statistical potentials, Monte Carlo simulations, Replica Exchange simulations
1
Introduction SimRNA is a computational method that performs a Monte Carlo simulation of a system composed of one or several chains of ribonucleic acid (RNA) [1]. It relies on a simplified (coarse-grained) representation of nucleotides and a statistically derived energy function. We were inspired by similar methods that performed well in the field of protein structure prediction [2]. We transferred
Tomasz K. Wirecki, Chandran Nithin, and Sunandan Mukherjee have contributed equally and should be considered joint first authors. Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_6, © Springer Science+Business Media, LLC, part of Springer Nature 2020
103
104
Tomasz K. Wirecki et al.
Fig. 1 SimRNA coarse-grained representation of an RNA molecule (example of PDB ID: 1f87) (a). The backbone of the RNA chain is represented by two atoms per nucleotide, P and C40 (b, c). The nucleotide bases are represented by three atoms each, N9, C2, C6 for purines (b), and N1, C2, C4 for pyrimidines (c). The three atoms of the nucleobase are used to position a 3D grid, containing the information about the interactions of the entire base moiety (d)
those ideas into the RNA world, modifying them to capture the main characteristics of RNA molecules. In SimRNA, the backbone of the RNA chain is represented by two atoms per nucleotide, whereas nucleotide bases are represented by three atoms each (Fig. 1). In fact, these three atoms are used to calculate a system of local coordinates that allows for positioning of a 3D grid—the actual representation of the base. The 3D grid contains information about the interactions of the entire base moiety (not only the three atoms explicitly included in the SimRNA representation). SimRNA allows for running single simulations, in a constant temperature (isothermal simulation), as well as simulated annealing (gradual cooling down) or gradual melting (heating up) of a system composed of RNA chain(s). It also allows for running simulations using Replica Exchange Monte Carlo Method (REMC), where several copies of a system are simulated in different (uniformly distributed, constant) temperatures in parallel. After a predefined number of simulation steps, the algorithm attempts swapping of replicas between neighboring temperature shelves, and based on the idea of Metropolis criterion, the swap is accepted or rejected, and the simulation continues. This approach allows for more efficient overcoming energy barriers cooling/heating swapped replicas, resulting in enhanced conformational space search.
Model Quality Assessment Using ModFOLD
105
All SimRNA running modes can be supported by additional restraints and constraints, especially secondary structure restraints, pairwise atom–atom distance restraints, and fixing the position of selected atoms or restricting their mobility within a given range. The applied restraints modify the energy function, thus the conformational space to be searched is limited (reshaped). SimRNA, along with additional tools, allows not only for de novo 3D structure prediction of RNA but also for analyzing of folding/unfolding pathways, conformational landscapes, identifying potential alternative structures, and partial remodeling (often with restraints) of existing structures obtained from other modeling protocols. Additional restraints can be obtained from other analyses, such as computational secondary structure prediction (for single sequences with tools such as RNAstructure [3], RNAfold [4], IPknots [5] or for sequence alignments with tools such as PETfold [6] or Direct Coupling Analysis [7–9]), experimental probing of local structure and dynamics (e.g., with SHAPE [10, 11] or in-line probing [12]), or from other experiments that provide more global information, including chemical cross-linking. Parts of the sequence that are already known or predicted to exhibit some particular 3D structure (e.g., from experimental structure determination by X-ray crystallography or NMR studies [13] or from other modeling analyses) can be frozen to keep the starting conformation or restrained to fold into a particular shape. Since SimRNA is a coarse-grained method, it is equipped with a built-in, all-atom reconstruction method based on fragment matching, associated with secondary structure (SS) classifier (provides all-atom PDB files accompanied by SS classification in dot-bracket notation). SimRNA suite also contains a tool for viewing resulting trajectories, which allows for convenient visual inspection of an output. SimRNA can be used as a stand-alone version run either on a desktop computer or on a computational cluster or as its webserver version (SimRNAweb, https://genesilico.pl/SimRNAweb/). The stand-alone version (see Note 1) allows for usage of all implemented options and application of all types of restraints. In this mode, the user is not dependent on external computational resources. Running SimRNA on a desktop PC is possible (see Notes 2 and 3), but for more complex problems (larger RNAs, running REMC with a higher number of replicas), it is recommended to run it on a computational cluster to reduce the actual time of requested computation. SimRNAweb is the most user-friendly method allowing the user to run simulations in a more simplistic and automated manner [14]; however, it is not as robust as the stand-alone version. The different state-of-the-art methods available for RNA structure simulations, using different approaches, have achieved substantial accuracy for template-based and homology-based predictions, though they have their own limitations and advantages
106
Tomasz K. Wirecki et al.
[15–17]. The fragment assembly of RNA (FARNA) and fragment assembly of RNA with full-atom refinement (FARFAR) are based on a random fragment assembly using a classic Metropolis-Monte Carlo criterion [18, 19]. MC-Fold|MC-Sym pipeline establishes the nucleotide relationships using nucleotide cyclic motif (NCM) to infer sufficient base-pairing context information to predict secondary and tertiary structures [20]. Dokholyan group developed a discrete molecular dynamics (DMD) method for ab initio structure predictions and characterization of folding dynamics using a coarse-grained representation of RNA [21]. RNAComposer provides a fully automated pipeline to predict the 3D structure of RNA by searching fragments of RNA from a database (FRABASE [22]) based on the secondary structure input provided by the user [23]. The 3dRNA starts building the tertiary structure of RNA from the smallest secondary elements (SSEs) and finally assembling conformations of the SSEs from a database consisting of 3D conformations of SSEs of nonredundant sequences [24]. VFold3D also use a similar template assembly approach based on the predicted secondary structure; however, it uses a motif-based assembly instead of small-fragment-based assembly methods [25].
2 2.1
Materials Config File
The configuration file is used to set up a variety of settings that will influence the course of the simulation. The user can control following parameters: length of a simulation, the frequency of the output writes, temperature trend during the simulation, weights of shortrange energy terms, restraints weights, frequencies of applying various types of Monte Carlo moves, the limiting radius of the simulation sphere (multichain simulations). The configuration file is a text file with one parameter specified in each line. An example configuration file is shown below: NUMBER_OF_ITERATIONS 16000000 TRA_WRITE_IN_EVERY_N_ITERATIONS 16000 INIT_TEMP 1.35 FINAL_TEMP 0.90 BONDS_WEIGHT 1.0 ANGLES_WEIGHT 1.0 TORS_ANGLES_WEIGHT 0.0 ETA_THETA_WEIGHT 0.40
All available options are listed in Table 1, along with a short explanation.
Model Quality Assessment Using ModFOLD
107
Table 1 Available options in the SimRNA configuration file. The options in bold are the mandatory parameters to be set in the configuration file Option
Type Description
NUMBER_OF_ITERATIONS
int
The number of iterations of a simulation
TRA_WRITE_IN_EVERY_N_ITERATIONS
int
Specifies how many iterations should occur before the current conformation is appended to the trajectory file
INIT_TEMP
float The starting temperature of the simulation (for simulated annealing) or the minimal replica temperature (for REMC runs). The temperature units are arbitrary and are not in physical scale
FINAL_TEMP
float The final temperature of the simulation (for simulated annealing) or the maximal replica temperature (for REMC runs). The temperature units are arbitrary and are not in physical scale
BONDS_WEIGHT ANGLES_WEIGHT TORS_ANGLES_WEIGHT ETA_THETA_WEIGHT
float Weights of the individual terms of the SimRNA energy function
SECOND_STRC_RESTRAINTS_WEIGHT
float The weight of the energy penalty for violating secondary structure restraints
FRACTION_OF_NITROGEN_ATOM_MOVES float Frequencies of subsequent Monte Carlo moves FRACTION_OF_ONE_ATOM_MOVES of various parts of the molecule. FRACTION_OF_TWO_ATOMS_MOVES The default parameters were decided based on FRACTION_OF_FRAGMENT_MOVES the exhaustive tests of the program. The user
is advised to use discretion in changing them LIMITING_SPHERE_RADIUS LIMITING_SPHERE_WEIGHT
2.2
Input Files
2.2.1 Sequence Input
float The radius and energy penalty for violating of a limiting sphere for multichain simulations. Chains contained within the sphere are not permitted to drift away from each other
A SimRNA simulation always starts from a structure. The initial coarse-grained structure is either generated based on the provided PDB input, or if the input is a sequence, then the starting structure is generated as a circular form. The RNA sequence can be provided in a single line in a basic text file (either in upper or in lower case) as following (from PDB ID: 2tpk): GCUGACCAGCUAUGAGGUCAUACAUCGUCAUAGCAC
108
Tomasz K. Wirecki et al.
or gcugaccagcuaugaggucauacaucgucauagcac
Note that in the sequence file, there should be only the desired sequence. To have more than one RNA chains as an input, the user must separate the different chains by white spaces. For example (from PDB ID: 255d): GGACUUCGGUCC GGACUUCGGUCC
2.2.2 Structure in PDB Format as an Input
SimRNA accepts atomic coordinates of RNA structure in PDB format as an input. However, the PDB file, especially the coordinate sections, should contain only RNA chains and should strictly follow the PDB file format (see Notes 4 and 5). SimRNA reads only atoms that are required for the simulation in a reduced RNA representation, so it is also possible to start the simulation from the reduced representation itself. SimRNA retains all the starting details from the PDB file such as chain ID, residue number, etc. The example below describes four types of nucleotides allowed in SimRNA: A, C, G, U. Below there are lists of atoms required for subsequent types of nucleotides, as they are in the PDB file(s) (Fig. 1). It should be noted that only those atoms are actually read by SimRNA, regardless the input file is already in coarse-grained representation or in all-atom format. The residue of a given number may have more atoms (they will be ignored); however, it is not allowed to have residues that are incomplete (or any ligands, water molecules, and so on): A: ATOM
134
P
A A
7
19.812
-20.320
4.773
1.00
ATOM
139
C4’
A A
7
16.595
-18.446
3.539
1.00
70.24 67.90
ATOM
146
N9
A A
7
18.332
-17.176
0.737
1.00
69.05
ATOM
153
C2
A A
7
17.922
-14.910
-1.952
1.00
70.82
ATOM
150
C6
A A
7
20.168
-15.413
-1.694
1.00
70.70
ATOM
221
P
C A
11
5.733
-19.882
6.017
1.00
71.77
ATOM
226
C4’
C A
11
5.509
-19.413
9.888
1.00
72.83
ATOM
233
N1
C A
11
6.217
-22.708
10.272
1.00
72.05
ATOM
234
C2
C A
11
6.876
-23.731
10.969
1.00
72.05
ATOM
237
C4
C A
11
6.984
-24.969
8.991
1.00
72.30
C:
G: ATOM
303
P
G A
15
23.421
-27.715
10.078
1.00
84.53
ATOM
308
C4’
G A
15
23.014
-31.065
8.116
1.00
90.24
ATOM
315
N9
G A
15
20.796
-29.799
5.904
1.00
86.07
ATOM
322
C2
G A
15
19.039
-30.252
2.857
1.00
85.58
ATOM
319
C6
G A
15
19.007
-27.888
3.530
1.00
82.65
Model Quality Assessment Using ModFOLD
109
U: ATOM
678
P
U A
32
-11.054
-47.794
19.269
1.00
98.50
ATOM
683
C4’
U A
32
-9.174
-46.476
22.471
1.00
99.46
ATOM
690
N1
U A
32
-9.556
-43.252
21.484
1.00
95.19
ATOM
691
C2
U A
32
-9.813
-41.932
21.822
1.00
94.50
ATOM
694
C4
U A
32
-10.881
-41.648
19.611
1.00
93.06
SimRNA reads only ATOM records. Within the ATOM record, columns up to the B-factor field are read, and everything after that is neglected. 2.2.3 Secondary Structure
The secondary structure restraints file should be prepared in the following dot-bracket format. The input file can also include the sequence in a separate line; however, it will be ignored by the program (partial base-pairing from PDB ID: 2tpk): ........(((((((............)))))))..
Here, note that SimRNA enforces the base-pairing for the corresponding brackets; however, it does not prevent base-pairing for the dots. A nucleotide residue indicated as a dot in the restraint file may form base pair depending on the overall architecture of the RNA structure (Fig. 2). Pseudoknots can be defined by additional lines in the secondary structure input file (full base-pairing from PDB ID: 2tpk): ........(((((((............))))))).. ..(((((........)))))................
Secondary structure restraints can be also put on multichain structures, and similarly to the sequence file, they must be separated by a whitespace (from PDB ID: 255d): (((((..((((( )))))..)))))
2.2.4 Atomic Restraints
The SimRNA users can implement two types of atomic restraints: single-atom positional restraints (immobilization or flexible pinning) and pairwise atom–atom distances (flexible tethering). Upon imposing positional restraints, the user can restrict the movement of selected atoms. The restriction can range from the absolute immobilization (freezing) to flexible pinning that keeps the atom within a certain distance to its starting position. In case of pairwise atom–atom restraints, the distance between a pair of selected atoms is treated as a flexible tether, which forces atoms to be within a specified distance from each other (Fig. 2).
110
Tomasz K. Wirecki et al.
Fig. 2 Schematic representation of available restraints in SimRNA that can be used to guide a simulation Single-Atom Positional Restraints
The immobilization/pinning of atoms can be achieved by editing an input PDB file. The file should be loaded with a -P flag. In case the user wants to immobilize (freeze) specific atoms in the space, the occupancy field of the atom in the PDB file should be set to 0.00; therefore, its position will not be changed during the entire simulation: ATOM
305 P
C A
27
10.381
27.547
-6.444
0.00
49.85
P
The pinning can be achieved by setting the occupancy field for an atom to a value between 0.00 and 1.00, and then specifying the radius of unrestricted movement (A˚) in the B-factor field. For example, pin the P of 27th residue and let it move in a range of 3 A˚: ATOM
305 P
C A
27
10.381
27.547
-6.444
0.50
3.00
P
If the occupancy is set to 1.00, the atom movement is unrestricted during the simulation, while other atoms can still be restricted.
Model Quality Assessment Using ModFOLD Pairwise Atom–Atom Restraints
111
The atom–atom distance restraints can be implemented by providing an additional file. In the file, the user can specify demanded pairwise distance ranges between atoms. If the distance deviates from the range, an additional energy penalty is applied. The restraint can also provide a reward when the desired distance is achieved. The penalties and rewards are positive and negative contributions to the total energy of the simulated system, respectively. Pairwise restraints can be applied to any atoms in the SimRNA representation, as well as to an additional pseudo-atom, which corresponds to the middle of the base, referenced as MB (and so in the input restraints file, it should be denoted as MB). There are two types of pairwise distance restraints: WELL and SLOPE. In the SLOPE-type restraint, the two atoms are tethered toward the region by applying a linear penalty that corresponds to the degree of violation of the distance from the desired range. In the WELL-type restraint, a constant value of reward or penalty is applied when the atom–atom distance is within the desired range. Example of a restraints file: SLOPE
A/23/C4’
C/45/P
5.5
8.5
1.0
WELL
A/23/C4’
C/45/P
6.5
7.5
1.0
where A/23/C40 means atom C40 of nucleotide 23 in chain A, and C/45/P means atom P in nucleotide 45 in chain C. The restraint weight is set to 1.0 for both SLOPE and WELL restraints. This means for SLOPE, the energy penalty increases linearly with slope 1.0 for distances less than 5.5 A˚ or greater than 8.5 A˚. In WELL restraint, the weight provided is interpreted as a negative increment to the total energy; hence, the energy is rewarded by the value of 1.0 if the atom–atom distance is between 6.5 A˚ and 7.5 A˚. 2.3 Script for Generating Pairwise Atom-Atom Restraints
The script pdb_2_SimRNA_dist_restrs.py is used to generate the pairwise atom–atom distance restraints from an existing PDB structure. The script accepts a standard PDB file as an input. The pairwise distance restraints are generated for all ATOM records, which have occupancy values other than 1.0. The input PDB file needs to be edited by the user accordingly to adjust the occupancy values. If the user wants to restrain several different regions of the structure independently, the restraints should be generated separately using multiple input PDB files with occupancy altered for the corresponding regions.
2.4 Tools for Processing SimRNA Output
The SimRNA package, apart from the program itself, contains also several tools for processing the output data from simulations.
112
Tomasz K. Wirecki et al.
2.4.1 SimRNA_trafl2pdbs – Trajectory Converter
The SimRNA_trafl2pdbs transforms the SimRNA trajectory (simulation output) into a PDB formatted file(s). Optionally, it also allows an all-atom reconstruction. In addition, the tool also generates an .ss_detected file for each of the converted frames, containing the secondary structure in the dot-bracket format. The SimRNA trajectory contains information only about the coordinates of the atoms (no information about the sequence, atom numbers, chain IDs, etc.), thus the converter requires a reference PDB structure. The main input for the converter is the trajectory data (extension .trafl), the reference PDB, and a list of which trajectory frame(s) to be converted. $ SimRNA_trafl2pdbs structure.pdb trajectory.trafl {list} AA
where
structure.pdb is the reference PDB, trajectory. is a trajectory file, {list} is the list of frames to convert (e.g., 1; 12–62), and AA is a flag initiating all-atom reconstruction for the indicated frames. It is recommended to use a PDB file generated by the SimRNA simulation or a file used to initiate the SimRNA simulation as a reference PDB file. Use of not matching PDB file will result in incorrect conversion and reconstruction.
trafl
2.4.2 trafl_extract_ lowestE_frame.py
The trafl_extract_lowestE_frame.py is a python script that reads in the trajectory file, finds the lowest energy frame (see Note 6), and outputs a single-frame .trafl file. Usage: $ trafl_extract_lowestE_frame.py trajectory.trafl
2.4.3 Clustering
The clustering is a tool for processing trajectory file by finding and grouping trajectory frames with similar conformations into groups (clusters). By clustering the lowest energy structures, the user can obtain the representative models (medoids of clusters). The input for clustering is a trajectory.trafl file. The file can be a result of the concatenation of several trajectory files. The result of clustering is a set of .trafl files, each containing conformations corresponding to one cluster. The clusters are ordered from the largest to the smallest in terms of size of the cluster. Example usage: $ clustering trajectory.trafl 0.01 3.5 >& mytest_clust.log
where trajectory.trafl is a trajectory file, 0.01 is the fraction of lowest energy structures from the trajectory that will be clustered, 3.5 is the RMSD threshold (A˚) for clustering (the maximal distance of a cluster member to the medoid of the cluster).
Model Quality Assessment Using ModFOLD
113
Fig. 3 The Energy vs RMSD plot for the simulation of a viral RNA pseudoknot (PDB ID: 1l2x) without restraints. In the energy landscape, we can identify distinct regions representing the unfolded conformations (U), partially folded pool of structures (I), and one cluster of fully folded structures (1). The crystal structure (C) was used as a reference to calculate the RMSD 2.4.4 calc_rmsd_to_ 1st_frame
This tool allows for the calculation of the RMSD value of a set of conformations in relation to the first conformation of the input file. The input is the trajectory (.trafl) file, while the output is a two-column file containing RMSD and Energy values, respectively. The calculated RMSD value is in relation to the first frame, so the RMSD in the first line of the output will be always 0.0. The tool is useful for generating data in order to obtain Energy vs. RMSD plots when the native structure is known (Figs. 3 and 4). It allows for showing the projection of conformational space, rooted in the native conformation. To obtain the reference conformation, the best practice is to generate a single-frame trajectory from the native structure, by the execution of a SimRNA run with zero iteration, and concatenate it along with the simulation trajectory file. Usage: $ calc_rmsd_to_1st_frame sum_trajectory_with_first_reference. trafl output_name.rmsd_e
2.5
traflView
The traflView is a tool provided as a part of SimRNA package for the visualization of the trajectories generated by SimRNA. The inputs for traflView are .trafl and .bonds files and can be loaded by the following command: $ traflView example_output.trafl example_output.bonds
114
Tomasz K. Wirecki et al.
Fig. 4 The Energy vs RMSD plot for the simulation of a pseudoknot of bacteriophage T2 (PDB ID: 2tpk) without restraints. In the energy landscape, we can identify distinct regions representing the unfolded conformations (U), partially folded pools of structures (I1, I2), and several clusters of folded structures (1–4). The crystal structure (C) was used as a reference to calculate the RMSD
Since the .trafl file contains only the coarse-grained representation of RNA, traflView renders the conformations in the reduced representation. The various options of traflView user interface are as follows (keys are case sensitive): —c (on/off): colors the structures, rainbow style, where 50 end is blue, and 3 0 end is red —e: (on/off): show energy plot as a green line at the bottom part of the window —Esc: quit the viewer Selecting different frames (moving forward and backward) is accomplished by digit keys (see Note 7): —1 and 3: one frame backward or forward, respectively. —7 and 9: ten frames backward or forward, respectively. —4 and 6: a hundred frames backward or forward, respectively.
Model Quality Assessment Using ModFOLD
115
Rotating, zooming, and moving the current conformation can be accomplished by using the mouse: —Left button + mouse movement (left/right) and (up/down): rotation of a structure, —Middle button + mouse movement (only left/right): zooming (in/out) of a structure, —Right button + mouse movement (only left/right): moving of a structure. The current display can be captured and saved in a .bmp file by pressing “Shift + s” or “S” (by pressing “s” in Caps Lock on mode). The traflView also displays the RMSD value to the reference frame in the terminal in which it is launched. By default, the reference frame is the first frame of the .trafl file; however, it can be changed to a new reference frame by pressing “r” on the desired frame in the current display. 2.6 Additional Scripts
The basic way of running SimRNA simulations is by using the SimRNA binary itself (refer to Subheading 3.1). However, for complex calculations, we need to run multiple runs of SimRNA, initiated with same input files but different random seeds, to more efficiently sample the conformational landscape. In order to perform this in a semiautomated way, we have developed two scripts to set up the simulations and process the output from these simulations.
2.6.1 run_SimRNA_ prediction.py
The script, run_SimRNA_prediction.py, is dedicated to run simulations on a computational cluster. This python script requires a job_id (supplied by the user in order to identify the job), followed by the input files as arguments. The job_id must be a valid “string” that is used also as an identifier in the queuing systems. The input files can be supplied in a similar manner as in the SimRNA itself, by the use of specific flags (Refer Box 1, user can also view available flags by typing $ run_SimRNA_prediction.py -h). Each of the simulations launched using this script is performed in its own subdirectory with job_id as a directory name, and the simulations are placed under the directory WORKING_SPACE. The script can launch multiple instances (runs) of the simulation initialized with a different random seed. The number of runs can be specified using the flag -n. This allows the simulations to explore a larger conformational space than each individual simulation can explore. Each independent run of these simulations is placed under the subdirectories run_01, run_02, run_03, etc. The script allows the user to submit jobs for both Son of Grid Engine (SGE) and SLURM queuing systems.
116
Tomasz K. Wirecki et al.
Box 1 Description of Available Command Line Options that Can Be Used for SimRNA Simulations -s input_file_sequence.
The sequence file. -p input_file_PDB.
The starting structure file in PDB format. -P input_file_PDB.
The starting structure file in PDB format, with positional restraints specified (B-factor and occupancy fields modified). -c simulation_config_file.
The config file contains a variety of important settings and is the recommended way to start a simulation. -o output_files_basename.
If the output name is not specified, then the program uses PDB or sequence file name as a base name for output. -S secondary_strc_restraints_file.
Secondary structure restraints file allows the user to specify the secondary structure restraints in dot-bracket format. -r restraints_file.
The file specifying pairwise atom-atom distance restraints. -E int_number_of_Replicas.
The number of replicas to use in the Replica Exchange Monte Carlo method. -R int_number.
Seed number for initiation of simulations, if not specified by the user the number will be generated automatically by using the system clock. -l
Option allowing generation of PDB files along with trajectory output writes (not recommended due to a large number of PDB files generated, use SimRNA_trafl2pdbs converter instead). -n number_of_iterations.
The number of iterations(steps) to be performed in the simulation.
2.6.2 process_SimRNA_ results.py
The simulations performed using the script run_SimRNA_prediction.py can be processed using the process_SimRNA_results.py. In order to identify the job to be processed, the script requires job_id that was provided as an identifier while running run_SimRNA_prediction.py. The script detects the directory WORKING_SPACE/< job_id> and the sub-directories containing output from individual runs of SimRNA. The .trafl files are combined into a single file and processed in the directory WORKING_SPACE// processing_results. The trajectory is clustered using one percent of
Model Quality Assessment Using ModFOLD
117
the lowest energy frames with an RMSD threshold equal to 10% of the RNA length. The script also generates both, coarse-grained and all-atom, PDB structures of the representatives of three largest clusters. The processing script copies the representative structures from the top three clusters to the directory WORKING_SPACE/ /output_PDBS. Usage: $ process_SimRNA_results job_id_name
2.7
3
SimRNAweb
SimRNA is also available as a web service (https://genesilico.pl/ SimRNAweb/), which carries out input processing, simulation, and processing of the results of the SimRNA calculation and provides the computational resource needed to perform a prediction [14]. Moreover, SimRNAweb has additional features such as the display of the progress of the simulation and visualization of the structure, while the simulation is still in progress. The output webpage displays three models of the top clusters using JSmol and provide options to download them. SimRNAweb allows also to download all the log files and intermediate data from the simulations.
Methods
3.1 Running the SimRNA Simulation
The SimRNA simulation can be launched from the command line, the obligatory input required to start a simulation is either a sequence file or a structure file. The input files are loaded using s or -p (or -P) flags, with the following commands: $ SimRNA -c -s
Or $ SimRNA -c -p
Structure file can be provided as an input with modified B-factor and occupancy columns with -P flag to specify positional restraints to be imposed on specific atoms, using the following command: $ SimRNA -c -P
All the arguments supported by SimRNA are mentioned in Box 1 (also refer Notes 8 and 9).
118
Tomasz K. Wirecki et al.
Fig. 5 The crystal structure of an adenine riboswitch aptamer domain with a three-way junction and the P1 switch helix (PDB ID: 5swe) (a). SimRNA REMC simulations without using any restraints on the structure, model the different regions of the RNA correctly, however, fail to predict the pseudoknot (b). SimRNA REMC simulations with secondary structure restraints are able to model a structure that closely resembles the crystal structure, except for the ligand-binding regions (c). Addition of the pairwise atom–atom distance restraints allows SimRNA REMC simulations to generate structures that closely resemble the crystal structure (d) 3.2 An Example Case of Running SimRNA Simulations
In this section, we describe an example of running SimRNA simulations to predict the 3D structure of the adenine riboswitch aptamer domain (PDB ID: 5swe). The crystal structure of this RNA exhibits a three-way junction and a switch helix (Fig. 5a) [26]. In the initial run, we performed REMC simulations with ten replicas (for isothermal simulation, see Note 10), without any restraints. This required only one input file—5SWE.seq with the nucleotide sequence of the RNA to be modeled in ASCII one-letter codes. The trajectory .trafl files were generated for each replica during the simulation. The trajectory files were combined into a single file. One percent of the lowest energy frames were clustered at a predefined RMSD threshold. The clustering program generated individual .trafl files for each of the clusters. The medoids of the top three clusters were converted into all-atom PDB files using trafl2pdbs. The commands used for running and processing the simulations are as follows: $ SimRNA -c config.dat -s 5SWE.seq -o 5SWE_norest -E 10 $ cat 5SWE_norest_??.trafl > 5SWE_norest_ALL.trafl $ clustering 5SWE_norest_ALL.trafl 0.01 7.1 >& 5SWE_norest_ clustering.log $ SimRNA_trafl2pdbs 5SWE_norest_01-000001.pdb
5SWE_norest_
ALL_thrs7.10A_clust01.trafl 1 AA $ SimRNA_trafl2pdbs 5SWE_norest_01-000001.pdb ALL_thrs7.10A_clust02.trafl 1 AA
5SWE_norest_
Model Quality Assessment Using ModFOLD $ SimRNA_trafl2pdbs 5SWE_norest_01-000001.pdb
119
5SWE_norest_
ALL_thrs7.10A_clust03.trafl 1 AA
The representative structures generated by the REMC simulation, without restraints, exhibit different regions of the RNA modeled correctly (Fig. 5b). However, this setup of the simulation failed to reproduce the pseudoknot observed in the crystal structure (see Note 11). In order to enforce the correct base-pairings and the pseudoknot, we ran another REMC simulation using SimRNA with ten replicas, sequence, and secondary structure restraints. In order to predict the secondary structure of the RNA, one can use IPknots [5], which is one of the best-performing methods for prediction of RNA secondary structure with pseudoknots according to the CompRNA benchmark [27]. Since the crystal structure of the target RNA is available, the secondary structure can be extracted from the PDB, e.g., using our in-house program ClaRNA [28]. The commands used for running and processing the simulations are as follows: $ SimRNA -c config.dat -s 5SWE.seq -S 5SWE_PDB.ss -o 5SWE_ss -E 10 $ cat 5SWE_ss_??.trafl > 5SWE_ss_ALL.trafl $ clustering 5SWE_ss_ALL.trafl 0.01 7.1 >& 5SWE_ss_clustering.log $ SimRNA_trafl2pdbs 5SWE_ss_01-000001.pdb 5SWE_ss_ALL_thrs7. 10A_clust01.trafl 1 AA $ SimRNA_trafl2pdbs 5SWE_ss_01-000001.pdb 5SWE_ss_ALL_thrs7. 10A_clust02.trafl 1 AA $ SimRNA_trafl2pdbs 5SWE_ss_01-000001.pdb 5SWE_ss_ALL_thrs7. 10A_clust03.trafl 1 AA
The representative structures from this simulation closely resemble the crystal structure except for the ligand-binding region (Fig. 5c). The binding region contains a turn, and it was modeled incorrectly. In order to enforce the turn, pairwise atom–atom distance restraints were used. In this simulation, we restrained the residues 21–23 using the distances derived from the crystal structure. The ATOM records for residues 21–23 were extracted from the crystal structure and converted into coarse-grained representation. The coarse-grained structure was generated by running SimRNA with a config file having NUMBER_OF_ITERATIONS defined as 0. The distance restraints can be generated using the script described in Subheading 2.3. The commands used for running and processing the simulations are as follows: $ SimRNA -c config_n0.dat -p region-to-restrain.pdb $ python pdb_2_SimRNA_dist_restrs.py region-to-restrain.pdb000001.pdb > restraints.dat $ SimRNA -c config.dat -s 5SWE.seq -S 5SWE_PDB.ss -r restraints.dat -o 5SWE_ss_3dr -E 10
120
Tomasz K. Wirecki et al. $ cat 5SWE_ss_3dr_??.trafl >5SWE_ss_3dr_ALL.trafl $ clustering 5SWE_ss_3dr_ALL.trafl 0.01 7.1 >& 5SWE_ss_3dr_ clustering.log $ SimRNA_trafl2pdbs 5SWE_ss_3dr_01-000001.pdb 5SWE_ss_3dr_ ALL_thrs7.10A_clust01.trafl 1 AA $ SimRNA_trafl2pdbs 5SWE_ss_3dr_01-000001.pdb 5SWE_ss_3dr_ ALL_thrs7.10A_clust02.trafl 1 AA $ SimRNA_trafl2pdbs 5SWE_ss_3dr_01-000001.pdb 5SWE_ss_3dr_ ALL_thrs7.10A_clust03.trafl 1 AA
The representative structures generated by these SimRNA REMC simulations are closer to the crystal structure compared to the previous two simulations (Fig. 5d). The files from the simulations described in this section are available at the following URL: https://github.com/fryzjergda/SimRNA_example_files.git
4
Case Studies SimRNA is generally used in combination with other methods as part of a whole modeling pipeline [29]. For instance, for modeling of a long RNA for which partial structural data are available (for comparative modeling), the initial models for such regions can be generated using ModeRNA [30]. The remaining regions can be modeled de novo using SimRNA and combined into a single model of the whole molecule. SimRNA can be used to sample various conformations to identify the relative orientation of the different regions, modeled using various programs. Subsequently, the models can be subjected to high-resolution refinement using tools such as QRNAS [31] (see Note 12). One of many previously published examples of the application of SimRNA to predict the 3D structure of RNA whose structure was unknown at the time of the modeling exercise was an L-glutamine riboswitch (GlnA), presented as Puzzle 14 in the RNA-Puzzles experiment in two slightly different sequence variants, corresponding to the ligand-bound and unbound forms. The structure of this RNA (for both sequence variants) was modeled using a hybrid approach, by combining the template-based modeling for some fragments corresponding to motifs with known structures, followed by global folding with restraints on both secondary and tertiary structure. Firstly, we obtained the secondary structure predicted for the entire GlnA family from the Rfam database [32, 33], and we identified that it encompasses a three-way junction. Additional restraints on individual base pairs were added based on our interpretation of the additional information made available by the organizers of RNA-Puzzles for the free and bound forms. Secondly, we used the CARTAJ method [34] to predict the architecture of the
Model Quality Assessment Using ModFOLD
121
three-way junction and generated distance restraints to enforce coaxial stacking of helical regions. Thirdly, the sequence of the target RNA was compared with the sequences of well-known motifs with experimentally determined structures. As a result, three candidate motifs were identified with very high sequence similarity to the E-loop motif, the sarcin-ricin loop motif, and to the U1A proteinbinding motif, and models were generated for these sequence regions by template-based modeling with ModeRNA [30, 35], using the 1JJ2, 1Q9A, and 1URN structures from the PDB as the templates, respectively. These partial models were used as a source of pairwise distance restraints for SimRNA, derived using an in-house tool (refer to Subheading 2.3). Fourthly, for a subset of predictions, additional information from the mutate-and-map data and MOHCA data (provided by the Das lab in cooperation with the organizers of RNA-Puzzles) was used in the form of distance restraints (generated with an in-house tool developed by Dr. Wayne Dawson). The structure of L-glutamine riboswitch was determined using X-ray crystallography, both in free and bound states, revealing a large conformational difference between the ligand-bound and unbound forms and enabling the validation of computationally predicted structures. On the one hand, we were quite successful in predicting the bound form, which was aided by the use of the restraint on the G23-C60 base pair. On the other hand, we mispredicted a G22-U61 base pair (which turned out not to be present in the unbound form), and consequently, we failed to predict the large conformational change between the bound and the free form, resulting in globally incorrect models of the free form. This exercise shows that while the use of additional restraints can be helpful, use of even one seriously wrong restraint that defines the mutual orientation of large parts of the molecule may drive the folding into a globally wrong structure. In addition to structure prediction, SimRNA can be used to analyze conformational space and to gain insights on the possible alternative structures and intermediate states of the studied system, which occur during folding. These pools of alternative structures can be used to identify the functionally important states of the molecule supported by additional experimental evidence. To illustrate the applicability of SimRNA in exploring such alternate pools, here we show two simple examples of pseudoknotted RNAs from the western yellow virus (WYV, PDB ID: 1l2x) and bacteriophage T2 (PDB ID: 2tpk). For both cases, we ran 24 independent runs of SimRNA REMC simulations, ten replicas each, starting from the sequence alone. The resulting trajectories were concatenated, and 20% of frames were randomly selected. The native structures were converted into reduced representation and added as first frames to the concatenated trajectory
122
Tomasz K. Wirecki et al.
files. For both cases, we ran calc_rmsd_to_first_frame and plotted the output data (Figs. 3 and 4). We also clustered the results to identify the most populated clusters. From the energy landscapes, we can identify one well-defined cluster of folded structure in case of simulations of the WYV pseudoknot, while for simulations of T2 pseudoknot, we can see two highly populated clusters 1 and 2 and two smaller clusters 3 and 4 of folded molecules (Fig. 4). It should be noted that the largest cluster does not represent the correct fold. Additionally, decoys generated by simulations of pseudoknot of bacteriophage T2 pseudoknot exhibits a more diverse pool of intermediate structures.
5
Notes 1. The SimRNA package can be downloaded from the following URL: http://genesilico.pl/software/stand-alone/simrna 2. SimRNA can be run on GNU/Linux, Mac OS X, and WSL/Windows 10 operating systems. 3. Running SimRNA requires the data directory or its symbolic link in the current working directory, which can be created using the following command: $ ln –s /data data
Or $ ln –s data
4. While starting the SimRNA simulation from a PDB file, it is important to make sure that the phosphate atom at the 50 -end is present. A missing 50 -phosphate can be added by using any third-party software or, alternatively, one can trick SimRNA by renaming the 50 O50 atom as P as a quick and dirty way. Also, the structure of different PDB files across the years is inconsistent, and it may happen that SimRNA will not be able to read a specific PDB file. To fix that issue, one can either examine the file itself and modify the regions that are causing the problem, or simply use one of our in-house servers – ModeRNA [30, 35]—to solve the problem. 5. It is recommended that the input PDB file should contain unique atom numbers (e.g., there should be no two atoms with the same number). The output PDB file in reduced representation uses CONECT records, which rely on atom numbers. If two atoms have the same number, the visualization of atom bonds will be incorrect.
Model Quality Assessment Using ModFOLD
123
6. The temperature and energy values used in SimRNA do not correspond to physical units and are derived based on the parametrization of the knowledge-based potential. 7. The traflView reads only the digit keys. Please make sure that the NumLock is ON in your keyboard while using NumPad. 8. The command line arguments have higher precedence and override the options in the config file. 9. For more details on how to use the SimRNA program, please refer to the SimRNA User Manual, which is a part of the SimRNA package. 10. SimRNA can be used for running isothermal simulations by specifying the same temperature in INIT_TEMP and FINAL_TEMP options in the config file, for instance: INIT_TEMP
1.15
FINAL_TEMP
1.15
11. The SimRNA package is under continuous development, and there is still room for improvement, which should be considered by the user while interpreting the results of a simulation. 12. It is highly recommended to refine the final model (s) generated by SimRNA using available refinement software (e.g., QRNAS [31] or RNAfitme server [36]) for removing the clashes and bad-conformations from the models.
Acknowledgments This work was supported by the Polish National Science Center Poland (NCN) (grant 2016/23/B/ST6/03433 to M.J.B.). T.K.W. was supported by NCN (grant 2017/25/B/NZ2/01294 to J.M.B). C.N. and J.M.B. were additionally supported by the NCN (grant 2017/26/A/NZ1/01083 to J.M.B.) and by the IIMCB statutory funds. S.M. was supported by the IIMCB statutory funds. Simulations were performed using the computational resources of IIMCB, the Poznan´ Supercomputing and Networking Center at the Institute of Bioorganic Chemistry, Polish Academy of Sciences (grant 312), and the Interdisciplinary Centre for Mathematical and Computational Modelling at the University of Warsaw (grant G66-9). We thank the current and former members of the Bujnicki group (in particular developers of methods and participants of the RNA-Puzzles experiment) for their intellectual contributions.
124
Tomasz K. Wirecki et al.
References 1. Boniecki MJ, Lach G, Dawson WK et al (2016) SimRNA: a coarse-grained method for RNA folding simulations and 3D structure prediction. Nucleic Acids Res 44:e63 2. Kmiecik S, Gront D, Kolinski M et al (2016) Coarse-grained protein models and their applications. Chem Rev 116:7898–7936 3. Bellaousov S, Reuter JS, Seetin MG et al (2013) RNAstructure: web servers for RNA secondary structure prediction and analysis. Nucleic Acids Res 41(Web server issue): W471–W474. https://doi.org/10.1093/ nar/gkt290 4. Gruber AR, Lorenz R, Bernhart SH et al (2008) The Vienna RNA websuite. Nucleic Acids Res 36:W70–W74 5. Sato K, Kato Y, Hamada M et al (2011) IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming. Bioinformatics 27: i85–i93 6. Seemann SE, Gorodkin J, Backofen R (2008) Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Res 36:6355–6362 7. De Leonardis E, Lutz B, Ratz S et al (2015) Direct-coupling analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction. Nucleic Acids Res 43:10444–10455 8. Weinreb C, Riesselman AJ, Ingraham JB et al (2016) 3D RNA and functional interactions from evolutionary couplings. Cell 165:963–975 9. Wang J, Mao K, Zhao Y et al (2017) Optimization of RNA 3D structure prediction using evolutionary restraints of nucleotide–nucleotide interactions from direct coupling analysis. Nucleic Acids Res 45(11):6299–6309. https://doi.org/10.1093/nar/gkx386 10. Merino EJ, Wilkinson KA, Coughlan JL et al (2005) RNA structure analysis at single nucleotide resolution by Selective 2‘-Hydroxyl Acylation and Primer Extension (SHAPE). J Am Chem Soc 127(12):4223–4231. https://doi. org/10.1021/ja043822v 11. Wilkinson KA, Merino EJ, Weeks KM (2006) Selective 20 -hydroxyl acylation analyzed by primer extension (SHAPE): quantitative RNA structure analysis at single nucleotide resolution. https://doi.org/10.1038/nprot.2006. 249 12. Regulski EE, Breaker RR (2008) In-line probing analysis of riboswitches. Methods Mol Biol 419:53–67
13. Ponce-Salvatierra A, Astha, Merdas K et al (2019) Computational modeling of RNA 3D structure based on experimental data. Biosci Rep 39 14. Magnus M, Boniecki MJ, Dawson W et al (2016) SimRNAweb: a web server for RNA 3D structure modeling with optional restraints. Nucleic Acids Res 44:W315–W319 15. Miao Z, Adamiak RW, Blanchet M-F et al (2015) RNA-Puzzles Round II: assessment of RNA structure prediction programs applied to three large RNA structures. RNA 21:1066–1084 16. Miao Z, Adamiak RW, Antczak M et al (2017) RNA-Puzzles Round III: 3D RNA structure prediction of five riboswitches and one ribozyme. RNA 23:655–672 17. Cruz JA, Blanchet M-F, Boniecki M et al (2012) RNA-Puzzles: a CASP-like evaluation of RNA three-dimensional structure prediction. RNA 18:610–625 18. Das R, Karanicolas J, Baker D (2010) Atomic accuracy in predicting and designing noncanonical RNA structure. Nat Methods 7:291–294 19. Das R, Baker D (2007) Automated de novo prediction of native-like RNA tertiary structures. Proc Natl Acad Sci U S A 104:14664–14669 20. Parisien M, Major F (2008) The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature 452:51–55 21. Ding F, Sharma S, Chalasani P et al (2008) Ab initio RNA folding by discrete molecular dynamics: from structure prediction to folding mechanisms. RNA 14:1164–1173 22. Popenda M, Szachniuk M, Blazewicz M et al (2010) RNA FRABASE 2.0: an advanced web-accessible database with the capacity to search the three-dimensional fragments within RNA structures. BMC Bioinformatics 11:231 23. Popenda M, Szachniuk M, Antczak M et al (2012) Automated 3D structure composition for large RNAs. Nucleic Acids Res 40:e112 24. Zhao Y, Huang Y, Gong Z et al (2012) Automated and fast building of three-dimensional RNA structures. Sci Rep 2:734 25. Xu X, Zhao P, Chen S-J (2014) Vfold: a web server for RNA structure and folding thermodynamics prediction. PLoS One 9:e107504 26. Stagno JR, Liu Y, Bhandari YR et al (2017) Structures of riboswitch RNA reaction states by mix-and-inject XFEL serial crystallography. Nature 541:242–246
Model Quality Assessment Using ModFOLD 27. Puton T, Kozlowski LP, Rother KM et al (2013) CompaRNA: a server for continuous benchmarking of automated methods for RNA secondary structure prediction. Nucleic Acids Res 41:4307–4323 28. Walen´ T, Chojnowski G, Gierski P et al (2014) ClaRNA: a classifier of contacts in RNA 3D structures based on a comparative analysis of various classification schemes. Nucleic Acids Res 42:e151 29. Piatkowski P, Kasprzak JM, Kumar D et al (2016) RNA 3D structure modeling by combination of template-based method ModeRNA, template-free folding with SimRNA, and refinement with QRNAS. Methods Mol Biol 1490:217–235 30. Rother M, Rother K, Puton T et al (2011) ModeRNA: a tool for comparative modeling of RNA 3D structure. Nucleic Acids Res 39:4007–4022 31. Stasiewicz J, Mukherjee S, Nithin C et al (2019) QRNAS: software tool for refinement of nucleic acid structures. BMC Struct Biol 19:5
125
32. Kalvari I, Argasinska J, Quinones-Olvera N et al (2018) Rfam 13.0: shifting to a genomecentric resource for non-coding RNA families. Nucleic Acids Res 46:D335–D342 33. Kalvari I, Nawrocki EP, Argasinska J et al (2018) Non-coding RNA analysis using the Rfam database. Curr Protoc Bioinformatics 62:e51 34. Lamiable A, Barth D, Denise A et al (2012) Automated prediction of three-way junction topological families in RNA secondary structures. Comput Biol Chem 37:1–5 35. Rother M, Milanowska K, Puton T et al (2011) ModeRNA server: an online tool for modeling RNA 3D structures. Bioinformatics 27:2441–2442 36. Antczak M, Zok T, Osowiecki M et al (2018) RNAfitme: a webserver for modeling nucleobase and nucleoside residue conformation in fixed-backbone RNA structures. BMC Bioinformatics 19(1):304. https://doi.org/10. 1186/s12859-018-2317-9
Chapter 7 Modeling Protein Homo-Oligomer Structures with GalaxyHomomer Web Server Minkyung Baek, Taeyong Park, Lim Heo, and Chaok Seok Abstract Cellular processes, such as metabolism, signal transduction, or immunity, often depend on the homooligomerization of proteins. Detailed structural knowledge of the homo-oligomer structure is therefore crucial for molecular-level understanding of protein functions and their regulation. In this chapter, we introduce the GalaxyHomomer server, which supports easy-to-use web interfaces for general users. It is freely accessible at http://galaxy.seoklab.org/homomer. GalaxyHomomer carries out template-based modeling, ab initio docking or both depending on the availability of proper oligomer templates. It also incorporates recently developed model refinement methods that can consistently improve model quality by performing symmetric loop modeling and overall structure refinement. Moreover, the server provides additional options that can be chosen by the user depending on the availability of information on the monomer structure, oligomeric state, and locations of unreliable/flexible loops or termini. Key words Homo-oligomer structure prediction, Template-based modeling, Ab initio docking, Symmetric loop modeling, Model structure refinement, GalaxyHomomer
1
Introduction The majority of cellular proteins exist as symmetric oligomers with distinct biochemical and biophysical properties [1–3]. In many proteins, ligand binding sites and enzyme catalytic sites are located at homo-oligomer interfaces [4–6]. Therefore, a precise description of protein’s homo-oligomer structures is essential to gain a detailed molecular understanding on how these proteins function and how the functions can be regulated. Unfortunately, the homo-oligomer structures resolved by experimental methods so far cover only a small portion of the possible structural space. To fill this gap, several computational methods to model protein homo-oligomer structures have been developed, which differ in types and the amount of structural information required as input [7–16]. One of the computational approaches is ab initio docking method, which requires monomer structure as input. M-ZDOCK
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_7, © Springer Science+Business Media, LLC, part of Springer Nature 2020
127
128
Minkyung Baek et al.
server [7] and GRAMM-X server [8] are two examples, which predict homo-oligomer structures using ab initio docking based on the fast Fourier transformation (FFT) algorithm. However, these tools cannot be used if the monomer structure required as input is not available. They also perform low-resolution docking, which does not consider structural flexibility of proteins. Another approach is template-based modeling that starts from either monomer structure or monomer sequence. GalaxyGemini server predicts homo-oligomer structures by superposing monomer structures onto oligomer structures of homologous proteins [10]. Only few methods for predicting protein homo-oligomer structures from their amino acid sequences using template-based modeling have been reported so far. Baker and co-workers have extended their tertiary structure prediction web server ROBETTA to predict quaternary structures of symmetric homo-oligomer proteins [9, 14]. SWISS-MODEL also predicts homo-oligomer structures from its amino acid sequence [13, 15]. However, both servers have limitation that users do not have the option to select the oligomeric state even when they are known from experiments. In this chapter, we introduce GalaxyHomomer web server [12], which predicts homo-oligomer structures of target proteins from their amino acid sequences or the monomer structures. It performs either template-based oligomer modeling or ab initio docking or both depending on the existence of proper oligomer templates. Predicted oligomer structures may have errors due to (1) sequence differences between the target and template proteins used to model or (2) inherent flexibility of monomer structures. In the CASP11 experiment held in 2014 in collaboration with CAPRI, we showed that errors in predicted oligomer structures could be reduced by model refinement methods including loop/ terminus modeling and overall structure refinement [17]. GalaxyHomomer applies such an additional model refinement technique to the homo-oligomer models built from template-based oligomer modeling and ab initio docking. According to the assessment of recent CASP13 blind protein structure prediction experiment held in 2018, GalaxyHomomer server, participated as “Seok-assembly”, ranked the first among the servers participated in the assembly category. When we tested GalaxyHomomer on 136 targets from PISA benchmark set, 20 targets from CASP11 experiments, and 89 targets from CAMEO protein structure prediction category, it showed comparable or better performance compared to other available homo-oligomer structure prediction methods as shown in the original paper [12].
Homo-Oligomer Structure Prediction Using GalaxyHomomer
2
129
Materials 1. A personal computer or device and a web browser are required to access the GalaxyWEB server (http://galaxy.seoklab.org) through the internet. A JavaScript enabled web browser is highly recommended to see the results on the web browser: The server compatibility was tested on Google Chrome, Firefox, Safari, and Internet Explorer. 2. To run GalaxyHomomer, a sequence in FASTA format or a structure file in standard PDB format for the protein of interest is required. The input target protein sequence/structure file must contain 20 standard amino acids only. The input should be a single-chain protein, and the number of amino acids should be greater than 30 and less than 1000. An example input sequence (Fig. 1, Label 1) and structure file (Fig. 1, Label 2) can be obtained from the GalaxyHomomer web page.
3
Methods 1. Go to GalaxyWEB, http://galaxy.seoklab.org. Click “Homomer” in the “Services” tab at the top of the page. GalaxyHomomer web server also can be reached by http://galaxy. seoklab.org/homomer. 2. In the “User Information” section, enter job name (defaults to “None”). The user can provide an e-mail address so that the server sends progress reports of the submitted job automatically. Otherwise, the user should bookmark the report page after submitting the job. 3. In the “Query Protein Information” section, provide a FASTAformatted monomer sequence or a standard PDB-formatted monomer structure file. If only the sequence of the query protein is known, the user may provide a FASTA-formatted protein sequence by copying the sequence and pasting it into the text box (Fig. 1, Label 3). If the structure of query protein has been already determined or predicted, the user may simply upload the protein structure file in PDB format (Fig. 1, Label 4, see Notes 1 and 2). 4. In the “Oligomeric state” section, provide the oligomeric state if it is already known (Fig. 1, Label 5). It should be given in integer number, for example, 2 for dimer. If the oligomeric state is not provided, the server predicts the oligomeric state automatically based on template search results (see Note 3). 5. In the “Loops or Termini to be Refined” section, provide residue numbers indicating less reliable loop or termini regions if the input structure has those regions (Fig. 1, Label 6). It
130
Minkyung Baek et al.
Fig. 1 GalaxyHomomer input submission page
should be noted that this option can be used only when monomer structure is provided. The indicated regions will be modeled by GalaxyLoop with symmetry constraint during homo-oligomer prediction. Please refer to the original GalaxyHomomer paper [12] for more details. 6. Press the submit button to queue the job. If any errors occur with the provided input (see Note 4), the user will get a notice about the errors that need to be corrected. If the submission is successful, the user will be directed to the summary page of the
Homo-Oligomer Structure Prediction Using GalaxyHomomer
131
Fig. 2 (a) Example of GalaxyHomomer submission information summary page. (b) Example of tracking GalaxyHomomer job status on report page
submission information, which has a link to the report page (Fig. 2a). The number of jobs in the “WAIT” or “RUN” status allowed per user is limited to 3. 7. Click “LINK” in the submission information page to access to the report page. The user can track the status of the submitted job in the report page, which will be refreshed every 30 s (Fig. 2b). When the job is completed, predicted results will be automatically presented. Average run time of GalaxyHomomer is 6–12 h for the sequence input. For the structure input, homo-oligomer structure prediction is finished within 2 h in most cases. 8. Predicted models: Five predicted homo-oligomer structures of the query protein are visualized on the report page using PV (http://biasmv.github.io/pv/), a JavaScript protein viewer, if the web browser supports JavaScript (Fig. 3). Users can zoom in and out by scrolling mouse wheel and change the focusing center by double clicking. Detected URLs (loops/termini whose structures are predicted to be less reliable) are colored in gray. Different homo-oligomer models can be seen by clicking
132
Minkyung Baek et al.
Fig. 3 Example of predicted models section on GalaxyHomomer report page
the model number in the “View in PV” line (Fig. 3, Label 1). Predicted homo-oligomer structures can also be downloaded in PDB-formatted files for further analyses (Fig. 3, Label 2). 9. Model information: Additional information on predicted models is provided in a table (Fig. 4). Oligomer templates, the number of subunits, and interface area are shown in the table. Sequence identity or structure similarity is shown for templatebased models, while docking score is shown for models predicted by ab initio docking with Cn-symmetry (see Note 5). For each model, detected less reliable regions are provided, and user can run additional refinement including loop modeling and overall structure relaxation by clicking “Submit” button in the table. 10. Detailed explanations on the GalaxyHomomer web server are also provided on the help page (see Note 6); click “Help” tab at the top of the page, and then click “GalaxyHomomer” on the right of the help page. User can be also directed to the help page by clicking “Information” on the right side of the submission and report page. The prediction method used for the GalaxyHomomer program is described in the original paper [12].
Homo-Oligomer Structure Prediction Using GalaxyHomomer
133
Fig. 4 Example of model information section on Galaxy Homomer report page. Detailed information is shown by prediction methods used to generate homo-oligomer models
4
Case Studies The GalaxyHomomer web server has been used to solve some biologically relevant questions by independent groups [18– 21]. One of the examples is a study that reveals the role of archaeal MutS5 proteins in DNA recombination mechanism [18]. In eukaryotes, MutSγ recognizes Holliday junction to promote homologous recombination. However, the homologous protein had not been identified in archaea, and the whole molecular mechanism of archaeal homologous recombination had not been revealed even though there had been experimental evidences that archaeal DNA recombination mechanism is similar to those in eukaryotes. In the study, GalaxyHomomer was used to predict the homo-dimer structure of a functionally uncharacterized MutS homolog, MutS5, from a hyperthermophilic archaeon Pyrococcus horikoshii. The resulting homo-dimer model, in combination with other experimental data, supports a homologous relationship between archaeal MutS5 and eukaryotic MutSγ proteins. Other blind prediction examples of GalaxyHomomer can be found in the latest CASP13 experiment held in 2018. According to the assessment, GalaxyHomomer server, participated as “Seokassembly”, ranked the first among the servers participated in the assembly category. Here, we describe the GalaxyHomomer
134
Minkyung Baek et al.
Fig. 5 Predicted homo-dimer structure of CASP13 target T1016. (a) The initial structure generated based on oligomer template (chain A in sky blue and chain B in dark blue) and (b) the structure optimized by symmetric loop modeling and overall complex refinement (chain A in pink and chain B in dark magenta) are superimposed on the experimental structure (both chain A and B in yellow)
prediction results on the CASP13 target T1016, a homodimer structure of alpha-ribazole-50 -P phosphatase (PDB ID: 6E4B). The server selected a template (PDB ID: 4IJ5) of high structural similarity to the input protein structure (TM-score ¼ 0.900 when measured by MM-align [22]). The initial model generated from the template had the ligand RMSD and interface RMSD from the crystal structure of 5.0 and 5.1 A˚, respectively, and the fraction of native contacts of 0.65. The initial model was further optimized in the refinement stage, and the quality of the model was improved to 4.1 A˚, 2.6 A˚, and 0.80 in ligand RMSD, interface RMSD, and fraction of native contacts, respectively, as shown in Fig. 5.
5
Notes 1. If there is no homologous structure of a query protein, it might be better to provide a monomer structure predicted by other methods, which support de novo protein structure prediction such as Robetta, QUARK, CABS-fold, or Phyre2 because GalaxyHomomer builds a monomer structure only based on templates if necessary. 2. If input PDB file has missing side-chain atoms, it is recommended to use the program SCWRL or a similar program for side-chain placement to fill and optimize all the side-chain atoms for more accurate docking. 3. Due to memory limitation of the computer resources, GalaxyHomomer generates homo-oligomer models from dimer to dodecamer. Also, if the total number of residues in the predicted oligomer structure exceeds 3000, refinement is not performed.
Homo-Oligomer Structure Prediction Using GalaxyHomomer
135
4. The most common problems with the inputs that cause submission failures are as follows: (a) The number of residues in the input exceeds 1000. For computational efficiency, upper limit of the number of residues is set to 1000. The user may judiciously delete irrelevant domains or termini before job submission to meet this requirement and/or to save computational cost if possible. (b) Oligomer state and loops/termini to be refined options are not given in integer. Any character except integer in the textbox for such options will cause submission errors. (c) Loops/termini regions shorter than 5 residues or longer than 20 residues are provided. It should be also noted that loops/termini to be refined option is only available when a monomer structure is provided as input. Moreover, user should check that input monomer structure contains specified loops/termini regions. If those regions are missing in the input structure, please fill the regions first using GalaxyLoop (http://galaxy.seoklab.org/loop) before submission. (d) Input PDB file contains multiple models, more than one chains, or invalid chain identifier such as “_”. If a user has multiple model structures such as NMR models for a query protein, it is recommended to generate single model PDB files and to run GalaxyHomomer separately. (e) Input contains nonstandard amino acids. GalaxyHomomer only accepts 20 standard amino acids. 5. It should be noted that only Cn-symmetry is considered during ab initio docking stage in GalaxyHomomer. If the user wants to generate models having Dn-symmetry by ab initio docking, please use GalaxyTongDock_D (http://galaxy.seoklab.org/ tongdock), a free web server for homo-oligomer structure prediction with Dn-symmetry by FFT-based docking. 6. If the user experiences problems, he or she may write to us at: [email protected], especially if an e-mail has not been received within 1 week after job submission.
Acknowledgments The work was supported by National Research Foundation of Korea (NRF) grants funded by the Korean government (MSIT) [No. 2016M3C4A7952630].
136
Minkyung Baek et al.
References 1. Andre I, Strauss CE, Kaplan DB, Bradley P, Baker D (2008) Emergence of symmetry in homooligomeric biological assemblies. Proc Natl Acad Sci U S A 105(42):16148–16152. https://doi.org/10.1073/pnas.0807576105 2. Goodsell DS, Olson AJ (2000) Structural symmetry and protein function. Annu Rev Biophys Biomol Struct 29:105–153. https://doi.org/ 10.1146/annurev.biophys.29.1.105 3. Poupon A, Janin J (2010) Analysis and prediction of protein quaternary structure. Methods Mol Biol 609:349–364. https://doi.org/10. 1007/978-1-60327-241-4_20 4. Snijder HJ, Ubarretxena-Belandia I, Blaauw M, Kalk KH, Verheij HM, Egmond MR, Dekker N, Dijkstra BW (1999) Structural evidence for dimerization-regulated activation of an integral membrane phospholipase. Nature 401(6754):717–721. https://doi. org/10.1038/44890 5. Ali A, Bandaranayake RM, Cai Y, King NM, Kolli M, Mittal S, Murzycki JF, Nalam MN, Nalivaika EA, Ozen A, Prabu-Jeyabalan MM, Thayer K, Schiffer CA (2010) Molecular basis for drug resistance in HIV-1 protease. Viruses 2(11):2509–2535. https://doi.org/10.3390/ v2112509 6. Pidugu LSM, Mbimba JCE, Ahmad M, Pozharski E, Sausville EA, Emadi A, Toth EA (2016) A direct interaction between NQO1 and a chemotherapeutic dimeric naphthoquinone. BMC Struct Biol 16:ARTN 1. https:// doi.org/10.1186/s12900-016-0052-x 7. Pierce B, Tong W, Weng Z (2005) M-ZDOCK: a grid-based approach for Cn symmetric multimer docking. Bioinformatics 21(8):1472–1478. https://doi.org/10.1093/ bioinformatics/bti229 8. Tovchigrechko A, Vakser IA (2006) GRAMMX public web server for protein-protein docking. Nucleic Acids Res 34(Web Server issue): W310–W314. https://doi.org/10.1093/nar/ gkl206 9. DiMaio F, Leaver-Fay A, Bradley P, Baker D, Andre I (2011) Modeling symmetric macromolecular structures in Rosetta3. PLoS One 6 (6):e20450. https://doi.org/10.1371/jour nal.pone.0020450 10. Lee H, Park H, Ko J, Seok C (2013) GalaxyGemini: a web server for protein homooligomer structure prediction based on similarity. Bioinformatics 29(8):1078–1080. https:// doi.org/10.1093/bioinformatics/btt079 11. Ritchie DW, Grudinin S (2016) Spherical polar Fourier assembly of protein complexes with
arbitrary point group symmetry. J Appl Crystallogr 49(1):158–167 12. Baek M, Park T, Heo L, Park C, Seok C (2017) GalaxyHomomer: a web server for protein homo-oligomer structure prediction from a monomer sequence or structure. Nucleic Acids Res 45(W1):W320–W324. https://doi. org/10.1093/nar/gkx246 13. Bertoni M, Kiefer F, Biasini M, Bordoli L, Schwede T (2017) Modeling protein quaternary structure of homo- and hetero-oligomers beyond binary interactions by homology. Sci Rep 7(1):10480. https://doi.org/10.1038/ s41598-017-09654-8 14. Park H, Kim DE, Ovchinnikov S, Baker D, DiMaio F (2018) Automatic structure prediction of oligomeric assemblies using Robetta in CASP12. Proteins 86(Suppl 1):283–291. https://doi.org/10.1002/prot.25387 15. Waterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R, Heer FT, de Beer TAP, Rempfer C, Bordoli L, Lepore R, Schwede T (2018) SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res 46(W1): W296–W303. https://doi.org/10.1093/ nar/gky427 16. Yan Y, Tao H, Huang SY (2018) HSYMDOCK: a docking web server for predicting the structure of protein homo-oligomers with Cn or Dn symmetry. Nucleic Acids Res 46 (W1):W423–W431. https://doi.org/10. 1093/nar/gky398 17. Lee H, Baek M, Lee GR, Park S, Seok C (2016) Template-based modeling and ab initio refinement of protein oligomer structures using GALAXY in CAPRI round 30. Proteins. https://doi.org/10.1002/prot.25192 18. Ohshita K, Fukui K, Sato M, Morisawa T, Hakumai Y, Morono Y, Inagaki F, Yano T, Ashiuchi M, Wakamatsu T (2017) Archaeal MutS5 tightly binds to Holliday junction similarly to eukaryotic MutSgamma. FEBS J 284 (20):3470–3483. https://doi.org/10.1111/ febs.14204 19. Luo Y, Ahmad E, Liu ST (2018) MAD1: kinetochore receptors and catalytic mechanisms. Front Cell Dev Biol 6:51. https://doi.org/ 10.3389/fcell.2018.00051 20. Sajib AA, Islam T, Paul N, Yeasmin S (2018) Interaction of rs316019 variants of SLC22A2 with metformin and other drugs—an in silico analysis. J Genet Eng Biotechnol 16 (2):769–775. https://doi.org/10.1016/j. jgeb.2018.01.003
Homo-Oligomer Structure Prediction Using GalaxyHomomer 21. Saju JM, Hossain MS, Liew WC, Pradhan A, Thevasagayam NM, Tan LSE, Anand A, Olsson PE, Orban L (2018) Heat shock factor 5 is essential for spermatogenesis in zebrafish. Cell Rep 25(12):3252–3261.e4. https://doi.org/ 10.1016/j.celrep.2018.11.090
137
22. Mukherjee S, Zhang Y (2009) MM-align: a quick algorithm for aligning multiple-chain protein complex structures using iterative dynamic programming. Nucleic Acids Res 37 (11):e83. https://doi.org/10.1093/nar/ gkp318
Chapter 8 Template-Based Modeling of Protein Complexes Using the PPI3D Web Server Justas Dapku¯nas and Cˇeslovas Venclovas Abstract There is a large gap between the numbers of known protein–protein interactions and the corresponding experimentally solved structures of protein complexes. Fortunately, this gap can be in part bridged by computational structure modeling methods. Currently, template-based modeling is the most accurate means to predict both individual protein structures and protein complexes. One of the major issues in template-based modeling is to identify homologous structures that could be utilized as templates. To simplify this task, we have developed the PPI3D web server. The server is not only able to search for homologous protein complexes, but also provides means to analyze identified interactions and to model protein complexes. In recent CASP and CAPRI experiments, PPI3D proved to be a useful tool for homology modeling of multimeric proteins. In this chapter, we provide a brief description of the PPI3D web server capabilities and how to use the server for modeling of protein complexes. Key words Protein structure, Protein–protein interactions, Protein–peptide interactions, Protein complex, Structure prediction, Template-based modeling, Homology modeling, Clustering, Docking
1
Introduction Advances in DNA sequencing technologies resulted in avalanche of genome sequence data and in a very fast growth of genomeencoded protein sequences [1]. To carry out their functions, most proteins interact with each other, forming either dimeric or multimeric complexes that can be either transient or very stable. The number of known interactions also grows rapidly because there are many high-throughput methods to elucidate them [2]. However, to comprehensively understand protein interactions, it is essential to know the three-dimensional (3D) structures of corresponding protein complexes. Their 3D structures can be determined by X-ray crystallography, nuclear magnetic resonance, or cryo-electron microscopy. Despite recent technological advances, these structure determination methods remain slow and expensive, producing only low throughput, especially in the cases of
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_8, © Springer Science+Business Media, LLC, part of Springer Nature 2020
139
140
Justas Dapku¯nas and Cˇeslovas Venclovas
large protein complexes. Consequently, there is a huge gap between the numbers of known protein sequences, discovered protein–protein interactions, and experimentally solved corresponding 3D structures [3, 4]. Fortunately, the structures of proteins are conserved in the course of evolution. It means that if a structure of some protein is known, it usually can serve as a template to model structures of its homologs. This approach, called template-based modeling, and also known as comparative or homology modeling, is most reliable for protein structure prediction. There are many software tools for homology modeling of individual proteins, which are widely used by experimental molecular biologists [5, 6]. As interactions between proteins to a large degree are conserved, comparative modeling can also be applied for modeling protein–protein interactions and prediction of structures of protein complexes [7, 8]. In recent years, interest in modeling of protein multimers has been steadily increasing. As a result, an increasing number of tools are becoming available for template-based modeling of protein complexes [9–12]. To facilitate analysis and modeling of protein–protein interactions, we have developed the PPI3D web server [13]. Provided only with sequences of interacting proteins, PPI3D can find structural data on protein–protein interactions available for these sequences or their homologs in the Protein Data Bank (PDB) [14]. The identified interactions can be further analyzed in detail. Furthermore, they can be used as templates for comparative structure modeling of protein complexes. The key feature of the PPI3D web server that distinguishes it from other related applications is clustering of protein interaction data. The data in PDB are highly redundant, because the protein structures are often solved multiple times. In the case of individual proteins, this redundancy can be reduced simply by clustering the proteins according to similarity of their sequences. However, this is not possible for protein complexes, because proteins frequently interact by forming alternative interaction interfaces [15]. Due to this, the protein–protein interaction interfaces in PPI3D are clustered according to both sequence and structure similarity. This allows reduction of data redundancy while preserving the alternative protein interaction modes. At the same time, such clustering facilitates the analysis and selection of available templates for structure modeling. Progress in protein structure modeling is monitored in the CASP (Critical Assessment of protein Structure Prediction) and CAPRI (Critical Assessment of PRediction of Interactions) experiments [16, 17]. In the recent rounds of these experiments, both our laboratory [18] and an independent research group [19] have demonstrated the usefulness of the PPI3D web server for template-
Using PPI3D to Model Protein Complexes
141
based modeling of protein complexes. In this chapter, we provide a practical guide on how to use PPI3D for modeling structures of protein complexes.
2
Materials PPI3D web server is available at http://bioinformatics.lt/ppi3d. Even though the structure modeling part in PPI3D is not fully automated and requires manual input in several steps, the software is easy to use, designed for both computational and experimental biologists who are interested in protein interactions. The workflow of data analysis and structure modeling utilizing the PPI3D web server is given in Fig. 1, and is described in more detail in further sections.
2.1
PPI3D Database
The database underlying the PPI3D server is based on the structural data available from PDB [14]. The PDB biological assemblies are downloaded, and all protein interactions are identified in non-NMR structures having resolution better than 4 A˚. The interaction interfaces are defined and analyzed using Voronoi tessellation [20], and then clustered according to sequence and structure similarity [21, 22]. This allows reduction of the PDB data redundancy without losing alternative protein–protein interaction modes. The PPI3D database is updated weekly in order to be synchronized with the PDB. As a result, the users are always provided with the newest available structural data on protein interactions.
2.2
Data Input
Input into the server is one or more protein sequences that can be pasted into the text form in FASTA format (Fig. 2). Additionally, UniProt accession numbers (ACs) are accepted for automated downloading of protein sequences from UniProt [1]. Throughout the text, input sequences are referred to as query sequences or, in the jargon of protein modelers, target sequences. Two major modes for data input are available in the PPI3D server: a single-sequence query and a two-sequences query. A single-sequence query finds all protein–protein and protein–
Fig. 1 Workflow of the structure modeling for protein complexes using the PPI3D web server
142
Justas Dapku¯nas and Cˇeslovas Venclovas
Fig. 2 PPI3D two-sequences query input window
peptide interactions that have structural data for the target sequence or its homologs. A two-sequences query identifies all binary protein–protein interactions that involve homologs of both target sequences. 2.3 Sequence Search in PPI3D Web Server
After the input of target sequences, the users can choose one of the two sequence search methods: (1) BLAST for finding close homologs of the protein or (2) PSI-BLAST for remote homology search [23]. Both searches are executed using the BLAST+ software [24], and it is possible to enter custom settings for the search before submitting the query. In the case of BLAST, the server searches directly the PPI3D database of protein sequences associated with structural data on protein interactions. PSI-BLAST initially performs iterative searches against the clustered NCBI non-redundant sequence database to build a profile for the submitted sequence. This profile is then utilized to search the PPI3D sequence database.
Using PPI3D to Model Protein Complexes
143
2.4 Output of the PPI3D Web Server
After the sequence search is finished, the results are displayed in a hierarchical manner (Fig. 1).
2.4.1 Summary of Results
At the top layer, the summary of results is shown. In case of a singlesequence query, the summary table shows the number of available binding sites for four interaction types: protein–protein, protein– peptide, domain–domain, and domain–peptide. In the case of a two-sequences query, the number of binary interaction interfaces is displayed for protein–protein and domain–domain interactions. Proteins correspond to full PDB chains, and domains are defined based on SCOPe database [25]. Peptides are proteins or domains that have 20 or less residues.
2.4.2 Clustering of Protein Interaction Data
The PPI3D search results are clustered by default. Three clustering levels are implemented: l
l
l
Sequence similarity >95%, similarity of interface residue contacts (areas) > 50%: identical or nearly identical interfaces (binding sites) clustered; Sequence similarity >40%, similarity of interface residue contacts (areas) > 50%: highly similar interfaces (binding sites) clustered; Sequence similarity >40%, similarity of interface (binding site) areas >50%: similar interfaces (binding sites) clustered.
Precalculated clustering is saved in the database; therefore, after selecting one of the predefined clustering schemes, the changes are applied immediately. 2.4.3 List of Identified Protein Binding Sites or Interaction Interfaces
To analyze the desired type of interactions in more detail, the user can click on the corresponding number in the table of results summary. The list of identified interactions is displayed in a table. PDB annotation, interface type, interface size, and some additional data are displayed for each result, as illustrated in Table 1. This table in PPI3D can be sorted and filtered according to all columns, and the displayed columns can be turned on and off using show/hide columns menu. Starting from the table with clustered results the user can access all data in a particular cluster, align and visually compare the interfaces from different clusters, or go to the details page for each of the identified protein interactions.
2.4.4 Visual Summary and Alignment of Interactions
The PPI3D server lists clustered protein–protein interactions, thus reducing the redundancy of the PDB data. The PPI3D clustering procedure does not attempt to group interacting protein pairs if their sequence identity is below 40% or the sequences are of different length. However, even at lower levels of sequence identity, the protein interaction interfaces or binding sites are often conserved.
144
Justas Dapku¯nas and Cˇeslovas Venclovas
Table 1 Part of the PPI3D clustered results table showing the homologs of putative archaeal alcohol dehydrogenase AKJ65_00115 (CASP target T0917, CAPRI target T119) Link to PDB details ID Structure title
a
Protein 1 source organism
Buried surface area, A˚2
No. of members in cluster
1
3bfj
Crystal structure analysis of 1,3-propanediol oxidoreductase
Klebsiella pneumoniae
517.34
20
2
3bfj
Crystal structure analysis of 1,3-propanediol oxidoreductase
Klebsiella pneumoniae
1695.28
15
3
1rrm Crystal structure of lactaldehyde reductase
Escherichia coli
1127.9
3
4a
5br4 E. coli lactaldehyde reductase (FucO) M185C mutant
Escherichia coli
604.2
1
5
1vhd Crystal structure of an iron containing alcohol dehydrogenase
Thermotoga maritima
1213.38
1
6
3zdr Structure of the Alcohol dehydrogenase (ADH) domain of a bifunctional ADHE dehydrogenase from Geobacillus thermoglucosidasius NCIMB 11955
Parageobacillus 1791.02 thermoglucosidasius
1
7
1vlj
Thermotoga maritima
1530.43
1
8
1oj7 structural genomics, unknown function Escherichia coli crystal structure of E. coli K-12 YQHD
1740.47
3
9
3jzd Crystal structure of Putative alcohol dehedrogenase (YP_298327.1) from RALSTONIA EUTROPHA JMP134 at 2.10 A resolution
Cupriavidus pinatubonensis
1091.19
3
10
Corynebacterium 3iv7 Crystal structure of Iron-containing glutamicum alcohol dehydrogenase (NP_602249.1) from Corynebacterium glutamicum ATCC 13032 KITASATO at 2.07 A resolution
1162.27
1
11
3rf7 Crystal structure of an Iron-containing alcohol dehydrogenase (Sden_2133) from Shewanella denitrificans OS-217 at 2.12 A resolution
Shewanella denitrificans
1780.79
1
12
3hl0 Crystal structure of Maleylacetate reductase from Agrobacterium tumefaciens
Agrobacterium fabrum
832.87
1
Crystal structure of NADH-dependent butanol dehydrogenase A (TM0820) from Thermotoga maritima at 1.78 A resolution
Incorrectly annotated interaction interface in the crystal structure
Using PPI3D to Model Protein Complexes
145
In other words, distinct clusters related by low sequence similarity might still retain high structural similarity. To provide the user with an ability to explore potential similarities between clusters, visual comparison and analysis tools were implemented. The user can select the interfaces (binding sites) in the clustered results table, and press the button “Summarize selected interactions” below the table. The sequences of selected proteins are then shown aligned to the target sequence, with interface residues highlighted in red. This enables the user to find out whether the identified protein–protein binding sites are still conserved below the 40% sequence identity. In addition to the sequence alignments, the structures of selected proteins may be aligned according to one subunit for single-sequence queries or both subunits for two-sequences queries. This allows identification of similar protein–protein interaction modes that were not identified by clustering, and distinguishing interactions that involve alternative interfaces. 2.4.5 Interaction Details Page
The interaction details page can be accessed by clicking on the row number in clustered results table or aligned interactions page. The interaction details page displays the most detailed data on the interaction interface, including annotations of the proteins, interface properties, 3D structures, interface residues, and contacts between them (Fig. 3). At the bottom of the interaction details page, the sequence alignments are displayed. They relate the query sequence to the experimental structure of a homologous protein. The interface residues are highlighted in the alignments, allowing inference of binding site in a target protein according to the experimentally determined interaction interface. In addition, a displayed sequence alignment can be used as the target-template alignment for modeling of protein complexes.
2.5 Structure Modeling for Protein Complexes
Modeling of the structures is available in the interaction details page of the PPI3D web server (Fig. 3). The models are generated using MODELLER [26], and several modeling methods are available for different types of data: l
In the case of a single-sequence query, the target-template alignment is available for only one of the two protein chains. Choosing the interface structure modeling generates a model for the target protein and aligns it to the template structure. The interactor chain is simply appended to the model. The resulting structure allows investigation of the binding site in the target protein.
l
In the case of a two-sequences query, target-template alignment is available for both proteins. As a result, the model for the binary protein–protein interaction is generated using the multichain modeling feature of MODELLER.
146
Justas Dapku¯nas and Cˇeslovas Venclovas
Fig. 3 Interaction details page of PPI3D web server; the structure modeling part is shown as a close up view
Using PPI3D to Model Protein Complexes l
147
Additionally, if the selected interaction interface is a part of a homo-oligomer, an oligomeric model having more than two protein chains can be generated for either single-sequence or two-sequences queries.
After modeling finishes, the resulting structural model is available for downloading. All intermediate modeling files can also be downloaded, allowing the user to improve models using the command line version of MODELLER.
3
Methods
3.1 Modeling Heterodimers
1. To model a heterodimeric structure, choose two-sequences query in PPI3D web server (Fig. 2). Enter the sequences of each subunit into the corresponding fields. Alternatively, you may enter UniProt ACs, and PPI3D will retrieve the sequences directly from UniProt. Select sequence search method (BLAST or PSI-BLAST). If needed, customize the advanced sequence search settings. Submit the query by clicking the “Submit” button. 2. After the search is finished, the results summary is displayed. The identified interfaces are divided into two types: protein– protein interfaces and domain–domain interfaces. The latter correspond to interactions between SCOPe domains. Choose the desired type of interactions by clicking on the numbers in the displayed Table 1. 3. A list of homologous interaction interfaces is displayed, including a protein annotation from the PDB, BLAST or PSI-BLAST E-values, interface areas and number of inter-chain contacts across the interface (Table 1). Hetero- and homo-interfaces are indicated. Note that some heterodimers can be modeled using a homodimeric structure as a template. You can choose the templates for further analysis and structure modeling based on the data in the table or using the visual summary feature. 4. To see the details of a desired protein–protein interaction interface, click on the row number to open the Interaction details page. Here you can analyze the properties of the interface, including the inter-chain contacts and detailed lists of interface residues. In the bottom of the page, sequence alignments are displayed, highlighting the interacting residues. Along with the sequence search E-values, other template quality indicators are displayed: sequence identity, sequence similarity, and the number of gaps in the alignment. These criteria are shown for the whole sequence as well as for the interaction interface residues only. The number of interface residues that are present in the alignment is one of the best primary
148
Justas Dapku¯nas and Cˇeslovas Venclovas
indicators of the template suitability for modeling, because if the binding site is not aligned to the submitted target sequence, no protein interaction model can be generated. 5. After analyzing several possible templates and choosing the best one for modeling, start the Interface structure modeling in the Interaction details page, just below the experimental structure view (Fig. 3). Modeling of a heterodimer takes up to several minutes, and the website refreshes automatically once the modeling is finished. The generated structural model and all the intermediate modeling files are available for downloading. 3.2 Modeling Homo-Oligomers
1. To model a homo-oligomer, use the single-sequence query in the web server. Enter your sequence or corresponding UniProt AC, select a search method, and run a sequence search by clicking “Submit” button. 2. After the search is finished, the summary of results is displayed. The numbers in the table show how many protein binding sites are found for each type of interactions. Choose protein–protein binding sites by clicking on the corresponding column of the table. 3. The list of clustered protein–protein binding sites is displayed. The table contains details regarding the PDB structure (annotations, resolution), BLAST or PSI-BLAST E-values, the size and type of the interface, and other properties facilitating template selection. Choose a template for modeling your protein oligomer from the available homo-interfaces based on the same criteria as described above for heterodimer modeling. 4. By clicking on the row number, you will access the page with interaction details. If the binary interaction is a part of a homooligomer, you can start oligomer structure modeling by opening “Oligomer structure modeling” part. 5. Depending on the size of the protein complex, modeling may take up to several minutes. After the modeling process is finished, you can download the modeled structure file and all intermediate modeling files.
3.3 Modeling Protein–Peptide Interactions
The PPI3D web server does not directly model protein–peptide interactions, however, their modeling is possible by involving a few manual processing steps and the command-line version of MODELLER installed in your computer [27]. If necessary, refer to the online manual of MODELLER (https://salilab.org/modeller/ manual/) to accomplish the following steps. 1. Use a single-sequence query in PPI3D web server, and after the query finishes, choose protein–peptide interactions in the results summary window. If no protein–peptide templates are
Using PPI3D to Model Protein Complexes
149
available, modeling is also possible using protein–protein or domain–domain interfaces. 2. Choose a template for structure modeling from the given list of protein–peptide binding sites. 3. Go to the Interaction details page and generate a structural model. PPI3D server generates only a model for the target protein structure, the interactor peptide or protein is simply copied from the template structure. 4. Download the modeling data and the structure file of the binary protein–peptide interaction which will be used as the template structure. 5. Generate a sequence alignment of target peptide with the partner peptide or protein in the template (chain B in the downloaded structure file). This cannot be done using PPI3D; external tools should be applied. 6. Unzip modeling data file and edit the target-template alignment, template structure, and the MODELLER script. First, introduce the peptide sequence for multimeric structure modeling into the alignment (target.pir). 7. Clear all residues that are not in the alignment from the template PDB file downloaded from PPI3D. Usually these are the residues at the protein N- or C-terminus. 8. Edit the MODELLER script (target.py) according to the modifications in the alignment file, and input the name of the template file into it. Run the modified MODELLER script to generate a structural model for the protein–peptide interaction. 3.4 Modeling Higher-Order Heteromers
At present, modeling protein heteromers having more than two chains is not implemented in the PPI3D web server, but it is possible to do this using the PPI3D data and command-line MODELLER. 1. Use two-sequences query, and enter all the sequences making up the complex to be modeled into both subunit 1 and subunit 2 fields. Run the sequence search. 2. After the search is finished, the results summary is a list of possible binary interactions between query proteins. Analyze these results to identify possible templates for multimer modeling. The easiest way to model the protein complex would be to identify a structure which has homologs of all target sequences. Using separate PDB entries as templates for different interfaces requires a lot of manual processing; therefore, it will not be described in this chapter. 3. Build interaction models for all binary interaction interfaces in the target complex using the chosen template.
150
Justas Dapku¯nas and Cˇeslovas Venclovas
4. Download the modeling data from PPI3D web server to your computer. 5. Combine the target-template sequence alignments and the template structures from all interaction interfaces into one alignment file and one structure file. Note that the order of chains should be the same in the alignment and in the structure. 6. Edit the MODELLER script (model_structure.py) correspondingly and run the structure modeling.
4
Case Studies The PPI3D web server was tested extensively in recent CASP and CAPRI experiments [18], and it proved to be highly useful for detection of templates for modeling of protein complexes. This is one of the crucial aspects in comparative modeling. Yet, in the current version of PPI3D, the selection of templates is not automated; they have to be picked manually before starting the structure modeling. On the other hand, the server has multiple options for the analysis of protein–protein interaction interfaces that may help in choosing the templates for modeling protein complexes. Oftentimes, only one binding mode is observed among the homologs identified by PPI3D. This makes the template selection straightforward. In such cases, all the identified interaction interfaces or protein binding sites fall into the same clusters. Only one result is displayed for a homodimer, and for larger complexes several results are listed that correspond to different interfaces in higher order multimers. Sometimes the existing similarity may be below the PPI3D clustering sensitivity, resulting in multiple clusters of detected interactions. However, the similarity between different clusters may still be identified using visual analysis of sequence alignments or superposition of the structures. Such consistency of the binding modes among several PDB structures, especially if they originate from different crystal forms, suggests that the interaction interface is biologically relevant [28]. Nonetheless, sometimes a more in-depth analysis of the search results is necessary. PPI3D uses the PDB biological assemblies as the underlying data source. Unfortunately, it is not uncommon in PDB to have some of the protein interactions that are the result of crystal packing annotated as biologically relevant. In such cases, nonbiological interfaces appear among the PPI3D search results. For example, in the CASP12-CAPRI experiment, an incorrectly annotated dimer was found when searching for templates for T0917 (CAPRI T119), a member of the iron-containing alcohol dehydrogenase family (Pfam: PF00465, PDB: 5YVM) [18, 29]. With an exception of one decameric structure (PDB:
Using PPI3D to Model Protein Complexes
151
3BFJ), the most of the homologous interfaces came from homo˚ (Table 1). Yet dimers and had an interface area larger than 1000 A one dimer, found by PPI3D, had a smaller monomer–monomer interface that was hard to explain (row no. 4 in Table 1, PDB: 5BR4, interface area 604 A˚2). This structure was analyzed using PDBePISA [30], and in the same crystal, a larger interface was identified, corresponding to the interactions observed in other proteins and suggesting an error in the assignment of biologically relevant interface. This simple analysis allowed us to exclude unreliable templates and to select the correct dimeric templates for modeling. Template selection is harder in the cases where alternative binding modes are observed among the PPI3D results. Modeling of such targets requires manual selection by the user, either based on the existing biological data or using model quality assessment procedures. An example of this kind was also observed during CASP12-CAPRI. It was a heterodimer of CASP T0921 (cohesin) and T0922 (dockerin) (PDB: 5M2O; CAPRI T120) [31]. Cohesins and dockerins are known to have dual binding modes: despite the fact that protein sequences are almost identical, in certain cases dockerin switches the orientation by 180 [32]. Two different interaction interfaces were clearly observed using the structural alignments of possible templates in PPI3D web server (Fig. 4). Interestingly, if cohesins are aligned, the binding mode at first seems the same (Fig. 4, top). However, when dockerins are aligned, the difference is clearly visible (Fig. 4, bottom). Under such circumstances, for this protein complex, we constructed models using templates having both interaction modes and then used model quality assessment to make model selection [18]. When alternative protein interaction modes are encountered, additional restraints,
Fig. 4 Alternative interaction modes between the cohesin and dockerin proteins (PDB entries 4UYP and 4UYQ)
152
Justas Dapku¯nas and Cˇeslovas Venclovas
such as the results of the mutagenesis or electron microscopy data, might be extremely helpful in selecting the correct binding mode for the target protein complex.
5
Notes PPI3D is not a fully automated protein structure modeling server, and the user has to choose templates for modeling manually. There are several useful features that facilitate the analysis of the sequence search results and the selection of templates for structure modeling. The PPI3D results are clustered by default, and only the representative interaction interface or binding site is shown. Sometimes the user may want to select higher resolution templates, or even a specific PDB structure, which is not displayed directly in the table of clustered results. If there is more than one structure in the cluster, it is possible to access all structures by clicking on the number of members in a cluster (the last column of the table). The table of clustered results contains more columns than are displayed. For example, columns that contain protein IDs (PDB or SCOPe IDs), length of the identified sequences and the alignment coverage are hidden by default. The hidden columns can be turned on by selecting them in show–hide columns menu just above the table. To make the table more concise, the user may hide some of the columns in the same way. One of the important features of the PPI3D server is a possibility to download all the intermediate modeling files. This allows remodeling of the protein complex using various methods. For example, the user may want to change the target-template sequence alignments. While PPI3D uses only BLAST and PSI-BLAST, methods based on profile–profile alignments may result in homology models of higher quality [33]. Introducing custom sequence alignments to refine PPI3D models requires editing of the alignment files and remodeling of structure using command-line MODELLER. Additionally, it is possible to use high quality monomer structures (either modeled or experimental) coupled with the PPI3D alignments for template-based docking. This may improve the quality of models in certain cases. Homology models of protein complexes produced by PPI3D may be structurally refined using structure refinement servers. Unfortunately, most of them work only with monomeric proteins. Currently, to our knowledge, only two refinement servers
Using PPI3D to Model Protein Complexes
153
accept multi-subunit structures: the fragment-guided molecular dynamics server (FG-MD) [34] and GalaxyRefineComplex [35]. It is highly recommended to evaluate the quality of the models, produced by PPI3D and (if applied) further refinement procedures. Sometimes a model of protein complex may contain accurately modeled individual subunits and still be incorrect if interfaces between these subunits are not biologically relevant. On the other hand, it is highly unlikely that the interaction interfaces will be modeled correctly if the individual subunits are inaccurate. Therefore, in our experience, both wholestructure and the interaction interface of models for protein complexes should be assessed [18]. A number of model quality assessment methods are available for assessing the global quality of protein structure [6] and also for scoring protein–protein interfaces [36, 37]. Unfortunately, most of such methods do not provide a simple way to use global scores together with local interface evaluation. One of the tools capable of doing that is VoroMQA, a model quality assessment method developed in our lab [38]. What to do when no templates are found using PPI3D? If a BLAST search produces no results, you can try iterative PSI-BLAST procedure to identify more distant homologs. If there are still no results, first examine the PPI3D search logs that provide information whether or not homologous proteins and protein interactions can be found for each of the query sequences in the current PDB. If the structures of subunits can be modeled by homology, but there are no templates for the protein interaction, free docking methods could be tried [39]. It is important to remember that in such a case the success rate is considerably lower than in the template-based modeling [8]. Notably, although in the case of free docking the inter-chain contacts are generally modeled incorrectly, the interface patches (binding sites) can be occasionally well predicted. One of the ways to increase the chances of generating an accurate docking model might be to use constraints from known protein interactions, which in turn can be based on the PPI3D results.
Acknowledgments This work was supported by the Research Council of Lithuania [S-MIP-17-60].
154
Justas Dapku¯nas and Cˇeslovas Venclovas
References 1. UniProt Consortium (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47:D506–D515. https://doi.org/ 10.1093/nar/gky1049 2. Orchard S, Kerrien S, Abbani S et al (2012) Protein interaction data curation: the International Molecular Exchange (IMEx) consortium. Nat Methods 9:345–350. https://doi. org/10.1038/nmeth.1931 3. Schwede T (2013) Protein modeling: what happened to the “protein structure gap”? Structure 21:1531–1540. https://doi.org/ 10.1016/j.str.2013.08.007 4. Mosca R, Ce´ol A, Aloy P (2013) Interactome3D: adding structural details to protein networks. Nat Methods 10:47–53. https:// doi.org/10.1038/nmeth.2289 5. Kryshtafovych A, Monastyrskyy B, Fidelis K et al (2018) Evaluation of the template-based modeling in CASP12. Proteins 86(Suppl 1):321–334. https://doi.org/10.1002/prot. 25425 6. Lam SD, Das S, Sillitoe I, Orengo C (2017) An overview of comparative modelling and resources dedicated to large-scale modelling of genome sequences. Acta Crystallogr D Struct Biol 73:628–640. https://doi.org/10. 1107/S2059798317008920 7. Szilagyi A, Zhang Y (2014) Template-based structure modeling of protein–protein interactions. Curr Opin Struct Biol 24:10–23. https://doi.org/10.1016/j.sbi.2013.11.005 8. Lafita A, Bliven S, Kryshtafovych A et al (2018) Assessment of protein assembly prediction in CASP12. Proteins 86(Suppl 1):247–256. https://doi.org/10.1002/prot.25408 9. Kawabata T (2016) HOMCOS: an updated server to search and model complex 3D structures. J Struct Funct Genom 17:83–99. https:// doi.org/10.1007/s10969-016-9208-y 10. Baek M, Park T, Heo L et al (2017) GalaxyHomomer: a web server for protein homooligomer structure prediction from a monomer sequence or structure. Nucleic Acids Res 45: W320–W324. https://doi.org/10.1093/nar/ gkx246 11. Park H, Kim DE, Ovchinnikov S et al (2018) Automatic structure prediction of oligomeric assemblies using Robetta in CASP12. Proteins 86(Suppl 1):283–291. https://doi.org/10. 1002/prot.25387 12. Waterhouse A, Bertoni M, Bienert S et al (2018) SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res 46:W296–W303. https://doi.org/ 10.1093/nar/gky427
13. Dapku¯nas J, Timinskas A, Olechnovicˇ K et al (2017) The PPI3D web server for searching, analyzing and modeling protein-protein interactions in the context of 3D structures. Bioinformatics 33:935–937. https://doi.org/10. 1093/bioinformatics/btw756 14. Berman HM, Westbrook J, Feng Z et al (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242. https://doi.org/10.1093/nar/ 28.1.235 15. Hamp T, Rost B (2012) Alternative proteinprotein interfaces are frequent exceptions. PLoS Comput Biol 8:e1002623. https://doi. org/10.1371/journal.pcbi.1002623 16. Moult J, Fidelis K, Kryshtafovych A et al (2018) Critical assessment of methods of protein structure prediction (CASP)-round XII. Proteins 86(Suppl 1):7–15. https://doi.org/ 10.1002/prot.25415 17. Lensink MF, Velankar S, Wodak SJ (2017) Modeling protein-protein and protein-peptide complexes: CAPRI 6th edition. Proteins 85:359–377. https://doi.org/10.1002/prot. 25215 ˇ 18. Dapku¯nas J, Olechnovicˇ K, Venclovas C (2018) Modeling of protein complexes in CAPRI round 37 using template-based approach combined with model selection. Proteins 86(Suppl 1):292–301. https://doi.org/ 10.1002/prot.25378 19. Yu J, Andreani J, Ochsenbein F, Guerois R (2017) Lessons from (co-)evolution in the docking of proteins and peptides for CAPRI rounds 28–35. Proteins 85:378–390. https:// doi.org/10.1002/prot.25180 ˇ (2014) Voronota: a 20. Olechnovicˇ K, Venclovas C fast and reliable tool for computing the vertices of the Voronoi diagram of atomic balls. J Comput Chem 35:672–681. https://doi.org/10. 1002/jcc.23538 21. Li W, Godzik A (2006) CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659. https://doi.org/10.1093/ bioinformatics/btl158 ˇ 22. Olechnovicˇ K, Kulberkyte˙ E, Venclovas C (2013) CAD-score: a new contact area difference-based function for evaluation of protein structural models. Proteins 81:149–162. https://doi.org/10.1002/prot.24172 23. Altschul SF, Madden TL, Sch€affer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. https://doi. org/10.1093/nar/25.17.3389
Using PPI3D to Model Protein Complexes 24. Camacho C, Coulouris G, Avagyan V et al (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421. https://doi.org/ 10.1186/1471-2105-10-421 25. Fox NK, Brenner SE, Chandonia J-M (2014) SCOPe: Structural Classification of Proteins— extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res 42:D304–D309. https://doi.org/ 10.1093/nar/gkt1240 26. Sˇali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234:779–815. https:// doi.org/10.1006/jmbi.1993.1626 27. Webb B, Sali A (2016) Comparative protein structure modeling using MODELLER. Curr Protoc Bioinformatics 54:5.6.1–5.6.37. https://doi.org/10.1002/cpbi.3 28. Xu Q, Canutescu AA, Wang G et al (2008) Statistical analysis of interface similarity in crystals of homologous proteins. J Mol Biol 381:487–507. https://doi.org/10.1016/j. jmb.2008.06.002 29. Gro¨tzinger SW, Karan R, Strillinger E et al (2018) Identification and experimental characterization of an extremophilic brine pool alcohol dehydrogenase from single amplified genomes. ACS Chem Biol 13:161–170. https://doi.org/10.1021/acschembio. 7b00792 30. Krissinel E, Henrick K (2007) Inference of macromolecular assemblies from crystalline state. J Mol Biol 372:774–797. https://doi. org/10.1016/j.jmb.2007.05.022 31. Bule P, Alves VD, Israeli-Ruimy V et al (2017) Assembly of Ruminococcus flavefaciens cellulosome revealed by structures of two cohesindockerin complexes. Sci Rep 7:759. https:// doi.org/10.1038/s41598-017-00919-w 32. Nash MA, Smith SP, Fontes CM, Bayer EA (2016) Single versus dual-binding
155
conformations in cellulosomal cohesindockerin complexes. Curr Opin Struct Biol 40:89–96. https://doi.org/10.1016/j.sbi. 2016.08.002 33. Yan R, Xu D, Yang J et al (2013) A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction. Sci Rep 3:2619. https://doi. org/10.1038/srep02619 34. Zhang J, Liang Y, Zhang Y (2011) Atomiclevel protein structure refinement using fragment-guided molecular dynamics conformation sampling. Structure 19:1784–1795. https://doi.org/10.1016/j.str.2011.09.022 35. Heo L, Lee H, Seok C (2016) GalaxyRefineComplex: refinement of protein-protein complex model structures driven by interface repacking. Sci Rep 6:32153. https://doi.org/ 10.1038/srep32153 36. Moal IH, Torchala M, Bates PA, Ferna´ndezRecio J (2013) The scoring of poses in proteinprotein docking: current capabilities and future directions. BMC Bioinformatics 14:286. https://doi.org/10.1186/1471-2105-14286 37. Barradas-Bautista D, Moal IH, Ferna´ndezRecio J (2017) A systematic analysis of scoring functions in rigid-body protein docking: the delicate balance between the predictive rate improvement and the risk of overtraining. Proteins 85:1287–1297. https://doi.org/10. 1002/prot.25289 ˇ (2017) VoroMQA: 38. Olechnovicˇ K, Venclovas C assessment of protein structure quality using interatomic contact areas. Proteins 85:1131–1145. https://doi.org/10.1002/ prot.25278 39. Porter KA, Desta I, Kozakov D, Vajda S (2019) What method to use for protein-protein docking? Curr Opin Struct Biol 55:1–7. https:// doi.org/10.1016/j.sbi.2018.12.010
Chapter 9 Protein–Protein and Protein–Peptide Docking with ClusPro Server Andrey Alekseenko, Mikhail Ignatov, George Jones, Maria Sabitova, and Dima Kozakov Abstract The process of creating a model of the structure formed by a pair of interacting molecules is commonly referred to as docking. Protein docking is one of the most studied topics in computational and structural biology with applications to drug design and beyond. In this chapter, we describe ClusPro, a web server for protein–protein and protein–peptide docking. As an input, the server requires two Protein Data Bank (PDB) files (protein–protein mode) or a PDB file for the protein and a sequence for the ligand (protein–peptide mode). Its output consists of ten models of the resulting structure formed by the two objects upon interaction. The server typically produces results in less than 4 h. The server also provides tools (via “Advanced Options” list) for a user to fine-tune the results using any additional knowledge about the interaction process, e.g., small-angle X-ray scattering (SAXS) profile or distance restraints. Key words Protein docking, Software, Energy-based scoring functions, Fast Fourier Transform, Clustering
1
Introduction Proteins form a major part of the machinery of biology. They can serve as messengers, scaffolds, modifiers, and gateways, and can perform many other fundamental tasks. Their interactions with other proteins, peptides, steroids, DNA, RNA, and other entities are at the core of most biological systems we know today. Studying these complex and sometimes mysterious phenomena related to them is crucial to our understanding of life. Previously, the only way to find out whether two proteins can interact was through high throughput screening and other tests done in a wet lab. These are costly and time-consuming processes, usually limited to identification of binary complexes only, whereas some proteins bind in a triplet form or only interact with other proteins when properly activated. While experimental methods remain important, new computational methods have been
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_9, © Springer Science+Business Media, LLC, part of Springer Nature 2020
157
158
Andrey Alekseenko et al.
developed. They are quickly becoming viable alternatives to the experimental methods or complement them, leading to the emergence of a new branch of biology: protein docking [1]. Within it, a growing number of computer programs have been developed, utilizing accumulated experimental knowledge and exploiting the ever-increasing performance of computers. Here, we discuss such a tool, ClusPro. It is implemented as a user-friendly web server, which does not require from users either significant computational resources or any specialized software besides a relatively modern web browser. The algorithm behind ClusPro consists of three main steps [2]: (1) rigid-body free docking, (2) clustering of the 1000 complexes with best scores, and (3) refinement using energy minimization. In the first step, the program goes through billions of possible conformations of the complex and retains 1000 structures with best scores. It uses a mathematical tool called Fast Fourier Transform (FFT) to make the search fast and exhaustive. The scoring is done via a specialized energy-based scoring function, which typically consists of both physics- and knowledge-based terms. The second step of the ClusPro algorithm consists in locating the ten largest groups among the structures produced by step 1 and choosing a representative for each of these groups. The local energy minimization is then performed on the selected structures, and the resulting models are presented to the user. Although the server only requires the knowledge of the structures of interacting molecules, it can employ additional experimental information about the interaction process to increase the accuracy. This has been implemented via “Advanced Options,” including “Structure Modification” (removal of unstructured protein regions), “Attraction and Repulsion” (application of attraction or repulsion on selected residues), “Restraints” (taking into account pairwise distance restraints), “Antibody Mode” (special scoring function designed for docking antibody-antigen pairs), “Multimer Docking” (construction of homodimers and homotrimers), “SAXS Profile” (consideration of small-angle X-ray scattering data), and “Heparin Ligand” (location of heparin-binding sites). According to several community-wide assessments, ClusPro is one of the top-performing docking methods. Namely, it was placed first in blind protein docking competition CAPRI of 2009, 2012, 2016, and 2019 [3–5, 20]. In the last 5 years, ClusPro performed over 300,000 docking jobs from approximately 15,000 registered users and multiple unregistered ones.
Docking with ClusPro Server
2
159
Algorithm
2.1
Overview
All molecular docking programs are designed with one sole purpose—to accurately predict a three-dimensional structure of a biological complex. ClusPro is not an exception. However, what makes it exceptional is a very efficient algorithm with a strong mathematical foundation rigorously developed for the purpose of protein docking. The main part of the algorithm relies on a program called PIPER [6]. It calculates the energy-based interaction score for all possible mutual orientations of two proteins and outputs a set of models having the best score. This is done by keeping one of the proteins in the pair fixed (called “receptor”) and by changing the position and orientation of the second protein (called “ligand”). As a result, each mutual orientation can be described by six coordinates: three rotational and three translational coordinates of the rigid body transformation applied to the ligand. To find a nearnative model with acceptable accuracy, one needs to sample over billions of rigid body transformations applied to the ligand, calculating the score for each of them. This is virtually impossible given that the score is not trivial to compute even for a single orientation: its complexity scales as a square of the number of atoms in the two interacting proteins. PIPER belongs to the class of docking programs that use Fast Fourier Transform (FFT) to overcome this obstacle. Below, we discuss all the above steps in detail, as well as some additional features of ClusPro, such as data-assisted docking and special docking modes.
2.2
Details
The main task of the algorithm is to identify viable mutual orientations of two molecules so that the resulting complex is as close to the real assembly as possible. The input is the three-dimensional structures of two molecules further referred to as receptor (usually, the bigger protein) and ligand (usually, the smaller protein or a peptide). The algorithm uses a scoring function (or score, for short) written as a sum of terms representing types of interactions between the receptor and ligand (e.g., electrostatic, van der Waals). Directly computing the score for each possible mutual orientation of two molecules of the pair receptor–ligand is prohibitively expensive. Luckily, a property known as Convolution theorem makes it possible. Namely, consider, for example, electrostatic energy of the complex. It can be written as a sum over all ligand atoms of products of their charges by the receptor electrostatic potential at their locations. Or, we can represent it as a convolution of two three-dimensional (3D) grids: one containing values of the receptor’s potential, the other containing the sum of ligand atom charges
2.2.1 FFT Docking
160
Andrey Alekseenko et al.
within each cell. Computing grid convolution directly is even more computationally expensive than summing all ligand atoms. But, thanks to the Convolution theorem, by applying the Fast Fourier Transform (FFT) method, we can quickly find the values of grid convolution for all possible relative displacements of the two grids. The same approach can be used for other terms of the scoring function used in PIPER. This way, we can afford to evaluate scores for all translations and rotations of the ligand. Furthermore, during the calculation, the receptor grids are fixed and all the spatial transformations are applied to the ligand grids. Each transformation results in a certain mutual orientation of two proteins, for which the score is computed. A single transformation can be expressed as a rotation of the ligand around its geometric center followed by a translation. The rotation is computed as a product of a rotation matrix taken from a precomputed list of 70,000 rotation matrices and the matrix consisting of coordinates of the ligand’s grid points. Overall, the entire rotational space is uniformly covered with a step of approximately 5 , and the ˚ . Given a fixed rotation, the ligand is translated in increments of 1 A score is evaluated for all possible translations of the ligand. This is where the FFT algorithm comes into play. It allows scoring all ligand translations in O(Nlog(N)) time instead of O(N3) (N is the grid size)—a tremendous speed-up, which makes the problem solvable in a reasonable amount of time. Applying FFT this way, we identify the best translation vector, which yields the best score for each ligand rotation. This way, we obtain a single model for each of the 70,000 rotations of the ligand. 2.2.2 Clustering
The previous step (“FFT docking”) results in 70,000 models. Since it is desirable to have as few answers as possible, the models are sorted by their score, and the best 1000 (in the default mode of the algorithm) are clustered by interface backbone RMSD (IRMSD) ˚ cluster radius. For the purposes of calculating IRMSD, we with 9 A only consider ligand residues lying within 10 A˚ of the receptor. The clusters are sorted by size as opposed to sorting by score. This is a noteworthy detail of the clustering procedure; cluster sizes reflect the behavior of the protein complex in real life, since, according to the Boltzmann distribution, the population of the low-energy (near-native) state reflects the probability of its occurrence [7, 8].
2.2.3 Refinement
Out of all the identified clusters, only 30 top ranking clusters are retained, and representatives of each cluster are passed to the energy minimization program. The last step ensures that there are no atomic clashes in the interfaces of the models since rigid body FFT docking does not perform any side-chain optimization or other flexible moves. The energy is minimized using the L-BFGS algorithm [9, 10] with a CHARMM19-like forcefield [11] for
Docking with ClusPro Server
161
energy calculation. The resulting models are ranked according to the corresponding cluster population and made accessible to the user as the final models. 2.3
Modifications
The described docking procedure is the baseline for all docking jobs submitted by users. In addition, there are a number of alterations, which can be introduced into the algorithm depending on the user’s needs and on the information available for the complex of interest.
2.3.1 Energy-Based Scoring Function
It was observed that a single scoring function is not suitable for all types of protein complexes and, therefore, has to be modified in order to accommodate a particular type of interaction between subunits. The scoring function used in PIPER consists of several components: E ¼ w1Erep + w2Eattr + w3Eelec + w4EDARS, where Erep and Eattr describe the repulsive and attractive contributions to the van der Waals interaction energy, respectively, and Eelec is an electrostatic energy term. The term EDARS is a pairwise structure-based potential constructed by the “decoys as the reference state” (DARS) approach [6, 12]. It primarily represents desolvation contributions—i.e., the free energy change due to the removal of water molecules from the interface. With these energy terms, four different weighting schemes were designed to favor a particular type of protein interaction, namely, (1) Balanced, (2) Electrostatic-favored, (3) Hydrophobic-favored, and (4) van der Waals + electrostatics (VdW + Elec). Some complexes do not fall under any of these categories. For them, a special “Others mode” was created. In this mode, docking is done using three different weighting schemes, and 500 best models from each run are retained and forwarded altogether (1500 models in total) to the clustering part.
2.3.2 Advanced Features
Driven by the needs of the community, some additional options were introduced. They can be turned on and off and combined.
Attraction and Repulsion
If there is evidence suggesting that certain residues are involved in binding, whereas other residues remain solvent-accessible upon complex formation, this information can be used to influence the docking procedure by specifying those residues in the “Attraction and Repulsion” section in “Advanced Options.”
Antibody Mode
Binding of antibodies manifests certain features specific to this type of protein association. For example, it is known that phenylalanine, tryptophan, and tyrosine residues are abundant in the binding interface on the antibody side, but not on the epitope side. Therefore, a special asymmetric potential was developed for this type of binding. It can be activated by choosing “Use Antibody Mode” in “Advanced Options.”
162
Andrey Alekseenko et al.
Multimer Docking
This special docking mode was introduced for protein complexes consisting of two or three identical subunits. It takes into account different types of symmetries for homodimers and homotrimers.
Experimental Constraints
Constraining the output of the docking algorithm to conformations that fit additional experimental data can tremendously improve the performance of docking. There are several types of experimental data that ClusPro can take into account via its special docking modes. One of them is “Restraints” mode [13]. Here, the user can provide pairwise residue distances obtained from NMR or cross-linking experiments, and the server will filter the models produced by PIPER to (at least partially) satisfy those. Another data-assisted docking mode allows prioritizing the PIPER models by their fit to an experimental Small-Angle X-ray Scattering (SAXS) profile [14, 15]. This is done by computing theoretical SAXS curves for each of the 70,000 models produced by FFT docking and retaining only 2000 that fit the experiment best. Theoretical SAXS curves can be computed directly for each model. Alternatively, by selecting the “Use fast SAXS” option, one can use a recently developed Fast Manifold Fourier Transform (FMFT) algorithm [15], which can estimate and optimize theoretical SAXS curves for thousands of models in a matter of minutes.
3
Web Server To start using ClusPro, go to https://cluspro.org/. The page provides two access options: to use the server without an account (in which case the results of your session will automatically be made public) or to create one. Only users with an academic email address can create an account. When using your account, the results of your sessions are made available only to you and the ClusPro developers. Upon choosing an access option, you will be redirected to the main site for docking.
3.1 Protein–Protein Docking
To dock proteins, you should open the “Dock” tab and select the proteins you wish to dock. These can be selected through their PDB (Protein Data Bank) ID’s or by uploading your own files in the PDB format. If you select PDB ID’s, the server automatically downloads the corresponding structures from the RCSB database and uses them for the docking procedure. The server also gives an option of selecting chains to use for docking. If no chains are selected, then all of the chains in the PDB file will be submitted for the docking process. In addition, ClusPro removes all non-protein and non-nucleic-acid structures from the submitted PDB files. The process consists of the following steps:
Docking with ClusPro Server
163
1. Enter a name for the job submission in the “Job Name” field. If this field is left empty, an arbitrary number will be assigned as a name. 2. Fill in the “Receptor” and “Ligand” fields (see Note 1). There are two options for specifying their structures. First, the PDB ID can be entered into the “PDB ID” field. Second, a PDB file can be uploaded by selecting the “Upload PDB” option (see Note 2). 3. If only some chains of the proteins should be used for docking, their ID’s can be entered in the “Chains” field. These should be entered as single letters separated by spaces. For example, if “A B” is entered, then only chains A and B from the input structure are used for docking, and all the other chains are ignored. If no chains are specified, then all the chains in the submitted structure will be used (see Notes 3 and 4). 4. Click the “Dock” button at the bottom of the page. Once the job is running, its status will be shown on the “Queue” tab. When the job finishes running (see Note 5), the results or errors can be found on the “Results” tab (see Note 6). 5. ClusPro produces four sets of results (models), with different coefficients in the scoring function (see Subheading 2.3.1 above). They are “Balanced,” “Electrostatic-favored,” “Hydrophobic-favored,” and “VdW + Elec” (van der Waals and Electrostatic) (see Note 7). In each case, the resulting models and their ranks are different (see Note 8). For the “Antibody” mode, only a single, specialized coefficient set is used. Select “View Model Scores” to see the corresponding ranking. 6. The resulting models can be downloaded as PDB files individually or as an archive containing all the models. The model files are named “model.XXX.YY.pdb,” where “XXX” denotes the scoring function flavor (“Balanced” and “Antibody”: 000, “Electrostatic-favored”: 002, “Hydrophobic-favored”: 004, “VdW + Elec”: 006), and “YY” denotes the structure’s rank (“00” is the highest rank). 3.2 Protein–Peptide Docking
To dock peptides to proteins using ClusPro, first select the protein. As in the protein–protein docking mode, it can be specified either by its PDB ID or by uploading a PDB file. There are two ways to specify the peptide. One way is to enter its sequence (as a string of one-letter amino-acid codes) together with a motif. A motif specifies which amino acids are considered important and which can be considered wildcards (denoted by “X”) in the complex of interest. The successful choice of motif is crucial in the case of larger peptides since it helps improve conformational sampling. The second method for peptide submission is to submit a ZIP archive
164
Andrey Alekseenko et al.
containing PDB files with putative peptide structures. However, it is only recommended for advanced users and is not covered in this tutorial. The process consists of the following steps: 1. Enter a name for the job submission in the “Job Name” field. 2. Select a protein by either specifying its PDB ID or uploading a PDB file (see Note 2). 3. If only some chains of the protein should be used for docking, their ID’s can be entered in the “Chains” field. These should be entered as single letters separated by spaces. For example, if “A B” is entered, then only chains A and B from the input structure are used for docking, and all the other chains are ignored. If no chains are specified, then all the chains in the submitted structure will be used. 4. To specify a peptide, enter the peptide sequence (as singleletter amino-acid codes) in the “Peptide” field. The server works best with peptides of length from five to eight residues. 5. Enter a motif associated with the peptide sequence in the “Motif” field. The choice of motif is dictated by the biological nature of the complex of interest as well as the availability of structural information. As a starting point, the full peptide sequence could be used as a motif. If some residues are known to play little role in binding, they can be replaced with “X” (see Note 9). 6. You may use the “Exclude” field to list PDB IDs that contain the peptide of interest and should not be used for conformational sampling. 7. Click the “Build Motif” button. After several seconds, the server will display the number of hits found in the PDB as well as a tuned version of the motif based on the findings. 8. Peptide docking works best if the number of hits found in the PDB database is between 100 and 1000. If this is the case, proceed by clicking “Dock.” 9. Otherwise, i.e., if the number of hits is not within the range from 100 to 1000, modifications to the motif must be made. There are two methods of modification: extending the motif by adding residues or introducing variability by replacing a residue code with “X.” The first method reduces the number of hits, while the second method increases the number of hits. When extending the motif, it is suggested to add amino-acid residues that are polar (Arg, His, Lys, Asp, Glu, Asn, and Gln) or aromatic (Phe, Tyr, and Trp). Adding small amino-acid residues (Ala, Gly, Ser, and Thr) should be avoided. When introducing variables (“X”) into the sequence, it is recommended to start with the smallest residues in the chain and refrain from
Docking with ClusPro Server
165
changing any terminal positions. If, after applying necessary modifications, the “Build Motif” step returns a number in the range from 100 to 1000, click “Dock.” 10. After starting the docking step, the progress of the job can be followed on the “Queue” tab. Once finished (see Note 5), the results or errors will be displayed on the “Results” tab (see Note 6). 11. Once the job is finished without errors, the top models are displayed. They can be downloaded either individually or as a whole. There are two sets of models, obtained with different flavors of the scoring function: “Peptide Balanced” and “Peptide VdW + Elec” (see Note 7). The model scores can be viewed by selecting “View Model Scores” (see Note 8). The models are ranked by cluster size (column “Members”). 12. The model files are named model.XXX.YY.pdb, where “XXX” denotes the scoring function flavor (“Peptide Balanced”: 000, “Peptide VdW + Elec”: 001), and “YY” denotes the model’s rank (“00” is the highest rank).
4
Case Studies Case studies provide a great insight into the intended use and workings of docking software. Here, we present five case studies, with emphasis on the use—either manual or automatic—of additional experimental data for model selection. The first presents a classical protein–protein docking problem, with emphasis on result analysis and interpretation. The second showcases the antibody mode of ClusPro. The third demonstrates protein–peptide docking and an approach to motif adjustment. The fourth demonstrates the use of mutational analysis data for generating docking restraints. The fifth shows the use of an experimental SAXS curve as additional scoring criteria.
4.1 Case Study 1: Enzyme/Inhibitor
Serine proteases are well-documented proteolytic enzymes. In humans, serine proteases are associated with an assortment of roles, most prominently immune modulation and also fertilization, development, and digestion. Serine protease inhibitors (SPIs) attenuate the effects of the serine proteases to maintain homeostasis. It turns out that some SPIs have also a sinister role. Recent studies have shown that SPIs are secreted by ticks when they bite to prevent clotting and proper immune response [16]. This helps tickborne diseases get easier access to a human body. It has been theorized that blocking SPIs released by a tick into a human body will reduce the chance of infection. One step in this direction is eliminating molecules that can interfere with the natural function of serine proteases. This will require the characterization of binding
166
Andrey Alekseenko et al.
sites of SPIs. The Kazal family of protease inhibitors, specifically ovomucoid turkey egg white trypsin inhibitor (OMTKY), provides an example of tight-binding inhibitors, which can be used to determine key characteristics of the binding area. The PDB structure of the serine protease Subtilisin carlsberg can be accessed using the PDB ID 1SCN (chain E), while the structure of the protease inhibitor OMTKY can be accessed using the PDB ID 2GKR (chain I). We apply protein–protein docking according to the protocol outlined in Subheading 3 to these two structures using the default settings (Fig. 1). After the docking has finished, we choose the “Balanced” option, since it typically works best for enzyme–inhibitor complexes. The top five models of the ligand correspond to the files model.000.00.pdb through model.000.04.pdb (see Fig. 2, shown in red, blue, teal, green, and cyan, respectively). In this case study, the native binding pose is resolved and can be accessed by PDB ID 1R0R (chains E and I, shown in magenta in Fig. 2). The receptor is shown in orange. The models ranked 1 and 3 (Fig. 2, red and teal) can be discarded since their binding sites are away from the active site of the protease. By clicking “View Model Scores” on the job’s “Results” page, we can see that the first five models correspond to clusters of sizes 195, 110, 95, 61, and 45, respectively. Since models 1 and 3 were deemed incorrect, this makes the second model (Fig. 2, blue) with cluster size of 110 the most likely candidate. Besides, this model has the ligand’s loop tightly fitting into the active site of the protease, giving further credibility to it. Out of the ten models computed by the server with the “Balanced” scoring function, it is indeed the closest model to the native state, with ˚. heavy-atom ligand RMSD 2.1 A 4.2 Case Study 2: Antibody/Antigen
Antibody–antigen interactions are the main interactions within the immune system. They also serve as the means to conduct protein precipitation tests. ClusPro provides a specific docking mode tuned for interactions of antibodies with protein antigens. In this case study, we investigate the complex of Fab E8 and Cytochrome C. The PDB ID of Fab E8 is 1QBL (chains H and L) and the PDB ID of Cytochrome C is 1HRC (chain A). The PDB ID’s and chains are entered into the corresponding fields under “Receptor” (Fab E8) and “Ligand” (Cytochrome C). We check the box “Use Antibody Mode” in the “Advanced Options” menu. It is highly recommended to enable the “Automatically Mask non-CDR regions” checkbox to limit docking to the antibody head region. After these options are selected, the job is submitted by clicking “Dock.” Once the job is finished, its results can be downloaded and investigated. Note that only a single scoring function flavor is used in the antibody mode.
Docking with ClusPro Server
167
Fig. 1 ClusPro “Dock” page with starting data for Case Study 1
If we visualize the docking results, we can see that in all the models the docked ligand structure is located at the CDR region of the antibody. Looking at the model scores, we see that the top four models have similar cluster sizes (105 to 86), while the fifth one has only 58 members. The most near-native structure has rank 3 and has heavy-atom ligand RMSD 4.7 A˚ to the crystallographic pose. In the downloaded archive, the corresponding file is model.000.02. pdb.
168
Andrey Alekseenko et al.
Fig. 2 Docking results for Case Study 1, downloaded from ClusPro and visualized in PyMOL. The receptor is shown in orange. The top five ligand models for the “Balanced” result set are shown in red, blue, teal, green, and cyan, respectively. The native binding pose (PDB ID 1R0R) is shown in magenta
4.3 Case Study 3: Protein/Peptide
The TRAF2 protein is a signal transducer for the proteins of the Tumor Necrosis Factor (TNF) receptor superfamily. TRAF2 has been found to interact with a variety of proteins involved in the TNF pathway, which modulates the cell’s life and death cycle. This makes TRAF2 a promising target for cancer therapy. In this case study, we choose a TRAF2 monomer (PDB ID 1CA4, chain A) as a receptor. As a ligand, we choose an oncogenic hLMP-1 peptide with sequence PQQATDD, derived from the LMP protein. Navigating to the “Peptide Docking” page of ClusPro, we fill the “Receptor” field with chosen PDB ID and chain and enter “PQQATDD” in the “Sequence” field. If we enter the full peptide sequence (PQQATDD) in the “Motif” field and click “Build Motif,” only two hits are found, while the recommended number is between 100 and 1000. Even replacing the two smallest residues with “X” (PQQXXDD) does not produce enough hits. The binding site of LMP to TRAF2 is well studied, and a motif has been identified as “PXQ” [17]. We enter “PXQ” in the “Motif” field and click “Build Motif.” After several seconds, the motif is expanded to PXQXXDD, and 224 hits are found in the PDB database. The number of hits is sufficient (i.e., within 100 and 1000), and we run the docking by clicking “Dock.” Thus, as often is the case in docking, the use of additional information (in this case, the variability of the first Gln) proves necessary for obtaining good results.
Docking with ClusPro Server
169
Fig. 3 Docking results for Case Study 3, downloaded from ClusPro and visualized in PyMOL. The receptor used for docking is shown in orange. Two other TRAF2 monomers from the trimer, not present during docking, are shown in beige. The top five ligand models for the “Balanced” result set are shown as red, blue, teal, green, and cyan sticks, respectively. The native binding pose (PDB ID 1CZY) is shown in magenta
Once the docking has finished, the resulting models can be downloaded and investigated. The protein–peptide docking protocol produces two sets of models: using a balanced scoring function (000) and an electrostatically favored one (001). We look at the top five “Peptide Balanced” models (files model.000.00.pdb through model.000.04.pdb). For comparison, the native docked model can be accessed using PDB ID 1CZY (chains A and D). In Fig. 3, we display the native receptor–ligand structure in magenta, the receptor in orange, and the ligands in the top five models in red, blue, teal, green, and cyan, respectively. For docking, we used a single TRAF2 monomer as a receptor (namely, chain A from PDB ID 1CA4), though it is shown to form homotrimers. We also show the two other subunits from PDB ID 1CA4 (chains B and C) in Fig. 3 in beige. We see that the ligands from models ranked 2 and 4 (blue and green) clash with chains B and C, and thus those models can be discarded. Furthermore, knowing that the TRAF2 interacts with the whole LMP protein, we can deem models ranked 1 and 3 (red and teal) unlikely. Indeed, the ligands in those models are situated in pockets suited for a free hLMP-1 peptide, but not when the peptide is located on the surface of the whole protein LMP interacting with TRAF2. This leaves model 5 (cyan), which is indeed the closest to the true native state out of all the models produced by the ˚. server, with a heavy-atom ligand RMSD of 2.1 A
170
Andrey Alekseenko et al.
4.4 Case Study 4: Use of Distance Restraints
In previous case studies, we used known properties of a complex structure to curate the results manually. In some cases, ClusPro can consider such data during the docking stage. The most versatile way to do so is via the distance restraints functionality. Data from NMR, crosslinking experiments, or from co-evolutionary studies can be taken into account by enforcing certain distances between specified residues. A (single) restraint is defined as a pair of amino acids (one on the receptor and one on the ligand) with an acceptable distance range. A restraint is considered satisfied if the distance between the amino acids in a given model is within this range. In ClusPro, restraints are specified using a JSON file (although the AIR format is also supported) containing a list of groups (see Note 10). Each group consists of a number of restraints. Within each group, a certain number of restraints (not necessarily all of them) must be satisfied in order for the model to be scored (see Note 11). Consider the association of the Bmi1/Ring1b-UbcH5c complex (PDB ID 3RPG, all chains) and the nucleosome core particle (PDB ID 3LZ0, all chains) of E. coli. Existing data suggest that the Lys119 residue on H2A of the nucleosome needs to be close to the Cys85 residue on UbcH5c in order for ubiquitination to occur [18]. Mutational data also suggest that Lys97 on Ring1b is involved in binding to the surface of the core histones of the nucleosome [18]. Therefore, we create two groups of restraints. The first group restrains Lys119 and Cys85 to ensure that they are ˚ from each other). The close enough for interaction (within 8 A second group has multiple distance restraints between the surface of the histone and Lys97 on Ring1b. The second group only requires one of the restraints to work, as the exact interaction between Lys97 and the histone surface is unknown. The restraints took the following format: { "groups": [{"required": 1, "restraints": [ { "rec_resid": "118", "dmax": 8.0, "lig_resid": "85", "lig_chain": "A", "dmin": 0, "rec_chain": "G", "type": "residue" }]}, # Only one restraint in this group shown {"required": 1, "restraints": [
Docking with ClusPro Server
171
{ "rec_resid": "73", "dmax": 5.0, "lig_resid": "73", "lig_chain": "C", "dmin": 0, "rec_chain": "E", "type": "residue" }]} }
To run this job, enter 3LZ0 into the “Receptor” field and 3RPG into the “Ligand” field, leaving the “both Chains fields” empty. Then, open “Advanced Options,” “Restraints,” and upload the prepared JSON file. Start the job by clicking “Dock.” When the results are ready, we can download all models and compare them to the known crystallographic complex (PDB ID 4R8P). When using “Restraints,” the second high-ranked model for the “Balanced” coefficient set is near-native with ligand RMSD 4.9 A˚. However, if we run the job without using “Restraints,” the top ten ligand models for all coefficient sets stick to the DNA receptor, violating known constraints. We can see that the use of restraints allows overcoming the bias introduced by highly charged DNA molecules. 4.5 Case Study 5: SAXS
Small-angle X-ray scattering (SAXS) experiments show how the intensity of X-rays scattered by a protein complex in solution depends on the scattering angle. As a result, they help predict a rough shape of the protein complex. During the docking run, ClusPro can compute theoretical SAXS profiles of multiple receptor–ligand orientations produced by PIPER, test them against the experimental SAXS data, and keep only the best scoring models [15]. It has been shown that this strategy greatly improves the docking results. In this case study, we explore such an example, namely, homodimeric complex of Yersinia outer protein M (YopM) [19]. YopM is an immunosuppressive agent of Yersinia pestis (bacteria causing bubonic plague). It forms homodimers in solution. Its crystal structure is known (PDB ID 4OW2). The asymmetric unit of the crystal consists of two dimers glued together. Chain pairs A– C and B–D are native dimers, whereas all the other interfaces exist due to crystal packing and are not present when YopM is in solution. Follow the steps below to see how SAXS data improve the docking results for this protein complex. 1. The experimental SAXS data are deposited in the SASBDB database (https://www.sasbdb.org) with accession code SASDAU8. Download the curve by navigating to https://www. sasbdb.org/data/SASDAU8/, clicking “Download files,” and
172
Andrey Alekseenko et al.
choosing “curve (dat).” The downloaded file should contain a header, footer, and a data section with three columns: scattering angle, intensity, and measurement error. 2. Modify the file, because ClusPro only accepts files containing three numeric columns (you can try submitting the file as is, and ClusPro will output an error). First, remove the header (first three lines) and the footer (last four lines), leaving only lines with three numbers in each. 3. The scattering angle in the downloaded SAXS curve is ˚ 1 (see measured in nm1. However, ClusPro accepts only A Note 12). To change units to A˚1, divide the numbers in the first column by 10. This way, “8.174430e-02” in the first line would become “8.174430e-03” etc. You can do Steps 2 and 3 using any data processing software you prefer. For example, on most Unix-like operating systems, the following one-liner can be run in a command prompt to produce a new file, SASDAU8.angstroms.txt, with removed header and footer and converted units of the scattering angle: awk ’{if ($1 ~ /^[[:digit:]]/) print $1/10" "$2" "$3}’ SASDAU8.dat > SASDAU8.angstroms.txt
4. Now, the SAXS file is ready for submission. On the main docking page, enter “4OW2” in the PDB ID fields for both “Receptor” and “Ligand” and choose chain “A” for “Receptor” and chain “C” for “Ligand.” Open the “Advanced Options” tab and go to “Saxs Profile.” Upload the SAXS file and check “Use fast SAXS” (see Note 13). Then, click “Dock.” For comparison, we dock two monomers (chains A and C) of YopM without using the “Saxs Profile” option. On the main docking page, enter “4OW2” in the PDB ID fields for both “Receptor” and “Ligand,” and choose chain “A” for “Receptor” and chain “C” for “Ligand.” Submit the job by clicking “Dock” below. When both the jobs are completed, we compare the generated models. Without SAXS data, there are no near-native ˚ ) in the top 10. models (that is, with IRMSD below 10 A However, with the help of the SAXS profile, ClusPro produces two near-native models among the top ten, which are ranked 7 and 9 with ligand RMSD 6.20 A˚ and 6.14 A˚, respectively.
5
Notes 1. When docking two proteins, it is recommended to use the larger one as a receptor, and the smaller one as a ligand. In this case, performance and accuracy are better due to the way energy-based grids are generated.
Docking with ClusPro Server
173
2. Before the docking, ClusPro removes all non-protein and nonnucleic-acid entities (including but not limited to HETATM records) from the submitted PDB files. 3. When building homology models of a protein, many tools create long straight tails for regions without proper templates. Such tails can clash with ligands and also increase the linear size of the protein, leading to jobs either crashing or taking more time. You can remove such regions from the PDB file manually, or use “Advanced Options -> Structure Modifications -> Remove Unstructured Terminal Residues” to do it automatically. 4. If the proteins have high structural uncertainty and weak shape- and electrostatic complementarity, we suggest using “Others” mode (found in “Advanced Options”). 5. Typically, ClusPro jobs are completed in less than 4 h. However, when the server receives many submissions, it can take longer. 6. Jobs submitted by unregistered users are publicly available (including input data). If you wish to keep your job data private, please create an account. 7. In the absence of any insights into the properties of the complex, “Balanced” mode is recommended. 8. While the choice of a top-ranked model may be appealing, it is not always the best one. The choice of a particular model from the docking output should be based on the additional information about the system of interest. 9. When choosing a peptide motif, if one wishes to allow only a limited set of residues in a specific position, one can list them inside square brackets. For example, the “PXQ” motif matches peptides with any residue in the second position, while “P[FQ] Q” matches only “PFQ” and “PQQ” peptides. 10. Jobs with restraints (“Advanced Options -> Restraints”) can run slow, sometimes taking several days to complete. Counterintuitively, a high number of restraints have no direct negative impact on the ClusPro performance. Typically, the more restraints a job has, the faster it is completed. 11. When specifying distance restraints, we recommend adding some slack (around 10%) to the maximal allowed distance. Rigid docking used in ClusPro does not account for possible local structure changes, like side-chain reorientations. Therefore, even in the correct orientation, the restraint might not be satisfiable with unbound models. ˚ 1 or 12. Different programs output SAXS curves using either A
nm1 for angles. When submitting SAXS data to ClusPro, ˚ 1. please ensure that it is converted to A
174
Andrey Alekseenko et al.
13. SAXS jobs that use the normal SAXS algorithm (not selecting “Use fast SAXS” option) can take several times longer than usual.
Acknowledgments This work was supported by the LIBH REACH award, the Russian Science Foundation grant 19-74-00090, and the PSC-CUNY award 62124-00 50. References 1. Shoichet BK, Kuntz ID (1991) Protein docking and complementarity. J Mol Biol 221:327–346 2. Kozakov D, Hall DR, Xia B et al (2017) The ClusPro web server for protein-protein docking. Nat Protoc 12:255–278 3. Lensink MF, Wodak SJ (2010) Docking and scoring protein interactions: CAPRI 2009. Proteins 78:3073–3084 4. Lensink MF, Wodak SJ (2013) Docking, scoring, and affinity prediction in CAPRI. Proteins 81:2082–2095 5. Lensink MF, Velankar S, Wodak SJ (2017) Modeling protein-protein and protein-peptide complexes: CAPRI 6th edition. Proteins 85:359–377 6. Kozakov D, Brenke R, Comeau SR et al (2006) PIPER: an FFT-based protein docking program with pairwise potentials. Proteins 65:392–406 7. Kozakov D, Clodfelter KH, Vajda S et al (2005) Optimal clustering for detecting nearnative conformations in protein docking. Biophys J 89:867–875 8. Lorenzen S, Zhang Y (2007) Identification of near-native structures by clustering protein docking conformations. Proteins 68:187–194 9. Nocedal J (1980) Updating quasi-Newton matrices with limited storage. Math Comput 35:773–773 10. Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Program 45:503–528 11. Neria E, Fischer S, Karplus M (1996) Simulation of activation free energies in molecular systems. J Chem Phys 105:1902 12. Chuang G-Y, Kozakov D, Brenke R et al (2008) DARS (Decoys As the Reference
State) potentials for protein-protein docking. Biophys J 95:4217–4227 13. Xia B, Vajda S, Kozakov D (2016) Accounting for pairwise distance restraints in FFT-based protein-protein docking. Bioinformatics 32:3342–3344 14. Xia B, Mamonov A, Leysen S et al (2015) Accounting for observed small angle X-ray scattering profile in the protein-protein docking server ClusPro. J Comput Chem 36:1568–1572 15. Ignatov M, Kazennov A, Kozakov D (2018) ClusPro FMFT-SAXS: ultra-fast filtering using small-angle x-ray scattering data in protein docking. J Mol Biol 430:2249–2255 16. Blisnick AA, Foulon T, Bonnet SI (2017) Serine protease inhibitors in ticks: an overview of their role in tick biology and tick-borne pathogen transmission. Front Cell Infect Microbiol 7:199 17. Devergne O, Hatzivassiliou E, Izumi KM et al (1996) Association of TRAF1, TRAF2, and TRAF3 with an Epstein-Barr virus LMP1 domain important for B-lymphocyte transformation: role in NF-kappaB activation. Mol Cell Biol 16:7098–7108 18. Bentley ML, Corn JE, Dong KC et al (2011) Recognition of UbcH5c and the nucleosome by the Bmi1/Ring1b ubiquitin ligase complex. EMBO J 30:3285–3297 19. Berneking L, Schnapp M, Rumm A et al (2016) Immunosuppressive yersinia effector YopM binds DEAD box helicase DDX3 to control ribosomal S6 kinase in the nucleus of host cells. PLoS Pathog 12:e1005660 20. Padhorny D, Porter KA, Ignatov M et al (2020) ClusPro in rounds 38 to 45 of CAPRI: Toward combining template‐based methods with free docking. Proteins (in press)
Chapter 10 Modeling of Protein Complexes and Molecular Assemblies with pyDock Mireia Rosell, Luis Angel Rodrı´guez-Lumbreras, and Juan Ferna´ndez-Recio Abstract The study of the 3D structural details of protein interactions is essential to understand biomolecular functions at the molecular level. In this context, the limited availability of experimental structures of protein–protein complexes at atomic resolution is propelling the development of computational docking methods that aim to complement the current structural coverage of protein interactions. One of these docking approaches is pyDock, which uses van der Waals, electrostatics, and desolvation energy to score docking poses generated by a variety of sampling methods, typically FTDock or ZDOCK. The method has shown a consistently good prediction performance in community-wide assessment experiments like CAPRI or CASP, and has provided biological insights and insightful interpretation of experiments by modeling many biomolecular interactions of biomedical and biotechnological interest. Here, we describe in detail how to perform structural modeling of protein assemblies with pyDock, and the application of its modules to different biomolecular recognition phenomena, such as modeling of binding mode, interface, and hot-spot prediction, use of restraints based on experimental data, inclusion of low-resolution structural data, binding affinity estimation, or modeling of homo- and hetero-oligomeric assemblies. Key words Protein–protein interactions, Protein structure prediction, Computational docking, Template-based modeling
1
Introduction Understanding the structure and energetics of protein interactions has potential applications in diverse fields such as biomedicine, biotechnology, or agricultural sciences. However, the experimental determination of structural information on the interactome lags well behind the amount of proteomics and genomics data that are being produced at a growing pace. In this context, computational structural prediction is a valuable complement to experimental data and an essential help to interpret genomics information. Indeed, a variety of protein–protein docking tools have been reported to model the atomic details of the interaction between two given
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_10, © Springer Science+Business Media, LLC, part of Springer Nature 2020
175
176
Mireia Rosell et al.
proteins [1, 2]. One of such docking approaches is pyDock [3], which uses an energy-based function to score docking poses generated by a variety of sampling methods. The distributed version for local running is optimized to be automatically used with the sets of rigid-body docking poses obtained by FTDock 2.0 [4] or ZDOCK 2.1 [5], but it has been applied to score either flexible or rigid docking models from other programs, such as RotBus [6], SwarmDock [7], ZDOCK 3.0 [8], SDOCK [8], or LightDock [9]. The method is also available as a web server, pyDockWEB, using pyDock version 3 and including a custom parallel FTDock version based on the Message Passing Interface (MPI) libraries, with grid size optimization for efficient use of the Fastest Fourier Transform in the West (FFTW) libraries and multiprocessor running [10]. The key aspect of pyDock is a unique combination of electrostatics, desolvation, and van der Waals terms, with weighting factors originally optimized for a small set of cases to avoid overfitting. Indeed, the method has been shown to be robust throughout the years, in regards to its application to new benchmark cases and to different types of interactions. The method has been successfully tested in CAPRI (http://www.ebi.ac.uk/msd-srv/capri/capri. html) and CASP (http://predictioncenter.org/) community-wide assessment experiments. Indeed, in the most recent CASP13 edition, the performance of our group in the multimeric targets, based on pyDock models, was ranked within the top three groups from a total of over 40 participants, showing the potential of pyDock to model multimolecular assemblies, including oligomers and multidomain proteins. In addition to docking prediction, pyDock provides a variety of additional modules to analyze fundamental problems in biomolecular recognition. The module pyDockNIP, which analyzes the frequency of interface residues in low-energy docking models from pyDock [11], has been reportedly applied to identify interface hot-spot residues [12], that is, residues that contribute the most to the binding affinity, which can be relevant for drug discovery targeting protein–protein interactions with small molecules. The module pyDockSAXS is the first systematically tested approach in using protein docking models to complement low-resolution structural data from Small Angle X-ray Scattering (SAXS) [13–15]. On the other side, interface residue data from bioinformatics predictions, mutational experiments, NMR, cross-linking, etc., can be included as distance restraints with the pyDockRST module [16]. Although originally aimed at protein–protein docking, pyDock can be also applied to model protein interactions with other biomolecules, such as protein-RNA [17]. On a more practical side, the pyDock methodology has provided biological insights and helped to interpret experiments in different cases of biomedical and biotechnological interest. One remarkable example is the structural study of host–pathogen complexes, in which pyDock docking
Modeling of Protein Complexes with pyDock
177
together with energetic analysis helped to interpret molecular mechanisms [18]. Another case of interest is the application of pyDock within a broad structural analysis of members of the family of Hetero Amino Acid Transporters (HATs), such as the integrative modeling of the assembly of transmembrane LAT2 and its ancillary protein 4F2hc, using a combination of modeling, docking, electron microscopy, and cross-linking experiments [19]. In this chapter, we will use a protein–protein case of a known 3D structure to illustrate the use of the different modules in pyDock for the structural modeling of protein complexes and the characterization of molecular assemblies.
2
Materials
2.1
Input
2.2
Programs
2.2.1 pyDock
The pyDock software needs the coordinates of the two interacting proteins, usually as PDB files, but it can also take AMBER coordinate and topology files. When using PDB files, hydrogens are not needed, and, if present, they will be removed and rebuilt again by pyDock. In addition, all HETATM coordinates will be removed in the docking calculations. The pyDock method can use AMBER coordinate files (with extensions such as .inpcrd, .restrt, .rs7, .crd) and topology files (with extensions such as .prmtop, .parm7, .top) created by the PARM, LeAP, SANDER, or GIBBS programs from AMBER [20]. In this case, the cofactors and other compounds will be included in pyDock calculations. The pyDock 3.0 package is available at https://life.bsc.es/ (to get pyDock you need to apply for a license by filling in your data; for academic use, you will receive a link to the pyDock distribution file by e-mail; for commercial use, you will be contacted by the authors). Uncompress and untar the pyDock distribution file to extract the pyDock3 directory. Next, we need to change permissions of the pyDock3/data directory:
pid/pydock/
chmod go+rx data
The pyDock3 directory can be moved to any location of your choice. For instance, let us say that it is moved to /usr/local/ software/ directory; then, you could call pyDock by: /usr/local/software/pyDock3/pyDock3
Moreover, we can define the PYDOCK variable in your .bashrc file, as follows: export PYDOCK=/usr/local/software/pyDock3/
178
Mireia Rosell et al.
so that the executable of pyDock can be called in a more convenient way: $PYDOCK/pyDock3
The pyDock binary has been compiled for Linux 32-bit (see more installation details in Note 1). 2.2.2 FTDock
We need some external programs to generate a set of rigid-body docking poses. In this regard, pyDock is ready to process the output of FTDock 2.0, and we will show here how to install it. The first step is to install the FFTW libraries (see Note 2). Once they are properly installed, we can download the FTDock 2.0 installation file gnu_licensed_3D_Dock.tar.gz from http://www.sbg.bio.ic.ac.uk/docking/download.html
(by clicking the appropriate link). Uncompressing this file will extract its contents to a new directory called 3D_Dock (which contains, among others, the folder progs with all the needed binaries). Now, within the new progs directory, open the Makefile file and edit the following lines: 1. FFTW_DIR line: define the full path of the fftw-2.1.5 directory (see Note 2). (e.g. FFTW_DIR¼ //fftw2.1.5). 2. CC_FLAGS line: remove the -malign-double argument. 3. CC_FLAGS line: define
-mcpu¼k8
# instead of the default
-mcpu¼pentiumpro.
Now, within the progs directory, compile the program with ./make (ignore WARNING messages). 2.2.3 SCWRL
In the case of incomplete side-chains in the input PDB files, we can use SCWRL (http://dunbrack.fccc.edu/) to rebuild them. To use it automatically from pyDock (see Note 3), we need to install SCWRL 3.0. This version is outdated and cannot be directly downloaded from the above web, so you need to obtain the installation file scwrl3_lin.tar.gz from their authors. Uncompressing this file will extract its contents to a new directory called scwrl3_lin. Within this directory, run: ./setup
This will create a SCWRL 3.0 binary (scwrl3) in such directory.
Modeling of Protein Complexes with pyDock 2.2.4 Other External Programs
2.3
Server
179
We can also use ZDOCK (http://zdock.umassmed.edu/soft for the generation of rigid-body docking poses. The pyDock pipeline is ready to process the output of ZDOCK 2.1, so it is advisable to download and install that version. This and the other above programs can be run either manually following each program’s instructions, or automatically within pyDock according to instructions herein (see Note 3). For some functions useful for the analysis of the docking results, as later explained, we will use the ICM-Browser (www. molsoft.com). ware/)
The pyDock method is also available as a web service at https:// The web front-end acts as a proxy to the user, removing any complexity aroused from a local installation of the software. Via a user-friendly interface, the user is capable of uploading molecular structural information in PDB format. After finishing, the user will receive the results in a web page, together with all the files generated during the docking project. These files might also be useful for local execution of pyDock, in order to start from a given intermediate step.
life.bsc.es/pid/pydockweb.
3
Methods pyDock has a highly modular architecture, with a series of modules performing the different functionalities of the program (Fig. 1). The general syntax for running pyDock is: $PYDOCK/pyDock3 DOCKNAME modulename
Thus, the executable pyDock3 usually needs two arguments: (1) DOCKNAME, which is the name of the pyDock project and the base for all the files that will be created during the docking pipeline, and (2) modulename, which will call for the specific module. The details of the different pyDock modules are described in the running instructions below. 3.1
Parameter File
First, you need to create a text file called DOCKNAME.ini, in which DOCKNAME will be the name of the pyDock project. This file will contain all the needed information about the interacting proteins (PDB files, chain IDs. . .). We will illustrate this with an example, in which we will model the structure of the complex formed between the proteins TolB and Pal (PDB 2HQS) from E. coli, involved in maintaining the outer membrane stability of the bacteria [21]. For this example, we also have the structures of the unbound TolB and Pal proteins: PDB 1c5k (chain ID A) and 1oap (chain ID A), respectively. We will download the coordinate files of these proteins (1c5k.pdb and
180
Mireia Rosell et al.
Fig. 1 Scheme of the pyDock pipeline. The pyDock pipeline with the different pyDock modules is shown 1oap.pdb)
from www.pdb.org. In that case, we will create a text file called Dock1.ini (we will use “Dock1” as base name for the docking files) with the following information: [receptor] pdb = 1c5k.pdb mol = A newmol = A
Modeling of Protein Complexes with pyDock
181
[ligand] pdb = 1oap.pdb mol = A newmol = B
The mol field is the original chain ID (one character) of the protein chain/s that will be used for docking, whereas the newmol will be the name of the chain ID of that protein in the docking output files (see Note 4). For convention, one of the proteins (usually the largest one) will be the receptor (static position), and the other the ligand (mobile position). The program can also take a protein structure in AMBER format. To illustrate this with the above example, suppose we generate coordinate files (1c5k.inpcrd, 1oap.inpcrd) and topology files (1c5k.prmtop, 1oap.prmtop) for our receptor and ligand molecules with the LEaP program from AMBER. Then, we should use the following initial file to define the input structures: [receptor] pdb = 1c5k.inpcrd,1c5k.prmtop mol = newmol = A [ligand] pdb = 1oap.inpcrd,1oap.prmtop mol = newmol = B
The order of the AMBER files in the pdb field is important: it should be first the coordinate file, then the topology file (with any valid extension as mentioned in Subheading 2.1). We should also note that AMBER coordinate and topology files do not have chain ID. In a case like this, we can define “mol ¼ - “ or “mol ¼ “, and it will take all molecules with no chain ID in the corresponding file. 3.2 Set Up the Receptor and Ligand Coordinate Files for Docking
Before any docking calculation, we need to generate the coordinate files correctly parsed for pyDock, from the receptor and ligand PDB files indicated in the DOCKNAME.ini parameter file. For that, run the pyDock setup, writing the following line in your console: $PYDOCK/pyDock3 Dock1 setup
This command will create the new PDB files for the receptor and the ligand Dock1_rec.pdb and Dock1_lig.pdb, respectively, which are suitable as input for pyDock. Some PDBs may have incomplete side-chains, that is, there are missing atoms in the structures. These side-chains can be rebuilt
182
Mireia Rosell et al.
with external programs. One of them is SCWRL, which can be run automatically from pyDock if its location is indicated in the pydock.conf file (see Note 3). 3.3 Generating Rigid-Body Docking Poses
pyDock can be applied to score rigid-body docking orientations generated by a variety of methods, but it is ready to automatically process the output from ZDOCK 2.1 or FTDock 2.0 docking programs. These programs can be run independently, but we will describe here how to call them automatically from pyDock, for which they should be previously installed (see Subheading 2.2) and their location indicated in the pydock.conf file (see Note 3).
3.3.1 Running FTDock Within pyDock
Before running FTDock within pyDock, the program expects the FTDock preprocessed files for the receptor and the ligand (in our example: Dock1_rec.parsed and Dock1_lig.parsed). These files should be created by using the FTDock utility preprocesspdb.perl, but since we already have suitable files from the pyDock setup step (described in Subheading 3.2), we can just create the parsed files by copying the pyDock receptor and ligand files: cp Dock1_rec.pdb Dock1_rec.parsed cp Dock1_lig.pdb Dock1_lig.parsed
Then, we can use the following commands to run FTDock with pyDock: $PYDOCK/pyDock3 Dock1 ftdock
This will produce the file Dock1.ftdock, where all the docking poses will be stored. FTDock parameters can be changed for higher precision (see Note 3). In case of multicore computer architectures, it could be convenient to execute a FTDock MPI version (see Note 5). 3.3.2 Running ZDOCK Within pyDock
In our example, we can use the following commands to run ZDOCK (if previously installed) with pyDock: $PYDOCK/pyDock3 Dock1 zdock
This will produce the file Dock1.zdock, where all the resulting docking poses will be stored.
Modeling of Protein Complexes with pyDock
3.4 Converting the Rigid-Body Docking Poses to pyDock Format 3.4.1 Converting the FTDock Output to pyDock Format
183
Now, we need to transform the output data from FTDock (Dock1. in our example, in which each solution is represented by the position of the ligand in Cartesian coordinates, and its rotation based on Euler angles) to the rotation and translation matrix that transforms the original ligand coordinates into the different orientations generated by FTDock. This is done by using the following command:
ftdock
$PYDOCK/pyDock3 Dock1 rotftdock
This calculation is quite fast and will create a file (Dock1.rot in our example) containing the abovementioned transformation matrices for the ligand in all docking poses. 3.4.2 Converting the ZDOCK Output to pyDock Format
We can also transform the output data from ZDOCK (Dock1. further pyDock scoring, containing the rotation and translation matrix for each docking solution:
zdock in our example) to a file suitable for
$PYDOCK/pyDock3 Dock1 rotzdock
This calculation is quite fast and will create a file (Dock1.rot in our example) containing the abovementioned transformation matrices for all docking poses. Note that the name of the file (Dock1.rot in our example) is the same as that from FTDock output, so to avoid overwriting files, it is advisable to rename the original files to keep them with different names (e.g. Dock1.rot. ftdock, Dock1.rot.zdock). The docking sets obtained from different docking programs (e.g. FTDock, ZDOCK...) can be scored independently (see Subheading 3.5) by keeping their DOCKINGNAME.rot files separated, but they can also be merged into a single docking set for its further processing (see Note 6). 3.5 Scoring the Rigid-Body Docking Poses with pyDock
The next step is to use the pyDock energy function to score and rank all positions by running the dockser module with the following command: $PYDOCK/pyDock3 T26 dockser > dockser.log &
In the case of multicore computer architectures, it could be convenient to execute this scoring step in parallel (see Note 7). The main output of the pyDock scoring step is a table (Dock1. ene in our example) with the detail of the different energy terms for each docking pose. See below the results obtained for this example when using FTDock with the default parameters in pydock.conf (calculate_grid¼1.2 and no electrostatics):
184
Mireia Rosell et al. Conf(1)
Ele(2)
Desolv(3)
VDW(4)
Total(5)
RANK(6)
------------------------------------------------------------3740
-15.199
-8.536
-1.041
-23.839
1
3496
-15.801
-8.436
6.326
-23.604
2
3847
-14.181
-9.301
21.455
-21.336
3
7260
-8.081
-12.723
-1.873
-20.992
4
(...)
1. Conformation number of the docking pose (same as that in the .rot file, last column). 2. Electrostatic energy term. 3. Desolvation energy term. 4. Van der Waals energy term. 5. Total binding energy (Ele + Desolv +0.1∗VDW). 6. Rank of the docking pose according to its total binding energy. 3.6 Analysis of the Docking Results 3.6.1 Generating the Best-Scoring Docking Models
We can generate the PDB file of a selection of resulting docking poses with the makePDB pyDock module, by indicating a range of models as ranked in the docking energy file. In our example, we will build the PDB files for the docking poses ranked 1–3 in the Dock1. ene file, as follows: $PYDOCK/pyDock3 Dock1 makePDB 1 3
This will create three files named
Dock1_3740.pdb,
Dock1_3496.pdb, and Dock1_3847.pdb, whose names will indi-
cate the conformation numbers (Conf column in Dock1.ene file) of these top 3 ranked docking models. 3.6.2 Comparing the Results with a Reference Complex Structure
In some cases, we would like to compare the docking models with a reference complex structure (for test purposes when the complex structure is known; in case of available structure of a complex involving homologous proteins, etc.). While there are different quality measures, one of the most popular is ligand RMSD, that is, the RMSD between the positions of the ligand in the reference and modeled complexes, after superimposing the receptor chains of both complexes. This value can be automatically computed by the pyDock dockser module, if a suitable reference complex is indicated and properly set up in the DOCKNAME.ini file. In our example, we can use as reference the file 2hqs.pdb (PDB entry 2HQS downloaded from www.pdb.org), and define the following Dock1.ini file: [receptor] pdb = 1c5k.pdb mol = A newmol = A
Modeling of Protein Complexes with pyDock
185
[ligand] pdb = 1oap.pdb mol = A newmol = B [reference] pdb = 2hqs.pdb recmol = A ligmol = H newrecmol = A newligmol = B
The newrecmol field must indicate the same chain ID as newfor the receptor, and the newligmol must be the same as newmol for the ligand. In this way, when running the setup module, the reference file Dock1_ref.pdb will be created. Also, when running the dockser module, the ligand RMSD values for all docking poses in the Dock1.rot file will be calculated and shown in the Dock1.ene file, in the RMSD column. If we have an energy table without RMSD values from a previous execution, and we do not want to execute dockser again because of computational cost, we can run the pyDock rmsd module to calculate and save RMSD values in a separate file Dock1.RMSD (which has to be manually added to Dock1.ene). For testing purposes, we usually consider a docking pose as acceptable (a.k.a. near-native) when ligand RMSD dockrst.log &
The resulting file Dock1.rst contains the scoring of each docking model based on the distance restraints. This file will be automatically combined with the existing Dock1.ene file to produce the Dock1.eneRST file, in which the docking models will be ranked according to the total scoring (Total column), obtained from a combination of pyDock energy (Total column in Dock1. ene) and restraint-based scoring (relRST column). In our example, the near-native docking pose ranked 61 with pyDock was ranked sixth after including distance restraints in the scoring, and there is another docking pose of an even better quality that was ranked 3275 by pyDock alone and became rank 35 after including distance restraints (Fig. 2). 3.7.2 Improving Docking with SAXS Data
Small-angle X-ray scattering (SAXS) technique can provide low-resolution structural information to help characterize biomolecules and macromolecular assemblies. If SAXS data are available for a given protein–protein complex, they can be used in combination with pyDock scoring to improve the identification of the correct docking models. This approach, called pyDockSAXS [13], has been implemented as a web server (https://life.bsc.es/ pid/pydocksaxs) [14, 15]. We need as input the PDB files with the atomic coordinates of the interacting proteins (either structures or models), and a file containing SAXS experimental data compatible with CRYSOL software version 2.8 [24].
3.7.3 Computing the Docking Energy Score for a Single Complex Structure
Using the pyDock bindEy module, we can compute the pyDock docking energy for a given complex structure (either experimentally determined or modeled). In our example, if we want to compute the pyDock energy between the receptor and the ligand in the reference complex structure (PDB 2HQS), we will define a new Dock2.ini file, indicating the corresponding chains for the receptor and the ligand in such complex structure: [receptor] pdb = 2hqs.pdb mol = A newmol = A
188
Mireia Rosell et al. g [ligand] pdb = 2hqs.pdb mol = H newmol = B
Now, we just need to run the bindEy module as follows: $PYDOCK/pyDock3 Dock2 bindEy
This will create a Dock2.ene table, with the energy terms (individual and total) for only one row, corresponding to the complex structure. 3.7.4 Structural Modeling of Multidomain Proteins by Docking
One of the pyDock applications is to model multidomain proteins by applying a distance restraint between amino acids. This can be done with the pyDockTET method [25], which can be called by the docktet module. To illustrate this, we can use our example, given that the TolB protein is composed of two clear structural domains as defined by CATH: http://www.cathdb.info/version/latest/domain/ 1c5kA02
Thus, we can split PDB 1C5K into two files, representing the two domains: one called 1c5k_D1.pdb with the first domain (D1) that can be defined by residues 1–158, and the other called 1c5k_D2.pdb with the second domain (D2) defined by residues 166–389. Now we will try to rebuild the entire two-domain protein by docking the individual domains with pyDockTET, using distance restraints derived from the length of the linker between residues 158–166 (7 residues). The parameter file Dock3.ini will be defined as follows: [receptor] pdb = 1c5k_D1.pdb mol = A newmol = A [ligand] pdb = 1c5k_D2.pdb mol = A newmol = B [tether] receptor = A.Thr.158 ligand = B.Thr.166 length = 7
Modeling of Protein Complexes with pyDock
189
Then, the following steps will be followed to complete the pyDockTET procedure, i.e. docking with linker-based distance restraints. $PYDOCK/pyDock3 Dock3 setup $PYDOCK/pyDock3 Dock3 ftdock $PYDOCK/pyDock3 Dock3 rotftdock $PYDOCK/pyDock3 Dock3 dockser $PYDOCK/pyDock3 Dock3 docktet
The output will be a Dock3.eneTET table, with the combined pyDock and linker-based restraint energies for each docking model.
4
Case Studies
4.1 Integrative Modeling with Docking, Cross-Linking and EM Data
The pyDock method has been applied to model many complexes of biological interest, which usually requires a multidisciplinary effort and the integration of a variety of additional computational and experimental data. One interesting case was the first reported model for the assembly of the two subunits of a heteromeric amino acid transporter (HAT) [19]. In this case, an integrative modeling approach was applied to model the interaction between LAT2, which is ancillary protein 4F2hc. First, the transmembrane LAT2 protein was computationally modeled with Modeller 8v1 [26], based on a template from the same HAT family (AdiC, PDB 3OB6). Modeling was challenging due to the low sequence identity (SI) with the template (around 20%), but multiple sequence alignments suggested that the transmembrane domains were highly conserved among the members of the family. Indeed, the models generated for LAT2 indicated high conservation in the topology of the transmembrane helices, but large flexibility in the extracellular loops. The 20 models with best DOPE score from Modeller were docked with the X-ray structure of the 4F2hc extracellular domain (PDB 2DH2) by using FTDock 2.0 and ZDOCK 2.1. The resulting 240,000 docking poses were filtered based on distance restraints from a known disulfide bond between subunits, by using pyDockTET (with restraint distance 14 A˚). Then, docking poses that would clash with the expected position of the membrane were also removed, which left 3145 docking models. Finally, the docking model with the best pyDock energy (here, we did not include van der Waals term due to the uncertainty in the transmembrane protein models) was found to be in line with the overall shape provided by transmission electron microscopy (TEM) data, and fully consistent with available cross-linking experiments. This model (Fig. 3) was further confirmed with additional cross-linking experiments. Finally, by applying the pyDock patch module, the most important LAT2 residues for the interaction with 4F2hc were
190
Mireia Rosell et al.
Fig. 3 Docking model of the 4F2hc and LAT2 complex. Structural model obtained by docking of homologybased models of LAT2 (gray) and 4F2hc X-ray structure (red). The model shown is the best-energy docking pose, after filtering by disulfide bond restraints, and was further confirmed by EM and cross-linking experiments. Residues showing positive cross-linking in wet lab experiments are shown as blue spheres, while residues not showing cross-linking are shown as gray spheres. The distances between these residues in the model are fully consistent with the cross-linking data. The membrane is shown only for visualization purposes
identified. Overall, this integrative model generated by a variety of experimental and computational methods provided the first structural insights on the interactions between the two subunits in HAT proteins, helping to understand the stabilizing role of the light subunit by 4F2hc, and its implications for other transporters. 4.2 Docking a Homodimer with Rotational Symmetry
The pyDock methodology can be also applied to model protein homo-oligomers, in which two important aspects should be considered. First, when modeling homo-oligomers, usually the monomer needs to be modeled (either template-based or ab initio), since its unbound structure is rarely available. Second, for practical purposes, we should assume symmetric oligomerization, e.g. rotational symmetry C2, C3..., in order to filter the resulting docking models. Based on these considerations, pyDock has been successfully applied in blind conditions to the modeling of a variety of homooligomers in the 13th community-wide experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP), as part of the common CASP13-CAPRI Assembly prediction challenge (http://predictioncenter.org/casp13/ zscores_multimer.cgi). One interesting case study in this CASP13 edition is target T1009 (CAPRI code T154), consisting in the homodimeric
Modeling of Protein Complexes with pyDock
191
assembly of α-xylosidase A (from Aspergillus niger), based on its monomer sequence formed by 718 residues. We obtained the monomer coordinates from the models already available at the CASP-hosted servers. More specifically, we took the rank #1 predictions from ZHANG-SERVER, BAKER-ROSETTA, and QUARK CASP-hosted servers, which can be extracted from the following file: http://predictioncenter.org/download_area/ CASP13/server_predictions/T1009.3D.srv.tar.gz
We ran three different docking executions with FTDock 2.0 (with each one of the three monomer models against itself), and another three with ZDOCK 2.1, to obtain a total of 36,000 docking poses. Each docking pose was checked for possible C2 symmetry, by applying the following ICM commands: icm> Rmsd(a_a a_b)
This command produces the variable R_out, which is the rotation matrix that should be applied to move one of the molecules to the position of the other one. Now, we can run: icm> Axis(R_out)
After executing this command, we obtain two useful variables: representing the rotation angle ( ) between the two molecules around the rotation axis; and r_2out representing the translation (A˚) along the rotation axis (values significantly different from 0 might indicate screw axis symmetry). Thus, in a general case, a homodimeric docking pose with r_out 360/n and r_2out 0 will be compatible with n-mer homo-oligomerization following Cn rotational symmetry. Such n-mer homo-oligomer can be easily built from the dimeric docking model following symmetry rules. In particular, we can identify docking poses with C2 symmetry as those that satisfy 175 < |r_out| < 180 , and r_2out < 5 A˚. In our CASP13 participation, we only used r_out value to select symmetric poses due to technical problems. These poses were further selected and evaluated by pyDock scoring, and the best eight models were submitted to CASP13-CAPRI (two additional models based on an available template were included in the set, for the sake of diversity). The official assessment of results showed that pyDock docking yielded two acceptable models (see Fig. 4), which were actually the only successful predictions for this target among all participants. r_out
4.3 Template-Based Docking
Another interesting application of pyDock is the structural modeling of a protein–protein complex based on available templates. If a suitable template is found, the complex can be directly modeled based on the template, or alternatively, the structures (or models)
192
Mireia Rosell et al.
Fig. 4 Models submitted by pyDock for CASP13-CAPRI target T154. The two acceptable models submitted to CASP13 are shown after superimposing their receptors (white ribbon). The ligand in model #7 is shown in red (8.0 A˚ ligand RMSD from complex structure) and that of model #9 in orange (12.7 A˚ ligand RMSD). For comparison purposes, the complex structure (PDB 6DRU) is shown after superimposing its receptor onto that of the models, with the ligand in a green ribbon
of the unbound proteins can be directly superimposed onto the available template. In case of closely homologous templates, both approaches can provide reasonable docking models. However, when no clearly homologous templates are found, or when the unbound proteins cannot be easily modeled, the challenge is to build suitable models among the possible docking orientations that can be derived from a variety of remote template structures. The use of pyDock scoring can help to identify the correct models. This strategy has been successfully applied in blind conditions to model target H0974 (CAPRI code T142) in the CASP13CAPRI experiment. This target consisted in the heterodimeric assembly of two proteins from part of the lysogeny switch of Lactococcus phage TP901–1: the repressor CI (72 residues) and the antirepressor MOR (95 residues). The coordinates of the individual proteins were taken from the models available in the CASP-hosted servers. More specifically, we took the rank #1 predictions from ZHANG-SERVER, BAKER-ROSETTA, and QUARK CASPhosted servers, which can be extracted from the following files:
Modeling of Protein Complexes with pyDock
193
http://predictioncenter.org/download_area/ CASP13/server_predictions/T0974s1.3D.srv.tar.gz http://predictioncenter.org/download_area/ CASP13/server_predictions/T0974s2.3D.srv.tar.gz
To build the homodimer from these monomers, we found a total of nine possible templates from the five CASP-hosted servers used (ZHANG, ROSETTA, QUARK, MULTICOMCONSTRUCT, and RAPTOR Deep Modeller). These templates were actually homodimers, but were found to be suitable to build the heterodimer, given the structural similarity between the two interacting proteins. Among them, only the three most structurally conserved templates were used for modeling (PDB codes 1Y7Y, 1UTX, and 2B5A). All modeled monomers (three for each protein) were structurally superimposed on the three different templates, in all possible combinations, thus building a total of 27 heterodimer models. Models with more than 300 interatomic clashes (i.e. pairs ˚ distance) of non-hydrogen atoms from both proteins within 3 A were removed. The remaining ones were further minimized with AMBER, and then scored with pyDock. The best two template-based models according to pyDock scoring were submitted to CASP13-CAPRI (the remaining eight models were built by pyDock; the proportion of template-based and docking models were based on the low reliability of the identified templates). One of these two template-based submitted models, actually model #1, showed medium accuracy (see Fig. 5). This was a challenging target, in which only 12 out of the 29 CAPRI participants were successful (incidentally, in this target, the submitted models built by pyDock alone were not successful).
Fig. 5 Successful model submitted by pyDock for CASP13-CAPRI target T142. The receptor is shown in white and the ligand in red. According to CASP evaluation, the model shows 3.12 A˚ LocalRMSD, 3.12 A˚ GlobalRMSD, and 3.20 A˚ InterfaceRMSD (complex structure not yet released)
194
5
Mireia Rosell et al.
Notes 1. In order to run pyDock in 64-bit Linux systems, we need to update the system regarding compatibility with 32-bit software. For this, in recent Debian-Like distribution type the following commands are used: sudo dpkg --add-architecture i386 sudo apt-get update sudo apt-get install zlib1g:i386
2. The FFTW libraries installation file fftw-2.1.5.tar.gz can be downloaded from http://www.fftw.org/download. html (by clicking the appropriate link). Uncompressing this file will extract its contents to a new directory called fftw2.1.5. Within this new directory, compile the libraries by: ./configure --enable-float make
The argument --enable-float will make libraries to use a single float precision, which implies faster execution. For double precision (slower), remove this argument. This fftw2.1.5 directory can be moved to any other location, as long as its location is indicated when installing FTDock (Subheading 2.2.2). 3. For automatic use of FTDock, ZDOCK, and SCWRL programs within pyDock, after installing them locally (see the Materials section), indicate the full path of the FTDock and ZDOCK directories, and that of the SCWRL binary, by modifying the corresponding lines in the $PYDOCK/pyDock3/ etc./pydock.conf file, as follows: (...) ZDOCK=//zdock2.1_linux_64bit/ FTDOCK=//3D_Dock/ SCWRL=//scwrl3_lin/scwrl3 (...)
In the pydock.conf file, there are also configuration parameters for FTDock. Some default parameters might need to be changed to yield the optimal results (for best pyDock results, it is advisable to use elec ¼ 1 and calculate_grid ¼ 0.7). 4. The PDB names in the DOCKNAME.ini file must correspond to the exact names of the PDB files you are using (1C5K.pdb,
Modeling of Protein Complexes with pyDock
195
1c5k.pdb, pdb1c5k.ent.Z, etc...). If the chain ID in a PDB file is empty, use “-” or “ ” in the mol field or leave it blank to select that chain. If a PDB file contains several copies of the same protein, select only the desired chain by indicating its ID in the mol field. If a protein to dock contains several chains (for example L and H chains for antibodies) that are relevant for docking, you may indicate their chain IDs in the mol field, separated by comma. The newmol field can be used to rename a protein chain in the docking output file, assigning it a name different from mol, but it can be also left unchanged. The newmol chain IDs must be different for the receptor and the ligand molecules (so that their chains can be distinguished in the docking models). 5. In case of using a computer architecture with multicore CPUs or a cluster, it is advisable to run FTDock in parallel, especially for large-size proteins. For this, we need first to compile the FFTW libraries to enable MPI. Before anything, install the MPI compilers if needed: sudo apt-get install mpi-default-dev
Then, download the FFTW libraries installation file fftwfrom http://www.fftw.org/download. html (see Note 2). Uncompressing this file will extract its contents to a new directory called fftw-2.1.5. Within this new directory, compile the libraries by:
2.1.5.tar.gz
./configure --enable-type-prefix --enable-mpi --prefix=/< fullpath>/fftw-2.1.5 make install
Now, you can download the ftdock-mpi-master.zip file with the optimized FTDock distribution for parallel running [10] from the GitHub repository (https://github. com/brianjimenez/ftdock-mpi). Unpack this zipped file, and within the new ftdock-mpi-master directory, edit the Makefile file to set the FFTW_DIR variable to the full path of the above described fftw-2.1.5 directory. Now, within the ftdock-mpi-master directory, type: ./make
This will create the program binaries, such as ftdock. You can find more information in the README.md file. To execute FTDock in parallel within a given pyDock project, you can download the parallel-master.zip file with useful scripts from the GitHub repository (https://
196
Mireia Rosell et al. github.com/pyDock/parallel).
Unpacking this zipped file within the $PYDOCK directory will create the parallel-master folder (alternatively, you can unzip the file from any location and copy the parallel-master folder to the $PYDOCK directory). In our example, assuming that we are using a 4-core/8-thread computer, we can launch FTDOCK in parallel as follows (keeping one free core for the OS to prevent system instability): $PYDOCK/parallel-master/run_parallel_ftdock.sh Dock1 6
6. The run_parallel_ftdock.sh script has three arguments, the name of the pyDock project, the number of CPU threads (in case of multicore processor, it is advisable to define a number slightly smaller than the total number of cores), and an optional noelec parameter to deactivate the electrostatic evaluation during the model generation. Let us suppose we have two different .rot files from FTDock and ZDOCK (e.g. Dock1.rot_ftdock and Dock1. rot_zdock). We can join both files into one and renumber the conformation numbers as follows: cat Dock1.rot_ftdock Dock1.rot_zdock > tmp.rot awk ’{$13=NR;print $0}’ tmp.rot | column -t > Dock1.rot
This new Dock1.rot file can be scored by pyDock, effectively including docking poses from FTDock and ZDOCK in a single set. 7. In case of using multicore computer architectures, the scoring module of pyDock can be run in parallel for faster execution times (especially with large-size proteins). For this, as described in Note 5, you will need to download the parallel-master.zip file with useful scripts from the GitHub repository (https://github.com/pyDock/parallel). Unpacking this zipped file within the $PYDOCK directory will create the parallel-master folder. In our example, assuming that we are using a 4-core/8thread computer, we can launch the pyDock dockser module in parallel as follows: $PYDOCK/parallel-master/run_dockser_parallel.sh Dock1 6
8. The NIP (Normalized Interface Propensity) value obtained from the docking represents the frequency of a given residue
Modeling of Protein Complexes with pyDock
197
to be located at the interface among the 100 lowest energy solutions of docking. If NIP ¼ 0, the corresponding residue appears at the interface within the top 100 docking poses as expected by a random distribution. If NIP < 0, the corresponding residue appears at the interface within the top 100 docking poses less than expected by random. If NIP > 0.2, the corresponding residue is predicted to be at the interface as it appears significantly more often than expected by random. 9. To define a restraint residue in the restr field of DOCKNAME. ini, we need to indicate its chain ID, its three-letter aminoacid code (first letter in uppercase), and its number, as found in the molecule file used in docking. Note that some residues might be named by its AMBER code in the file used in docking (e.g. Hid, Hie, Hip, Cyx), in which case, such notation should be used. When more than one restraint residues are used, they must be separated by comas with no space. A distance restraint defined from a potential interface residue is considered fulfilled when the center of coordinates of its ˚ from any non-hydrogen side-chain lies within a distance of 6 A atom of the partner molecule. For each docking solution, the percentage of satisfied restraints is converted to pseudoenergy (just by multiplying by 1.0) and added to the final scoring function in the DOCKNAME.eneRST file.
Acknowledgments This work was supported by the Spanish Ministry of Science (grant BIO2016-79930-R). References 1. Huang S-Y (2014) Search strategies and evaluation in protein-protein docking: principles, advances and challenges. Drug Discov Today 19(8):1081–1096. https://doi.org/10.1016/ j.drudis.2014.02.005 2. David WR (2008) Recent progress and future directions in protein-protein docking. Curr Protein Pept Sci 9(1):1–15. https://doi.org/ 10.2174/138920308783565741 3. Cheng TM-K, Blundell TL, Fernandez-Recio J (2007) pyDock: electrostatics and desolvation for effective scoring of rigid-body protein–protein docking. Proteins 68(2):503–515. https://doi.org/10.1002/prot.21419
4. Gabb HA, Jackson RM, Sternberg MJE (1997) Modelling protein docking using shape complementarity, electrostatics and biochemical information. J Mol Biol 272(1):106–120. https://doi.org/10.1006/jmbi.1997.1203 5. Chen R, Weng Z (2003) A novel shape complementarity scoring function for proteinprotein docking. Proteins 51(3):397–408. https://doi.org/10.1002/prot.10334 6. Solernou A, Fernandez-Recio J (2010) Protein docking by Rotation-Based Uniform Sampling (RotBUS) with fast computing of intermolecular contact distance and residue desolvation. BMC Bioinformatics 11:352–352. https:// doi.org/10.1186/1471-2105-11-352
198
Mireia Rosell et al.
7. Moal IH, Torchala M, Bates PA, Ferna´ndezRecio J (2013) The scoring of poses in proteinprotein docking: current capabilities and future directions. BMC Bioinformatics 14(1):286. https://doi.org/10.1186/1471-2105-14-286 8. Barradas-Bautista D, Moal IH, Ferna´ndezRecio J (2017) A systematic analysis of scoring functions in rigid-body protein docking: The delicate balance between the predictive rate improvement and the risk of overtraining. Proteins 85(7):1287–1297. https://doi.org/10. 1002/prot.25289 9. Jime´nez-Garcı´a B, Roel-Touris J, RomeroDurana M, Vidal M, Jime´nez-Gonza´lez D, Ferna´ndez-Recio J (2017) LightDock: a new multi-scale approach to protein–protein docking. Bioinformatics 34(1):49–55. https://doi. org/10.1093/bioinformatics/btx555 10. Jime´nez-Garcı´a B, Pons C, Ferna´ndez-Recio J (2013) pyDockWEB: a web server for rigidbody protein–protein docking using electrostatics and desolvation scoring. Bioinformatics 29(13):1698–1699. https://doi.org/10. 1093/bioinformatics/btt262 11. Ferna´ndez-Recio J, Totrov M, Abagyan R (2004) Identification of protein–protein interaction sites from docking energy landscapes. J Mol Biol 335(3):843–865. https://doi.org/ 10.1016/j.jmb.2003.10.069 12. Grosdidier S, Ferna´ndez-Recio J (2008) Identification of hot-spot residues in proteinprotein interactions by computational docking. BMC Bioinformatics 9:447–447. https://doi. org/10.1186/1471-2105-9-447 13. Pons C, D’Abramo M, Svergun DI, Orozco M, Bernado´ P, Ferna´ndez-Recio J (2010) Structural characterization of protein–protein complexes by integrating computational docking with small-angle scattering data. J Mol Biol 403(2):217–230. https://doi.org/10.1016/j. jmb.2010.08.029 14. Jime´nez-Garcı´a B, Ferna´ndez-Recio J, Pons C, Svergun DI, Bernado´ P (2015) pyDockSAXS: protein–protein complex structure by SAXS and computational docking. Nucleic Acids Res 43(W1):W356–W361. https://doi.org/ 10.1093/nar/gkv368 15. Jime´nez-Garcı´a B, Bernado P, Ferna´ndezRecio J (2020) Structural characterization of protein-protein interactions with pyDockSAXS. Methods Mol Biol 2112:131–144 16. Chelliah V, Blundell TL, Ferna´ndez-Recio J (2006) Efficient restraints for protein–protein docking by comparison of observed amino acid substitution patterns with those predicted from local environment. J Mol Biol 357 (5):1669–1682. https://doi.org/10.1016/j. jmb.2006.01.001
17. Pe´rez-Cano L, Romero-Durana M, Ferna´ndez-Recio J (2017) Structural and energy determinants in protein-RNA docking. Methods 118-119:163–170. https://doi.org/10. 1016/j.ymeth.2016.11.001 18. Lucas M, Gaspar AH, Pallara C, Rojas AL, Ferna´ndez-Recio J, Machner MP, Hierro A (2014) Structural basis for the recruitment and activation of the Legionella phospholipase VipD by the host GTPase Rab5. Proc Natl Acad Sci 111(34):E3514. https://doi.org/ 10.1073/pnas.1405391111 ´ lvarez-Marimon E, 19. Rosell A, Meury M, A Costa M, Pe´rez-Cano L, Zorzano A, Ferna´ndez-Recio J, Palacı´n M, Fotiadis D (2014) Structural bases for the interaction and stabilization of the human amino acid transporter LAT2 with its ancillary protein 4F2hc. Proc Natl Acad Sci U S A 111(8):2966–2971. https://doi.org/10.1073/pnas.1323779111 20. Case DA, Cheatham TE III, Darden T, Gohlke H, Luo R, Merz KM Jr, Onufriev A, Simmerling C, Wang B, Woods RJ (2005) The Amber biomolecular simulation programs. J Comput Chem 26(16):1668–1688. https:// doi.org/10.1002/jcc.20290 21. Bonsor DA, Grishkovskaya I, Dodson EJ, Kleanthous C (2007) Molecular mimicry enables competitive recruitment by a natively disordered protein. J Am Chem Soc 129 (15):4800–4807. https://doi.org/10.1021/ ja070153n 22. Me´ndez R, Leplae R, De Maria L, Wodak SJ (2003) Assessment of blind predictions of protein–protein interactions: current status of docking methods. Proteins 52(1):51–67. https://doi.org/10.1002/prot.10393 23. Ray MC, Germon P, Vianney A, Portalier R, Lazzaroni JC (2000) Identification by genetic suppression of Escherichia coli TolB residues important for TolB-Pal interaction. J Bacteriol 182(3):821–824. https://doi.org/10.1128/ JB.182.3.821-824.2000 24. Svergun D, Barberato C, Koch MHJ (1995) CRYSOL—a program to evaluate X-ray solution scattering of biologicalmacromolecules from atomic coordinates. J Appl Crystallogr 28:768–773 25. Cheng TMK, Blundell TL, Fernandez-Recio J (2008) Structural assembly of two-domain proteins by rigid-body docking. BMC Bioinformatics 9(1):441. https://doi.org/10. 1186/1471-2105-9-441 26. Eswar N, Eramian D, Webb B, Shen MY, Sali A (2008) Protein structure modeling with MODELLER. Methods Mol Biol 426:145–159. https://doi.org/10.1007/ 978-1-60327-058-8_8
Chapter 11 A Guide for Protein–Protein Docking Using SwarmDock Iain H. Moal, Raphael A. G. Chaleil, Mieczyslaw Torchala, and Paul A. Bates Abstract Many of the biological functions of the cell are driven by protein–protein interactions. However, determining which proteins interact and exactly how they do so to enable their functions, remain major research questions. Functional interactions are dependent on a number of complicated factors; therefore, modeling the three-dimensional structure of protein–protein complexes is still considered a complex endeavor. Nevertheless, the rewards for modeling protein interactions to atomic level detail are substantial, and there are numerous examples of how models can provide useful information for drug design, protein engineering, systems biology, and understanding of the immune system. Here, we provide practical guidelines for docking proteins using the web-server, SwarmDock, a flexible protein–protein docking method. Moreover, we provide an overview of the factors that need to be considered when deciding whether docking is likely to be successful. Key words SwarmDock, Protein–protein complexes, Protein–protein interactions, Protein docking, Protein structure prediction
1
Introduction The functions of protein complexes are products of the specific geometrical arrangement of the subunits from which it is formed. The position, orientation, and conformation of each subunit is established by the formation of specific intermolecular contacts which anchor the subunit in place and lower the free energy of the bound state. Information about the structure can aid in tasks such as the identification of energetic hotspots and potential sites for the design of molecules which can mimic or prevent a natural interaction. It can also help in elucidating the mechanisms through which pathological mutations alter functions, aid library design for engineering high binding affinity, and allow the identification of overlapping binding sites. While in many cases the structure of an interaction can be resolved using nuclear magnetic resonance (NMR), X-ray crystallography, or high-resolution electron
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_11, © Springer Science+Business Media, LLC, part of Springer Nature 2020
199
200
Iain H. Moal et al.
microscopy, these experiments are not guaranteed success and can be expensive and time-consuming. However, if the structures of the unbound constituents of the interaction have been resolved, or a high-quality model can be generated by homology modeling, then it may be possible to generate a structure of the interaction computationally using protein–protein docking. Multiple servers and stand-alone programs for docking are currently available [1–14], and while the focus of this chapter is on using SwarmDock [15, 16], many of the principles also apply to other approaches. Most of the tasks required to perform the docking are handled automatically in the SwarmDock server [17, 18]. Below is a brief overview of the algorithm, with the following sections outlining the steps involved in docking, finishing with a case study.
2
Materials The SwarmDock Server (SDS) may be accessed at https://bmm. crick.ac.uk/~svc-bmm-swarmdock/. The purpose of protein–protein docking is to produce structural models of interacting proteins ranked such that the top-ranked models are most likely to be close to the native structure. This typically proceeds as a sequence of steps: an initial conformational search, filtering of models using an efficient scoring function and/or clustering, refinement of the resultant structures, and finally ranking of the refined structures. In practice, not all docking pipelines employ all these steps. While it is possible to mix and match different search, filtering, refinement, and scoring protocols, the success of the later steps is usually greatest when applied to structures generated in the same way in which the method was developed [19–21]. SwarmDock is a flexible docking method that optimizes the conformation and the relative position and orientations of the subunits (Fig. 1a). The set of accessible states is specified by the orientation and position of the smaller of the two proteins with respect to the larger and the conformation of both binding partners. The conformations are modeled by a linear combination of normal coordinates, which in many cases encapsulates the conformational changes observed when proteins bind to one another [22]. In this framework, each potential docked pose is typically characterized by the three Cartesian coordinates for relative position of the binding partners, four quaternion terms for their relative orientation, and five coefficients for each binding partner that specify their conformation. Any given set of values for these 17 parameters corresponds to a structure. In SwarmDock, these parameters are optimized to find the set of values that minimize the interaction energy of the two binding partners (Fig. 1b, i–iv). The interaction energy is calculated using the DComplex potential
A Guide for Protein–Protein Docking Using SwarmDock
201
Fig. 1 Overview of the SwarmDock algorithm. (a) The parameters optimized during the docking process. (b) A summary of the steps taken for docking with SwarmDock (i–v) and ranking with IRaPPA (vi–viii)
function [23]. The optimization is performed by a populationbased memetic algorithm, in which a swarm of 350 parameter combinations are initially sampled. This is done by combining a modified particle swarm optimization global search [24] with a local search [25]. On each iteration of the algorithm, the energies and past histories of the swarm members are used to determine the subsequent positions to be sampled, combining a global search for identifying the broad low-energy regions of parameter space within
202
Iain H. Moal et al.
which the correctly bound structure is more likely to be found, with a local search for refining the structures into the minima of the energy landscape. The optimization is performed around 240 times, with the initial sampling of each run focusing on different overlapping regions of parameter space corresponding to evenly spaced areas surrounding the surface of the larger of the two binding partners. Once the search is complete, the resultant structures are subsequently clustered and ranked, either with a pairwise potential function [26], using the local energy structure of the binding region [27] or using the IRaPPA method (Fig. 1b, vi– viii) [28].
3
Methods
3.1 Factors Influencing the Success Rate of Protein–Protein Docking
The success rate for docking depends on various factors that can be taken into consideration when choosing whether or not docking is a viable option and when interpreting the results of a docking calculation (Fig. 2). 1. Structural resolution and homology models: During docking, prior modeling errors can accumulate with docking errors, so using the highest-quality available starting structures is advised, ideally a high-resolution ( 1, and “cn” for the cyclic of order n > 1 groups. To test the previous example for only the tetrahedral symmetry, please type in the terminal “AnAnaS 5 47.pdb t.”
3.4 Symmetry-Based Reconstruction of Missing Subunits
The AnAnaS software is a powerful method for finding symmetry axis in cyclic assemblies with one or more missing subunits. We should specifically note that AnAnaS supports missing subunits only for cyclic symmetries. If there are any missing subunits, the result will give, for each permutation of the chain in the input
Analysis of Protein Symmetries
251
Fig. 2 2gza, a C6 assembly with three missing chains
structure, the best rotation with the expected symmetryconstrained angle. However, the different transformations will not form a group. This is why no average RMSD is provided, because each RMSD is potentially obtained with a different axis. The treatment of assemblies with missing subunits is not as automated as the analysis of the complete assemblies. The user has to explicitly provide the symmetry group to be tested. Figure 2 presents an example of incomplete assembly. 3.5 Visualization of the Results with PyMOL
We provide a PyMol script to visualize the predicted axes. Please use the “-y” option to output the PyMol commands and then simply copy-paste it into your PyMol console. One can paste the full output as the nonrelevant part will be ignored by the Python interpreter. Please note that the “cgo_arrow.py” script has to be loaded in advance by “run cgo_arrow.py” in the PyMol console. The script can be found in the examples folder of the AnAnaS distribution or at https://raw.githubusercontent.com/PymolScripts/Pymol-script-repo/master/cgo_arrow.py. Please type in the terminal “AnAnaS 5 47.pdb1 t -y.” This produces the following output:
Symmetry group : t RMSD RMSD_R RMSD_T RMSD_Z RADGYR ORDER AXIS X AXIS Y AXIS Z CENTER X CENTER Y CENTER Z 0.769 0.356 0.515 0.447 29.397 3 0.536 -0.000 0.844 -26.543 -0.053 34.814 cgo_arrow [0.753636,-0.0641983,77.8158], [-53.8396,-0.0414718,-8.18717] 0.967 0.451 0.660 0.545 29.427 2 0.098 -0.777 0.621 -26.543 -0.053 34.814 cgo_arrow [-20.6024,-47.2003,72.5106], [-32.4836,47.0946,-2.88193]
252
Guillaume Page`s and Sergei Grudinin
0.873 0.403 0.556 0.539 29.482 3 0.423 0.897 0.127 -26.543 -0.053 34.814 cgo_arrow [-5.99587,43.5509,40.9687], [-47.0901,-43.6566,28.66] 0.945 0.587 0.537 0.511 29.467 3 -0.600 -0.198 0.775 -26.543 -0.053 34.814 cgo_arrow [-56.1212,-9.8402,73.0655], [3.03524,9.73453,-3.43679] 0.833 0.427 0.457 0.551 29.469 2 0.153 -0.605 -0.781 -26.543 -0.053 34.814 cgo_arrow [-17.2952,-36.6277,-12.3879], [-35.7908,36.522,82.0165] 0.875 0.382 0.601 0.509 29.483 2 0.983 0.172 0.060 -26.543 -0.053 34.814 cgo_arrow [32.8768,10.3173,38.4204], [-85.9628,-10.423,31.2082] 0.928 0.534 0.542 0.531 29.492 3 -0.713 0.699 0.058 -26.543 -0.053 34.814 cgo_arrow [-62.7094,35.4285,37.7446], [9.62341,-35.5341,31.8841] Average RMSD : 0.884844
In the output above, after each axis, a command to display it in PyMol is provided. In PyMol, type the following commands to visualize 5x47 with all its symmetry axes: fetch 5x47 run
https://raw.githubusercontent.com/Pymol-Scripts/Pymol-scriptrepo/ master/cgo_arrow.py
cgo_arrow [0.753636,-0.0641983,77.8158], [-53.8396,-0.0414718,-8.18717] cgo_arrow [-20.6024,-47.20034,72.5106], [-32.4836,47.0946,-2.88193] cgo_arrow [-5.99587,43.5509,40.9687], [-47.0901,-43.6566,28.66] cgo_arrow [-56.1212,-9.8402,73.0655], [3.03524,9.73453,-3.43679] cgo_arrow [-17.2952,-36.6277,-12.3879], [-35.7908,36.522,82.0165] cgo_arrow [32.8768,10.3173,38.4204], [-85.9628,-10.423,31.2082] cgo_arrow [-62.7094,35.4285,37.7446], [9.62341,-35.5341,31.8841]
3.6 Saving Results in the JSON Format
To facilitate the usage of AnAnaS in automated pipelines, we provide an option to output the result in the JSON format. To obtain this output, use the option “--json .” It will output a JSON array with the specification described in the jsonschema.json file. Please type in the terminal “AnAnaS 5 47.pdb t --json out. json.” This example creates an output file “out.json” containing the detailed result. This file is easy to read and understand for humans and also easy to parse. Libraries to parse JSON files are available for most programming languages.
3.7
A SAMSON module is available for the AnAnaS method at samsonconnect.net. It provides a convenient interactive graphical user interface, as shown in Fig. 3. To use this module, select the structure on which you want to run the symmetry analysis. Then choose the symmetry group you want to test or keep it “Automatic” otherwise, and click on “Compute Symmetry.” The list of the results appears for each of tested symmetry groups. To visualize the axes, just click on an element in the list. One can also highlight specific axes by selecting them in the list, or move the camera along
A GUI Interface
Analysis of Protein Symmetries
253
Fig. 3 A SAMSON module that provides an interactive graphical user interface for the AnAnaS method. The structure in the example is 4B2H. All the point group symmetries have been automatically detected and are listed with their RMSD measures. When a point group is selected, the axes are displayed in the viewport and the detail of axis order, and RMSD by axis is displayed. Clicking on an axis in the list highlights it on the viewport
the chosen axis direction by double-clicking on the axis information in the list. 3.8 Symmetrize a Protein Assembly
4
A common task in structural bioinformatics is to modify a structure that is approximately symmetric to make it perfectly symmetric. AnAnaS provides this feature with the option “--symmetrize .” When used with this option, AnAnaS will replicate and rotate the first subunit to create a perfectly symmetrical assembly. One should explicitly provide a symmetry group when using this option. This option may be particularly useful to generate full assembly from ones with missing subunits. Please type in the terminal “AnAnaS 2gza.pdb c6 –symmetrize 2gza_c6.pdb” and then “AnAnaS 2gza.pdb c7 –symmetrize 2gza_c7.pdb” to obtain two full assemblies, with different symmetry order from the partial assembly 2gza.
Case Studies
4.1 How Good Are PDB Symmetry Annotations?
We use AnAnaS to assess the quality of symmetry annotations in the PDB [4]. In 98.1% of the cases, the annotation from the PDB and the AnAnaS results were identical (Table 1). AnAnaS was also able to find a higher-order symmetry group in 1.6% of the cases. Figure 4
C3
C4
C5
6∗
1∗
1∗
3∗
26∗ 8∗ 1∗ 1∗ 5∗
2∗∗
54∗ 33,091 8∗∗ 23∗ 2∗ 3∗∗ 4188 1∗ 2∗ 1046 6∗ 561 2∗ 2∗
C2
411
6∗ 16∗
C6
1∗
104
C7
34
7∗
D3
6571 1939
470∗ 15∗ 60∗
C8 D2
654
6∗
4∗
7∗
D4
236
1∗
1∗
D5
106
1∗
1∗
1∗
D6
99
6∗
O
I
Total
33,883 4269 1060 568 416 110 3∗ 37 2∗ 6609 1952 5∗ 661 237 106 101 34 34 359 3∗ 364 329 329 2∗ 617 625
D8 T
205∗ 2∗
D7
For example, the first cell shows that there are 54 structures annotated as C2 in the PDB for which AnAnaS did not find any symmetry. ∗ compatible groups, ∗∗ incompatible groups
PDB annotations C2 C3 C4 C5 C6 C7 C8 D2 D3 D4 D5 D6 D7 D8 T O I
C1
AnAnaS detection
Table 1 Summary of the symmetry groups annotated in the PDB (rows) against the ones discovered by AnAnaS (columns)
254 Guillaume Page`s and Sergei Grudinin
Analysis of Protein Symmetries
Number of structures
7 RMSD (Å)
6
1000
5
100
4 3
10
2
1
1 0
255
0
10
20
30
40 50 60 Radius of Gyration (Å)
70
80
90
100
Fig. 4 Relation between the RMSD symmetry measure and the radius of gyration of all structures annotated as symmetric in the PDB. Surprisingly, there is no positive correlation between the two
shows the number of structures with different symmetry groups, according to AnAnaS or the PDB. 4.2 Are the Big Assemblies more Symmetrical Than the Small Ones?
5
A simple geometric intuition would suggest that as the angular uncertainty in the packing of subunits should stay constant with the size of the assembly, the imperfection of its symmetry becomes more pronounceable as it grows larger. Therefore, one could expect a linear correlation between the RMSD symmetry measure and the radius of gyration of the assemblies. We used AnAnaS to conduct an experiment to compare these values for all the assemblies in the PDB [4]. We surprisingly found the opposite. More precisely, the bigger assemblies seem to be better organized and have smaller imperfections compared to the small ones. Figure 4 provides more details of these experiments.
FAQ Why Do I Obtain Different RMSD for the Same Axis in Different Symmetry Groups? To perform computation, AnAnaS uses sequentially aligned alpha-carbons as the reference points. Different symmetry groups will require different sets of subunits to be sequentially aligned and thus change the reference points used. When Should I Change the Default Symmetry Threshold Value? The default threshold of 7 A˚ is usually a good choice to consider an assembly as symmetric, when the radius of gyration of this ˚ . For very small assemblies, assembly is between 15 and 100 A however, this cutoff might be too large, and one may reduce it to keep fewer results. In the opposite case, if one wants to work with large assemblies with large deformation amplitudes, one may increase the threshold.
256
Guillaume Page`s and Sergei Grudinin
The Symmetry of My Assembly Is not Detected, What Can I Do? The first step is to check the number of chains of your assembly. AnAnaS will try to find a symmetry involving every single chain of the assembly. If some chains are missing or additional chains are present, the recognition of the symmetry group will fail. Make sure you removed chains not involved in the symmetry. Provide the symmetry group in input if your assembly has missing subunits. Second, you may try to increase the RMSD threshold. If your assembly presents large amplitude deformation, it may fall within the predefined threshold. Last, it may happen that AnAnaS is unable to properly determine correspondences between the different chains of the assembly. In this case, you may provide manually the correspondences. How Can I Provide Manually the Correspondence Between the Chains? The correspondence may be provided using the P option. For cyclic assemblies, the correspondence between the chains must be provided as shown in Fig. 5. The chains are labeled from zero in the order of appearance in the input file. For the dihedral and cubic assemblies, two correspondences should be given, one corresponding to a rotation of order n (or 3 for cubic) and one corresponding to a rotation of order 2. This option is rather difficult to use; thus, we recommend the user to consult with the detected permutations typing the “-p” flag.
Fig. 5 Example of how to provide the correspondence between the chains. Here, the provided correspondence should be (3,4,0,1,2) as after a fifth of a turn, the subunit 0 goes to the place of the subunit 3 (0 ! 3), 1 ! 4, 2 ! 0, 3 ! 1, 4!2
Analysis of Protein Symmetries
257
Acknowledgments The authors thank Nikolay Mayorov for his 2D trust-region algorithm developed during Google Summer of Code 2015 and Sergei Khashin from Ivanovo State University for his fourth-order polynomial solver available at http://math.ivanovo.ac.ru/dalgebra/ Khashin/poly/index.html. The authors also thank Elvira Kinzina and Andrei Kazennov from MIPT Moscow for their support at the initial stage of the project. This work has been supported by L’Agence Nationale de la Recherche (grant number ANR-15CE11-0029-03). References 1. Petitjean M (1999) On the root mean square quantitative chirality and quantitative symmetry measures. J Math Phys 40:4587–4595 2. Pinsky M et al (2008) Analytical methods for calculating continuous symmetry measures and the chirality measure. J Comput Chem 29:2712–2721 3. Page`s G, Kinzina E, Grudinin S (2018) Analytical symmetry detection in protein
assemblies. I. Cyclic symmetries. J. Struct Biol 203:142–148 4. Page`s G, Grudinin S (2018) Analytical symmetry detection in protein assemblies. II. Dihedral and cubic symmetries. J. Struct. Biol 203:185–194 5. Popov P, Grudinin S (2014) Rapid determination of RMSDs corresponding to macromolecular rigid body motions. J Comput Chem 35 (12):950–956
Chapter 15 MDockPeP: A Web Server for Blind Prediction of Protein–Peptide Complex Structures Xianjin Xu and Xiaoqin Zou Abstract Protein–peptide interactions mediate a wide range of important cellular tasks. In silico prediction of protein–peptide complex structure is highly desirable for mechanistic investigation of these processes and for therapeutic design. Recently, we developed a docking-based method for predicting protein–peptide complex structures, which starts with the peptide sequence and globally docks the all-atom, flexible peptide onto the protein structure. The produced modes are then evaluated with a statistical potential-based scoring function. The method has been implemented into an online server, MDockPeP server, which is freely available at http://zougrouptoolkit.missouri.edu/mdockpep. The server can be used for protein–peptide complex structure prediction. The server can also be used for initial-stage sampling of the protein–peptide binding modes for computational-demanding simulation or docking methods. Key words Protein–peptide docking, Molecular modeling, Molecular docking, Structure prediction, Web server, Peptide-based drug design, Binding mode sampling, Peptide therapeutics, Computeraided drug design, In silico drug design, Ab initio docking, Template-free, Knowledge-based scoring function, MDockPeP, ITScorePeP, AutoDock Vina
1
Introduction Protein–peptide interactions play crucial roles in a variety of cellular processes such as transcription regulation, signal transduction, and immune response [1]. Peptide-based therapeutics have attracted much attention in recent years, and an increasing number of peptide-based drugs have been designed and approved for different types of diseases [2]. The structure of the protein–peptide complex is a key to understand the underlying mechanism of the interaction between the protein and the peptide and is therefore critical for peptide therapeutic development. However, it is difficult and expensive to solve a protein–peptide complex structure using experimental methods such as X-ray crystallography and NMR. Recently, a number of in silico methods have been developed to predict protein–peptide complex structures. Current methods can
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_15, © Springer Science+Business Media, LLC, part of Springer Nature 2020
259
260
Xianjin Xu and Xiaoqin Zou
be grouped into three classes: template-based modeling, local docking, and global docking. The template-based methods are computationally efficient but suffer from limited available protein–peptide templates [3, 4]. An example of template-based methods is GalaxyPepDock [4]. For local docking, a user-defined binding site is required for a prediction. Rosetta FlexPepDock [5] and HADDOCK [6] peptide docking are two well-developed local docking methods. High-accuracy models can be achieved if a correct binding location is provided. However, binding site information are often unknown in practice. Therefore, a few global docking methods were developed for predicting protein–peptide complex structures without any knowledge about the peptide binding sites. Recently released global docking programs/Web servers include pepATTRACT [7], CABS-dock [8], MDockPeP [9], ClusPro PeptiDock [10], PIPER-FlexPepDock [11], and HPEPDOCK [12]. A thorough summary of current protein–peptide complex structure prediction methods can be found in a recent review [13]. Our globally blind docking method, MDockPeP, requires only the peptide sequence and the 3D structure of the protein as inputs for a prediction. MDockPeP was systematically validated and achieved good performance based on the peptiDB [14] benchmarking database. The method was recently implemented in the online MDockPeP server (http://zougrouptoolkit.missouri.edu/ mdockpep) [15], which is free and open to all users without registration. In this chapter, a brief introduction will be given about the programs and methods used in the MDockPeP server, followed by a detailed description about how to use the server.
2 2.1
Materials Inputs
Only two inputs, a peptide sequence and a protein structure, are required to run a prediction in the MDockPeP server.
2.2 Programs Used in MDockPeP
MDockPeP consists of three primary stages: (1) The peptide conformers are modeled based on the given peptide sequence. (2) Putative, flexible peptide binding modes are sampled on the whole protein surface. (3) The sampled binding modes are scored and ranked. The programs and methods used in each stage are described briefly as follows.
2.2.1 Construction of the Initial Peptide Conformers
For a given peptide sequence, MDockPeP models up to three nonredundant conformers, based on the similar sequence fragments from a nonredundant protein database provided by MODELLER [16] (pdb95.pir.gz, updated on October 20, 2016). To ensure fragment hits are from proteins rather than from peptides, proteins with sequence lengths 50 are removed from the database. An exhaustive searching strategy is used to find fragments
MDockPeP: Protein-Peptide Docking Webserve
261
with similar sequences against the query peptide. Specifically, the query peptide sequence works as a sliding window and slides along each protein sequence in the protein database. The sliding process starts from the N-terminal of the protein sequence and moves only one residue for each step. The sequence similarity and the sequence identity between the query peptide and the corresponding fragment in the protein are calculated based on the BLOSUM62 matrix [17]. Fragments with sequence similarity 5.5 A˚), the rigid global sampling process will be repeated for the initial peptide conformer, followed by flexible sampling. The procedure stops when the maximum step number for ILS, N, is reached. N is dependent on both the number of torsional angles and the number of the movable atoms. The exhaustiveness value in Vina is set to 100 for the MDockPeP server, which means 100 independent runs are performed for each docking. Finally, up to 2 104 binding modes are generated for each initial peptide conformer.
262
Xianjin Xu and Xiaoqin Zou
2.2.3 Rescoring of the Sampled Peptide Binding Modes
Binding modes generated from different initial peptide conformers are combined and ranked according to their energy scores calculated by an in-house scoring function ITScorePeP [9]. ITScorePeP is a statistical potential-based scoring function that is developed for protein–peptide dockings. Contributions from both interactions between the protein and the peptide (interscore) and interactions among non-neighbored residues within the peptide (intrascore) are considered in the scoring function. For any two modes with ligand RMSD (Lrmsd) less than a cutoff, only the one with the lower score is kept. Lrmsd is calculated based on the backbone atoms of the peptide between the predicted binding mode and the native binding mode after the optimal superimposition of the protein structures. The cutoff is set to 4.0 A˚ for the prediction of top ten models. For the enrichment of high-quality models (L rmsd < 3:0A˚) in top 500 models that are provided to the user as the sampling results, the cutoff is set to 2.0 A˚.
2.3 MDockPeP Web Server
The prediction methods have been implemented in the MDockPeP Web server, which is publicly accessible at http://zougrouptoolkit. missouri.edu/mdockpep. The user needs only to provide a peptide sequence and a target protein structure, and then the MDockPeP server will automatically perform the prediction. The top ten binding modes will be reported and are viewable online. In addition, the server also provides top 500 models, which can be used as enriched initial-stage sampling for other docking or simulation methods.
3 3.1
Methods Job Submission
The MDockPeP is freely accessible on the website http:// zougrouptoolkit.missouri.edu/mdockpep. The Web server has been tested on three popular Web browsers: Google Chrome, Firefox, and Internet Explorer. To submit a job, two input files are required, a peptide sequence and a protein structure. Other inputs are optional. A snapshot of the job submission page is shown in Fig. 1. 1. The peptide sequence can be provided by either entering single-letter amino acid codes or uploading a FASTA file. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. An example of FASTA file is shown as following: >1AWQ:B HAGPIA
Only 20 standard amino acids are accepted by the MDockPeP server. The sequence length of the peptide should be within 4–20 (see Note 1).
MDockPeP: Protein-Peptide Docking Webserve
263
Fig. 1 A snapshot of the job submission page
Fig. 2 An example of preparing a protein receptor with the UCSF Chimera program
2. A protein structure should be uploaded in the PDB format. Thirty-one to one thousand amino acids are allowed for the protein. Although the uploaded protein file will be cleaned (e.g., deletion of the solvates and ligands) after submission, uploading a pre-checked structure is recommended. Figure 2 shows an example of preparing a protein receptor for structure prediction. Here, we use a publicly available program, UCSF Chimera [19]. The experimental structure of the protein is downloaded from Protein Data Bank (PDB). Co-bound
264
Xianjin Xu and Xiaoqin Zou
Fig. 3 Checking the input files for a submission
partners (peptide or protein) and nonstandard amino acids (including ligands, ions, and solvents) are selected and then deleted from the original PDB file. The cleaned structure is saved in the PDB format for docking. 3. The “Job name” is optional. The user can search a job using the job name in the “Queue page.” 4. The “Email address” is optional but recommended. If an email address is provided, the user will receive an email notification after the job is completed. 5. The server also allows a submitted job to be private by checking the option of “Do not show my job on the queue page.” 6. Click on the “Submit” button once the input files are ready. The server will check the submitted files (see Fig. 3). Once all the inputs pass the filters, the job will be assigned a unique job ID. The user can monitor the job status on the “Queue page” using this job ID (see Fig. 4). If the job is private, a random number will be generated for the job. In this case, we recommend the user to provide an email address to receive prediction results or to save the link of the result page (see Fig. 3). 3.2 Advanced Options
The MDockPeP server provides several advanced options for the user to improve the prediction results. A snapshot of the “Advanced options” is shown in Fig. 5. The usage of each option is described in details as follows:
MDockPeP: Protein-Peptide Docking Webserve
265
Fig. 4 A snapshot of the “Queue” page
Fig. 5 Advanced options for a job submission
1. “The number of initial peptide models to be used for docking” is set to three as default. This number can be set to any integers between one and five. The user must be cautioned that the increase in the number of initial peptide conformers will dramatically increase the computational cost for a prediction because each peptide conformer is independently docked to the target protein (see Subheading 2.2.2 for details). 2. The user is allowed to upload one initial peptide 3D structure in the PDB format. Meanwhile, if “The number of initial peptide models to be used for docking” is set to one, only the user-provided peptide structure will be used as the initial peptide conformation.
266
Xianjin Xu and Xiaoqin Zou
Fig. 6 An example of the coordinate section in a PDB file (PDB ID: 1iak)
3. The user is also allowed to control the degree of restriction of the peptide conformations in the sampling process by changing the cutoff value of the backbone RMSD (bRMSD). The default ˚ (see Note 2). value is 5.5 A 4. The increase of the exhaustiveness value allows a larger conformational space to be sampled at the cost of the increase in computational time. The default exhaustiveness value is set to 100, namely, each docking calculation (docking one initial peptide conformer onto the protein) contains 100 independent runs (see Note 3). 5. The user is allowed to define a binding location by providing coordinates of a searching cubic box (see Note 4). The XYZ coordinates of the center can be estimated by the coordinates of residues near the binding site, which are available in the PDB file (in the coordinate section; see Fig. 6 for an example). The size of the searching box can be estimated using the Chimera BILD tool. Installation and usage of the BILD tool are available at https://github.com/josan82/Chimera_BILD. The following command is used to draw a box in Chimera: box x1 y1 z1 x2 y2 z2 color [transparency]. Figure 7 shows an example based on the PDB ID 1iak. The X, Y, and Z coordinates of the center of the box are set to 10 A˚, 76 A˚, and 51 A˚, respectively. The box size is set to 40 A˚. Therefore, x1, y1, z1, x2, y2, and z2 are calculated as following: ˚; x1 ¼ 10–40/2 ¼ 10 A˚; x2 ¼ 10 + 40/2 ¼ 30 A ˚ ˚ y1 ¼ 76–40/2 ¼ 56 A; y2 ¼ 76 + 40/2 ¼ 96 A; z1 ¼ 51–40/2 ¼ 31 A˚; z2 ¼ 51 + 40/2 ¼ 71 A˚; In addition to the Chimera BILD tool, AutoDockTools (http://autodock.scripps.edu/resources/adt) can also be used to define and to view the searching box.
MDockPeP: Protein-Peptide Docking Webserve
267
Fig. 7 A searching box drawn by the Chimera BILD tool for PDB ID 1iak
If the user does not provide the box size, the size will be automatically determined according to the peptide length. Specifically, the side of the cubic box equals (3.8 peptide_se˚ . The value 3.8 is the distance between quence_length + 40) A two CA atoms in adjacent residues. 3.3 Analysis of the Results
Once a job is completed, the result page will be updated with the prediction results. Figure 8 shows an example of a result page. 1. Basic information of the job is provided, including the job ID, job name, running time, peptide sequence, and protein structure. 2. The top ten predicted protein–peptide complex structures are displayed via 3Dmol.js [20]. Proteins are represented by the ribbon model, and peptides are represented by both the stick model and the ribbon model. 3. Individual models can be viewed by clicking. 4. Binding scores calculated by ITScorePeP are provided for the predicted protein–peptide complex structures.
268
Xianjin Xu and Xiaoqin Zou
Fig. 8 A snapshot of a result page
5. Individual predicted protein–peptide complex structures can be downloaded from the links. 6. All the results can be downloaded in a compressed file (JobID. tar.gz), including a cleaned protein structure in the PDB format and the top 10 and top 500 predicted binding modes (see Notes 5 and 6). The file can be decompressed using the command “tar -zvxf JobID.tar.gz” on a Linux platform. Although the server displayed the predicted models online, the user is recommended to use a local program to analyze the details of the predicted interactions. Here, the “ViewDock” tool in UCSF Chimera is used as an example. First, open the protein structure (“rec_clean.pdb”) in Chimera. Then, load the predicted binding modes “Top10_pep_models.pdb” or “Top500_pep_models.pdb” using the “ViewDock” tool. The user can easily view each peptide binding mode on the protein surface by clicking on the model list in the “ViewDock” window. The details of the usage of the Chimera ViewDock is shown in Fig. 9.
MDockPeP: Protein-Peptide Docking Webserve
269
Fig. 9 The use of the “ViewDock” tool in UCSF Chimera to display the predicted protein–peptide binding modes
4
Case Studies The MDockPeP server was systemically tested using the peptiDB benchmarking database [14]. The database consists of 100 protein– peptide complex structures, in which 64 unbound protein structures are available. The peptide sequence length ranges from 5 to 15. The results and analysis are available in Refs. 9 and 15. On the website of the MDockPeP server, we provided a demo using the PDB id 1awr. The peptide sequence and a cleaned structure of the receptor are available for download. Prediction results are also available on the website with the job ID S00689 (http:// zougrouptoolkit.missouri.edu/mdockpep/Results/S00689/). A near-native binding mode was successfully ranked at No. 1. As shown in Fig. 10, Lrmsd between the predicted binding mode ˚ . It is (No. 1, cyan) and its experimental structure (tan) is 3.1 A interesting that a crystal water was found to play an important role in the protein–peptide interaction by forming hydrogen bonds between the two partners. This could be the reason why MDockPeP did not yield a higher accuracy model in this case.
270
Xianjin Xu and Xiaoqin Zou
Fig. 10 The top predicted binding mode (cyan) and its experimental structure (tan) for PDB id 1awr. The structures are displayed with UCSF Chimera
5
Notes 1. MDockPeP has been systematically assessed using a protein– peptide benchmark, in which the peptide length ranges from 5 to 15. The MDockPeP server allows the user to submit a peptide with the length up to 20 amino acids. The performance of MDockPeP on peptides longer than 15 residues is not available. 2. For long peptides, the user may need to loosen the restriction of the peptide conformation during the sampling process. The ˚. default restriction is bRMSD ¼ 5.5 A 3. The user could improve the prediction by increasing the number of initial peptide conformers and/or by increasing the exhaustiveness value. 4. If the binding site information is available, local docking with the use of a searching box is highly recommended.
MDockPeP: Protein-Peptide Docking Webserve
271
5. The current scoring function has limited accuracy. The user may use a more accurate scoring function (usually more timeconsuming) to re-rank the sampled binding modes (top 500 models) from the MDockPeP server. 6. The protein structure is treated as a rigid body during the prediction. The user may refine the complex structures generated from the MDockPeP server using methods such as molecular dynamic simulation.
Acknowledgments This work was supported by NIH R01GM109980 (PI: XZ), NIH R01HL126774, and NIH R01HL142301 (PI: Cui) to XZ. The computations were performed on the high-performance computing infrastructure supported by NSF CNS-1429294 (PI: Chi-Ren Shyu) and the HPC resources supported by the University of Missouri Bioinformatics Consortium (UMBC). References 1. Petsalaki E, Russell RB (2008) Peptidemediated interactions in biological systems: new discoveries and applications. Curr Opin Biotechnol 19:344–350 2. Fosgerau K, Hoffmann T (2015) Peptide therapeutics: current status and future directions. Drug Discov Today 20:122–128 3. Verschueren E, Vanhee P, Rousseau F, Schymkowitz J, Serrano L (2013) Proteinpeptide complex prediction through fragment interaction patterns. Structure 21:789–797 4. Lee H, Heo L, Lee MS, Seok C (2015) GalaxyPepDock: a protein–peptide docking tool based on interaction similarity and energy optimization. Nucleic Acids Res 43:W431–W435 5. London N, Raveh B, Cohen E, Fathi G, Schueler-Furman O (2011) Rosetta FlexPepDock web server-high resolution modeling of peptide–protein interactions. Nucleic Acids Res 39:W249–W253 6. Trellet M, Melquiond AS, Bonvin AM (2013) A unified conformational selection and induced fit approach to protein-peptide docking. PLoS One 8:e58769 7. Schindler CE, de Vries SJ, Zacharias M (2015) Fully blind peptide-protein docking with pepATTRACT. Structure 23:1507–1515 8. Kurcinski M, Jamroz M, Blaszczyk M, Kolinski A, Kmiecik S (2015) CABS-dock web server for the flexible docking of peptides
to proteins without prior knowledge of the binding site. Nucleic Acids Res 43: W419–W424 9. Yan C, Xu X, Zou X (2016) Fully blind docking at the atomic level for protein-peptide complex structure prediction. Structure 24:1842–1853 10. Porter KA et al (2017) ClusPro PeptiDock: efficient global docking of peptide recognition motifs using FFT. Bioinformatics 33:3299–3301 11. Alam N et al (2017) High-resolution global peptide-protein docking using fragmentsbased PIPER-FlexPepDock. PLoS Comput Biol 13:e1005905 12. Zhou P, Jin B, Li H, Huang SY (2018) HPEPDOCK: a web server for blind peptide-protein docking based on a hierarchical algorithm. Nucleic Acids Res 46:W443–W450 13. Ciemny M, Kurcinski M, Kamel K, Kolinski A, Alam N, Schueler-Furman O, Kmiecik S (2018) Protein-peptide docking: opportunities and challenges. Drug Discov Today 23:1530–1537 14. London N, Movshovitz-Attias D, SchuelerFurman O (2010) The structural basis of peptide-protein binding strategies. Structure 18:188–199 15. Xu X, Yan C, Zou X (2018) MDockPeP: an ab initio protein–peptide docking server. J Comput Chem 39:2409–2413
272
Xianjin Xu and Xiaoqin Zou
16. Webb B, Sali A (2014) Comparative protein structure modeling using Modeller. Curr Protoc Bioinformatics 47:5.6.1–5.6.32 17. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 89:10915–10919 18. Trott O, Olson AJ (2010) AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient
optimization, and multithreading. J Comput Chem 31:455–461 19. Pettersen EF et al (2004) UCSF chimera-a visualization system for exploratory research and analysis. J Comput Chem 25:1605–1612 20. Rego N, Koes D (2014) 3Dmol.Js: molecular visualization with WebGL. Bioinformatics 31:1322–1324
Chapter 16 Protocols for All-Atom Reconstruction and High-Resolution Refinement of Protein–Peptide Complex Structures Aleksandra E. Badaczewska-Dawid, Alisa Khramushin, Andrzej Kolinski, Ora Schueler-Furman, and Sebastian Kmiecik Abstract Structural characterizations of protein–peptide complexes may require further improvements. These may include reconstruction of missing atoms and/or structure optimization leading to higher accuracy models. In this work, we describe a workflow that generates accurate structural models of peptide–protein complexes starting from protein–peptide models in C-alpha representation generated using CABS-dock molecular docking. First, protein–peptide models are reconstructed from their C-alpha traces to all-atom representation using MODELLER. Next, they are refined using Rosetta FlexPepDock. The described workflow allows for reliable all-atom reconstruction of CABS-dock models and their further improvement to high-resolution models. Key words Protein reconstruction, Coarse-grained model, Protein structure, Protein refinement, Protein–peptide interaction
1
Introduction In recent years, peptides have found many successful applications as therapeutic agents and leading molecules in drug design. Consequently, there is a large interest in structural characterization of protein–peptide interactions. Unfortunately, the experimental characterization can be difficult or impossible. Thus, a variety of computational approaches have been developed for structure prediction of the protein–peptide complexes [1]. Computational, or even experimental, structure characterizations often require reconstruction of missing atoms and/or models refinement to a higher resolution. This is also the case for the CABS-dock method [2, 3] - a well-established tool for protein–peptide docking that has been recently reviewed [4]. In this chapter, we present the protocols for all-atom reconstruction of protein–peptide complexes (CABS-dock predictions) from C-alpha trace and their further refinement to higher accuracy
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_16, © Springer Science+Business Media, LLC, part of Springer Nature 2020
273
274
Aleksandra E. Badaczewska-Dawid et al.
models. The software tools used for generating protein–peptide models (CABS-dock), all-atom reconstruction (MODELLER [5]), and refinement (Rosetta FlexPepDock [6]) are shortly presented in Subheading 2. In Subheading 3, we describe step-by-step how to use the protocols and discuss their features. Finally, we describe protocol performance on a few example cases in Subheading 4. The described reconstruction workflow can be applied using input models from CABS-dock, other docking protocols, or sparse experimental data. The advantage of the presented protocols over other protein reconstruction or refinement tools is that they are specifically tailored for protein–peptide complexes. For example, there are many tools that effectively rebuild atomic details from the Cα-trace protein chains (a detailed overview of the available methods can be found in our most recent review [7]). However, most of them cannot be easily used to handle more than one protein chain in the reconstructed structure and use structural template(s) of the protein backbone to enhance the reconstruction accuracy or to maintain the appropriate interface of protein–peptide interaction. With regard to structure refinement of protein–peptide complexes, the performance robustness and accuracy of the Rosetta FlexPepDock tool have been demonstrated in many studies [8–12].
2
Materials
2.1
CABS-Dock
CABS-dock [2, 3] uses an efficient multiscale approach for fast simulation of peptide folding and binding that merges modeling at coarse-grained resolution with all-atom resolution. The main feature of the CABS-dock method is global flexible docking without a priori knowledge of the binding site and the peptide conformation. However, as an option, it is possible to use some knowledge about the interaction interface or peptide conformation in the form of weak distance restraints. CABS-dock is available as a web server [2] and stand-alone application [3]. CABS-dock has been applied in numerous docking studies, including docking associated with large-scale conformational transitions [13, 14]. It has been also extended to modeling protocols that allow for residueresidue contact map analysis of the binding dynamics [15], prediction of protein–protein interaction interfaces [16], and docking using fragmentary information about protein–peptide residueresidue contacts [17].
2.2
Modeller
MODELLER is a program for comparative modeling of protein structures by satisfaction of spatial restraints [5]. The restraints are given as a set of geometrical criteria, which are used to calculate the locations of all non-hydrogen atoms in the protein structure.
Reconstruction and Refinement of Protein-Peptide Models
275
Smart integration of various MODELLER modules and tailored definition of additional restraints facilitate many other tasks, such as de novo modeling of loops, structure refinement, energy optimization or simple all-atom reconstruction from coarse-grained models, and addition of hydrogen atoms [18, 19]. The required input is an alignment of the query amino acid sequence with known template structures and the corresponding structures data in PDB format. The optional input is a starting structure, for example, obtained as a result of coarse-grained modeling. The initial structure and templates have to contain at least alpha carbons, but in comparison with all-heavy-atom templates, this loses homology information, such as backbone atom distances and side chain dihedrals. The position of a selected type of atoms (e.g. Cα) or chain fragment can be frozen during modeling if necessary. The MODELLER software is available free of charge to academic use, but a license key valid for a term of 5 years is required. The license key and the software package can be obtained from the website https://salilab.org/modeller/. MODELLER is a command-line only tool and runs on most Unix/Linux systems, Windows, and MacOS. The MODELLER commands are usually provided in a Python script. The user should have basic skills in Python language scripting, but many useful examples of taskdedicated scripts can be found in the examples directory. 2.3 Rosetta FlexPepDock
FlexPepDock is a high-resolution refinement protocol for modeling of protein–peptide complexes developed within Rosetta framework [20]. FlexPepDock usually requires an approximate predefined localization of the binding site. It uses the Monte Carlo-with-minimization approach (MCM) to optimize the rigid body orientation of the peptide, including full flexibility of peptide backbone as well as all receptor side chains. The FlexPepDock is highly effective if the initial peptide conformation is located within 5 A˚ backbone RMSD from the target conformation [6]. The required input is an initial structure of protein–peptide complex. As optional input, a reference structure of the protein– peptide complex, such as the native bounded structure, can be provided for RMSD calculations and evaluation of convergence. The initial structure of the complex needs to provide all heavy backbone atoms, including Cβ. The software is available free of charge to academic use, after signing licensing agreement to download the Rosetta package (https://www.rosettacommons.org/software/license-and-down load). Rosetta FlexPepDock is also available via a web server at http://flexpepdock.furmanlab.cs.huji.ac.il/.
276
3
Aleksandra E. Badaczewska-Dawid et al.
Methods Figure 1 presents the reconstruction and refinement scheme described in this work. The scheme uses ca2all.py script based on the MODELLER [20, 21] and FlexPepDock [12] protocol.
3.1 Reconstruction of All-Atom Representations from C-Alpha Traces Using MODELLER
The CABS-dock stand-alone package [3] offers automatic reconstruction of top-scored models from Cα-trace to atomic resolution using the ca2all.py script (available from our repository: https:// bitbucket.org/lcbio/ca2all/) based on the MODELLER [21] software. Note that CABS-dock releases 0.9.15 and earlier use previous versions of ca2all.py script, which may lack some of the functionalities described below.
Fig. 1 Pipeline of all-atom reconstruction (a) and refinement (b) of protein–peptide complexes. The coarsegrained CABS-dock docking procedure provides n (default n ¼ 10) Cα-traces of top-scoring complex structures. The atomic details for each of these models can be reconstructed using the ca2all.py script based on the MODELLER software. Optionally, the receptor can be replaced by, for example, the free receptor structure (see Subheading 4). After removal of clashes in the receptor (using the FlexPepDock prepack protocol), final refinement (using the FlexPepDock refine protocol) is performed. This should include at least 200 independent refinement trajectories (N ¼ 200 optimization runs), of which the top K are extracted and further analyzed (usually K ¼ 10)
Reconstruction and Refinement of Protein-Peptide Models
277
In order to rebuild top-scored models, use the --aa-rebuild (or -A) command-line CABS-dock flag, as in the example below. Usage: $ CABSdock --i protein-pdb-code --p peptidesequence:peptidesecondary-structure --A $ CABSdock --i 2FVJ:A --p HKLVQLLTTT:CHHHHHHHCC -A
Output:
Is
Description default?
l
model. pdb
Yes
Output all-atom protein–peptide complex structure
Note that the ca2all.py requires MODELLER to be installed. The program offers comprehensive reconstruction and refinement options, as well as some additional user-friendly (fancy) options: Inputs:
Is required? Description
l
model_CA. Yes No pdb l receptor. pdb
CA-trace of protein-peptide complex; CABSdock output models in this work All-atom template structure of protein receptor (see description in the text)
Python script:
Location
l
ca2all.py
https://bitbucket.org/lcbio/ca2all/
Usage: $ python ca2all.py -i model_CA.Pdb -ii 1 -t receptor.pdb -o output.pdb -n 1 -v -k Options
Is required? Description
-i NAME -ii NUM -t NAME -o NAME -m NUM -v -k
Yes No No Yes No No No
--input-PDB, input PDB file with CA-trace (initial) structure --index-of-model, index of model to be rebuilt; default NUM ¼ 1 and means first model, 0 – All models --template-PDB, input PDB file of all-atom receptor structure --output-PDB, output PDB file with rebuilt complex structure --modeller-iterations, number of models generated by modeller; default NUM ¼ 1 --verbose, print modeller output to stderr; default false --add-hydrogens, add hydrogen atoms to final all-atom structure; default false (continued)
278
Aleksandra E. Badaczewska-Dawid et al.
Output: l
output.pdb
Is default?
Description
No
Output PDB file with rebuilt complex structure (PDB)
The procedure uses an automodel() class to build one or more (-modeller-iterations, -m option) comparative models based on the CABS-dock-provided final Cα-trace. Note that protein conformations generated by coarse-grained modeling may exhibit some unphysical distortions [22]. This is also the case of the CABSdock C-alpha trace models obtained from CABS coarse-grained protein model [22]. Those distortions may be not well tolerated by a reconstruction algorithm since it is not always trivial to keep the desired global topology while maintaining a sound local atomic geometry. We address this issue using appropriate template structures. Based on these, MODELLER creates a set of restraints (mainly distances and angles using heavy atoms and topology). The more detailed the template structure is, the more restraints are provided. For this reason, the reconstruction procedure can be used as a template, not only the final Cα-chain (from CABS-dock, referred to as initial) but also the known all-atom structure of the receptor (referred to as template). At the same time, the local geometry of the main chain and the orientation of the side chains for the receptor can be significantly improved by more accurate restraints from the template (unpublished results). The final structure (including alpha carbon atoms) is optimized with the variable target function method with conjugate gradients and refined using molecular dynamics with simulated annealing. Alpha carbon positions are frozen in the first optimization step and freed in the second. By default, the structure returned by ca2all.py contains all heavy atoms only, but using the --add-hydrogens (or -k) option will ensure the addition of hydrogen according to geometric criteria (with heavy atoms unchanged). Receptor structures containing chain breaks are not an obstacle to CABS-dock modeling. If a complete sequence of protein is provided, it is possible to rebuild the missing part of the structure using script rebuild_loops.py, which is publicly available in our repository (https://bitbucket. org/lcbio/ca2all/). Using the ca2all.py reconstruction protocol, it is also possible to reconstruct selected models from the CABS-dock trajectory. By default, this is always the first model in the file, but the --index-ofmodel (or -ii NUM) option enables user selection of other models, by specifying NUM, the model index (NUM ¼ 0 means that all models from the trajectory will be processed). If the trajectory is large and you would like to speed up calculations, use our script divide_tra (available from our repository: https://bitbucket.org/ lcbio/ca2all/) that divides the trajectory file into smaller ones and runs them parallel on multiple cores.
Reconstruction and Refinement of Protein-Peptide Models
279
3.2 Refinement of CABS-Dock Models Using Rosetta FlexPepDock
CABS-dock models can be improved by additional structure refinement using FlexPepDock protocol. For the refinement of an individual protein-peptide complex, one can use the online server (available at http://flexpepdock.furmanlab.cs.huji.ac.il/). Here, we present a variant using the stand-alone version, i.e., a locally installed Rosetta package. Note that Rosetta is written in c++ and needs to be compiled. Detailed description of compilation on different systems can be found at https://www.rosettacommons.org/ docs/latest/build_documentation/Build-Documentation. After successful installation, some useful scripts for input preparation or results analysis can be found in the directory: ~/Rosetta/tools/ protein_tools/scripts/. Some of them required in this pipeline, such as clean_pdb.py, are described below. The executables of the applications for Rosetta protocols, here FlexPepDocking or extract_pdbs, can be found in the ~/Rosetta/main/source/bin/ directory, and the database is located in ~/Rosetta/main/database/. In this section, we describe how to prepare input files, refine the structure using FlexPepDock, and evaluate the results.
3.2.1 Prepare an Initial Complex Structure
FlexPepDock protocol requires, as input, a structure of the proteinpeptide complex that includes at least all heavy atoms of the backbone and beta carbon atoms. Coarse-grained CABS-dock models reconstructed to all-atom representation using the MODELLER satisfy this criterion. If CABS-dock models reconstructed with MODELLER contain local structure distortions, replacing the receptor with the experimental structure of the unbound receptor conformation may be helpful (see discussion in Subheading 4). It is recommended to keep only the “ATOM” section, although other molecules can be kept if needed (see the Rosetta tutorial “prepare ligand tutorial”). The Rosetta refinement protocol only requires the initial structure of the complex, but additional information can also be used if available. The unbound receptor structure is an additional source of rotamers to side chains packing, and the native protein–peptide structure can be used as a reference for the evaluation of refinement results. Inputs: l
Is required? Description
initial_complex. Yes pdb
Rosetta script: l
Top-scoring CABS-dock model in all-atom representation Location
clean_pdb.py
~/Rosetta/tools/protein_tools/scripts/ clean_pdb.py
Usage: $ ~/Rosetta/tools/protein_tools/scripts/clean_pdb.py model.pdb receptor_chain_id peptide_chain_id Output: l
Is default?
model_AB.pdb Yes
Description Initial complex structure
280
Aleksandra E. Badaczewska-Dawid et al.
3.2.2 Prepack the Initial Complex Structure
Note that the final Rosetta score (total_score) takes into account not only the interactions of the protein–peptide interface but also the energy of internal interactions in each monomer. The initial structure of the protein–peptide complex may contain local distortions, especially side chain clashes or backbone distortions in the receptor structure that will result in high Rosetta’s energy penalties and will not reliably identify best refined models. In order to avoid the effects of nonuniform conformational background in non-interface regions, Rosetta repacking with the -flexpep_prepack flag is recommended. If several models are generated from a specific starting model, it is suggested to use the same prepacked structure as starting point. In Subheading 4, we describe a modeling option in which we replace protein receptor coordinates in CABS-dock models with experimentally determined unbound receptor structures (see Note 1 and Subheading 4). This can be helpful if distortions at the backbone level occur that are not removed by side chain repacking. Inputs: l
model_AB.pdb
Is required? Description Yes
Rosetta protocol: l
Initial protein-peptide complex structure Location ~/Rosetta/main/source/bin/ FlexPepDocking
FlexPepDocking
Usage: $ ~Rosetta/main/source/bin/FlexPepDocking.linuxgccrelease @prepack. flagfile Flagfile (prepack. flagfile)
Is required? Description
-database ~/Rosetta/ Yes main/database
Rosetta database location
-s model_AB.pdb
Yes
Initial protein-peptide complex structure
-receptor_chain A
NO
Receptor chain, default: first chain (Multichain receptor supported: must appear first)
-peptide_chain B
NO
Peptide chain, default: second chain
-ex1
Yes
Extra rotamers for chi1 angles (side chain packing)
-ex2aro
Yes
Extra rotamers for aromatic chi2 angles
-use_input_sc
Yes
Extra rotamers from initial structure of complexes (continued)
Reconstruction and Refinement of Protein-Peptide Models
-unboundrot receptor.pdb
No
Add the position-specific rotamers of the specified structure to the rotamer library (this is helpful when they are not represented well by the rotamer library)
-nstruct 1
Yes
Create one prepacked output structure
-scorefile prepack_score.sc
Yes
Name of the output scorefile
-flexpep_prepack
Yes
Flag for running FlexPepDock prepack
-flexpep_score_only
Yes
Adds FlexPepDock specific scores to scorefile
Output:
Is default?
Description
prepack_score.sc model_prepacked. pdb
No Yes
Scorefile containing an energy description (scorefile) Prepacked protein-peptide complex structure (PDB)
l l
3.2.3 Refine the Prepacked Complex Structure
281
FlexPepDock refinement protocol uses an iterative Monte Carlo with energy minimization scheme to optimize the position of the peptide backbone relative to the receptor and the orientation of the side chains in the interaction interface. It is recommended to generate at least 200 of refined models (-nstruct flag), whereas each of them is an independent event. Additional improvement can be achieved through initial pre-optimization at the coarse-grained stage (Rosetta centroid representation) by using -lowres_preoptimize flag. Inputs:
Is
Description required?
model_prepacked. pdb l native_complex.pdb l ub_receptor.pdb l
Yes No No
Rosetta protocol: l
Prepacked protein-peptide complex structure Known structure of protein-peptide complexes Known structure of free (unbound) receptor Location ~/Rosetta/main/source/bin/ FlexPepDocking
FlexPepDocking
Usage: $ ~/Rosetta/main/source/bin/FlexPepDocking.linuxgccrelease @refine. flagfile Flagfile (refine.flagfile) Is required? Description -database /Rosetta/ main/database
Yes Yes No
Rosetta database location Initial protein-peptide complex structure (continued)
282
Aleksandra E. Badaczewska-Dawid et al.
Yes Yes Yes No Yes Yes Yes Yes Yes No Yes
Output:
Is default?
Description
Yes Yes
Scorefile containing an energy description (scorefile) Refined protein-peptide complex structure (silentfile)
l l
3.2.4 Analysis of Refinement Results
Reference structure for RMSD calculations Extra rotamers for chi1 angles (side chain packing) Extra rotamers for aromatic chi2 angles Extra rotamers from initial structure of complexes Extra rotamers from unbound receptor Create 250 output decoys Name of the output scorefile Name of the output silent file Type of silent file Flag for running FlexPepDock refinement Coarse-grained optimization before all-atom refinement Adds FlexPepDock specific scores to scorefile
-s model_prepacked. pdb -native native_complex.pdb -ex1 -ex2aro -use_input_sc -unboundrot ub_receptor.pdb -nstruct 250 -scorefile refine_score. sc -out:file:silent refined. silent -out:file: silent_struct_type binary -pep_refine -lowres_preoptimize -flexpep_score_only
refine_score.sc refined.silent
The refined output models are saved to a file (-out:file:silent flag). The silent file is a Rosetta output format that is used to store compressed ensembles of structures. Each frame in a silent file has a unique identifier, which is called the decoy-tag. The unique decoy-tag is located at the end of each line that belongs to the respective frame and enables to identify and extract selected frames. The second output file is a scorefile (-scorefile flag) that contains the energy description of each of the output models and various root mean square deviation (RMSD) parameters in relation to reference complex structure (if provided) or to initial complex structure. This file allows to analyze the energy landscape sampled during the refinement procedure (by plotting score vs. rmsd) or select top-best models (sorted according to Rosetta energy criterion). To evaluate FlexPepDock refinement results, it is recommended to use interface_score (I_sc) and reweighted_score (reweighted_sc, the weighted sum of total score, I_sc, and pep_sc) rather than the total score rmsBB_if (peptide backbone interface RMSD). Selected top-scoring models can be extracted into pdb format by using Rosetta extract_pdbs protocol.
Reconstruction and Refinement of Protein-Peptide Models
283
Example of sorting refined decoys by reweighted_sc and selecting ten top-scoring models. Inputs: l
refine_score.sc
Output: l
top10-tags
Is required?
Description
Yes
Scorefile
Is default?
Description
No
Ten top-scoring models
Example of extracting PDB coordinates for selected decoytags. Inputs: l
top10-tags
Is required? Description Yes
Rosetta protocol: l
Selected decoy-tag for ten top-scoring models Location
extract_pdbs
~/Rosetta/main/source/bin/ extract_pdbs
Usage: $ ~/Rosetta/main/source/bin/extract_pdbs.Linuxgccrelease -in:File:Silent refined.Silent -in:File:Tagfile top10-tags Output: l
4
Is default?
model_refined_0010. Yes pdb
Description Separate PDB file for each of ten models
Case Studies We detail here the implementation and results of the above described protocol for the atomic reconstruction and refinement. The starting point are CABS-dock predictions of four different protein–peptide complexes from our previous study by Blaszczyk et al. [17]. For each protein–peptide complex, we reconstructed ten top-scored CABS-dock models to all-atom structure using the ca2all.py script. Next, in each reconstructed model, we replaced the receptor structure with the unbound receptor structure (see Note 1). Note that this is an optional step in the reconstruction and refinement procedure described in this work. Introduction of conformational flexibility in the receptor can be important to identify binding sites in cases where the receptor changes considerably upon binding. However, introduction of conformational flexibility might also lose features of the receptor that are crucial to allow the highresolution step to identify the native conformation among other false-positive minima in the energy landscape. Thus, by replacing
284
Aleksandra E. Badaczewska-Dawid et al.
Fig. 2 Comparison of CABS-dock models before and after FlexPepDock refinement to experimental structures. The figure presents four example cases of protein-peptide complexes (1N7E, 2O9S, 2AM9, and 2QHN). For each complex, only peptide conformations are shown (after superimposition of the receptor structures). Reference experimental structures are shown in pink and CABS-dock models before refinement in blue and after in green. The numbers in the corresponding colors indicate i-RMSD values
the receptor with the solved structure of the unbound receptor conformation, we allow both the detection of cryptic binding sites in the first step and the refinement starting from an informative crystal structure in the second step. The peptide coordinates were taken directly from CABS-dock modeling. In order to remove all possible internal clashes and to create a uniform energy background, the receptor unbound structure was prepacked using the Rosetta flexpep_prepack protocol. Prepacking was performed once, and the same receptor structure was used for all simulations. Finally, each resulting model was further refined with the FlexPepDock refinement protocol, and the ten top-scored models were selected for quality assessment. Figure 2 shows comparison of the top-scoring refined model with CABS-dock models and experimental reference structures. As presented in Fig. 2, FlexPepDock refinement can signifi˚ ngstrom level, as cantly improve the modeling accuracy to sub-A measured by i-RMSD (interface root mean square deviation from the reference experimental structure). Motivated by these promising results, we are now applying this protocol on large and representative benchmark set.
Reconstruction and Refinement of Protein-Peptide Models
5
285
Notes 1. In order to replace the structure of receptor, you should first align the CABS-dock model of the complex and the unbound receptor using, for example, Theseus [23] (or PyMOL): Inputs:
Is required? Description
model_AB. pdb l unbound. pdb
Yes Yes
l
Theseus: l
Selected decoy-tag for ten top-scoring models CABS-dock (all-atom) protein-peptide complex structure Known structure of free receptor Download
Theseus
https://theobald.brandeis.edu/theseus/
Usage: $ theseus -s0–120 -o model_AB.pdb unbound.pdb # -s option is the receptor residues selection Output: l
Is default?
theseus_sup. Yes pdb
Description Unbound receptor superposed on CABSdock receptor
In the next step, you need to cut the peptide coordinates from the model_AB.pdb file, which is possible using the already known Rosetta clean_pdb.py script, and then add peptide to the free receptor theseus_sup.pdb file. Usage: $ ~/Rosetta/tools/protein_tools/scripts/clean_pdb.py model_AB.pdb B # cut peptide (chain B) $ cat theseus_sup.pdb model_AB_B.pdb > model.pdb # paste unbound receptor and peptide Output: l
Is default? Description
model.pdb Yes
Free receptor and CABS-dock modeled peptide
Acknowledgments A.E.B-D, A.Ko., and S.K. received funding from NCN Poland, Grant MAESTRO2014/14/A/ST6/00088. O.S-F. and A.Kh. received funding from the ISF, Grant 717/17.
286
Aleksandra E. Badaczewska-Dawid et al.
References 1. Ciemny M, Kurcinski M, Kamel K, Kolinski A, Alam N, Schueler-Furman O, Kmiecik S (2018) Protein–peptide docking: opportunities and challenges. Drug Discov Today 23:1530–1537. https://doi.org/10.1016/j. drudis.2018.05.006 2. Kurcinski M, Jamroz M, Blaszczyk M, Kolinski A, Kmiecik S (2015) CABS-dock web server for the flexible docking of peptides to proteins without prior knowledge of the binding site. Nucleic Acids Res 43(W1): W419–W424. https://doi.org/10.1093/ nar/gkv456 3. Kurcinski M, Ciemny MP, Oleniecki T, Kuriata A, Badaczewska-Dawid AE, Kolinski A, Kmiecik S (2019) CABS-dock standalone: a toolbox for flexible protein–peptide docking. Bioinformatics 35 (20):4170–4172 https://doi.org/10.1093/bioinformatics/ btz185 4. Kurcinski M, Badaczewska‐Dawid AE, Kolinski M, Kolinski A, Kmiecik S (2020) Flexible docking of peptides to proteins using CABS‐dock. Protein Science 29 (1):211–222 https://doi.org/10.1002/pro.3771 5. Webb B, Sali A (2016) Comparative protein structure modeling using MODELLER. Curr Protoc Bioinforma 54:5.6.1–5.6.37. https:// doi.org/10.1002/cpbi.3 6. Raveh B, London N, Schueler-Furman O (2010) Sub-angstrom modeling of complexes between flexible peptides and globular proteins. Proteins Struct Funct Bioinforma 78:2029–2040. https://doi.org/10.1002/ prot.22716 7. Badaczewska-Dawid AE, Kolinski A, Kmiecik S (2020) Computational reconstruction of atomistic protein structures from coarsegrained models. Comput Struct Biotechnol J 18:162–176. https://doi.org/10.1016/j.csbj. 2019.12.007 8. Alam N, Goldstein O, Xia B, Porter KA, Kozakov D, Schueler-Furman O (2017) High-resolution global peptide-protein docking using fragments-based PIPER-FlexPepDock. PLoS Comput Biol 13:e1005905. https://doi.org/10.1371/journal.pcbi. 1005905 9. Alam N, Schueler-Furman O (2017) Modeling peptide-protein structure and binding using Monte Carlo sampling approaches: Rosetta flexpepdock and flexpepbind. Methods Mol Biol 1561:139–169. https://doi.org/10. 1007/978-1-4939-6798-8_9
10. Marcu O, Dodson EJ, Alam N, Sperber M, Kozakov D, Lensink MF, Schueler-Furman O (2017) FlexPepDock lessons from CAPRI peptide–protein rounds and suggested new criteria for assessment of model quality and utility. Proteins Struct Funct Bioinforma 85:445–462. https://doi.org/10.1002/prot. 25230 11. Liu T, Pan X, Chao L, Tan W, Qu S, Yang L, Wang B, Mei H (2014) Subangstrom accuracy in pHLA-I modeling by rosetta FlexPepDock refinement protocol. J Chem Inf Model 54:2233–2242. https://doi.org/10.1021/ ci500393h 12. London N, Raveh B, Cohen E, Fathi G, Schueler-Furman O (2011) Rosetta FlexPepDock web server—high resolution modeling of peptide-protein interactions. Nucleic Acids Res 39:W249–W253. https://doi.org/10.1093/ nar/gkr431 13. Blaszczyk M, Kurcinski M, Kouza M, Wieteska L, Debinski A, Kolinski A, Kmiecik S (2016) Modeling of protein-peptide interactions using the CABS-dock web server for binding site search and flexible docking. Methods 93:72–83. https://doi.org/10.1016/j. ymeth.2015.07.004 14. Ciemny MP, Debinski A, Paczkowska M, Kolinski A, Kurcinski M, Kmiecik S (2016) Protein-peptide molecular docking with largescale conformational changes: the p53-MDM2 interaction. Sci Rep 6:37532. https://doi.org/ 10.1038/srep37532 15. Ciemny MP, Kurcinski M, Kozak K, Kolinski A, Kmiecik S (2017) Highly flexible proteinpeptide docking using cabs-dock. Methods Mol Biol 1561:69–94. https://doi.org/10. 1007/978-1-4939-6798-8_6 16. Ciemny MP, Kurcinski M, Blaszczyk M, Kolinski A, Kmiecik S (2017) Modeling EphB4-EphrinB2 protein-protein interaction using flexible docking of a short linear motif. Biomed Eng Online 16:71. https://doi.org/ 10.1186/s12938-017-0362-7 17. Blaszczyk M, Ciemny MP, Kolinski A, Kurcinski M, Kmiecik S (2019) Protein–peptide docking using CABS-dock and contact information. Brief Bioinform 20 (6):2299–2305. https://doi.org/10.1093/ bib/bby080 18. Martı´-Renom MA, Stuart AC, Fiser A, Sa´nchez R, Melo F, Sali A (2000) Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 29:291–325. https://doi.org/10.1146/ annurev.biophys.29.1.291
Reconstruction and Refinement of Protein-Peptide Models 19. Fiser A, Do RKG, Sˇali A (2000) Modeling of loops in protein structures. Protein Sci 9:1753–1773. https://doi.org/10.1110/ps. 9.9.1753 20. Leaver-Fay A, Tyka M, Lewis SM, Lange OF, Thompson J, Jacak R, Kaufman K, Renfrew PD, Smith CA, Sheffler W, Davis IW, Cooper S, Treuille A, Mandell DJ, Richter F, Ban YEA, Fleishman SJ, Corn JE, Kim DE, Lyskov S, Berrondo M, Mentzer S, Popovic´ Z, Havranek JJ, Karanicolas J, Das R, Meiler J, Kortemme T, Gray JJ, Kuhlman B, Baker D, Bradley P (2011) ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol 487:545–574. https://doi.org/10. 1016/B978-0-12-381270-4.00019-6
287
21. Fiser A, Sˇali A (2003) MODELLER: generation and refinement of homology-based protein structure models. Methods Enzymol 374:461–491. https://doi.org/10.1016/ S0076-6879(03)74020-8 22. Kmiecik S, Gront D, Kolinski M, Wieteska L, Dawid AE, Kolinski A (2016) Coarse-grained protein models and their applications. Chem Rev 116:7898–7936. https://doi.org/10. 1021/acs.chemrev.6b00163 23. Theobald DL, Wuttke DS (2006) THESEUS: maximum likelihood superpositioning and analysis of macromolecular structures. Bioinformatics 22:2171–2172. https://doi.org/10. 1093/bioinformatics/btl332
Chapter 17 Dockground Tool for Development and Benchmarking of Protein Docking Procedures Petras J. Kundrotas, Ian Kotthoff, Sherman W. Choi, Matthew M. Copeland, and Ilya A. Vakser Abstract Databases of protein–protein complexes are essential for the development of protein modeling/docking techniques. Such databases provide a knowledge base for docking algorithms, intermolecular potentials, search procedures, scoring functions, and refinement protocols. Development of docking techniques requires systematic validation of the modeling protocols on carefully curated benchmark sets of complexes. We present a description and a guide to the DOCKGROUND resource (http://dockground.compbio.ku.edu) for structural modeling of protein interactions. The resource integrates various datasets of protein complexes and other data for the development and testing of protein docking techniques. The sets include bound complexes, experimentally determined unbound, simulated unbound, model–model complexes, and docking decoys. The datasets are available to the user community through a Web interface. Key words Protein recognition, Protein–protein interactions, Structure prediction, Benchmark sets, Docking methodology
1
Introduction Structural modeling of protein–protein complexes (protein docking) determines mutual arrangement of the proteins in a complex, given the structure of the interacting proteins [1]. The two general types of the docking approaches are free docking, which is not based on prior knowledge of similar experimentally determined protein–protein complexes, and comparative or template-based docking, in which the similar complexes (templates) determine the docking model. The docking protocols consist of the global search for an approximate structure of the complex, followed by scoring/refinement of the putative matches. The scoring can be based on the physical principles or on knowledge/statistics [1]. A number of databases of protein complexes have been compiled and used to study protein–protein interfaces [2–8]. Docking techniques are evaluated by community-wide blind assessment
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_17, © Springer Science+Business Media, LLC, part of Springer Nature 2020
289
290
Petras J. Kundrotas et al.
CAPRI [9] and by validation on pre-compiled protein–protein benchmark sets. Such sets are based on bound and unbound experimentally determined structures [8, 10–13] and on protein models [8, 14–17]. The predictive docking protocols have to distinguish the near-native matches from the false-positive ones. Thus, an important part in developing scoring functions for docking is scoring benchmarks (docking decoys), where the correct docking predictions are mixed with the incorrect ones (decoys) [8, 12, 18–20]. Comparative docking is based on similarity of the target protein pair to experimentally determined protein–protein complexes, including similarity of the structure [21–24]. Surface patches may yield similar binding modes for otherwise dissimilar protein structures [25, 26]. Thus, docking can also be performed by the alignment of proteins with the co-crystallized interfaces. Selecting all pairwise protein–protein complexes from PDB would produce the complete set of currently known structures. However, such bruteforce approach greatly increases computation time due to the redundancy of the PDB. It would also include erroneous, low-quality, and biologically irrelevant assemblies [27]. Thus, comparative docking typically is based on template libraries obtained by filtering PDB for redundancy and irrelevant assemblies [28–33]. The DOCKGROUND resource (http://dockground.compbio.ku. edu) distinguishes itself by offering various datasets of experimentally determined and modeled structures, designed to address most aspects of protein docking. It combines five integrated datasets of complexes: (1) bound, (2) unbound, (3) models, (4) docking decoys, and (5) docking templates. The Web interface provides easy download of the current and earlier versions of the pre-compiled datasets, as well as generation of custom sets. DOCKGROUND is a tool for the development of tools. Thus, its intended user base is primarily the developers of macromolecular modeling techniques in general and docking methodologies in particular.
2
Materials
2.1 Basic Dataset of Protein–Protein Complexes
The dataset of experimentally determined bound protein–protein complexes is the source of other protein–protein sets in the DOCKGROUND. A fully automated protocol is being implemented to regularly update the content of the dataset. The protocol is written in C++ and runs in the background each week following the weekly update of the PDB. The initial data from the PDB mmcif files are filtered to purge irrelevant structures, followed by extraction of pertinent information (Fig. 1). The chain domains are obtained from the local copy of the ECOD database [34], which currently is the most comprehensive and regularly updated source of information on protein domains. The data related to the single chain and to PDB structure as a whole are inserted into DOCKGROUND
DOCKGROUND Tool for Development of Docking Procedures
291
Fig. 1 Workflow of the automated update of the bound set
PostgreSQL tables. The list of the PDB codes is further passed to the program module which reads PDB biounit files, extracts all possible pairwise combinations of chains, and determines which chains are interacting. This is done by calculating the difference in solvent accessible surface area (SASA) between a pair of chains together and alone. If this difference is >250 A˚2, the chains are deemed to be interacting, and the list of the interface residues (defined as those with 10% change in their SASA upon complexation) is generated. The SASA is calculated by freeSASA [35] in the low-resolution mode. The interacting chains are further analyzed for several special cases. The small molecules (ligands) presence at the interface is detected if the ligand is found interacting with both chains. Similarly, a single nucleotide interacting with both chains indicates the presence of DNA/RNA on the interface. The ligand and the nucleotide are considered interacting with a protein chain if their SASA changes 10% between the stand-alone and the bound versions. The detection of interaction by SASA change is computationally more efficient compared to detection by a distance threshold, especially for the increasing fraction of PDB files that contain a large number of protein chains. A pairwise combination of chains is inspected for disulfide bonds across the interface and the presence of the membrane. This is accomplished by searching for specific keywords in the mmcif file. The program checks if a part of one chain is buried in the other chain (tangled chains). Such cases are detected when the difference in SASA for any five consecutive residues in an isolated chain and in association with the other chain is >70%. All such special cases are annotated by setting
292
Petras J. Kundrotas et al.
corresponding flag values to true. Finally, the multimeric state and its type (hetero or homo) of the entire PDB structure, from which the pairwise combination was extracted, are recorded. 2.2 Datasets of Unbound Protein– Protein Complexes
An accurate prediction of protein complexes from the unbound components is a key problem in docking due to the conformational changes upon protein association. Thus, the datasets of unbound proteins corresponding to the co-crystallized complexes are important for the development and validation of docking approaches. The DOCKGROUND unbound docking benchmark is an integral part of the resource and the basis for docking decoy sets. It also allows a direct comparison of docking performance for unbound experimentally determined and modeled proteins. Along with three legacy datasets of unbound structures [11], the most recent docking benchmark set 4 contains 396 complexes, of which 223 structures have single-chain monomers only, and the rest have one or both interacting subunits consisting of two or more chains [8]. Since the number of proteins in PDB determined in both bound and unbound conformation is relatively small, a radical expansion of such sets is achieved by computational simulation of the unbound proteins. The resource contains a large (3205 single chains from 1918 complexes) set of such simulated structures. Protein complexes were selected from the bound part of the DOCKGROUND, separated from the interacting partner and subjected to 1 ns Langevin dynamics simulation in CHARMM (CHARMM22 force field [36]), with electrostatics by Generalized Born approximation. The simulated unbound structures were selected according to criteria from the systematic comparison of experimentally determined bound and unbound proteins [37].
2.3 Datasets of Model–Model Complexes
Docking techniques have been tested largely on the datasets of the X-ray structures. However, most protein structures in the interactome would be computational models. DOCKGROUND provides carefully curated sets of representative modeled structures of proteins with arrays of structural accuracy. The first set is derived from the DOCKGROUND unbound benchmark 3 and consists of arrays of six models with 1, 2, ... 6 A˚ model-to-native Cα RMSD for each of the proteins from 63 binary complexes. The models were generated by simple single-template modeling or by the Nudged Elastic Band protocol [15]. The second set was constructed from 165 protein– protein complexes generated by the built-in engine of the DOCKGROUND bound part. Similar to the first set, for each protein, it ˚ model-to-native Cα contains six models with the 1, 2, ... 6 A RMSD. However, all models were selected from the modeling trajectories generated by I-TASSER [38] (details in Ref. 16), which is more adequate to the real-case modeling scenario.
DOCKGROUND Tool for Development of Docking Procedures
2.4
Docking Decoys
2.5 Structural Templates for Docking
293
Development and assessment of the scoring functions for protein docking require a set of docking poses, some of which have to be close to the native structure (i.e., within the binding funnel [39]), while the rest would be false-positive ones (decoys). To avoid bias in testing of the scoring functions due to clustering and distribution of energy values, the decoys should be not clustered and should have energies/scores similar to those of the near-native matches. DOCKGROUND provides two such sets of docking decoys (scoring benchmarks). The first set consists of 99 non-native and one near˚ ) genernative match (ligand RMSD to the native structure 250 A˚2 per chain, and the interface consists of 10 residues per chain. Complexes with one or both chains containing < Φmax Φthr U EM ðrÞ ¼ ð1Þ ζ if ΦðrÞ < Φthr : > > > : where Φ(r) is the experimentally derived electron density at a point r, ζ referred to as gscale is the scaling factor that regulates the coupling strength between the map and model, Φthr is a threshold for disregarding background density due to bulk water, and Φmax ¼ max ðΦðrÞÞ. In MDFF, an initial starting model is refined employing MD, where the traditional potential energy surface is modified by UEM. The density-weighted MD potential directs the model to the EM map, while simultaneously following constraints from the traditional force fields. This real space refinement performed by MDFF allows for fitting the density with atomically detailed features.
Flexible Fitting with MDFF
3.2 Resolution Exchange MDFF (ReMDFF):
305
While traditional MDFF works well with low-resolution density ˚ ), recent high-resolution EM maps have proven to be maps (> 7 A more challenging. This is because high-resolution maps run the risk of trapping the search model in a local minimum of the density features. To overcome this unphysical entrapment, ReMDFF employs a series of MD simulations. Starting with i ¼ 1, the ith map in the series is obtained by applying a Gaussian blur of width σ i to the original density map. Each successive map in the sequence i ¼ 1, 2, . . .L has a lower σ i (higher resolution), where L is the total number of maps in the series (σ L ¼ 0 A˚blur). The fitting protocol assumes a replica-exchange approach described in detail and illustrated in Fig. 1. At regular simulation intervals, replicas i and j, of coordinates xi and xj and fitting maps of blur widths σ i and σ j, are compared energetically and exchanged with Metropolis acceptance probability
pðxi , σ i , x j , σ j Þ ¼ U ðxi , σ j Þ U ðx j , σ i Þ þ U ðxi , σ i Þ þ U ðx j , σ j Þ min 1 , exp kB T
ð2Þ
where kB is the Boltzmann constant, U(x, σ) is the instantaneous total energy of the conformation x within a fitting potential map of blur width σ. Thus, ReMDFF fits the search model to an initially
Fig. 1 A simple example of ReMDFF. Three replicas are included in this schematic. Each replica consists of a molecular structure and a cryo-EM map-based grid potential. Different green boxes represent grid potentials of different resolutions. The structural models as refined at different resolutions are shown in red, orange, and yellow, with different hue levels representing changes in the conformation. The arrows indicate the transfer of a grid potential from one replica to another. The output structure is selected from the trajectory visited by the grid potential of the original resolution (dark green). Reproduced with permission from [6]
306
John W. Vant et al.
large and ergodic conformational space that is shrinking over the course of the simulation towards the highly corrugated space described by the original MDFF potential map. Details of ReMDFF implementation are provided in [6, 19]. 3.3 ReMDFF Refinement Protocol
This section provides an overview of the steps involved in the ReMDFF refinement given a starting search model and the corresponding cryo-EM density map. The details of force field and other simulation parameters used in the case studies (Subheading 4) are presented in Subheading 5. The ReMDFF work flow discussed here is illustrated in Fig. 2.
Fig. 2 Resolution exchange MDFF (ReMDFF) work-flow. (a) Low resolution cryo-EM structure of the Mg2+channel CorA in the Mg2+-free, asymmetric open state II (PDB: 3JCH, resolution 7.06 A˚) and high resolution cryo-EM density map of the Mg2+ channel CorA in the closed symmetric Mg2+-bound state (PDB: 3JCF, resolution: 3.8 A˚). (b) Initial docking of low resolution cryo-EM structure into high resolution cryo-EM density map using UCSF Chimera v1.13.1. (c) Generate protein structure file (PSF) using the AutoPSF plugin in VMD 1.9.3 or higher, followed by generating smooth maps by applying a series of Gaussian blurs with increasing half-widths, σ ¼ 1 A˚ to the high resolution cryo-EM density map. (d) Benchmark the system to submit ReMDFF job and validate ReMDFF results by calculating the RMSD between the ReMDFF trajectory with respect to high resolution structure (PDB: 3JCF A˚) and cross-correlation coefficient (CCC) between the ReMDFF trajectory and high resolution cryo-EM density map (resolution: 3.8 A˚). After validation and convergence of ReMDFF trajectory, further analysis to determine the overall quality of the structure after fitting is performed using MolProbity and EMRinger. All scripts to generate input files for ReMDFF, sorting replicas and analyzing the ReMDFF trajectories are available from the ReMDFF GitHub repository [12]
Flexible Fitting with MDFF
307
1. Docking initial structure into EM density map: As shown in Fig. 2, the first step is to dock the initial structure into the high resolution cryo-EM density map. This can be performed by using either UCSF Chimera or Situs software, mentioned above in Subheading 2.2. 2. Generate initial system: The first step to run ReMDFF or any MD simulation using the NAMD program requires adding hydrogen to the heavy atoms in the PDB file, obtained after docking the initial search model into the high resolution cryoEM density map. The next step is to create a “structure” file (PSF), corresponding to the PDB file. The PSF file contains the following information about a molecule: atoms, partial charges of each atom, masses, bonds, angles, dihedrals, improper angles, and van der Waals terms. This step can be accomplished with the VMD’s AutoPSF plugin (Extensions ! Modeling ! Automatic PSF Builder) or with the Generate_System.pgn script in the ReMDFF GitHub repository [12]. By default the AutoPSF plugin in VMD 1.9.3 or higher uses the CHARMM36 force field parameters [20] for protein, nucleic acids, carbohydrates, lipid, ions, and water. However, other popular force-fields such as AMBER can be implemented as well. In principle, one can perform ReMDFF in vacuum or in implicit or explicit solvent. To generate a system with explicit solvent, we use the solvate plugin in VMD (Extensions ! Modeling ! Add Solvation Box). Additional care needs to be taken when performing ReMDFF in explicit solvent to neutralize the system by adding counter-ions to the simulation system. The neutralizing ions can be added using VMD’s autoionize plugin (Extensions ! Modeling ! Add Ions). After each step, i.e. the solvation of the protein and neutralization of the system, a PDB and PSF file will be automatically generated by VMD in the working directory. 3. Generate Gaussian blur maps: The number of grid potential maps needed is equal to the number of replicas that will run in parallel. The high resolution experimental density map is smoothed by applying a series of Gaussian blurs with increasing half-widths, σ. This can be implemented using the volutil package, via VMD’s Tk Console extension (Extensions ! Tk Console) and then type the following command in the console window, volutil -smooth < sigma> < map.situs> -o < smooth-grid-$n.dx> Subsequently, NAMD’s gridForces feature is used to define the corresponding biasing potential on a 3-D grid and can be implemented with the following command, mdff griddx -i < smooth-grid-$n.dx> -o < initialmaps/$n.dx>
308
John W. Vant et al.
Repeat this procedure of generating the smooth grid and corresponding map potential to generate different Gaussian blur maps and corresponding potential. This can be implemented using the available from [12]. The typical number of replicas/blurred-maps is usually a multiple of 2. 4. Prepare ReMDFF input script: The next step is to specify the parameters needed for the simulation to be performed using the NAMD program. These parameters include but are not limited to the number of replicas, number of steps between resolution exchanges, total number of steps, scaling factor ζ (see Eq. 1 above), the number of replicas we wish to simulate and the distribution of the simulation over the specified number of processors. Details of these parameters and their default values used in the case studies below are mentioned in Subheading 5 and in the README.md file in [12]. 5. Job submission: ReMDFF will run on a wide array of computing architectures. Each replica is able to run in parallel until an exchange happens after which the replicas continue to run in parallel. Therefore, the number of time-steps between exchanges needs to be specified in the ReMDFF input script. The user also needs to specify the total number of time-steps, gscale and the number of replicas in the input script. To learn on how to implement this in practice, please see the example submission scripts in the GitHub repository [12]. Finally, the job submission script is prepared, depending on the computing architecture used to run the simulation. The base command uses charmrun and will look as follows: charmrun +p¡num_threads> namd2 +replicas < num_replicas> remdff.namd +stdout output/%d/job0.%d.log > Master.log where specifies the number of replica that will run in parallel and specifies the number of working threads. The number of working threads should be a multiple of the number of replica. 6. Analysis: To analyze the results after ReMDFF, we first need to sort the trajectory. This is done using the sortreplicas command (see usage below) found in NAMD binary directory, to un-shuffle replica trajectories, to place same resolution frames in the same file. The sorted trajectory corresponds to the high resolution cryo-EM density. sortreplicas < job_output_root> < runs_per_frame> [final_step]
< num_replicas>
Now, to determine the quality of fit, it is recommended to calculate the global cross-correlation coefficient (CCC) of the
Flexible Fitting with MDFF
309
sorted trajectory with respect to the high resolution cryo-EM density. This result can also be used to determine the convergence of the simulation. To further analyze simulation results, it is customary to determine the MolProbity and EMRinger scores to check for stereochemical correctness of the backbone and sidechains. These web-based tools provide a good measure to determine the overall quality of the ensemble of structures obtained from the sorted ReMDFF trajectory.
4
Case Study We present two examples of structure refinement with ReMDFF, the first with the carbon monoxide dehydrogenase and the second with Mg2+-channel CorA. In these case studies, flexible fitting is showcased as a general approach for resolving the all-atom details of both monomeric proteins and multimeric complexes.
4.1 Carbon Monoxide Dehydrogenase
The first case study focuses on structural refinement of carbon monoxide dehydrogenase from synthetic electron density data at ˚ resolution. A closed conformation of the monomeric protein 3A (chain C from PDB ID: 1OAO) was used as the search model, while the open conformation (chain D from 1OAO) was the target. The high-resolution density map from chain D was smoothed to generate six potential maps of varying resolution by applying a ˚ . The series of Gaussian blurs with increasing half-widths of 1.0 A six replicas, corresponding to these six potential maps, were used to perform the ReMDFF simulation of the chain C, with a scaling factor ζ of 0.3. Root mean square deviation (RMSD) of the fitted chain C-model with respect to the target chain D structure converged within 10 ps of ReMDFF simulation (Fig. 3d, pink line). For comparison, we also report RMSD from traditional MDFF and cMDFF simulations (black and blue lines, respectively). While RMSD converged within 10 ps for ReMDFF, it converged only within 500 ps for cMDFF simulations (note that X-axis is in log-scale for the inset). In case of direct MDFF, the search model deviated minimally from the original conformation within 2000 ps of the simulation, clearly depicting the computational advantage of employing ReMDFF for real-space refinements.
4.2 Magnesium Channel CorA
The second case study we report is Mg2+ channel CorA. A ˚ structure of the channel is a search model low-resolution 7.06 A ˚ (PDB ID: 3JCH) for resolving a high-resolution target at 3.8 A (PDB: 3JCF) [10]. CorA is a tetrameric system, the refinement of which required 16 potential maps, each obtained by progressively ˚ . Employing a Gaussian blurring the original density every 1.0 A gscale of 0.3, RMSD of the system converged in 787 ps (Fig. 4b–d) ˚ relative to 3JCF. The cross correlation to a value of > 1.0 A
310
John W. Vant et al.
Fig. 3 Case study with carbon monoxide dehydrogenase. (a) Overlay of the search model (red, 1OAO: chain C) with the target structure (blue, 1OAO: chain D). (b) The structure of carbon monoxide dehydrogenase after ReMDFF refinement, superimposed on its known target structure (blue). ReMDFF significantly improves the fit. (c) 3.0 A˚-resolution electron density map of carbon monoxide dehydrogenase used in ReMDFF refinement synthesized using the reported diffraction data on phenix.maps. (d) Evolution of the RMSD during the ReMDFF trajectory (pink line, inset), compared to the RMSD of the same structure through the traditional MDFF (black) and cMDFF (blue, inset). While the RMSD converges within 10 ps for ReMDFF simulation, it converges in 500 ps for cMDFF simulation. In case of the traditional MDFF, the structure does not deviate from the original map within 2000 ps of the simulation
improved from 50% to 70%. In the 787 ps refinement simulation, only the protein backbone was coupled to the map for the first 500 ps. EMRinger scores for this part are shown in black (Fig. 4d). Low EMRinger scores are a result of sidechains not being placed correctly. In the next 287 ps, sidechains were also coupled to the map, resulting in higher EMRinger scores (red plot, Fig. 4d). Following from the output of the sidechain-coupled computation, a traditional MDFF was performed for 1000 ps at 80 K, coupling both backbone and sidechains. This final step resulted in even higher EMRinger scores (blue plot, Fig. 4d), going up to 2.0. The MolProbity score of 1.21 was obtained, with 94.54% Ramachandran-favored backbones, 92% favored rotamers and no
Flexible Fitting with MDFF
311
A
B
0 ps
200 ps
100 ps
786 ps
E
C
D
Fig. 4 Case study with Mg2+ channel CorA and validation of ReMDFF. (a) ReMDFF case study of Magnesium channel CorA for ζ ¼ 0.3. (a) Overlay of the original structure (red) with the high resolution reference structure (blue) at different timepoints of the ReMDFF simulation. The region outside the density is pulled into the cryoEM density within the first 200 ps. (b) Validation of ReMDFF protocol by illustrating the decrease in RMSD between the ReMDFF trajectory and high resolution structure (PDB: 3JCF). (c) Cross-correlation coefficient (CCC) between the ReMDFF trajectory and the high-resolution cryo-EM density indicates a gradual increase until convergence is established. (d) Histogram of EMRinger scores of the ReMDFF trajectory for different atom selections in protein (black: backbone, red: sidechain orientation at 300 K, blue: sidechain orientation at 80 K). EMRinger score increased consistently indicating the improvement in the overall quality of the structure. (e) Overlay of the result after ReMDFF and high resolution cryo-EM density
steric overlaps—a testament to the high global and local quality of the models derived our ReMDFF case study.
5
What Are in My Input Files? This section provides details of the input parameters used in ReMDFF and can also be used as a preface to the ReMDFF GitHub repository [12]. l
Force field parameter files: By default for all ReMDFF simulations, the latest version of VMD 1.9.4a31 uses the all-atom
312
John W. Vant et al.
additive CHARMM36 force field parameters [20] for different biomolecules listed below,
l
l
Filename
Biomolecule
par_all36_prot.prm
Proteins
par_all36_lipid.prm
Lipids
par_all36_na.prm
Nucleic acids
par_all36_carb.prm
Carbohydrates
par_all36_cgenff.prm
Drug-like molecules
System Environment: The system environment specifies the ReMDFF simulation environment to the NAMD program. Name
Description
Vacuum
To perform simulation in vacuum
Implicit solvent
To perform simulation using generalized Born implicit solvent (GBIS) with default parameters.
Periodic boundary condition
To perform simulation in explicit solvent with counter-ions.
Temperature: In molecular dynamics, after the minimization step is performed on the system, the system is equilibrated either using a canonical or isobaric-isothermal ensemble. Often, both schemes are implemented in sequence which represents the equilibrated ensemble of the system. In ReMDFF, a similar strategy is adopted by initially minimizing the system followed by simulating the system at 300 K. Once ReMDFF is complete and replicas are sorted, the sorted trajectory corresponds to the highest resolution cryoEM density. Subsequently, it is recommended to run a traditional MDFF on the sorted trajectory by gradually lowering the temperature of the system to 80 K, a temperature value experienced during cryo-EM experiments. The values for initial and final temperatures of the system can be easily modified in the input script by simply changing the values of the following variables.
Name
Description
Default value
Prescribed value
ITEMP
Initial temperature of the system
300 K
300 K
FTEMP
Final temperature of the system
300 K
80 K
Flexible Fitting with MDFF
313
l
Gscale: The scaling factor applied to the potential energy function from the cryo-EM density map. It is represented by the variable ζ in Equation 1. Typically, we recommend to use values in the range 0.3–0.7 for ζ. The default value is 0.3.
l
Gridpdb: This selection identifies the atom groups over which the cryo-EM map potential is applied after the minimization step of your simulation. A separate PDB file is generated indicating the atom selection. Following are the commonly used options, Atom selection Description
l
Protein and name CA
Grid forces applied only to the Cα atoms in the PDB file of the protein
Backbone
Grid forces applied only to the backbone atoms in the PDB file of the protein
noh
Grid forces applied to the entire protein except for the hydrogen atoms
Restraints: We apply restraints during the ReMDFF simulation to enforce the secondary structure of our protein. NAMD’s extraBonds feature allows for additional bonds, angles, dihedral angles, and impropers to be defined. The VMD package ssrestraints automates the generation of extraBonds input files that define secondary structure restraints. It is also important to note that relatively large forces are being applied to certain atoms, which could in turn lead to certain structural artifacts such as chiral centers with wrong handedness or generation of cis peptide bonds. This is a limitation of any modeling technique based on commonly used molecular dynamics force fields that do not define explicit terms to prevent such structures. VMD provides two plugins to detect, fix, and prevent generation of cis peptide bonds and chirality errors using packages cispeptide and chirality, respectively. Name
Description
Extrabonds Define restraints for dihedral angles (ϕ and ψ) for amino acid residues in helices or sheets, as well as restraints for hydrogen bonds involving backbone atoms from the same residues Cispeptide
Restrain cis-peptide bonds to their current cis/trans configuration
Chirality
Restrain chiral centers to their current handedness
314
John W. Vant et al.
Three separate text files are generated instructing the NAMD program about the three restraints discussed here. In the NAMD script all these files are necessary to be passed as input arguments to extrabonds command. l
Map potentials: The potential files generated by adding Gaussian blur to the high resolution cryo-EM map is included here, see Subheading 3.3. The typical difference of half-widths between subsequent maps is one σ.
l
Number of replicas: The choice for the number of replicas is dictated by the desire to achieve a fast convergence to a refined structure and the number of available parallel processors in your computing architecture. For most biological systems and computing architectures the ideal number of replicas is 8.
A README.md file is provided in each GitHub repository [12] that requires user interaction. Reference the README files as you are following steps for ReMDFF preparation, submission, and analysis.
Acknowledgements The authors acknowledge start-up funds from the School of Molecular Sciences and Center for Applied Structure Discovery at Arizona State University, and the resources of the OLCF at the Oak Ridge National Laboratory, which is supported by the Office of Science at DOE under Contract No. DE-AC05-00OR22725, made available via the INCITE program. We also acknowledge NAMD and VMD developments supported by NIH (P41GM104601) and R01GM098243-02 for supporting our study of membrane proteins. References 1. Li X, Mooney P, Zheng S et al (2013) Electron counting and beam-induced motion correction enable near-atomic-resolution single-particle cryo-EM. Nat Methods 10:584 2. Milazzo AC, Cheng A, Moeller A et al (2011) Initial evaluation of a direct detection device detector for single particle cryo-electron microscopy. J Struct Biol 176(3):404–408 3. Trabuco LG, Villa E, Mitra K et al (2008) Flexible fitting of atomic structures into electron microscopy maps using molecular dynamics. Structure 16(5):673–683 4. Trabuco LG, Villa E, Schreiner E et al (2009) Molecular dynamics flexible fitting: a practical guide to combine cryo-electron microscopy
and X-ray crystallography. Methods 49 (2):174–80 5. McGreevy R, Teo I, Singharoy A et al (2016) Advances in the molecular dynamics flexible fitting method for cryo-EM modeling. Methods 100:50–60 6. Singharoy A, Teo I, McGreevy R et al (2016) Molecular dynamics-based model refinement and validation for sub-5 angstrom cryoelectron microscopy maps. Elife 5. https:// doi.org/10.7554/eLife.16105.001 7. Schweitzer A, Aufderheide A, Rudack T et al (2016) Structure of the human 26S protea˚ . Proc Natl Acad some at a resolution of 3.9 A Sci 113(28):7816
Flexible Fitting with MDFF 8. Sun C, Benlekbir S, Venkatakrishnan P et al (2018) Structure of the alternative complex III in a supercomplex with cytochrome oxidase. Nature 557(7703):123–126 9. Chen S, Zhao Y, Wang Y et al (2017) Activation and desensitization mechanism of AMPA receptor-TARP complex by Cryo-EM. Cell 170(6):1234–1246.e14 10. Matthies D, Dalmas O, Borgnia MJ et al (2016) Cryo-EM structures of the magnesium channel CorA reveal symmetry break upon gating. Cell 164(4):747–756 11. Domnik L, Merrouch M, Goetzl S et al (2017) CODH-IV: a high-efficiency CO-scavenging CO dehydrogenase with resistance to O2. Angew Chem Int Ed Engl 56 (48):15466–15469 12. Vant JW (2019) Resolution exchange molecular dynamics flexible fitting (ReMDFF) all you want to know about flexible fitting. https:// github.com/jvant/ReMDFF_Singharoy_ Group.git 13. Humphrey W, Dalke A, Schulten K (1996) VMD: visual molecular dynamics. J Mol Graph 14(1):33–38 14. Phillips JC, Braun R, Wang W et al (2005) Scalable molecular dynamics with NAMD. J Comput Chem 26(16):1781–1802
315
15. Pettersen EF, Goddard TD, Huang CC et al (2004) UCSF chimera-a visualization system for exploratory research and analysis. J Comput Chem 25(13):1605–1612 16. Wriggers W (2012) Conventions and workflows for using Situs. Acta Crystallogr D Biol Crystallogr 68(Pt 4):344–351 17. Chen VB, Arendall W, Bryan R, Headd JJ et al (2010) MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr D Biol Crystallogr 66 (Pt 1):12–21 18. Barad BA, Echols N, Wang RY-R et al (2015) EMRinger: side chain-directed model and map validation for 3D cryo-electron microscopy. Nat Methods 12(10):943–946 19. Wang Y, Shekhar M, Thifault D et al (2018) Constructing atomic structural models into cryo-EM densities using molecular dynamics – Pros and cons. J Struct Biol 204(2):319–328 20. Best RB, Zhu X, Shim J et al (2012) Optimization of the additive CHARMM all-atom protein force field targeting improved sampling of the backbone phi, psi and side-chain chi(1) and chi(2) dihedral angles. J Chem Theory Comput 8(9):3257–3273
Chapter 19 Protein Structure Modeling from Cryo-EM Map Using MAINMAST and MAINMAST-GUI Plugin Genki Terashi, Yuhong Zha, and Daisuke Kihara Abstract Protein structure modeling is a fundamental step for the structural interpretation of 3D electron microscopy (EM) density map. Recently, because of the significant progress of the cryo-EM technique, protein structure modeling tools are needed for EM maps determined around 4 A˚ resolution. At this rear atomic resolution, finding main-chain structure and assigning the amino acid sequence into EM map are still challenging problems. We have developed a de novo modeling tool named MAINMAST for EM maps at ˚ ). MAINMAST can trace the backbone structure of a protein from an EM near-atomic resolution (~4.5 A density map directory. We also developed a Graphical User Interface (GUI) plugin of MAINMAST for the UCSF Chimera so that users can monitor structures at each step of a modeling procedure. In this chapter, we demonstrate two examples of the use of MAINMAST software and MAINMAST-GUI to build protein structure model from an EM density map. MAINMAST software and MAINMAST-GUI plugin are freely available for academic users at http://kiharalab.org/mainmast/index.html. Key words Protein structure modeling, Cryo-EM, Graph theory, Minimum spanning tree, De novo modeling, MAINMAST
1
Introduction Recent technical improvements in cryo-electron microscopy (cryoEM) have led to a rapid increase of cryo-EM data [1] at a near˚ or better). When an EM map is atomic resolution (e.g., 4 A obtained, structure modeling of biomolecules in the map is a critical step for interpreting the map density. Consequently, this advantage of cryo-EM led to the needs of the tools that can interpolate the EM density map. There are several tools for building protein structure model in the EM density map: identifying secondary structure elements (Emap2Sec [2], Helix hunter [3]), refining structure models (MDFF [4], Rosetta [5]), and de novo modeling (Phenix [6], Pathwalking [7, 8], Rosetta [5, 9], MAINMAST [10, 11]).
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_19, © Springer Science+Business Media, LLC, part of Springer Nature 2020
317
318
Genki Terashi et al.
Recently, we have developed a de novo modeling tool named MAINMAST (MAINchain Model trAcing from Spanning Tree) for EM maps [10, 11]. MAINMAST builds main-chain traces of a protein structure in an EM density map. To build the main-chain models, MAINMAST automatically constructs a tree graph structure by connecting points with a high density in the map without referring to known protein structures or fragments. Then, MAINMAST generates Cα models by matching the local densities along the main-chain trace with the expected density of amino acids in the target protein sequence. To enhance the usability of MAIMAST protocol, we also developed a MAINMAST-GUI plugin for the UCSF Chimera. The major functionalities of the plugin include to generate and to display tree structures from local dense points in the map, mainchain traces, and reconstructed all-atom models. Through the interface, users can easily control the parameters of MAINMAST and save and restore sessions.
2
Materials MAINMAST and MAINMAST-GUI plugin are freely available at our MAINMAST Web site (http://kiharalab.org/mainmast/ Downloads.html) and GitHub (https://github.com/kiharalab/ MAINMASTplugin). The archived file MAINMAST.tgz file contains executable files, examples, and source codes of MAINMAST programs. The MAINMAST Web site is showing the details of the examples.
2.1
Input Files
2.2 Required Programs
MAINMAST requires a SITUS [12] file and an SPD3 file which is a result file of the secondary structure prediction. The SITUS file is a readable EM density map, and it can be converted from MRC format [13] file (see Subheading 3.1). The SPD3 file is calculated from a target sequence file by SPIDER2 [14] (see Subheading 3.2). MAINMAST requires the following external programs: l
UCSF Chimera [15]: This program is a visualization program for molecular structures and EM density map. MAINMASTGUI plugin is used as a plugin in this program. Chimera is available at https://www.cgl.ucsf.edu/chimera/. If a user wants to use MAINMAST as command-line only, Chimera is not required.
l
SPIDER2 [14]: This program calculates Position-Specific Substitution Matrix by PSI-BLAST [16] and then predicts the secondary structures of the protein sequence. SPIDER2 is available at GitHub (https://github.com/sysu-yanglab/SPIDER2). MAINMAST requires the output file from SPIDER2.
Cryo-EM Structure Modeling with MAINMAST
319
l
PULCHRA [17]: This program constructs a full-atom protein model from a C-alpha atom model. PULCHRA is available at http://cssb.biology.gatech.edu/skolnick/files/PULCHRA/ index.html.
l
map2map (in SITUS package): MAINMAST can read a situs format file. Map2map converts MRC format to SITUS format. SITUS package can be downloaded from https://situs.bio machina.org.
l
pymol: This program is not necessary but a useful tool to visualize EM density map and protein structure models. By using our script programs, pymol can visualize predicted main-chain paths and the minimum spanning trees from the output files of MAINMAST on command line. Pymol can be downloaded from https://pymol.org/2/.
l
PHENIX [6]: This software package is not necessary for modeling. But user can perform further full-atom refinement by phenix.real_space_refine (see Subheading 3.6). PHENIX package can be downloaded from https://www.phenix-online.org.
l
MDFF (molecular dynamics flexible fitting [4]): This program is not necessary for modeling. MDFF refines the protein structure model. Chapter 18 shows the details of MDFF.
2.3 Installation and Launching of MAINMAST
MAINMAST programs are available for Linux and MacOS. In both OS systems, MAINMAST programs require gfortran (GNU fortran).
2.3.1 Installing gfortran on MacOS Machine
In order to compile and run MAINMAST programs on MacOS, gfortran should be installed on the computer. In this section, we will show how to install gfortran via Xcode and Homebrew on MacOS. Xcode is a development tool for MacOS. User can install Xcode from Mac App Store. Once Xcode is installed, user can install Xcode Command Line Tools by the following command in the terminal window: $ xcode-select --install
Then, the user can install Homebrew by the following command: $
/usr/bin/ruby
-e
"$(curl
-fsSL
https://raw.
githubusercontent.com/Homebrew/install/master/install)"
Finally, the gfortran can be installed via Homebrew by the following command: $ brew install gcc
320
Genki Terashi et al.
To confirm that gfortran was installed correctly, type the following command in the terminal: $ gfortran -v
If the terminal shows the version of gcc, gfortran is available on the computer. 2.3.2 Installation and Launching of MAINMAST Programs in Command Line
The MAINMAST programs and examples are archived in the file MAINMAST.tgz that can be downloaded from http://kiharalab. org/mainmast/Downloads.html. To decompress the archived files, type the command: $ tar -zxf MAINMAST.tgz
This tar command creates a directory named MAINMAST/. This directory contains the following files and directories: l
20AA.param: Parameter file for the ThreadCA program
l
bondmk.pl: Perl script that create a pymol script from an output file from MAINMAST program. This script file is used for visualization of main-chain path (see Subheading 3.3.1).
l
bondtree.pl: Perl script that creates a pymol script from an output file from MAINMAST program. This script file is used for visualization of minimum spanning tree models (see Subheading 3.3.1).
l
Example1/: This directory contains all input and output files for the simulated EM density map (1yfq.situs). The simulated EM ˚ with a grid spacing of map was generated at a resolution of 5.0 A ˚ 1.0 A/voxel using the e2pdb2mrc.py program in the EMAN2 package [18].
l
Example2/: This directory contains all input and output files for the EM density map (emd-6374).
l
MAINMAST.f: The source code of MAINMAST program.
l
ThreadCA.f: The source code of ThreadCA program. To compile the MAINMAST and ThreadCA, user can type the following commands:
$ cd MAINMAST $ gfortran MAINMAST.f -O3 -fbounds-check -o MAINMAST -mcmodel=medium $ gfortran ThreadCA.f -O3 -fbounds-check -o ThreadCA -mcmodel=medium
After compiling the codes, there should be two executable files (MAINMAST and ThreadCA).
Cryo-EM Structure Modeling with MAINMAST 2.3.3 Installation and Launching of MAINMAST-GUI Plugin in Chimera
321
MAINMAST-GUI plugin can be downloaded from https:// github.com/kiharalab/MAINMASTplugin. After downloading the files and directories, the user can put the downloaded files to the directory where the user wants to install the MAINMAST-GUI plugin. In this section, this directory is denoted as [Installed Directory]. The path of the [Installed Directory] will be used for setting up of configuration later. For example, if the directory MAINMASTplugin-master are stored at. /user/smith/Desktop/MAINMASTplugin-master/, the [Installed Directory] is /user/smith/Desktop/.
There should be following files and directories: /user/smith/Desktop/MAINMASTplugin-master/imgs/ /user/smith/Desktop/MAINMASTplugin-master/MAINMAST/ /user/smith/Desktop/MAINMASTplugin-master/MainMastUI/ /user/smith/Desktop/MAINMASTplugin-master/README.log /user/smith/Desktop/MAINMASTplugin-master/README.md /user/smith/Desktop/MAINMASTplugin-master/urls.txt
Once all files and directories are stored at [Installed Directory], the user should set up plugin configurations. There are three steps: 1. Setting up work-path configuration WorkPath.py should be modified according to the [Installed Directory]. WorkPath.py is located at [Installed Directory]/MAINMASTplugin-master/MainMastUI/WorkPath.py. This python script specifies the location of MAINMASTGUI plugin and the working path. Example of WorkPath.py: def showWorkingPath(): return "[Installed Directory]/MAINMASTplugin-master/MainMastUI/WorkPath.py"
For example, when the [Installed Directory] is /user/smith/ Desktop/, WorkPath.py should be: def showWorkingPath(): return "/user/smith/Desktop/MAINMASTplugin-master/MainMastUI/WorkPath.py"
2. Build executable files Same with the command-line programs, the source codes need to be compiled in the [Installed Directory]/ MAINMASTplugin-master /MAINMAST/. Type:
322
Genki Terashi et al. $ cd [Installed Directory]/MAINMASTplugin-master/MAINMAST $ gfortran MAINMAST.f -O3 -fbounds-check -o MAINMAST -mcmodel=medium $ gfortran ThreadCA.f -O3 -fbounds-check -o ThreadCA -mcmodel=medium
3. Add the MAINMAST-GUI plugin to Chimera First, open Chimera to open the preference window. Once the Chimera is open, MAINMASTplugin-master extension needs to be added to Chimera program. In the menu bar, go to [Favorites] > [Preferences...] > [Tools] (Fig. 1-(1) and (2)), then click [Add...] (Fig. 1-(3)), then select the MAINMASTplugin-master folder ([Installed Directory]/MAINMASTplugin-master), and click [Close]. To apply the change (Fig. 1(4)), click [Save]. It should be noted that when user selects the MAINMASTplugin-master folder, it should select the folder only once. Do not select any of the child folders. For example, if the folder is located at “/Users/terashi/Desktop/MAINMASTplugin-master,” third-party plugin locations should show “/Users/terashi/Desktop” (Fig. 1-(5)). 4. Launch MAINMAST-GUI plugin on Chimera. After adding the MAINMAST plugin to Chimera, the plugin can be launched from IDLE terminal window. IDLE terminal is located at menu bar [Tools] > [General
Fig. 1 Installation of MAINMAST-GUI plugin on Chimera
Cryo-EM Structure Modeling with MAINMAST
323
Controls] > [IDLE] (Fig. 1-(6)). Python Shell Window will pop up, and then type the command in the terminal window (Fig. 1-(7)): >>> import MainMastUI.gui
If a button with MAINMAST logo appears on top left corner (Fig. 1-(8)), the MAINMAST-GUI plugin is ready to use on Chimera.
3
Methods In this section, we will describe details of de novo modeling by the MAINMAST protocol. We will demonstrate the use of both MAINMAST command and MAINMAST-GUI plugin. All files used for the demonstration are placed in the directory Example1/. Example1/ is located at. MAINMAST/Example1/
or [Installed Directory]/MAINMASTplugin-master/ MAINMAST/Example1/
3.1 Overview of MAINMAST Protocol
The MAINMAST protocol consists of mainly five steps (Fig. 2): 1. MAINMAST identifies local dense points (LDPs) in a density map using the mean shifting algorithm. The mean shift algorithm is a nonparametric clustering algorithm originally developed for image processing. The assumption is that a density observed in a map is the sum of density functions that originate from atoms in the map. The primary assumption of the mean shift algorithm is that each point in the map represents a Gaussian density function, and the local maxima of dense regions correspond to the chain positions of proteins. This process simplifies the EM map as a clustered LDPs. 2. Minimum spanning tree (MST) is constructed by connecting all LDPs. MST is a graph structure that connects all vertices with the minimal total weight of edges without forming cycles. It was found that the main-chain structure of the protein is well covered by the MST. 3. The obtained tree structure is refined iteratively by a tabusearch algorithm [19]. Usually, the longest path in the MST captures a large fraction of the correct main-chain trace, but there are several erroneous connections. To refine the tree
324
Genki Terashi et al.
Fig. 2 MAINMAST protocol [10]
structure, MAINMAST uses tabu-search method. A tabusearch attempts to explore a large search space by keeping a list of moves that are visited recently and thus are forbidden (tabu list). 4. Build Cα models. The longest path is identified in a refined tree. Then, local densities along the path are matched with the expected density of amino acids in the target protein sequence. In this step, ThreadCA program assigns Cα positions of the target protein on the longest path. 5. Finally, generated Cα models are subject to the full-atom building using the PULCHRA program [17]. PULCHRA is a program that builds a full-atom protein structure model from a reduced protein representation, which was originally developed for protein structure prediction. The full-atom models can be subjected to refinement tools. In the original MAINMAST paper, we used molecular dynamics flexible fitting (MDFF).
Cryo-EM Structure Modeling with MAINMAST
325
MDFF is available on VMD as a plugin (see http://www.ks. uiuc.edu/Training/Tutorials/science/mdff/tutorial_mdffhtml/). In this chapter, we will use a simple refinement tool phenix.real_space_refine in PHENIX package. In the MAINMAST directory uncompressed from MAINMAST.tgz (see Subheading 2.3.2) or [Installed Directory]/MAINMASTplugin-master/MAINMAST/ (see Subheading 2.3.3), there are two programs MAINMAST and ThreadCA. MAINMAST performs steps 1–3, and ThreadCA performs sequence threading part in the step 4, respectively. 3.2 Preparation of Input Files
MAINMAST protocol requires an EM density map file and the result of secondary structure prediction computed by SPIDER2. Here, we explain the preparation of input files.
3.2.1 EM Density Map
MAINMAST program can read a SITUS format file as an EM density map. To use an MRC format file for modeling, user needs to convert the file format by map2map program in SITUS package. SITUS package is available at https://situs.biomachina.org. For example, emd-1234.mrc file is converted to emd-1234.situs by the command: $ map2map emd-1234.mrc emd-1234.situs
Then, the map2map program may require input of selection: map2map> Input map in MRC or CCP4 binary format detected. Desired output format seems to be Situs based on file extension. map2map> Choose one of the following options (or 0 for classic menu): map2map> map2map> 1: Convert to classic Situs (auto)∗ map2map> 2: Convert to classic Situs (manual)∗∗ map2map> map2map> ∗: automatic fill of header fields map2map> ∗∗: manual assignment of header fields map2map> map2map> Enter selection:
Type “1” to select the conversion in automatic. If user wants to convert SITUS format file automatically, type $ echo 1 | map2map emd-1234.mrc emd-1234.situs. This command will generate emd-1234.situs with SITUS format.
326
Genki Terashi et al.
3.2.2 Secondary Structure Prediction
ThreadCA requires an output file (∗.spd3) from SPIDER2. SPIDER2 can be downloaded from GitHub (https://github.com/ sysu-yanglab/SPIDER2). To run the SPIDER2, simply run the run_local.sh in the SPIDER2 package. For example, there is a target sequence file 1yfq.seq in the Example1/. To run the secondary structure prediction, type: $ run_local.sh 1yfq.seq
Then, the program will generate 1yfq.spd3 file. 3.3 Building Graph and Minimum Spanning Tree Structure
As we showed in Subheading 3.1, LDPs are computed from the EM density map and connected by a tree graph model. In this section, we demonstrate the building of the initial tree graph (MST) and the graph model which shows all possible connections. Before predicting the main-chain paths, this analysis is helpful to find the optimal density threshold values. For example, if the density threshold value is too low, the number of LDPs will be large and may take very long computational time. If the density threshold is too high, the longest path cannot cover the whole region in the EM density map. By inspecting the MST and graph models, user can see the coverage and connections of LDPs in the EM map.
3.3.1 Command Line
To build a graph and MST model, we can use two programs, MAINMAST and bondtree.pl. To compute a graph model, type: $ MAINMAST -m 1yfq.situs -t 9 -filter 0.3 -Graph > graph.pdb
To compute MST model, type: $ MAINMAST -m 1yfq.situs -t 9 -filter 0.3 -Tree > tree.pdb
The graph.pdb and tree.pdb are PDB format files with BOND records. Bondtree.pl makes a script file to visualize graph.pdb and tree.pdb on the molecular visualization program pymol. To generate the script file, type: $ bondtree.pl graph.pdb GRAPH > script.txt $ bondtree.pl tree.pdb TREE >> script.txt
Then, open the script.txt by the pymol command with -u option:
$ pymol -u ./script.txt
$ /Applications/PyMOL.app/Contents/MacOS/PyMOL -u ./script.txt
pymol shows two model panels, “TREE” and “GRAPH.” To show the models as nodes and edges, click [S] > [as] > [sticks].
Cryo-EM Structure Modeling with MAINMAST
327
Fig. 3 Graph structure analysis on MAINMAST-GUI plugin. (a) Modeling part, (b) visualization part, (c) MST model, and (d) graph model 3.3.2 MAINMAST-GUI Plugin
In the MAINMAST-GUI plugin, graph structure and MST can be computed by the following steps: 1. Open a tab [Create MAINMAST files]. 2. Select an EM density map file by MAP file: [select file] icon. 3. Select SP3 file by spd3 file: [select file] icon. 4. Put the value of threshold of density values. 5. Click [Create MST & All Edges] icon. 6. If this area shows “MST & all Edges: ready to display,” all files are ready to display. 7. Open a tab [Display MAINMAST graphs]. 8. Click [Minimum Spanning Tree] icon. Chimera shows MST model. Yellow spheres represent LDPs. Black lines represent MST (see Fig. 3c). 9. Click [All edges] icon. Chimera shows graph model. Cyan spheres represent LDPs. Black lines represent all edges (see Fig. 3d).
3.4 Building Protein Structure Models
As we discussed in the overview (Subheading 3.1), LDPs are identified in a density map, and then MST is built and refined iteratively. From the refined tree structures, the longest paths (main-chain paths) are computed. These steps are performed by MAINMAST program. For the predicted main-chain paths, ThreadCA program aligns the protein sequence. In the MAINMAST-GUI plugin, these processes are performed automatically. All required files are placed in the directory Example1/.
328
Genki Terashi et al.
3.4.1 Command Line
First, MAINMAST program builds main-chain paths from the EM density map. In the Example1/, we use 1yfq.situs as an EM density map. To predict main-chain paths, type: $ MAINMAST -m 1yfq.situs -t 9 -filter 0.3 -Dkeep 1.0 -Ntb 10 -Rlocal 5 -Nround 50 > path.pdb
path.pdb contains ten main-chain paths in PDB format. After computing the main-chain paths, ThreadCA aligns the protein sequence (1yfq.spd3) to the main-chain paths as follows: $ ThreadCA -i path.pdb -a ./20AA.param -spd 1yfq.spd3 >CA.pdb $ ThreadCA -i path.pdb -a ./20AA.param -spd 1yfq.spd3 -r >CA_r.pdb
ThreadCA program aligns the target sequence to main-chain paths with two sequence directions (controlled by -r option). CA_r. pdb has opposite sequence direction of path.pdb. To make a pymol script for visualization, type the following commands: $ bondmk.pl path.pdb PATH > script.txt $ bondmk.pl CA.pdb CA >> script.txt $ bondmk.pl CA_r.pdb CA_r >> script.txt
Then, open the script.txt by pymol with -u option.
$ pymol -u ./script.txt
$/Applications/PyMOL.app/Contents/MacOS/PyMOL -u ./script.txt
pymol will show three model panels, path.pdb as “PATH”, CA. pdb as “CA,” and CA_r.pdb as “CA_r” (Fig. 4b). 3.4.2 MAINMAST-GUI Plugin
In the MAINMAST-GUI plugin, main-chain paths and Cα models can be computed by the following steps: 1. Open a tab [Create MAINMAST files]. 2. Select an EM density map file by MAP file: [select file] icon. 3. Select SP3 file by spd3 file: [select file] icon. 4. Put the value of the threshold of density values. In this example, we use nine. 5. Click [Create all files] icon. 6. If this area shows “All files: ready to display,” all files are ready to display. 7. Open a tab [Display MAINMAST graphs].
Cryo-EM Structure Modeling with MAINMAST
329
Fig. 4 MAINMAST models on PyMol. (a) MST on PyMol. Graph structure (green) and MST (cyan), (b) main-chain path (green), and Cα models (magenta and cyan)
8. Select the number of main-chain path models to show. In this example, we select ten. 9. Select the number of Cα models to show. In this example, we select ten. 10. Click [Main-chain Path] icon. Chimera shows ten main-chain paths (see Fig. 5c). 11. Click [Predicted CA model] icon. Chimera shows ten Cα models (see Fig. 5d). 12. Click [Pulchra Rebuild] icon. Chimera shows a full-atom model generated by PULCHRA (see Fig. 5e). 3.5 Save and Restore Session in MAINMAST-GUI Plugin
In the MAINMAST-GUI plugin menu, user can save computed models at “Save and Restore files” tab. This tab has three icons: l
[save session & files]: The current files and session are saved to the directory you selected. It will also store the zoom in/out percentage and the orientation of the current display.
l
[restore session & file]: The files are restored from the directory you selected. It will restore the last displayed graph, the zoom in/out percentage, and the orientation.
l
[compare session & file]: User can compare up to three sessions that are previously stored.
330
Genki Terashi et al.
Fig. 5 Protein structure modeling on MAINMAST-GUI plugin. (a) Modeling part, (b) visualization part, (c) mainchain paths, (d) Cα models, and (e) full-atom model 3.6 Full-Atom Model Refinement
In this section, we explain the procedure to reconstruct full-atom models from Cα model that were constructed by MAINMAST and perform full-atom refinement.
3.6.1 Command Line
PULCHRA builds full-atom model from a Cα model. To run PULCHRA, type: $ pulchra CA.pdb
Pulchra reconstructs a full-atom model as CA.rebuilt.pdb. The new file CA.rebuilt.pdb can be subjected to full-atom refinement tool, phenix.real_space_refine, in PHENIX package. Type $ phenix. real_space_refine CA.rebuilt.pdb current.mrc resolution ¼ 5.0 phenix.real_space_refine will create a new refined full-atom model, CA.rebuilt_real_space_refined.pdb. 3.6.2 MAINMAST-GUI Plugin
At the “Run Phenix refinement:” (see Fig. 5-(13)), the user can enter the resolution of the EM map and select the MRC format file. Then click [Refinement command] (see Fig. 5-(14)), and the plugin will show three commands like:
Cryo-EM Structure Modeling with MAINMAST
331
$ cd [Installed Directory]/MAINMASTplugin-master/MAINMAST/MAINMASTfile $ module load phenix $ phenix.real_space_refine CA.rebuilt.pdb current.mrc resolution=5.0
Copy those command lines by line to the terminal. After success, go back to the plugin and click show result graph. It will show the phenix refined graph. It is possible to use a refinement program to the Cα models that were predicted by MAINMAST-GUI plugin. In the working directory ([Installed Directory]/MAINMASTplugin-master/MAINMAST/MAINMASTfile), ten Cα models are saved as CA0.pdb, CA1.pdb. . .CA9.pdb. User can perform full-atom reconstruction and refinement on the Cα models.
4
Case Study: De Novo Modeling of Cytoplasmic Polyhedrosis Virus (emd-6374) In this section, we show an example of de novo modeling by MAINMAST protocol on emd-6374 (http://www.ebi.ac.uk/ pdbe/entry/emdb/EMD-6374). We use two input files, 6374. situs and 6374.spd3 in Example2/. Example2/ is located at MAINMAST/Example2/ or [Installed Directory]/MAINMASTplugin-master/ MAINMAST/Example2/. We describe details of command lines and procedure on MAINMAST-GUI plugin.
4.1
Command Line
In order to build accurate protein structure models, the density threshold value is one of the important parameters. Before computing protein models, we can confirm the coverage of the initial tree structure (MST) with different density threshold values. In this case, we use 7.0 and 1.0 as the density threshold values. For computing of MST models with the density threshold value ¼ 7.0, the following commands were executed:
$ MAINMAST -t 7.00 -filter 0.3 -Rlocal 10 -m 6374.situs -Tree > mst.pdb $ bondtree.pl graph.pdb MST > script.txt $ pymol -u ./script.txt
Figure 6a is showing the MST. The MST (red) are missing many regions of the EM density map (blue). In order to use lower density threshold value ¼ 1.0, the following commands were executed: $ MAINMAST -t 1.00 -filter 0.3 -Rlocal 10 -m 6374.situs -Tree > mst.pdb $ bondtree.pl mst.pdb MST > script.txt $ pymol -u ./script.txt
332
Genki Terashi et al.
Fig. 6 MST and MAINMAST models for example2. (a and b) MST models with different density threshold values (7.0 and 1.0). (c) 3jb0-D.pdb (cartoon model) and refined full-atom model (stick model)
The MST with the density threshold value ¼ 1.0 covers the whole region. In this case, we specified that the density threshold value is 1.0. To generate Cα models, we executed the following commands: $ MAINMAST -t 1.00 -filte 0.3 -Rlocal 10 -m 6374.situs -Nround 50 > path.pdb ../ThreadCA -i path.pdb -a ../20AA.param -spd 6374.spd3 -fw 1.4 -Ab 3.4 -Wb 0.9 > CA. pdb ../ThreadCA -i path.pdb -a ../20AA.param -spd 6374.spd3 -fw 1.4 -Ab 3.4 -Wb 0.9 -r > CA_r.pdb
The above commands are computing main-chain paths (path. pdb) and two Cα models (CA.pdb and CA_r.pdb). From the two Cα models, we constructed full-atom models by pulchra and then refined by phenix.real_space_refine: $ pulchra CA.pdb $ pulchra CA_r.pdb $ phenix.real_space_refine CA.rebuilt.pdb 6374.mrc resolution=2.9 $ phenix.real_space_refine CA_r.rebuilt.pdb 6374.mrc resolution=2.9
The above commands generate full-atom models (CA.rebuilt. pdb and CA_r.rebuilt.pdb) and refined full-atom models (CA. rebuilt_real_space_refined.pdb and CA_r.rebuilt_real_space_refined. pdb). In the directory Example2/, there is the native protein structure file (3jb0-D.pdb). To compare the native structure and predicted model, type: $ pymol CA.rebuilt_real_space_refined.pdb 3jb0-D.pdb
RMSD value between 3jb0-D.pdb and CA.rebuilt_real_space_ refined.pdb is 1.8 A˚ (Fig. 6c).
Cryo-EM Structure Modeling with MAINMAST
4.2 MAINMAST-GUI Plugin
333
In the MAINMAST-GUI plugin, main-chain paths and Cα models can be computed by the following steps: 1. Open a tab [Create MAINMAST files]. 2. Select 6374.situs in the directory Examample2/ by MAP file: [select file] icon. 3. Select 6374.spd3 by spd3 file: [select file] icon. 4. Put the value of the threshold of density values. In this example, we use 1.0. 5. Click [Create all files] icon. 6. If this area shows “All files: ready to display,” all files are ready to display. 7. Open a tab [Display MAINMAST graphs]. 8. Select the number of main-chain path models to show. In this example, we select ten. 9. Select the number of Cα models to show. In this example, we select ten. 10. Click [main-chain Path] icon. Chimera shows ten main-chain path models (Fig. 7c). 11. Click [Pulchra Rebuild] icon. Chimera shows a full-atom model generated by PULCHRA (Fig. 7d).
Fig. 7 Modeling on MAINMAST-GUI plugin. (a) MAINMAST plugin menu. (b) visualization part. (c) Ten mainchain models. (d) Full-atom model (green) and native structure (orange 3jb0-D.pdb)
334
Genki Terashi et al.
12. To use phenix.real_space_refine, put the resolution value 2.9. 13. Select 6374.mrc in Example2/ by mrc file for refinement file: [select file] icon. 14. Click [Refinement command], and then execute the three command lines.
5
Notes In MAINMAST protocol, a user can perform MAINMAST with different parameters. As we demonstrated in Subheading 4.1, the threshold value of density is the most important parameter. To generate accurate models, sufficient threshold value of the density should be selected by monitoring the coverage of the MST model. Tables 1 and 2 show all available options in MAINMAST and
Table 1 Options and default parameters in MAINMAST Option
Value
Description
-m
SITUS file
MAP file. ∗required
-tree
None
Show minimum spanning tree
-graph
None
Show LDPs and all connections
-gw
Float
Bandwidth of the Gaussian filter. Default value is 2.0
-Dkeep
Float
Keep edge where its length is < [float]A˚. ˚ Default value is 0.5 A
-t
Float
Threshold of density values Default value is 0.0
-allow
Float
Maximum distance of shifting in the mean shifting step ˚ Default value is 10.0 A
-filter
Float
Denoising filter of LDPs Default value is 0.1
-merge
Float
Distance cutoff for merging step ˚ Default value is 0.5 A
-Nround
Int
Number of iterations in tabu-search step Default value is 5000 times
-Nnb
Int
Number of neighbors in tabu-search step Default value is 30
-Ntb
Int
Size of tabu-list Default is 100
-Rlocal
Float
Radius of local-MTS Default value is 10 A˚
-Const
Float
Constrain of total length of edges in tree graphs Default value is 1.01
Cryo-EM Structure Modeling with MAINMAST
335
Table 2 Options and default parameters in ThreadCA Option
Value
Description
-i
File name
Output file of MAINMAST program. ∗required
-a
20AA.Param
Parameter file. ∗required
-spd
SPD3 file
Result file of SPIDER2. ∗required
-fw
Float
Bandwidth of the Gaussian filter. Default value is 1.0
-Ab
Float
Average length (A˚) of Cα-Cα bond Default value is 3.5 A˚
-Wb
Float
Weight of bond score. Default value is 0.9
-r
None
Reverse mode. Using opposite direction of main-chain path
Table 3 Parameters in MAINMAST-GUI plugin Option
Value
Description
MAP file (situs format)
File name
EM density map file. ∗required
sp3 file
File name
Result file of SPIDER2. ∗required
Threshold of density values
Float
Density cutoff value
Filter of representative points
Float
Noise points are removed by high value
Maximum edge distance
Float
Maximum length of the edge which is not deleted
Size of tabu-list
Int
Parameter for tabu-search
Radius of local MST
Float
Size of the local sphere which defines local MST
Number of iterations
Int
Parameter for tabu-search
ThreadCA commands, respectively. Table 3 shows all parameters for MAINMAST-GUI plugin.
Acknowledgments The authors acknowledge C. Christoffer for his help in finalizing the manuscript. This work was partly supported by the National Institutes of Health (R01GM123055), the National Science Foundation (DMS1614777 and CMMI1825941), and the Purdue Institute of Drug Discovery.
336
Genki Terashi et al.
References 1. Frank J (2017) Advances in the field of singleparticle cryo-electron microscopy over the last decade. Nat Protoc 12:209 2. Subramaniya SRMV, Terashi G, Kihara D (2019) Protein secondary structure detection in intermediate-resolution cryo-EM maps using deep learning. Nat Methods 16:911–917 3. Jiang W, Baker ML, Ludtke SJ, Chiu W (2001) Bridging the information gap: computational tools for intermediate resolution structure interpretation. J Mol Biol 308:1033–1044 4. McGreevy R, Teo I, Singharoy A, Schulten K (2016) Advances in the molecular dynamics flexible fitting method for cryo-EM modeling. Methods 100:50–60 5. DiMaio F, Song Y, Li X et al (2015) Atomicaccuracy models from 4.5-A˚ cryo-electron microscopy data with density-guided iterative local refinement. Nat Methods 12:361 6. Terwilliger TC, Grosse-Kunstleve RW, Afonine PV et al (2008) Iterative model building, structure refinement and density modification with the PHENIX AutoBuild wizard. Acta Crystallogr D Biol Crystallogr 64:61–69 7. Baker MR, Rees I, Ludtke SJ et al (2012) Constructing and validating initial Cα models from subnanometer resolution density maps with pathwalking. Structure 20:450–463 8. Chen M, Baldwin PR, Ludtke SJ, Baker ML (2016) De novo modeling in cryo-EM density maps with Pathwalking. J Struct Biol 196:289–298 9. Wang RY-R, Kudryashev M, Li X et al (2015) De novo protein structure determination from near-atomic-resolution cryo-EM maps. Nat Methods 12:335
10. Terashi G, Kihara D (2018) De novo mainchain modeling with MAINMAST in 2015/ 2016 EM model challenge. J Struct Biol 204:351–359 11. Terashi G, Kihara D (2018) De novo mainchain modeling for EM maps using MAINMAST. Nat Commun 9:1618 12. Wriggers W (2012) Conventions and workflows for using Situs. Acta Crystallogr D Biol Crystallogr 68:344–351 13. Cheng A, Henderson R, Mastronarde D et al (2015) MRC2014: extensions to the MRC format header for electron cryo-microscopy and tomography. J Struct Biol 192:146–150 14. Heffernan R, Dehzangi A, Lyons J et al (2015) Highly accurate sequence-based prediction of half-sphere exposures of amino acid residues in proteins. Bioinformatics 32:843–849 15. Pettersen EF, Goddard TD, Huang CC et al (2004) UCSF chimera—a visualization system for exploratory research and analysis. J Comput Chem 25:1605–1612 16. Altschul SF, Madden TL, Sch€affer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402 17. Rotkiewicz P, Skolnick J (2008) Fast procedure for reconstruction of full-atom protein models from reduced representations. J Comput Chem 29:1460–1465 18. Tang G, Peng L, Baldwin PR et al (2007) EMAN2: an extensible image processing suite for electron microscopy. J Struct Biol 157:38–46 19. Glover F (1986) Future paths for integer programming and links to artificial intelligence. Comput Oper Res 13:533–549
Chapter 20 Protocols for Fast Simulations of Protein Structure Flexibility Using CABS-Flex and SURPASS Aleksandra E. Badaczewska-Dawid, Andrzej Kolinski, and Sebastian Kmiecik Abstract Conformational flexibility of protein structures can play an important role in protein function. The flexibility is often studied using computational methods since experimental characterization can be difficult. Depending on protein system size, computational tools may require large computational resources or significant simplifications in the modeled systems to speed up calculations. In this work, we present the protocols for efficient simulations of flexibility of folded protein structures that use coarse-grained simulation tools of different resolutions: medium, represented by CABS-flex, and low, represented by SUPRASS. We test the protocols using a set of 140 globular proteins and compare the results with structure fluctuations observed in MD simulations, ENM modeling, and NMR ensembles. As demonstrated, CABS-flex predictions show high correlation to experimental and MD simulation data, while SURPASS is less accurate but promising in terms of future developments. Key words Protein modeling, Coarse-grained protein models, Molecular modeling, Molecular Dynamics, Multiscale modeling
1
Introduction Molecular Dynamics (MD) and Elastic Network Models (ENM) are perhaps the most popular computational approaches in the studies of structural flexibility of biomolecules [1]. Both approaches are very effective in the study of local dynamics around well-defined, usually experimentally determined, structures that are used as the input. For MD studies, the size and complexity of the biomolecule may be a major limitation, while ENM approach may not give satisfactory results for structurally ambiguous regions [1]. The simulation alternatives are coarse-grained (CG) protein models [2] which enable efficient modeling of much larger systems and/or longer processes than classical MD and use more sophisticated interaction schemes than ENM techniques.
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4_20, © Springer Science+Business Media, LLC, part of Springer Nature 2020
337
338
Aleksandra E. Badaczewska-Dawid et al.
In this work, we present the protocols for prediction of protein structure fluctuations using CG protein models of medium (CABSflex) and low resolution (SURPASS). Subheading 2 contains short description of CABS-flex [3, 4] and SURPASS [5, 6] simulation tools and their access links. Subheading 3 presents the protocols for fast simulations of protein structure flexibility. In order to test the protocols, we use the set of 140 globular proteins and compare the predictions of protein fluctuations with the data from all-atom MD, ENM (using DynOmics tool [7]), and NMR ensembles. The tests results are presented in Subheading 4. In general, the CABS-flex method showed high correlation to experimental and MD simulation data. The low-resolution SURPASS was less accurate, particularly for proteins with low content of regular secondary structure or weak hydrophobic core, which is not surprising in the context of SURPASS design. Nevertheless, despite using very low resolution of protein structure, SURPASS has showed accuracy on the level of the other methods for significant portion of the proteins from the test set. This is a very promising result in view of SURPASS potential use in studies of large protein systems. The advantageous feature of the tested protocols is their very low calculation cost, which is reduced to minutes in case of CABS-flex and seconds using SURPASS model (for proteins having several hundred residues and using a single CPU of average power).
2
Materials
2.1 CABS-Flex Method
CABS-flex method is dedicated for fast simulations of protein flexibility [3, 4]. CABS-flex uses a well-established CABS coarsegrained model (reviewed elsewhere [2]; see Fig. 1) as a simulation engine and merges it with tools for all-atom and simulation analysis. The protein flexibility profiles produced by CABS Monte Carlo dynamics simulations were shown to be consistent with the protein flexibility of folded globular proteins seen in MD simulations
Fig. 1 Example tripeptide in all-atom and coarse-grained (CG) representations. CABS CG model (used in CABSflex package) and SURPASS CG model are presented. Single pseudo-atom of SURPASS replaces short (fourresidue-long) secondary structure fragment. Despite the deep simplification of representation, the model reproduces basic structural properties of globular proteins
Fast Modeling of Protein Structure Flexibility
339
[8, 9], with NMR ensembles [10], and also with various kinds of experimental data on protein folding mechanisms [11–14]. Moreover, CABS-flex method is successfully used in Aggrescan3D method [15–18] to predict the influence of protein flexibility on protein aggregation properties and in CABS-dock method for simulations of protein flexibility during peptide molecular docking [19–22]. CABS-flex method is presently available as the CABS-flex 2.0 web server [3] (http://biocomp.chem.uw.edu.pl/CABSflex2) and the stand-alone package [4]. The package repository (available at https://bitbucket.org/lcbio/cabsflex/) contains online documentation, descriptions of the options, installation instructions, and examples of usage. Note that CABS-flex stand-alone version runs on most Unix/Linux, Windows, and MacOS systems and is also available as a Docker image. The following programs are also necessary: 1. GFortran, a freely redistributable Fortran compiler, required by CABS model (CABS-flex simulation engine). 2. Python 2.7, CABS-flex, a package using python 2.7 version. 3. Modeller package [23], which is required by CABS-flex for all-atom reconstruction from C-alpha trace of CABS models. A thorough installation guide and CABS-flex issue tracker, which allows users to report any issues, can be found at CABSflex repository. 2.2 SURPASS Software
SURPASS [5, 6] is a low-resolution coarse-grained model for efficient modeling of structure and dynamics of larger biomolecular systems. This model employs highly simplified representation of the protein structure and statistical potentials (see Fig. 1). The concept of SURPASS representation is very simple and assumes averaging of short secondary structure fragments. The specific interaction model describes local structural regularities characteristic for most globular proteins. Despite its high simplification, SURPASS model reproduces reasonably well the basic structural properties of proteins and overcomes some limitations of coarse-grained moderate resolution models [5, 6]. Reconstruction from SURPASS pseudoatoms to Cα-trace is possible using SUReLib algorithm (http:// biocomp.chem.uw.edu.pl/tools/surpass). The SURPASS software is available free of charge to academic use as a stand-alone program from the Laboratory website (http:// biocomp.chem.uw.edu.pl/tools). Online SURPASS repository and documentation including installation instructions, options description, and examples of use can also be found at https://bitbucket. org/lcbio/surpass/. Note that SURPASS is implemented in C+ +11 as a part of Bioshell 3.0 package and needs to be compiled (we recommend using the g++ ver. 4.9 compiler). After successful
340
Aleksandra E. Badaczewska-Dawid et al.
package compilation, you will find an executable program surpass in the bin directory. Calling a program with the -h option will display all currently available simulation settings. SURPASS software runs on most Unix/Linux, Windows, and MacOS systems.
3
Methods
3.1 CABS-Flex Protocol for Fast Simulations of Protein Flexibility
The CABS-flex simulations of protein flexibility can be run using CABS-flex web server version [3] or the stand-alone package [4]. Here, we describe how to use the stand-alone package (a short information on using the web server is provided in Note 1). The only required CABS-flex input is a protein structure (in all-atom representation or C-alpha trace only). It may be provided as a file in PDB format, or just as PDB ID, which will be used by CABS-flex to download the appropriate file from the PDB database. To simulate only selected protein chains, write appropriate chain symbols, for example, “AC,” after the colon sign. For example, to run CABS-flex for the protein with PDB ID 4w2o, use one of the following commands:
$ CABSflex -i 4w2o #PDB ID variant for known protein structure $ CABSflex -i PATH/structure_file.pdb #PATH is the localization of a structure_ file. pdb on your local system $ CABSflex -i 4w2o:AC #Select chain ID if you don’t want to use all of them
Running the CABS-flex protocol on the user’s local machine using the default simulation settings is as simple as in the example above. The default settings control the simulation parameters and distance restraints. The default CABS-flex settings were derived by Jamroz et al. [8] and provide a consensus picture of protein fluctuations with all-atom Molecular Dynamics in aqueous solution for globular proteins. Table 1 contains the exact setup of the default simulation settings. The detailed description of CABS-flex options is provided in CABS-flex repository at https://bitbucket.org/lcbio/cabsflex/. Below, we comment only selected options. Modifications of these options may have some practical effects on the simulation outcome. The simulation length and number of models in the output trajectory are controlled by the values assigned to the set of --mc options (y, s, a) (detailed description of the sampling procedure that is controlled by these options has been recently provided in the review by Ciemny et al. [24]). Using the -t option, user allows setting up the CABS simulation temperatures: at the beginning of the simulation (TINIT) and at the end of simulation (TFINAL). For example, the default setting “-t 1.4 1.4” introduces isothermal conditions at 1.4 temperature. This parameter may be used to increase or
Fast Modeling of Protein Structure Flexibility
341
Table 1 Default CABS-flex simulation settings Option
Short option
Parameters
Default value
--protein-restraints
-g
MODE GAP MIN MAX
ss2 3 3.8 8.0
--temperature
-t
TINIT TFINAL
1.4 1.4
--replicas
-r
NUM
1
--mc-cycles
-y
NUM
50
--mc-steps
-s
NUM
50
--mc-annealing
-a
NUM
20
--protein-flexibility
-f
NUM or FILE
Not applicable
decrease the amplitude of protein fluctuations. The --proteinrestraints option allows generating a set of binary distance restraints between Cα atoms. For example, the default setting of the --protein-restraints option “ss2 3 3.8 8.0” makes secondary structure elements more stable as compared to simulation without any restraints (see work by Jamroz et al. [8] for details). It is controlled by four parameters: l
MODE (default: ss2), which enables to select a subset of residues for which distance restraints will be generated, can be between all residues [all] or only those belonging to secondary structure elements [ss2] or between residues from which at least one belongs to secondary structure element [ss1].
l
GAP (default: 3) specifies gap along the main chain (difference of indices) for two residues to be restrained.
l
MIN (default: 3.8) and MAX (default: 8.0) define minimum and maximum restraint length in Angstroms between two residues to be restrained.
Additionally, protein flexibility (or rigidity, as defined in the CABS-flex web server [3]) can be modified with the -f option for selected protein residues (for the details, see Note 2). By default, the simulation results are returned into three directories stored on the current/working path: l
output_pdbs - Cα-trace of initial structure (start.pdb), Cα-trace of trajectory (replica.pdb), ten top models (all-atom) in separate PDB files, and all models (Cα-trace) present in the ten most dense clusters.
l
output_data - RMSD for each frame of trajectory comparing to reference structure (all_rmsds.txt); if no reference is given, input structure is used as reference.
l
plots - data (.csv) and graphics (.svg) of Energy vs. RMSD and RMSF profile.
342
Aleksandra E. Badaczewska-Dawid et al.
3.2 SURPASS Protocol for Fast Simulations of Protein Flexibility
The required SURPASS input is the protein structure of interest (all-atom or C-alpha trace only or SURPASS representation) provided in PDB file format and secondary structure assignment/ prediction provided in ss2 file format. We recommend dssp as a method of assigning a secondary structure to a known protein structure or psipred as a method to predict a secondary structure (both approaches are presented in Note 3). To run SURPASS using the protein with PDB ID 4w2o, type the following command: $
./surpass
-in:pdb=4w2o.pdb
-in:ss2=4w2o.ss2
-sample:
t_start=0.2
Running the SURPASS program on the user’s local machine is quite simple although it requires management from the command line. Table 2 contains the simulation settings with recommended values of parameters. The length of the simulation and the number of models in the output trajectory are controlled by the values assigned to the set of sample:mc_ options. Using the -sample:t_start option, user chooses the isothermal simulation scheme, which we recommend to study the local dynamics of folded protein structures. For other applications that require enhanced sampling techniques, a simulated annealing scheme (additional options -sample:t_end and -sample: t_steps) or a Replica Exchange Monte Carlo sampling (use options sample:exchanges and -sample:replicas) can be used. In contrast to CABS-flex, SURPASS does not automatically generate distance restraints based on the initial structure, but the user can load a two-column file with the indexes of the interacting residues. In this case, the information about the file with restraints should be provided in the configuration file (surpass.wghts; for the details, see Note 4) of the SURPASS force field in the section concerning the Table 2 Default SURPASS simulation settings Option
Parameters
Recommended value
Description
-in:pdb
PATH/FILE
Required, user provided
Target structure
-in:ss2
PATH/FILE
Required, user provided
Target secondary structure
-in_pdb:native
PATH/FILE
User provided
Reference structure
-in:database
PATH
Working dir
Software database
-in:scfx
PATH/FILE
Working dir
Force field configuration
-sample:t_start
NUM (float)
0.2
Isothermal temperature
-sample:mc_outer_cycles
NUM (int)
100
Outer MC cycles
-sample:mc_inner_cycles
NUM (int)
100
Inner MC cycles
Fast Modeling of Protein Structure Flexibility
343
SurpassPromotedContact energy component. The user can modify the strength of the given contacts in relation to new contacts created during the simulation. Changing the default settings allows the user to adjust the level of flexibility of the structure. Moreover, SURPASS provides many additional options that, for example, allow to load a reference structure for RMSD calculations, set the seed for random number generator or to modify the sampling scheme, and many others (detailed descriptions are provided at https://bitbucket.org/lcbio/surpass/). By default, the simulation results are returned into working directory. There are four categories of output files: 1. tra.pdb - Simulation trajectory in PDB file format (frames in SURPASS representation). 2. energy.dat - Scoring file with energy components. 3. observers.dat - For each frame contains RgSquare, Elapsed_time, and RMSD to reference (or initial structure). 4. topology.dat - For proteins containing beta-type secondary structure elements, assign a topological pattern for each frame.
4
Case Studies For the set of 140 globular proteins from the work of Jamroz et al. [10], we compared the predictions from CABS-flex, SURPASS, all-atom MD (deposited in the MoDEL database [25]), and ENM (using DynOmics tool [7]) with NMR ensemble data. The input and reference structure for each protein was the first model from the NMR ensemble. CABS-flex, ENM, and SURPASS simulations were run with default settings. For each protein from the test set, we obtained and analyzed the following data: l
Ten models from CABS-flex (by default, CABS-flex outputs ten models obtained by structural clustering of 10,000 model’s trajectory).
l
Forty models from DynOmics server [7] (two all-atom structures corresponding to each of 20 modes; for the details, see Note 5).
l
One hundred models from SURPASS isothermal simulation (by default, all models from the trajectory).
l
Ten thousand models from MD simulation (data were taken from the MoDEL library [25] as Cα-only trajectory).
l
At least ten models (depending on protein) from NMR ensembles were taken from PDB database.
For all simulation methods and NMR ensembles, the flexibility data were analyzed using root mean square fluctuation (RMSF) profiles and compared to the structural variability in NMR
344
Aleksandra E. Badaczewska-Dawid et al.
ensembles using Spearman’s rank correlation coefficient (rsNMR). In order to calculate these parameters, Theseus tool [26] was used to superimpose the models on the reference structure (the first model of the NMR ensemble; for the details, see Note 6). Only Cα positions were used for structural alignment. In case of SURPASS, the pseudo-atoms were superimposed on the reference structure converted into simplified representation. The residue fluctuation profiles were calculated according to the formula: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N u1 X 2 RMSF ¼ t x ið j Þ hx i i N j
where xi( j ) denotes the position (coordinates) of the ith Cα atom in the structure of the jth model, and hxii denotes the averaged position of the ith Cα atom in all models obtained by this method. Spearman’s rank correlation coefficient between each method and NMR data was calculated as follows: 6 rsNMR ¼ 1
n P i¼1
d 2i
n ð n 2 1Þ
where di denotes the difference between the ranks of variables for the position of the ith Cα atom, and n is the sum of Cα atoms in the system. Related ranks are taken into account as the arithmetic average of the ranks belonging to the same observations. Table 3 shows RMSF and rsNMR values (averaged over the entire set of 140 proteins). For each of the parameters in Table 3, the minimum (min), maximum (max), and average (ave) values are given. The lowest average RMSF fluctuations were observed for DynOmics server and the highest for SURPASS model (twice as Table 3 Average, minimum, and maximum values of RMSF and rsNMR for the benchmark set of 140 proteins RMSF [A˚]
rsNMR
Method
min
max
ave
NMR
0.27
16.96
2.29
MD
1.12
9.61
CABS-flex
0.81
DynOmics SURPASS
min
Max
Ave
1.00
1.00
1.00
3.15
0.18
0.91
0.62
11.24
3.59
0.04
0.96
0.63
0.47
34.54
1.58
0.05
0.90
0.67
1.87
12.81
4.90
0.39
0.72
0.37
We compare protein models obtained using MD, CABS-flex, DynOmics, and SURPASS computational methods and from NMR experimental data. RMSF is the averaged value of the fluctuation per residue. rsNMR is the Spearman’s rank correlation coefficient calculated between the fluctuation profile for each method and the reference NMR data. Minimum (min), maximum (max), and mean (ave) values are given for both parameters
Fast Modeling of Protein Structure Flexibility
345
high as for NMR). It should be noted that SURPASS simulations did not include any distance restraints, so the structure was fully flexible. Moreover, the model currently does not have a component dependent directly on the amino acid sequence, which makes it less structurally rigid. Spearman’s rank correlation coefficient (rsNMR) is a measure of the statistical relationship between predicted residue fluctuation profiles and NMR ensemble data. For methods using medium- or high-resolution models, we observed a high (rsNMR > 0.5) or very high (rsNMR > 0.7, and in some cases > 0.9) correlation with NMR data (the highest rsNMR of 0.96 was obtained using CABS-flex for 1P9C protein). In case of low-resolution SURPASS model, we noted the average level of correlation (rsNMR > 0.3), although in some cases the level was high as for the other methods (e.g., for the proteins 1W9R or 2IQ3). Table 4 contains a more detailed statistical analysis of Spearman’s rank correlation coefficient (rsNMR) dividing the results into two subsets of (1) average or weak correlation and (2) high and very high correlation. For each method, the percentage of deposits in a given range (counts) and the average value of the ratio in that range were given. Figure 2 shows correlation of RMSF profiles of different methods and NMR ensembles for 140 proteins from the test set. As demonstrated in Table 4 and Fig. 2, most results significantly correlated with NMR ensembles were provided by DynOmics server (using ENM) and CABS-flex (more than 80% of proteins with an average correlation coefficient greater than 0.7). Figure 3 presents a comparison of the residue fluctuation profiles for selected proteins. 1IQ3 and 1W9R are an example of a protein for which all methods showed a very high correlation to NMR data. 1SSG is an example of a small alpha protein for which all methods failed (see Note 7). The expected structural flexibility for this and several other cases (1DS9, 1EO0, 1K8B, 1PAW, 2RGF) is much higher than the reference NMR data, which may be caused by underestimation of structure fluctuations in NMR ensembles [27]. Interestingly, 1K5K, 1KGG, 1WAZ, and 1RGF are also
Table 4 The mean of Spearman’s rank correlation coefficient (rsNMR) calculated between MD, CABS-flex, DynOmics, and SURPASS methods and NMR data in two ranges of rsNMR values: if less than 0.5, average or weak correlation, and if greater or equal to 0.5, high or very high correlation rsNMR < 0.5
rsNMR 0.5
Method
MD
CABS-flex
DynOmics
SURPASS
MD
CABS-flex
DynOmics
SURPASS
Counts
21%
19%
14%
71%
79%
81%
86%
29%
0.28
0.34
0.37
0.30
0.71
0.70
0.72
0.59
346
Aleksandra E. Badaczewska-Dawid et al.
Fig. 2 Spearman’s rank correlation coefficients of computational predictions using tested methods with NMR ensemble data (rsNMR) for 140 proteins. The graphs also show the mean value (blue dot) and standard deviation bars
Fig. 3 Residue fluctuation profiles for selected proteins (PDB code: 1W9R, 1IQ3) obtained using tested computational tools and from NMR ensembles. The curves on the plots were marked with colors: NMR, red; MD, blue; CABS-flex 2.0, pink; DynOmics (ENM), cyan; and SURPASS, green. The Spearman’s rank correlation coefficient (rsNMR) is given in brackets. The right panel shows sets of protein models obtained by NMR, CABS, and SURPASS methods
Fast Modeling of Protein Structure Flexibility
347
proteins for which the compliance of predicted fluctuation to experiment data was not high, but for all of them, the best result was given by deeply simplified SURPASS model. Probably, the unrestrained mobility of SURPASS structures plays here a crucial role, and this feature can be important for studies of highly flexible molecules. For the studied set of 140 proteins, the CABS-flex method showed high correlation to experimental data (Spearman’s rank correlation coefficient 0.7 for 81% of proteins). In most cases, MD provided similar results, but for a few proteins, MD predictions were much worse (1I35, 1OG7, 2HQI, 3CRD). In comparison to ENM-based tools (represented by the DynOmics tool in our tests), CABS-flex may be better suited for prediction of not obvious dynamic behavior, for example, structure fluctuations within the well-defined secondary structural elements or other non-collective motions (see discussion in the review [1]). A low-resolution coarsegrained SURPASS model is less accurate than the other methods, which probably is related with a deep simplification of the interaction model, although the fluctuation profiles are quite realistic. This method showed a high or very high correlation to experimental NMR data for nearly 30% of proteins and an average for the rest. Much higher correlation to experimental fluctuation profiles was obtained for proteins with high contents of secondary structure elements (1W9R, 2IQ7). In a few cases, the SURPASS model proved to be even better than the other methods (1KKG, 1WAZ). These results are very promising in terms of future SURPASS developments that may focus on utilization of experimental data in the modeling process (e.g., in the form of distance restraints like in CABS-flex method). Finally, a particular advantage of the tested simulation protocols is their low computational cost (in the range of minutes for CABS-flex or seconds for SURPASS using standard CPU). Both tested methods can be used as simulation engines of multiscale modeling protocols merging fast conformational sampling in coarse-grained resolution with protein reconstruction methods [28] and more accurate modeling tools of higher resolution.
5
Notes 1. The CABS-flex web server [3] allows to make a similar computations as described in this work (obviously, the stand-alone package can be more useful for users interested in advanced options or in massive computations for many systems). Using the web server, user can provide an input structure by entering the ID of the protein from the Protein Data Bank (PDB) in the “PDB code” text box or by uploading a file in the pdb format from local hard drive in the “PDB file” box. The user can
348
Aleksandra E. Badaczewska-Dawid et al.
optionally provide protein chain(s) identifiers and project name, as well as an email address. Before the CABS-flex simulation begins, the server prepares a set of default distance restraints based on the input conformation. User can also create additional restraints or modify the initial ones by using tabs on the right marked as A and B in Fig. 4. The third tab marked as C allows the user set few advanced simulation options. A preview of all additional options is given in Fig. 4. More details on using the server can be found in the web server publication [3] or online server documentation at http://bio comp.chem.uw.edu.pl/CABSflex2/. 2. Except introducing or changing distance restraints, CABS-flex enables to modify structural flexibility of selected protein resi-
Fig. 4 Interface of CABS-flex 2.0 web server [3]. The interface allows to modify or introduce simulation parameters and distance restraints
Fast Modeling of Protein Structure Flexibility
349
dues by using -f or --protein-flexibility option. To set up this particular simulation, the user should prepare and load config file telling CABS which fragment to modify and how much flexibility is needed. The configuration file, for example, 4w2o. inp, contains only one (or multiple) line: 45:A - 51:A 0 #start to end of the fragment and flexibility value
The flexibility value of selected protein residues can be modified as: (a) 0 - Fully flexible backbone (b) 1 - Almost stiff backbone (default value, given appropriate number of protein restraints) (c) >1 - Increased stiffness (d) - All protein residues will be assigned flexibility equal to this number. (e) bf - Flexibility for each residue is read from the beta factor column of the Cα atom in the PDB input file. (f) bfi - Each residue is assigned its flexibility based on the inverted beta factors stored in the input PDB file. (g) - Flexibility is read from file in the format of single residue entries, i.e., 12:A 0.75, or residue ranges, i.e., 12:A–15:A 0.75. More details are provided in the CABS-flex repository available at https://bitbucket.org/lcbio/cabsflex/. 3. The idea of the SURPASS model is based on a specific averaging of the secondary structure fragments; therefore, it is required to provide an assignment or prediction of the secondary structure in the ss2 format. If the spatial structure of the protein is known, we recommend using dssp to assign a secondary structure. The program is free and available to download at https://swift.cmbi. umcn.nl/gv/dssp/DSSP_3.html. The program is executed from the command line with a simple command: $ mkdssp -i 4w2o.pdb -o 4w2o.dssp
Then, you can convert the output .dssp format file into required .ss2 format using a ready-made program from the Bioshell package (ap_dssp_to_ss2), which you can find in the bin directory: $ ./ap_dssp_to_ss2 4w2o.dssp > 4w2o.ss2
If the protein structure is not known, we recommend using secondary structure predictor such as psipred, which is free and
350
Aleksandra E. Badaczewska-Dawid et al.
available to download from GitHub repository https://github. com/psipred/psipred. On the local machine, the program can be executed from the command line: $ ./runpsipred.local 4w2o.fasta > 4w2o.horiz
One of the default output files will be the predicted secondary structure in .ss2 format. Note that only a protein sequence in fasta format is required. Higher prediction accuracy can be achieved by using a consensus of various methods. 4. The knowledge-based SURPASS force field consists of several components that make up the total energy. Depending on the application, some potentials may be deactivated, or the user may change their scaling factors and parameter values or add distance restraints. The default configuration file, i.e., surpass. wghts, is located on the path ~/data/forcefield/ in a local copy of the Bioshell package. The file can be copied to a working directory and modified as needed. Below, you can find a preview of the config file with the description of the required parameters. ./surpass.wghts # R12 is a harmonic energy for pseudo-bonds. SurpassR12 1.0 forcefield/local/R12_surpass.dat 0.001 # R13 is a term that controls distance between ith and (i+2) atoms SurpassR13 1.0 forcefield/local/R13_surpass.dat 0.001 # R14 is a term that controls distance between ith and (i+3) atoms SurpassR14 1.0 forcefield/local/R14_surpass.dat 0.001 # R15 is a term that controls distance between ith and (i+4) atoms SurpassR15 5.0 forcefield/local/R15_surpass.dat 0.001 # A13 is a term that controls planar angle between ith and (i+2) atoms SurpassA13 0.0 forcefield/local/A13_surpass.dat 0.001 # SurpassHelixStifnessEnergy is a term that controls helix stiffness SurpassHelixStifnessEnergy 5.0 # SurpassCentrosymetricEnergy is forcing the presence of 50% of the residues at a predetermined distance SurpassCentrosymetricEnergy 1.0 # SurpassLocalRepulsionEnergy is forcing the presence at most 6(E), 4(C), and 2 (H) residues in the local repulsion sphere SurpassLocalRepulsionEnergy 1.0 # SurpassHydrogenBond calculates hydrogen bond energy only between atoms in B-sheets SurpassHydrogenBond 10.0 # SurpassContactEnergy keeps excluded volume (repulsion) and contacts (attraction) parameters: weight, high_energy, low_energy, and contact_shift
Fast Modeling of Protein Structure Flexibility
351
SurpassContactEnergy 1.0 100.0 -0.5 0.01 # SurpassPromotedContact promotes listed contacts parameters: weight, high_energy, low_energy, contact_shift, restraints file, and promote_weight SurpassPromotedContact 1.0 100.0 -0.5 0.01 PATH/surpass.contacts 5.0
5. DynOmics server (http://enm.pitt.edu/) enables to generate two all-atom structures along each of 20 modes at a given ˚ ) disRMSD. Returned structures correspond to given (in A tance extremes of amplitude. For this purpose, use the following option: Molecular Motions ! Full Atomic Structures for ANM-Driven Conformers ! Motion along mode (1–20) with RMSD: (2 A˚) at Main result tab. 6. Theseus (https://theobald.brandeis.edu/theseus/) is an efficient program for superpositioning multiple macromolecular structures using the method of maximum likelihood. The program is executed from the command line with a simple command: $ ./theseus reference.pdb target.pdb
Among the output files, there are two in PDB format: (a) theseus_ave.pdb - artificially averaged structure (single structure), (b) theseus_sup.pdb - original structures imposed on the reference (multiple structures). Note that in our studies, the reference structure for both the imposition and all calculations was the first model from NMR ensemble, not the “artificially” averaged structure generated by Theseus. 7. Example residue fluctuation profiles are presented in Fig. 5.
Acknowledgments A.E.B-D, A.K., and S.K. received funding from NCN Poland, Grant MAESTRO2014/14/A/ST6/00088.
352
Aleksandra E. Badaczewska-Dawid et al.
Fig. 5 Example residue fluctuation profiles computed using tested tools (MD, blue; CABS-flex, pink; DynOmics, cyan; SURPASS, green) and from NMR ensembles (red). The numbers on the chart show the Spearman’s rank correlation coefficient between the computational method (see corresponding color) and the NMR data
References 1. Kmiecik S, Kouza M, Badaczewska-Dawid AE, Kloczkowski A, Kolinski A (2018) Modeling of protein structural flexibility and large-scale dynamics: coarse-grained simulations and elastic network models. Int J Mol Sci 19(11):3496. https://doi.org/10.3390/ijms19113496 2. Kmiecik S, Gront D, Kolinski M, Wieteska L, Dawid AE, Kolinski A (2016) Coarse-grained protein models and their applications. Chem Rev 116:7898–7936. https://doi.org/10. 1021/acs.chemrev.6b00163 3. Kuriata A, Gierut AM, Oleniecki T, Ciemny MP, Kolinski A, Kurcinski M, Kmiecik S (2018) CABS-flex 2.0: a web server for fast simulations of flexibility of protein structures. Nucleic Acids Res 46(W1):W338–W343. https://doi.org/10.1093/nar/gky356 4. Kurcinski M, Oleniecki T, Ciemny MP, Kuriata A, Kolinski A, Kmiecik S (2019) CABS-flex standalone: a simulation environment for fast modeling of protein flexibility. Bioinformatics 35:694–695. https://doi.org/ 10.1093/bioinformatics/bty685 5. Dawid AE, Gront D, Kolinski A (2017) SURPASS low-resolution coarse-grained protein
Modeling. J Chem Theory Comput 13:5766–5779. https://doi.org/10.1021/ acs.jctc.7b00642 6. Dawid AE, Gront D, Kolinski A (2018) Coarse-grained Modeling of the interplay between secondary structure propensities and protein fold assembly. J Chem Theory Comput 14:2277–2287. https://doi.org/10.1021/ acs.jctc.7b01242 7. Li H, Chang YY, Lee JY, Bahar I, Yang LW (2017) DynOmics: dynamics of structural proteome and beyond. Nucleic Acids Res 45: W374–W380. https://doi.org/10.1093/ nar/gkx385 8. Jamroz M, Orozco M, Kolinski A, Kmiecik S (2013) Consistent view of protein fluctuations from all-atom molecular dynamics and coarsegrained dynamics with knowledge-based forcefield. J Chem Theory Comput 9(1):119–125. https://doi.org/10.1021/ct300854w 9. Jamroz M, Kolinski A, Kmiecik S (2013) CABS-flex: server for fast simulation of protein structure fluctuations. Nucleic Acids Res 41: W427–W431. https://doi.org/10.1093/ nar/gkt332
Fast Modeling of Protein Structure Flexibility 10. Jamroz M, Kolinski A, Kmiecik S (2014) CABS-flex predictions of protein flexibility compared with NMR ensembles. Bioinformatics 30(15):2150–2154. https://doi.org/10. 1093/bioinformatics/btu184 11. Kurcinski M, Kolinski A, Kmiecik S (2014) Mechanism of folding and binding of an intrinsically disordered protein as revealed by ab initio simulations. J Chem Theory Comput 10(6):2224–2231. https://doi.org/10.1021/ ct500287c 12. Kmiecik S, Kolinski A (2007) Characterization of protein-folding pathways by reduced-space modeling. Proc Natl Acad Sci 104:12330–12335. https://doi.org/10. 1073/pnas.0702265104 13. Kmiecik S, Kolinski A (2008) Folding pathway of the B1 domain of protein G explored by multiscale modeling. Biophys J 94 (3):726–736. https://doi.org/10.1529/ biophysj.107.116095 14. Kmiecik S, Gront D, Kouza M, Kolinski A (2012) From coarse-grained to atomic-level characterization of protein dynamics: transition state for the folding of B domain of protein a. J Phys Chem B 116:7026–7032. https://doi. org/10.1021/jp301720w 15. Zambrano R, Jamroz M, Szczasiuk A, Pujols J, Kmiecik S, Ventura S (2015) AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures. Nucleic Acids Res 43(W1):W306–W313. https://doi. org/10.1093/nar/gkv359 16. Kuriata A, Iglesias V, Pujols J, Kurcinski M, Kmiecik S, Ventura S (2019) Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility. Nucleic Acids Res 47(W1): W300–W307. https://doi.org/10.1093/ nar/gkz321 ˜ o´-Polo M, Vareja˜o N, 17. Gil-Garcia M, Ban Jamroz M, Kuriata A, Dı´az-Caballero M, Lascorz J, Morel B, Navarro S, Reverter D, Kmiecik S, Ventura S (2018) Combining structural aggregation propensity and stability predictions to redesign protein solubility. Mol Pharm 15:3846–3859. https://doi.org/10. 1021/acs.molpharmaceut.8b00341 18. Kuriata A, Iglesias V, Kurcinski M, Ventura S, Kmiecik S (2019) Aggrescan3D standalone package for structure-based prediction of protein aggregation properties. Bioinformatics 35 (19):3834–3835. https://doi.org/10.1093/ bioinformatics/btz143 19. Kurcinski M, Jamroz M, Blaszczyk M, Kolinski A, Kmiecik S (2015) CABS-dock web server for the flexible docking of peptides to proteins without prior knowledge of the
353
binding site. Nucleic Acids Res 43(W1): W419–W424. https://doi.org/10.1093/ nar/gkv456 20. Blaszczyk M, Kurcinski M, Kouza M, Wieteska L, Debinski A, Kolinski A, Kmiecik S (2016) Modeling of protein-peptide interactions using the CABS-dock web server for binding site search and flexible docking. Methods 93:72–83. https://doi.org/10.1016/j. ymeth.2015.07.004 21. Ciemny MP, Kurcinski M, Kozak K, Kolinski A, Kmiecik S (2017) Highly flexible proteinpeptide docking using cabs-dock. Methods Mol Biol 1561:69–94. https://doi.org/10. 1007/978-1-4939-6798-8_6 22. Kurcinski M, Ciemny MP, Oleniecki T, Kuriata A, Badaczewska-Dawid AE, Kolinski A, Kmiecik S (2019) CABS-dock standalone: a toolbox for flexible protein-peptide docking. Bioinformatics 35(20):4170–4172. https://doi.org/10.1093/bioinformatics/ btz185 23. Webb B, Sali A (2016) Comparative protein structure modeling using MODELLER. Curr Protoc Bioinforma 54:5.6.1–5.6.37. https:// doi.org/10.1002/cpbi.3 24. Ciemny MP, Badaczewska-Dawid AE, Pikuzinska M, Kolinski A, Kmiecik S (2019) Modeling of disordered protein structures using Monte Carlo simulations and knowledge-based statistical force fields. Int J Mol Sci 20(3):606. https://doi.org/10. 3390/ijms20030606 25. Meyer T, D’Abramo M, Hospital A, Rueda M, Ferrer-Costa C, Pe´rez A, Carrillo O, Camps J, Fenollosa C, Repchevsky D, Gelpı´ JL, Orozco M (2010) MoDEL (molecular dynamics extended library): a database of atomistic molecular dynamics trajectories. Structure 18:1399–1409. https://doi.org/10.1016/j. str.2010.07.013 26. Theobald DL, Wuttke DS (2006) THESEUS: maximum likelihood superpositioning and analysis of macromolecular structures. Bioinformatics 22:2171–2172. https://doi.org/10. 1093/bioinformatics/btl332 27. Spronk CAEM, Nabuurs SB, Bonvin AMJJ, Krieger E, Vuister GW, Vriend G (2003) The precision of NMR structure ensembles revisited. J Biomol NMR 25:225–234. https:// doi.org/10.1023/A:1022819716110 28. Badaczewska-Dawid AE, Kolinski A, Kmiecik S (2020) Computational reconstruction of atomistic protein structures from coarsegrained models. Comput Struct Biotechnol J 18:162–176. https://doi.org/10.1016/j.csbj. 2019.12.007
INDEX A Ab initio docking ...................... 127, 128, 132, 135, 218 ANANAS ............................................................. 245–256 Annotation ..................................... 6, 18, 27–64, 85, 86, 90, 93, 143, 145, 147, 246, 253–255 Assessment of structure prediction ........................ 13–25, 28, 29, 41–43, 45, 53, 63, 69, 70, 77, 103, 105, 121, 128, 131, 134, 140, 233, 235, 238, 240, 318, 324–326 AutoDock Vina ............................................................ 261
B Benchmark sets.......................................... 128, 284, 290, 292, 297, 298, 344 bindEy .................................................................. 187, 188 Binding affinity.......................... 176, 199, 203, 209, 210 Binding funnels ............................................................ 293 Binding mode sampling..................................... 150–152, 218, 227, 260–262, 268–271, 290
C CABS ....................... 134, 274, 278, 338–340, 346, 349 CABS-dock ................. 260, 273, 274, 276–285, 339 CABS-flex ...................................................... 337–352 CATHs existing domain recognition algorithm (CATHEDRAL)................................................. 35 CCharPPI server for characterization of protein-protein interactions ....................................................... 207 Chimera ............................... 17, 21, 225, 263, 266–270, 303, 306, 307, 318, 321–323, 327, 329, 333 Class architecture topology/fold and homology (CATH) ..................................... 6, 28, 29, 32–41, 49, 50, 53–59, 63, 188 ClusPro ....................................................... 157–173, 260 Clustering ....................................... 34, 70, 71, 112, 113, 118–120, 140, 143, 145, 150, 158, 160, 161, 200, 207, 211, 227, 293, 323, 343 Coarse-grained (CG) model..................... 103–108, 114, 117, 119, 274, 275, 278, 279, 281, 282, 337–339, 347 Complex structures prediction.......................... 184–188, 190, 192, 193, 217, 218, 222, 226, 259–271, 273–285
Computer-aided drug design ...................................... 273 CONFOLD2........................................................... 15, 22 Conformational search............................... 206, 208, 209 Contact distance agreement (CDA) ....................... 71–73 Continuously evaluate the accuracy and reliability of predictions (CAMEO) ................. 70, 74, 76, 128 Continuous optimization ............................................ 246 Critical assessments of techniques for protein structure prediction (CASP)................................ 14, 16, 18, 20–22, 70–78, 84, 128, 133, 134, 140, 144, 150, 151, 176, 190–193 Cryo-electron microscopy (cryo-EM) .............. 139, 208, 301–309, 311–314, 317–335
D Deep learning ........................................................... 13–25 De novo modeling ............................................. 120, 134, 275, 317, 318, 323, 331–334 DISOPRED.......................................... 44, 60, 61, 71, 84 Disorder B-factor Agreement (DBA) .................... 71, 73 DOCKGROUND............................................... 289–298 Docking ..................................................... 127, 128, 132, 134, 135, 152, 153, 157–173, 175–193, 195–197, 199–213, 217–220, 222, 223, 225, 227, 231–235, 239, 260–262, 264–266, 270, 273, 274, 276, 289–298, 303, 306, 307, 339 Docking templates .............................................. 218, 290 Dockser............................................... 183–186, 189, 196
E Electron microscopy ..................... 1, 139, 152, 208, 317 Energy-based scoring functions ......................... 158, 161 Estimates of model accuracy (EMA)...................... 70–72, 74, 76, 77
F FASTA .............................................. 3, 7, 31, 44, 88, 90, 129, 141, 221, 233, 262, 350 Fast Fourier transform (FFT)............................ 128, 135, 158–160, 162, 218 Flexibility ............................................... 88, 94, 128, 189, 203, 275, 283, 302, 337–352 Fold recognition......................................................... 2, 60
Daisuke Kihara (ed.), Protein Structure Prediction, Methods in Molecular Biology, vol. 2165, https://doi.org/10.1007/978-1-0716-0708-4, © Springer Science+Business Media, LLC, part of Springer Nature 2020
355
PROTEIN STRUCTURE PREDICTION
356 Index
Fold recognition technique (FORTE)........................ 1–9 FTDock...................................................... 176, 178, 182, 183, 185, 189, 191, 194–196 FUGUE ...................................................... 29, 31, 36–41 Function prediction ................................. 1–9, 29, 60, 62 FunFHMMER ............................................................... 34 FunTree .......................................................................... 35
K
G
M
Galaxy ........................................ 129, 130, 135, 153, 260 GalaxyHomomer................................................. 127–135 Gene3D ...................................................... 29, 31, 53–58 Genome3D............................................................... 27–64 Global sampling ........................................................... 261 Graph theory ...................................... 318, 323, 326–327
Machine learning............................................................ 60 MAINMAST ....................................................... 317–335 MDockPeP .......................................................... 259–271 Mean shift ............................................................ 323, 334 Minimum spanning tree (MST)............... 319, 320, 323, 326–327, 329, 331, 332, 334, 335 MODELLER ...................................... 3–5, 7, 38, 59, 60, 145, 147–150, 152, 189, 193, 261, 274–279, 339 Modelling immune system proteins................... 166, 210 Model quality assessment (MQA)......................... 48, 70, 76, 151, 153 Model ranking ................................................................ 70 ModFOLD ........................................................ 70–73, 78 ModFOLD6 ............................................................ 74, 76 ModFOLD6_cor............................................................ 74 ModFOLD6_rank.......................................................... 74 ModFOLD7 ............................................................. 69–80 ModFOLD7_cor............................................... 73, 74, 77 ModFOLD7_rank................................................... 73, 74 ModFOLD7_res ..................................................... 72, 73 ModFOLD7res_lDDT .................................................. 73 ModFOLDclust ...................................................... 71, 73 ModFOLDclust2 ........................................................... 73 ModFOLDclustQ .......................................................... 73 Molecular dynamics (MD) ....................... 231, 232, 234, 237, 302–305, 307, 312, 313, 337, 338, 340, 343–346, 352 Molecular dynamics flexible fitting (MDFF) ................ 301–314, 317, 319, 324, 325 Monte Carlo simulations .......................... 103, 106, 107, 116, 281, 338, 342 MULTICOM ......................................... 13–25, 193, 233 Multilayer perceptron (MLP)........................................ 72 Multiscale modeling protein fluctuation .................... 347
H HDOCK .............................................................. 217–227 Heteromers................................................. 149–150, 189 Hidden Markov model (HMM) ........................... 14, 35, 42, 43, 48, 49, 51, 53–55 Homology modeling 7, 14, 59, 60, 140, 152, 173, 200, 202, 204, 212 Homology models for docking ......................... 7, 14, 59, 60, 140, 152, 173, 200, 202, 204, 212 Homomers .......................................................... 127–135 Homo-oligomer structure prediction ............... 127–135, 147, 148, 190, 191 Hybrid structure determination .......................... 49, 120, 218, 222 Hydrophobicity ........................................... 9, 83, 88, 94, 161, 163, 209, 338
I ICM-browser................................................................ 179 IDP-LZerD ......................................................... 231–242 In silico drug design ............................................... 1, 259 Interfaces .................................... 44, 48, 70, 73, 74, 114, 127, 132, 134, 140, 141, 143–145, 147–153, 160, 161, 171, 176, 179, 185–186, 193, 196, 197, 203, 209, 218, 221, 222, 226, 227, 239, 246, 248, 252–253, 274, 280–282, 284, 289–291, 293–298, 318, 348 IntFOLD5 ...................................................................... 71 IntFOLD-TS .................................................................. 71 Intrinsically disordered proteins (IDPs) ............ 231–242 Intrinsically disordered regions (IDRs) ................ 83, 85, 93, 94 IRaPPA method .................................................. 202, 207 ITScorePeP.......................................................... 262, 267
J JSmol ........................................................ 19, 74, 80, 117
Knowledge-based scoring function.................... 123, 218
L Logistic regression ......................................................... 88 LZerD .................................................................. 231–242
N Nanoscale molecular dynamics (NAMD) .................. 302, 303, 307, 308, 312–314 Neural networks (NNs) ........................................... 71–73 Normal modes.................................... 204–206, 213, 231
P pDomTHREADER ............................. 29, 31, 53, 59, 60 Peptide-based drug design .......................................... 259
PROTEIN STRUCTURE PREDICTION Index 357 Peptide therapeutics..................................................... 259 Pocket comparison........................................................... 2 Pocket detection ..................................................... 44, 48 Pocket similarity search................................................ 1–9 PoSSuM ........................................................................ 1–9 Potential ligand binding site prediction ........................ 3, 5–6, 9 PPI3D .................................................................. 139–153 Profile-profile alignment............................ 2–4, 7, 8, 152 ProQ2 ................................................... 44, 48, 71, 72, 76 ProQ2D.......................................................................... 71 ProQ3D................................................................... 71, 73 Protein assembly ....................... 207, 245, 246, 248, 253 Protein complex ................................ 139–153, 160–162, 171, 175–197, 199, 238, 289–292, 294–296 Protein contact prediction...................................... 14, 20 Protein data bank (PDB) ........................ 1, 3, 4, 6–8, 17, 20, 21, 32, 33, 35, 37, 38, 41, 43–45, 47, 49, 54, 55, 57, 62, 73, 75, 78–80, 104, 105, 107–114, 116–119, 121, 122, 129, 132, 134, 135, 140, 141, 143, 144, 147–153, 162–173, 177–181, 184–189, 192–195, 204, 205, 212, 213, 217–222, 225–227, 233, 238–241, 245–249, 253–255, 261, 263–270, 275, 277, 278, 281, 283, 290–293, 295, 296, 302, 306, 307, 309, 311, 313, 326, 328, 340–343, 346, 347, 349, 351 Protein distance prediction..................................... 14, 16 Protein-DNA docking ........................................ 217–227 Protein docking................................. 158, 159, 162–163, 165, 166, 175, 176, 199–213, 231, 234, 289–298 Protein domain ..................... 28, 29, 32–34, 59, 62, 290 Protein dynamics..................................... 9, 44, 153, 203, 271, 274, 292, 337–340, 347 Protein family ................................................................. 32 Protein flexibility ................................ 203, 338–343, 349 Protein homology/analogy recognition engine (PHYRE2) ....................... 31, 32, 41–48, 61, 134 Protein interactome ............................................ 175, 217 Protein-ligand interaction ........................... 1–9, 38, 127, 134, 158–160, 163, 166–173, 213, 239 Protein oligomer structure ................................ 127–135, 147, 148, 176, 190 Protein-peptide complex ................... 259–271, 273–285 Protein-peptide docking .................... 157–173, 262, 273 Protein-peptide interaction ............................... 148–149, 259, 269, 273, 274 Protein-protein docking .................................... 162–163, 165, 166, 175, 176, 199–213, 231, 234 Protein-protein interaction.......................... 62, 140–143, 145, 147, 150, 176, 232, 274 Protein quality assessment ............................................. 62 Protein recognition ............................................ 2, 37, 41, 59, 93, 176, 217 Protein reconstruction ........................................ 274, 347
Protein-RNA docking......................................... 176, 218 Protein simulation..................................... 231, 232, 262, 274, 292, 309, 310, 313, 337–352 Protein structural symmetry .............................. 130, 135, 190, 204–205, 245–256 Protein symmetry...................... 190–191, 204, 245–256 PSIPRED.................................... 29, 32, 59–61, 71, 233, 235, 238, 239, 342, 350 PULCHRA................................................ 234, 319, 324, 329, 330, 332, 333 pyDock................................................................. 175–197 pyDockNIP ......................................................... 176, 185 pyDockRST .................................................................. 176 pyDockTET......................................................... 188, 189
Q Quality assessment (QA) ................................. 14–16, 19, 38, 48, 70, 73, 76, 83–97, 151, 153 QUARTER............................................................... 83–97 Quaternion arithmetic ............................... 200, 206, 247
R Real-space refinement ............................... 304, 309, 319, 325, 330–332 Replica exchange simulations ............................ 104, 116, 304, 305, 342 Resolution exchange ........................................... 301–314 Restraints supported modeling ................... 14, 105–111, 113, 114, 116–121, 151, 158, 162, 165, 169–171, 173, 176, 186–190, 197, 204, 208, 221, 274, 275, 278, 313, 314, 340–342, 345, 347–351 RNA folding simulation ................... 105, 106, 120, 121 RNA structure ..................................................... 103–123 Rosetta FlexPepDock................................ 260, 274, 275, 279–283
S Secondary structure agreement (SSA) .................... 71–73 Sequence complexity............................................... 88, 94 Sequence profile ............................ 14, 15, 42, 43, 87, 88 SimRNA............................................................... 103–123 Small-angle X-ray scattering (SAXS)........ 158, 162, 165, 171–173, 176, 187, 208 Solvent accessibility .......................................... 15–17, 19, 20, 41, 42, 88, 94, 95, 161, 291 Statistical potentials.................................... 218, 262, 339 Structural classification of proteins database (SCOP) .......................................... 28, 29, 32–33, 35–53, 62, 63, 139, 143, 147, 152 Structure-based design .............................. 7, 37, 50, 161 Structure refinement ................................. 128, 152, 274, 275, 279, 309
PROTEIN STRUCTURE PREDICTION
358 Index
SUPERFAMILY ..................... 29, 31, 32, 41, 48–53, 62 Superfamily ............................ 32–36, 41, 48–59, 62, 168 Superfamily mapping ..................................................... 35 SURPASS............................................................. 337–352 SwarmDOCK ...................................................... 199–213 Swarm optimization..................................................... 201 SWISSMODEL ....................................................... 29, 64 Symmetric loop modeling ........................................... 134 Symmetry axes detection ........................... 245–247, 252
T Template-based docking........................... 152, 191–193, 218, 297 Template-based modeling (TBM) ........................ 14, 20, 21, 24, 121, 128, 139–153, 193, 225, 227, 260
Template free............................................. 14, 15, 20, 24, 218, 220, 222, 227 TOCATTA ........................................................ 29, 37–41
V Visual molecular dynamics (VMD)................... 212, 302, 303, 306, 307, 311, 313, 314, 325 VIVACE............................................................ 29, 32, 36, 38–42, 61, 62 VoroMQA................................................. 71, 73, 77, 153
Z ZDOCK............................................ 127, 176, 179, 182, 183, 189, 191, 194, 196, 234, 235