Homology Modeling: Methods and Protocols 1071629735, 9781071629734

This detailed volume provides state-of-the-art methodologies and reviews of important topics in the field of homology mo

364 59 15MB

English Pages 378 [379] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Contributors
Chapter 1: Homology Modeling in the Twilight Zone: Improved Accuracy by Sequence Space Analysis
1 Introduction
2 Materials
2.1 Databases
2.2 Sequence Analysis
2.3 Template Mining
2.4 Secondary Structure Prediction
2.5 Automated Structure Prediction
2.6 Customized Structure Prediction
2.7 Model Validation
2.8 Graphical Analysis
3 Methods
3.1 Our Case Study
3.2 Template Search by PDB Mining
3.3 Template Search with LOMETS
3.4 Sequence Space Investigation
3.5 Template Analysis
3.6 Sequence Alignment
3.7 Loop Modeling
3.8 Model Building
3.9 Concluding Remarks
4 Notes
References
Chapter 2: Illuminating the ``Twilight Zone´´: Advances in Difficult Protein Modeling
1 Introduction
1.1 Protein Structure Determination
1.2 Prediction from First Principles
2 Template-Based Prediction
3 Advances in Low-Homology Modeling
3.1 Applications of Machine Learning
3.2 Inter-Residue Distance Predictions
3.3 Recent Improvements
4 Scoring Functions
4.1 Categories of Scoring Functions
4.2 Overview of Available Tools
4.3 The Most Recent Advances
4.4 Limitations
5 Conclusions
References
Chapter 3: Contact-Assisted Threading in Low-Homology Protein Modeling
1 Introduction
2 Materials
2.1 Template Library
2.2 Query and Template Feature Set
2.2.1 Sequence Profiles
2.2.2 Secondary Structures
2.2.3 Solvent Accessibility
2.2.4 Backbone Dihedral Angles
2.2.5 Additional Features
2.3 Threading Performance Measure
3 Methods
3.1 Overview of Protein Threading
3.1.1 Threading Scoring Function
3.1.2 Template Selection
3.1.3 Optimal Query-Template Alignment
3.2 Contact-Assisted Protein Threading
3.2.1 Residue-Residue Contact Map
3.2.2 Contact Map Alignment
3.3 Overview of Existing Contact-Assisted Threading Methods
3.3.1 Threading Methods That Implicitly Use Contact Information via Pairwise Contact Potential
3.3.2 Threading Methods That Explicitly Use Contact Information via Predicted Residue-Residue Contacts
3.4 Significance of Contact Maps Quality in Threading
3.5 Growth of Protein Sequence Databases and Its Implication in Threading
3.6 Discussion
4 Notes
References
Chapter 4: Omics and Remote Homology Integration to Decipher Protein Functionality
1 Evolution of Protein Structure Modeling
2 Comparative Genomics Among Other ``Omics´´
3 Natural Selection
4 Scientific Advances Boosted by Comparative Genomics: Protein Structures Integration
4.1 Insights in Cancer, Longevity, and Immunity Field
4.2 Insights in Mitochondrial Function and Related Diseases
4.3 Insights in Venom Proteins and Drug Design
5 Revolutionary Use of Homology Prediction in Biotechnology: Biosensors and Recombinant Proteins
6 Conclusion
References
Chapter 5: Easy Not Easy: Comparative Modeling with High-Sequence Identity Templates
1 Introduction
2 Conformational Diversity
3 High-Sequence Identity Does Not Guarantee an Accurate Model
3.1 A Single Amino Acid Change (Minimal Sequence Changes) May Produce a Huge Conformational Change
3.2 Candidate Templates with Conformational Diversity
3.3 Structural Divergence Within a Protein Family
4 Does My Query Protein Have Conformational Diversity?
4.1 Modeling Different Conformations of a Protein
4.2 When Do I Need to Model Multiple Conformations?
5 Conclusions
6 Notes
References
Chapter 6: Quality Estimates for 3D Protein Models
1 Introduction
2 Estimates of Model Accuracy (EMA) Are Essential for Template-Based Modeling (TBM) and Template-Free Modeling (FM)
3 Methods for Estimates of Model Accuracy
4 Observed Model Accuracy Scoring
5 EMA Classification
6 ModFOLD: A Leading EMA Web Server
6.1 ModFOLD History
6.1.1 The Initial Construction of ModFOLD
6.1.2 ModFOLDclustQ for Speed, Accuracy, and Consistency
6.1.3 The Quasi-Single-Model Approach
6.2 Latest Versions of ModFOLD
7 EMA in Community-Wide Experiments
8 Recent Advances in EMA Methods
References
Chapter 7: Using Local Protein Model Quality Estimates to Guide a Molecular Dynamics-Based Refinement Strategy
1 Introduction
1.1 The Local Quality Estimation of 3D Models
1.2 The Refinement of the Predicted 3D Models
1.3 The ReFOLD Server
2 Materials
3 Methods
3.1 The Performance of the Local Quality Assessment Guided MD-Based Protocol
3.1.1 ModFOLD6 in the Refinement Pipeline
4 Notes
5 Addendum
References
Chapter 8: Specificities of Modeling of Membrane Proteins Using Multi-Template Homology Modeling
1 Introduction
2 Materials
3 Methods
3.1 Template Search and Identification
3.2 Sequence and Structural Alignments
3.3 Multi-Template Homology Modeling with RosettaCM
3.4 High-Resolution Refinement
3.5 Preparation of the Ligand
3.6 Ligand Docking Using RosettaLigand
4 Notes
References
Chapter 9: Homology Modeling of the G Protein-Coupled Receptors
1 Introduction
1.1 G Protein-Coupled Receptors
1.2 Homology Modeling
1.3 Sequence Alignment
2 Methods. Where to Start?
2.1 Template Coverage
2.2 Where to Get the Template from?
2.3 Model Refinement
3 Conclusions
4 Notes
References
Chapter 10: Modeling of Olfactory Receptors
1 Introduction
2 Materials
2.1 Sequence Comparison and Alignment
2.2 3D Structure Building
2.3 Ligand Docking
2.4 Membrane Embedding
2.5 Molecular Dynamics
3 Methods
3.1 Collection of the Templates
3.2 Structure-Based Sequences Alignment
3.3 Construction of the Model
3.4 3D Model Analysis and Validation
3.5 Building the Complexes
3.6 Membrane Embedding
3.7 Molecular Dynamics Simulation
4 Notes
References
Chapter 11: Analyses of Mutation Displacements from Homology Models
1 Introduction
1.1 Which Mutations Are We Observing?
2 Effects of Mutations
2.1 Effect on the Function
2.2 Global Structural Effect
2.3 Side Chain and Stability Effect
2.4 Backbone Effect
3 Methods
3.1 The Three Mutants
3.2 Homology Model of the Wild Type
3.3 FoldX and Missense3D Models and Analyses
3.4 Dynamut and Rosetta Backrub Analyses
4 Conclusion
References
Chapter 12: Persistent Homology for RNA Data Analysis
1 Introduction
2 Method
2.1 Topological Representations for RNA
2.1.1 Simplicial Complex
2.1.2 Hypergraph
2.2 Persistent Homology for RNA Data Analysis
2.2.1 Persistent Homology
Homology
Vietoris-Rips Complex and Filtration
Persistent Homology
2.2.2 Persistent Homology Based Models and Functions
Persistent Betti Number
Persistent Entropy
Persistent Similarity
2.2.3 Weighted Persistent Homology for RNA Representation
Physics-Aware WPH Models
Element-Specific WPH
2.3 Persistent Spectral Theory for RNA Data Analysis
2.3.1 Spectral Graph
2.3.2 Spectral Simplicial Complex
2.3.3 RNA-Based Persistent Spectral Models
2.4 Persistent Models Based Machine Learning Models for RNA Data Analysis
3 Notes
References
Chapter 13: Computational Methods to Predict Intrinsically Disordered Regions and Functional Regions in Them
1 Introduction
2 IDR Prediction Methods
2.1 Scoring Function Methods
2.1.1 Uversky Plot and FoldIndex
2.1.2 IUpred
2.2 Machine Learning Methods
2.2.1 DISOPRED
2.2.2 SPOT-Disorder
2.2.3 PONDR
2.3 Consensus Methods
2.3.1 MobiDB-lite
2.3.2 MetaDisorder
3 Case Study
3.1 NeProc
3.2 Prediction by NeProc
References
Chapter 14: Homology Modeling of Transporter Proteins
1 Introduction
2 Materials
3 Methods
3.1 Template Identification and Selection
3.2 Target-Template Alignments
3.3 Model Building and Refinements
3.4 Model Validation
4 Notes
References
Chapter 15: Modeling of SARS-CoV-2 Virus Proteins: Implications on Its Proteome
Abbreviations
1 Introduction
2 Protein Modeling
2.1 Template-Based Structure Prediction
2.2 Ab Initio Structure Prediction
3 The SARS-CoV-2 Proteome
3.1 Nonstructural Proteins (Nsp)
3.2 Structural Proteins
3.3 Accessory Proteins
4 Models of Important SARS-CoV-2 Viral Proteins
4.1 Homology Modeling of Proteins Where High-Resolution Experimental Structures Are Available: Comparison of Homology Models w...
4.1.1 Nsp1
4.1.2 3CL-pro (Nsp5)
4.1.3 S Protein
4.1.4 N Protein
4.1.5 RdRp (Nsp12)
4.2 Homology Modeling of Envelope (E) Protein Where High-Resolution Structure Is Unavailable: Comparison Between Different Hom...
4.3 Ab Initio Protein Modeling of Membrane Protein (M) Where No Experimental Structure Is Available
5 Functional Implications of Protein Modeling
5.1 Protein-Protein Interactions
5.2 Understanding the Functionality of Proteins
5.3 Binding Site Predictions
5.4 Molecular Docking
5.4.1 ATP Binding Sites on 3CL-pro
5.4.2 Nsp7-nsp8 Primase Complex
5.5 Insights into Viral Replication Machinery
5.5.1 Nsp7-Nsp8-Nsp12 Replication Machinery
6 Summary of Methods
6.1 Steps of Homology Modeling
6.2 ATPbind Steps
6.3 Molecular Docking of ATP and 3CL-pro
References
Chapter 16: Homology Modeling of Antibody Variable Regions: Methods and Applications
1 Introduction
1.1 Antibody Structure
1.2 Antibody Variable Region
1.3 Homology Modeling of Antibody Variable Region
2 Materials and Methods
2.1 Identification of CDRs and FRs in the Input Sequences
2.2 Identification and Selection of Template Structures for CDRs and FRs
2.3 Optimization of the Initial VL and VH Orientation
2.4 Grafting of CDR Templates, Building CDR H3, and Assembling Initial Model
2.5 Side-Chain Optimization and Final Refinement
3 Applications and Further Developments
4 Notes
References
Chapter 17: 3D-BMPP: 3D Beta-Barrel Membrane Protein Predictor
1 Introduction
2 Materials
2.1 Additional Equipment
2.2 Equipment Setup
3 Methods
4 Conclusion
5 Notes
References
Chapter 18: Protein Homology Modeling for Effective Drug Design
1 Introduction
2 Materials and Methods
2.1 The Template Selection
2.2 The Sequence Alignment and Sources of Inaccuracies in the Models
2.3 Model Building and Quality Checking
2.4 Case of Modeling the Bacterial Enzyme OatA with Shallow Binding Site for Drug Design
3 Notes
References
Chapter 19: Specificities of Protein Homology Modeling for Allosteric Drug Design
1 Introduction
1.1 Allosteric Effects
1.2 Classification of Allosteric Ligands
2 Materials and Methods
2.1 Modeling of N-Terminal Part of CB1 Receptor
2.2 Ligand-Guided Modeling of the Allosteric Site in GABA Receptor
2.3 Homology Modeling of Potassium Channels
3 Notes
References
Chapter 20: Modeling of Protein Complexes
1 Introduction
2 Materials
2.1 Resources for Identification of Structural Homologs and Domain Boundaries
2.2 Resources for Generation of Homology Models
2.3 Resources for Evaluation of Protein Contact Interfaces
2.4 Resources to Compute Docking Models of Protein-Protein Interactions
2.5 Resources to Integrate Multiple Experimental and Computational Data into Molecular Models
2.6 Resources for Model Building, Visualization, and Optimization
2.7 Resources for Model Validation
3 Methods
3.1 Generate Homology Models of Individual Proteins
3.2 Find Reliable Data About Interaction Surfaces
3.3 Combine Homology Models and Contact Interface Information to Generate the Complex Model
3.4 Postprocessing and Validation: Model Adjustment and Final Evaluation
3.5 Final Considerations and Take-Home Messages
4 Notes
References
Index
Recommend Papers

Homology Modeling: Methods and Protocols
 1071629735, 9781071629734

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Methods in Molecular Biology 2627

Sławomir Filipek  Editor

Homology Modeling Methods and Protocols

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK

For further volumes: http://www.springer.com/series/7651

For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-by step fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.

Homology Modeling Methods and Protocols

Edited by

Sławomir Filipek Faculty of Chemistry, Biological and Chemical Research Centre, University of Warsaw, Warsaw, Poland

Editor Sławomir Filipek Faculty of Chemistry Biological and Chemical Research Centre University of Warsaw Warsaw, Poland

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-2973-4 ISBN 978-1-0716-2974-1 (eBook) https://doi.org/10.1007/978-1-0716-2974-1 © Springer Science+Business Media, LLC, part of Springer Nature 2023 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Cover Illustration Caption: Image created by Jakub Jakowiecki. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

Preface The method of modeling by homology is used with success from the 1990s relying on the relationship between evolution of the sequence and conservation of the structure. Homology modeling is employed in a variety of ways; however, it is very important to know where the limit of confidence is. Therefore, the purpose of this book is not only to present state-ofthe-art methodologies but also to discuss where they are most useful and where the “twilight zone” is, wherein other methods such as “threading” can help. Nowadays, homology modeling has become an almost automatic method for everyday use even for inexperienced people thanks to multiple Web servers. Therefore, there is a danger that black box treatment can easily produce results that are not particularly useful and sometimes even wrong. In such cases, the advice, suggestions, recommendations, and warnings included in this book could help to obtain reliable results. In recent years, the deep learning methodology in AlphaFold has produced structural models for almost all proteins from gene sequences from all currently known genomes. This is a great achievement in modeling the structure of proteins, but it is certainly not the end of modeling. There are hard cases even for AlphaFold: (i) many proteins exist in more than one conformation, especially receptors and transporters, (ii) there are unordered parts of protein structure which take shape in specific conditions, and (iii) there are protein complexes, the structures of which are very difficult to determine and, additionally, they can be dynamic. The current book contains both detailed procedures and reviews of important topics in the field of homology modeling. The first chapter, by Boubaker et al., deals with homology modeling in the twilight zone and improving accuracy through sequence space analysis. The challenge in identifying the correct templates in the twilight zone is that they should not only have similar sequences but also have the query structure preserved—they can be found only by the profile-profile mining methods followed by multidimensional scaling. Additionally, this chapter compares automated modeling via the Web servers to a custom model to show how modeling can be improved with user-added information. The next chapter, by Bartuzi et al., also illuminates the twilight zone and analyzes the advances in protein modeling in difficult cases. The homology modeling limit was long assumed to be around 20–30% of sequence identity; however, the recent CASP and CAMEO community-wide modeling assessments provided interesting outcomes that much more information can be deduced from the protein sequences when proper methodology is used. Particular attention in this chapter is devoted to the improvement of machine learning methods and the evaluation of the obtained models. The following chapter, by Bhattacharya et al., also deals with low-homology protein modeling and describes the contact-assisted threading method. The progress in this area is based on improved accuracy of deep learning–based inter-residue contact map predictors. This method employs contact maps that encode information about the interatomic interactions. Such maps can be used for highly accurate prediction of protein structures even for proteins that are difficult cases for classical homology modeling. The current limitations and future prospects of these methods are also discussed. The chapter by Silva and Antunes deals with “omics” technologies (not only genomics but also transcriptomics, proteomics, metabolomics, and epigenomics) as they have a background of bioinformatics for data integration and analysis. In the chapter, the authors

v

vi

Preface

assess various approaches for integration of different omics fields and remote homology modeling to predict protein structure and function. Such an approach could enable massive discovery of novel recombinant proteins with multiple biotechnological and other applications. Zea et al.’s chapter nicely illustrates the idea of conformational diversity of proteins and how to incorporate this feature in template-based modeling. Homology modeling using templates with high sequence identity may seem easy at first glance, but such proteins have a complexity that is not evident at the sequence level, and these cases can be very difficult to predict structure. One of the main problems in the field of protein structure prediction was to estimate the accuracy of the predicted 3D models, both locally and globally, in the absence of known structures. To address this problem, Maghrabi et al. present their new model quality assessment (MQA) method called ModFOLD. This method has been ranked as one of the most accurate MQA tools in independent blind evaluations. This chapter discusses the quality assessment of models in the field of protein modeling, showing both its strengths and limitations, and introduces some of the best, newest methods. The refinement of predicted 3D models based on the molecular dynamics simulations (MD) using different types of restraints provides good results but suffers from the absence of a reliable guidance mechanism to reach consistent improvement. Adiyaman and McGuffin propose to utilize the local quality assessment score produced by ModFOLD to guide the MD-based refinement approach to further increase the accuracy of the predicted protein models. By using the local quality score to guide the refinement process, the procedure is able to prevent the refined models from undesired structural deviations. Leman and Bonneau in their chapter discuss the specificities of modeling membrane proteins using multi-template homology modeling. There are fewer methods for modeling membrane proteins and they are of lower quality than those for modeling soluble proteins. As the available templates increase, several templates often overlap query sequence segments, so multi-template modeling can be applied to local segments and combine them into a single model. In this chapter, the authors provide a detailed protocol for membrane protein modeling from multiple templates, using the creatine transporter CT1 as an example, in the Rosetta software suite. In the next chapter, Mordalski and Kos´cio´łek also model membrane proteins but focus on modeling G protein-coupled receptors (GPCRs). It is a very important family of signaling proteins that are involved in almost all physiological and pathological processes in humans. The authors describe the most modern methods of modeling structures of GPCRs, starting from template selection through inferring the coordinates of conserved regions from the template, free modeling of non-aligned regions, and fine-tuning the model. Olfactory receptors constitute the largest subfamily within G protein-coupled receptors; however, their structures are still unknown. Wang et al. provide a detailed procedure for the structure construction and refinement of mouse eugenol olfactory receptor Olfr73 as an example. This procedure, which focuses on obtaining receptor structure for drug design, includes candidate template collection, structure-based sequence alignment, 3D structure construction, ligand docking, embedding in the membrane, and molecular dynamics simulation. Carpentier and Chomilier in their chapter present the latest advances in the assessment of structural disorders introduced by a single amino acid mutation in protein structures. The methods allow for the split of the distortion between the actual substitution effect and the

Preface

vii

contribution of local flexibility of the position in which the mutation occurs. As proof of concept, mutations in human lysozyme are analyzed: two of these mutations result in the formation of amyloid fibrils and the last one is neutral. Xia et al. present persistent homology methods for RNA data analysis. In their chapter, they introduce persistent homology and persistent spectral models. The persistent attributes for RNAs can be obtained from the above persistent models and further combined with machine learning to analyze the structure, flexibility, dynamics, and function of RNA. Anbo et al. describe the computational methods to predict intrinsically disordered regions (IDRs) of proteins. Such regions should be excluded from the target parts of homology modeling as these regions do not have ordered three-dimensional structures. This chapter provides an overview of the IDR prediction methods and the functional regions within them. Sylte et al. demonstrate homology modeling of membrane transporters. From template selection, through multiple sequence alignments, 3D structure generation and optimization, model validation, and the use of transporter homology models to structure-based virtual screening of ligands, various pitfalls and clues are discussed. Sarkar and Saha describe a timely approach to modeling of SARS-CoV-2 virus proteins with implications for its proteome, which contains structural, nonstructural (that manipulates host cellular mechanisms), and accessory proteins (responsible for replication and virulence). Due to the importance of research on SARS-CoV-2, almost all of its proteins have been modeled. To design vaccines, Bansia and Ramakumar present methods on homology modeling of antibody variable regions illustrating this by the SARS-CoV-2 spike protein. Antibody variable regions are the most important part of an antibody since they are capable of recognizing a virtually unlimited number of antigens. This chapter summarizes current practices in successful homology modeling of antibody variable regions and the potential applications of the generated homology models. β-barrel membrane proteins (βMPs) play an important role in membrane anchoring, pore formation, and enzyme activities. Tian et al. demonstrate their method 3D-BMPP for accurate construction of transmembrane domains of βMPs by predicting their strand registers, from which full 3D structures are derived. They can further model the extended beta barrels and loops in non-TM regions. This method can be broadly applied to genomewide βMPs structure prediction. Gniado et al. present an approach to model protein structures focusing on the ligand binding site which is important for drug design. In some difficult cases, for example, with a very shallow binding site and with limited structural information, the combination of homology modeling, molecular dynamics simulations, and fragment screening can produce satisfactory results. For drug design that focuses on allosteric binding sites, Jakowiecki et al. describe specificities of protein homology modeling for such variable regions that are of interest for drug design for more selective drugs. This chapter provides an example of modeling the N-terminus of cannabinoid CB1 receptor, ligand-guided modeling of structure of the allosteric site, and homology modeling of allosteric site in potassium channels. In the final chapter of the book, Scietti and Forneris provide an overview of approaches to construct multi-protein complex models. Such structures remain a challenge for non-experts due to the use of specific procedures that depend on the system under investigations and the need for experimental validation approaches to strengthen the resulting models. In this chapter, the authors provide examples to help the reader generate

viii

Preface

homomeric and heteromeric models of proteins. For computationally generated models, several repositories are available, but the lack of standardization of data availability procedures makes the quality of the models uncertain. Users approaching homology modeling should always remember that the model remains only theoretical in the absence of comprehensive experimental validation. I hope that the current book on recent homology modeling procedures, assumptions made, and model quality assessment will illuminate the black box of homology modeling for novice readers and broaden the knowledge of this methodology for professionals. Warsaw, Poland

Sławomir Filipek

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 Homology Modeling in the Twilight Zone: Improved Accuracy by Sequence Space Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rym Ben Boubaker, Asma Tiss, Daniel Henrion, and Marie Chabbert 2 Illuminating the “Twilight Zone”: Advances in Difficult Protein Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Damian Bartuzi, Agnieszka A. Kaczor, and Dariusz Matosiuk 3 Contact-Assisted Threading in Low-Homology Protein Modeling . . . . . . . . . . . . Sutanu Bhattacharya, Rahmatullah Roche, Md Hossain Shuvo, Bernard Moussad, and Debswapna Bhattacharya 4 Omics and Remote Homology Integration to Decipher Protein Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liliana Silva and Agostinho Antunes 5 Easy Not Easy: Comparative Modeling with High-Sequence Identity Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diego Javier Zea, Elin Teppa, and Cristina Marino-Buslje 6 Quality Estimates for 3D Protein Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ali H. A. Maghrabi, Fahd M. F. Aldowsari, and Liam J. McGuffin 7 Using Local Protein Model Quality Estimates to Guide a Molecular Dynamics-Based Refinement Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recep Adiyaman and Liam J. McGuffin 8 Specificities of Modeling of Membrane Proteins Using Multi-Template Homology Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julia Koehler Leman and Richard Bonneau 9 Homology Modeling of the G Protein-Coupled Receptors. . . . . . . . . . . . . . . . . . . Stefan Mordalski and Tomasz Kos´ciołek 10 Modeling of Olfactory Receptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xueying Wang, H. C. Stephen Chan, and Shuguang Yuan 11 Analyses of Mutation Displacements from Homology Models . . . . . . . . . . . . . . . . Mathilde Carpentier and Jacques Chomilier 12 Persistent Homology for RNA Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kelin Xia, Xiang Liu, and JunJie Wee 13 Computational Methods to Predict Intrinsically Disordered Regions and Functional Regions in Them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroto Anbo, Motonori Ota, and Satoshi Fukuchi

ix

v xi

1

25 41

61

83 101

119

141 167 183 195 211

231

x

14 15 16

17

18

19

20

Contents

Homology Modeling of Transporter Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ingebrigt Sylte, Mari Gabrielsen, and Kurt Kristiansen Modeling of SARS-CoV-2 Virus Proteins: Implications on Its Proteome . . . . . . Manish Sarkar and Soham Saha Homology Modeling of Antibody Variable Regions: Methods and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Harsh Bansia and Suryanarayanarao Ramakumar 3D-BMPP: 3D Beta-Barrel Membrane Protein Predictor . . . . . . . . . . . . . . . . . . . . Wei Tian, Meishan Lin, Ke Tang, Manisha Barse, Hammad Naveed, and Jie Liang Protein Homology Modeling for Effective Drug Design. . . . . . . . . . . . . . . . . . . . . Natalia Gniado, Agata Krawczyk-Balska, Pakhuri Mehta, Przemysław Miszta, and Sławomir Filipek Specificities of Protein Homology Modeling for Allosteric Drug Design . . . . . . . Jakub Jakowiecki, Urszula Orzeł, Aleksandra Gliz´dzinska, Mariusz Moz˙ajew, and Sławomir Filipek Modeling of Protein Complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luigi Scietti and Federico Forneris

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

247 265

301 321

329

339

349 373

Contributors RECEP ADIYAMAN • School of Biological Sciences, University of Reading, Reading, UK FAHD M. F. ALDOWSARI • School of Biological Sciences, University of Reading, Reading, UK HIROTO ANBO • Faculty of Engineering, Maebashi Institute of Technology, Maebashi, Japan AGOSTINHO ANTUNES • CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Porto, Portugal; Department of Biology, Faculty of Sciences, University of Porto, Porto, Portugal HARSH BANSIA • Department of Physics, Indian Institute of Science, Bengaluru, India; Advanced Science Research Center at The Graduate Center of the City University of New York, New York, NY, USA MANISHA BARSE • Center for Bioinformatics and Quantitative Biology and Richard and Loan Hill Department of Biomedical Engineering, University of Illinois at Chicago, Chicago, IL, USA DAMIAN BARTUZI • Department of Synthesis and Chemical Technology of Pharmaceutical Substances with Computer Modelling Laboratory, Medical University of Lublin, Lublin, Poland RYM BEN BOUBAKER • UMR CNRS 6015 – INSERM 1083, Laboratoire MITOVASC, Universite´ d’Angers, Angers, France DEBSWAPNA BHATTACHARYA • Department of Computer Science, Virginia Tech, Blacksburg, VA, USA SUTANU BHATTACHARYA • Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, USA RICHARD BONNEAU • Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA; Department of Biology, New York University, New York, NY, USA; Department of Computer Science, New York University, New York, NY, USA; Center for Data Science, New York University, New York, NY, USA; Prescient Design, a Genentech Accelerator, New York, NY, USA MATHILDE CARPENTIER • Institut Syste´matique Evolution Biodiversite´ (ISYEB), Sorbonne Universite´, MNHN, CNRS, EPHE, Paris, France MARIE CHABBERT • UMR CNRS 6015 – INSERM 1083, Laboratoire MITOVASC, Universite´ d’Angers, Angers, France JACQUES CHOMILIER • Sorbonne Universite´, BiBiP, IMPMC, UMR 7590, CNRS, MNHN, Paris, France SŁAWOMIR FILIPEK • Faculty of Chemistry, Biological and Chemical Research Centre, University of Warsaw, Warsaw, Poland FEDERICO FORNERIS • Department of Biology and Biotechnology, The Armenise-Harvard Laboratory of Structural Biology, University of Pavia, Pavia, Italy SATOSHI FUKUCHI • Faculty of Engineering, Maebashi Institute of Technology, Maebashi, Japan MARI GABRIELSEN • Molecular Pharmacology and Toxicology, Department of Medical Biology, Faculty of Health Sciences, UiT The Arctic University of Norway, Tromsø, Norway ALEKSANDRA GLIZ´DZINSKA • Faculty of Chemistry, Biological and Chemical Research Centre, University of Warsaw, Warsaw, Poland

xi

xii

Contributors

NATALIA GNIADO • Faculty of Chemistry, Biological and Chemical Research Centre, University of Warsaw, Warsaw, Poland; Department of Molecular Microbiology, Biological and Chemical Research Centre, Faculty of Biology, University of Warsaw, Warsaw, Poland DANIEL HENRION • UMR CNRS 6015 – INSERM 1083, Laboratoire MITOVASC, Universite´ d’Angers, Angers, France JAKUB JAKOWIECKI • Faculty of Chemistry, Biological and Chemical Research Centre, University of Warsaw, Warsaw, Poland AGNIESZKA A. KACZOR • Department of Synthesis and Chemical Technology of Pharmaceutical Substances with Computer Modelling Laboratory, Medical University of Lublin, Lublin, Poland; University of Eastern Finland, School of Pharmacy, Kuopio, Finland JULIA KOEHLER LEMAN • Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA TOMASZ KOS´CIO´ŁEK • Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland AGATA KRAWCZYK-BALSKA • Department of Molecular Microbiology, Biological and Chemical Research Centre, Faculty of Biology, University of Warsaw, Warsaw, Poland KURT KRISTIANSEN • Molecular Pharmacology and Toxicology, Department of Medical Biology, Faculty of Health Sciences, UiT The Arctic University of Norway, Tromsø, Norway JIE LIANG • Center for Bioinformatics and Quantitative Biology and Richard and Loan Hill Department of Biomedical Engineering, University of Illinois at Chicago, Chicago, IL, USA MEISHAN LIN • Center for Bioinformatics and Quantitative Biology and Richard and Loan Hill Department of Biomedical Engineering, University of Illinois at Chicago, Chicago, IL, USA XIANG LIU • Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore; Chern Institute of Mathematics and LPMC, Nankai University, Tianjin, China ALI H. A. MAGHRABI • College of Applied Sciences, Umm Al Qura University, Mecca, Saudi Arabia CRISTINA MARINO-BUSLJE • Fundacion Instituto Leloir, Buenos Aires, Argentina DARIUSZ MATOSIUK • Department of Synthesis and Chemical Technology of Pharmaceutical Substances with Computer Modelling Laboratory, Medical University of Lublin, Lublin, Poland LIAM J. MCGUFFIN • School of Biological Sciences, University of Reading, Reading, UK PAKHURI MEHTA • Faculty of Chemistry, Biological and Chemical Research Centre, University of Warsaw, Warsaw, Poland PRZEMYSŁAW MISZTA • Faculty of Chemistry, Biological and Chemical Research Centre, University of Warsaw, Warsaw, Poland STEFAN MORDALSKI • Department of Medicinal Chemistry, Maj Institute of Pharmacology Polish Academy of Sciences, Krakow, Poland BERNARD MOUSSAD • Department of Computer Science, Virginia Tech, Blacksburg, VA, USA MARIUSZ MOZ˙AJEW • Faculty of Chemistry, Biological and Chemical Research Centre, University of Warsaw, Warsaw, Poland HAMMAD NAVEED • Computational Biology Research Lab and Department of Computing, National University of Computer and Emerging Sciences, Islamabad, Pakistan URSZULA ORZEŁ • Faculty of Chemistry, Biological and Chemical Research Centre, University of Warsaw, Warsaw, Poland

Contributors

xiii

MOTONORI OTA • Graduate School of Information Sciences, Nagoya University, Nagoya, Japan SURYANARAYANARAO RAMAKUMAR • Department of Physics, Indian Institute of Science, Bengaluru, India RAHMATULLAH ROCHE • Department of Computer Science, Virginia Tech, Blacksburg, VA, USA SOHAM SAHA • MedInsights, Veuilly la Poterie, France; MedInsights SAS, Paris, France MANISH SARKAR • Hochschule fu¨r Technik und Wirtschaft (HTW) Berlin, Berlin, Germany; MedInsights SAS, Paris, France LUIGI SCIETTI • Department of Biology and Biotechnology, The Armenise-Harvard Laboratory of Structural Biology, University of Pavia, Pavia, Italy MD HOSSAIN SHUVO • Department of Computer Science, Virginia Tech, Blacksburg, VA, USA LILIANA SILVA • CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Porto, Portugal; Department of Biology, Faculty of Sciences, University of Porto, Porto, Portugal H. C. STEPHEN CHAN • Shenzhen Institutes of Advanced Technology, Chinese Academy of Science, Shenzhen, China INGEBRIGT SYLTE • Molecular Pharmacology and Toxicology, Department of Medical Biology, Faculty of Health Sciences, UiT The Arctic University of Norway, Tromsø, Norway KE TANG • Center for Bioinformatics and Quantitative Biology and Richard and Loan Hill Department of Biomedical Engineering, University of Illinois at Chicago, Chicago, IL, USA ELIN TEPPA • Toulouse Biotechnology Institute, TBI, Universite´ de Toulouse, CNRS, INRA, INSA, Toulouse, France WEI TIAN • Center for Bioinformatics and Quantitative Biology and Richard and Loan Hill Department of Biomedical Engineering, University of Illinois at Chicago, Chicago, IL, USA ASMA TISS • UMR CNRS 6015 – INSERM 1083, Laboratoire MITOVASC, Universite´ d’Angers, Angers, France XUEYING WANG • Shenzhen Institutes of Advanced Technology, Chinese Academy of Science, Shenzhen, China JUNJIE WEE • Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore KELIN XIA • Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore SHUGUANG YUAN • Shenzhen Institutes of Advanced Technology, Chinese Academy of Science, Shenzhen, China DIEGO JAVIER ZEA • Laboratory of Computational and Quantitative Biology, LCQB, UMR 7238 CNRS, IBPS, Sorbonne Universite´, Paris, France

Chapter 1 Homology Modeling in the Twilight Zone: Improved Accuracy by Sequence Space Analysis Rym Ben Boubaker, Asma Tiss, Daniel Henrion, and Marie Chabbert Abstract The analysis of the relationship between sequence and structure similarities during the evolution of a protein family has revealed a limit of sequence divergence for which structural conservation can be confidently assumed and homology modeling is reliable. Below this limit, the twilight zone corresponds to sequence divergence for which homology modeling becomes increasingly difficult and requires specific methods. Either with conventional threading methods or with recent deep learning methods, such as AlphaFold, the challenge relies on the identification of a template that shares not only a common ancestor (homology) but also a conserved structure with the query. As both homology and structural conservation are transitive properties, mining of sequence databases followed by multidimensional scaling (MDS) of the query sequence space can reveal intermediary sequences to infer homology and structural conservation between the query and the template. Here, as a case study, we studied the plethodontid receptivity factor isoform 1 (PRF1) from Plethodon jordani, a member of a pheromone protein family present only in lungless salamanders and weakly related to cytokines of the IL6 family. A variety of conventional threading methods led to the cytokine CNTF as a template. Sequence mining, followed by phylogenetic and MDS analysis, provided missing links between PRF1 and CNTF and allowed reliable homology modeling. In addition, we compared automated models obtained from web servers to a customized model to show how modeling can be improved by expert information. Key words Molecular modeling, Threading, Twilight zone, Profile-profile mining, Cytokine, Plethodontid receptivity factor

1

Introduction Since the resolution of the myoglobin structure in 1958 [1], the number of protein structures deposited in the Protein Data Bank [2] has increased exponentially to reach more than 160,000 structures in 2020. These structures led to a better understanding of protein functions and mechanisms of action. They have paved the way to computational approaches for rational drug design, search

Supplementary Information The online version contains supplementary material available at https://doi.org/ 10.1007/978-1-0716-2974-1_1. Sławomir Filipek (ed.), Homology Modeling: Methods and Protocols, Methods in Molecular Biology, vol. 2627, https://doi.org/10.1007/978-1-0716-2974-1_1, © Springer Science+Business Media, LLC, part of Springer Nature 2023

1

2

Rym Ben Boubaker et al.

of targetable allosteric sites, better understanding of structurefunction relationships, and so on. However, in spite of the huge advances in the field, the sequence space increases much more than structural space and computational approaches toward many proteins still rely on molecular modeling. Presently, based on available structural information and deposited structures, proteins (or protein regions) can be classified into four categories: (1) proteins with resolved structures, (2) proteins with closely related structurally resolved homologs that can be straightforwardly modeled by homology, (3) proteins within the twilight zone, for which structural information can be reached in spite of the absence of close structurally resolved homologs, and (4) proteins within the dark zone that lack similarity to any known structure and are inaccessible to homology modeling [3]. A recent analysis of the human genome, carried out with the deep learning based AlphaFold program [4], revealed that, on a residue basis, 58% of total residues have a resolved structure or are modelled with high confidence, whereas about 20% are dark and the remaining ones are in the twilight zone [5]. Thus, efforts still need to be made to improve accuracy in the twilight zone. The initial studies from the 1990s on the relationship between sequence and structural evolution remain valid today. In a hallmark paper, Sander and Schneider [6] determined a curve describing the limit of confidence between evolution of the sequence and conservation of the structure. This study was extended by Rost [7] who corroborated the main result: a length-dependent cutoff line separates close homologs with structural conservation from remote homologs with unknown structural similarity. The length of the aligned sequence is a key factor for confidence level in structural conservation. A cutoff of about 20% for a sequence of 200 amino acids or longer separates the safe zone for homology modeling from the twilight zone (Fig. 1). This cutoff does not mean that homology modeling is not possible, but it points towards the additional difficulties that arise in the twilight zone to build reliable models. Homology modeling is based on evolution. Proteins do not arise from scratch and can be classified into families. Homologous proteins within a family evolved from a common ancestor and share sequence, structural, and, to a lesser extent, functional similarities [8]. The structural conservation within a family is the keystone of molecular modeling. Using known structures of homologous proteins (the templates) and a multiple sequence alignment between the query and the templates, homology modeling programs such as MODELLER [9] build restraints and optimize the query structure from the template structures. In the safe zone, the sequence identity between the query and putative templates is high enough to infer that these proteins belong to the same family and share a common structure. By contrast, in the twilight zone, the similarities may arise from chance, convergence, or common ancestry, which raises several issues as follows:

Homology Modeling in the Twilight Zone

3

Fig. 1 Schematic representation of the dark, twilight, and safe zones for molecular modeling of a protein as a function of the sequence identity and aligned length between the query and the template. Above the yellow line (drawn from [6]), the light gray zone indicates the safe zone of homology modeling. Below the yellow zone, the dark gray zone indicates the twilight zone for which molecular modeling becomes increasingly difficult because sequence identity does not infer structural conservation. In the twilight zone, templates cannot be found with BLAST and threading methods must be used. When they fail, the dark zone is reached. Note that any two proteins have at least 5% sequence identity (dashed blue line)

1. Finding templates by straightforward sequence-sequence comparison methods is not possible. This point has led to the development of “threading” methods based either on compatibility of sequence and fold or on profile-based search (see below). 2. Additional information may be necessary to find evidence of common ancestry (homology) between template and query and to infer the conservation of the structure. Indeed, very low identity rates may correspond to divergent or convergent evolution. In addition, random alignment leads to 5% sequence identity, whereas, in highly divergent families, sequence identities can be as low as 8% (see an example in Fig. 2). 3. Alignment between query and template(s) may be difficult. Indeed, the cutoff also separates easy from tricky alignments which may alter the quality of the homology modeling. Alignments are greatly improved by the use of multiple sequence alignment methods [10, 11], but their accuracy remains challenging at low-sequence identities. Thus, the challenge of molecular modeling in the twilight zone relies on the recognition of correct templates and generation of accurate sequence-template alignments [7]. In this chapter, for clarity purpose, we will use, as a case study, the plethodontid receptivity factor isoform 1 (PRF1) from Plethodon jordani. This

4

Rym Ben Boubaker et al.

Fig. 2 Cytokines of the IL6 family. (a) General up-up-down-down topology of the four-helix bundle fold of this cytokine family. The helices are numbered from A to D. The positions of the three conserved sites of interaction with the cognate receptors are indicated. Site II interacts with gp130, site III interacts with either gp130, LIFR, or OSMR, while site I can interact with a third, specific, “α” receptor. (b) Sequence identities between PRF1, its closest homologs, and the cytokines of the IL6 family. The color code indicates the reliability of the sequence identity (green, safe zone; yellow, transition zone; red, twilight zone)

protein, for which structural and evolutionary data are missing, was discovered in 1999. It is a member of a pheromone protein family present only in lungless salamanders, with weak similarity with cytokines of the IL6 family [12]. We will show how sequence analysis methods, in particular multidimensional scaling (MDS), support the homology and structural conservation between PRF1 and cytokines of the IL6 family, by revealing intermediary sequences. We will also show how modeling of PRF1 can be improved by a variety of techniques.

2 2.1

Materials Databases

1. UniProt (https://www.uniprot.org/) is a comprehensive resource of protein sequences and functional information [13]. It is composed of the manually curated Swiss-Prot and of the automatically annotated trEMBL repositories. It contains not only protein sequences but also additional information including related 3D structures or models and identifiers of the protein family in different family databases such as Pfam [14] and InterPro [15]. 2. The Protein Data Bank (PDB, accessible at https://www.rcsb. org/) is the repository of biological macromolecular structures [2, 16]. 3. SCOP (structural classification of proteins) [17] is a repository of protein folds, based on an initial classification into five structural classes: all-alpha, all-beta, alpha/beta, alpha+beta, and small proteins.

Homology Modeling in the Twilight Zone

2.2 Sequence Analysis

5

1. NRDB90.pl [18] is a Perl script aimed at clustering sequences based on sequence identity to build nonredundant sets. It can be downloaded from ftp://biodisk.org. 2. Different programs such as CLUSTAL [19], MUSCLE [20, 21], and T-COFFEE [22] can be used to perform multiple sequence alignments (MSA). 3. The EXPRESSO program from the T-COFFEE suite [23] provides a multiple sequence alignment based on structural alignments, which may be a useful initial step for aligning proteins with low identities (http://tcoffee.crg.cat/apps/ tcoffee/do:expresso). 4. MSA can be manually edited using GeneDoc [24], a program aimed at editing and analyzing MSA through a graphical interface and available at https://genedoc.software.informer.com/. Subsequent phylogenetic analysis can be performed by the user-friendly MEGA software (Molecular Evolutionary Genetics Analysis) [25] available at https://www.megasoftware.net/. 5. The R package bios2mds [26] is aimed at analyzing MSAs by multidimensional scaling and provides tools for user-friendly visualization of the sequence space (see Note 1).

2.3

Template Mining

A variety of programs can be used to mine homologs in sequence databases. The choice of the program depends on the putative sequence identities between the query and the hits. Sequencesequence comparison programs are adequate in the safe zone, whereas sophisticated profile-profile searches are adapted to the twilight zone. Initially, “threading” referred to the search of a template by analyzing the compatibility of a sequence with a protein fold. Presently, “threading” refers to any method searching a template by sequence-profile or profile-profile comparison. Here is a non-exhaustive list of sequence database mining programs: 1. BLAST (Basic Local Alignment Search Tool) [27], based on local sequence similarity, allows fast sequence-sequence comparison. 2. PSI-BLAST (position-specific interactive BLAST) [28] is based on sequence-profile comparison. It derives a position-specific scoring matrix (PSSM) from the multiple sequence alignment of sequences detected above a given score threshold using protein-protein BLAST. 3. HMMER [29] is a sequence-profile comparison method based on profile hidden Markov models (HMMs) (https://toolkit. tuebingen.mpg.de/tools/hmmer). 4. Phyre2 (Protein Homology/analogY Recognition Engine V 2.0) [30] performs its searches by mining a database of profile

6

Rym Ben Boubaker et al.

HMMs, one for each known 3D structure (http://www.sbg. bio.ic.ac.uk/~phyre2/). 5. HHpred [31–33] performs HMM-HMM profile searches in sequence databases to find homologs (https://toolkit. tuebingen.mpg.de/tools/hhpred). 6. LOMETS (local meta-threading-server) [34] performs template searches using 11 different threading methods (see Note 2). Starting from a query sequence, LOMETS works in three steps: (1) building a sensitive (or deep) MSA, (2) threading the deep MSA by individual programs, and (3) ranking templates with a specific scoring function which takes into account normalized Z-scores and sequence identities. For each method, the normalized Z-scores differentiate good/bad templates (threshold of 1) (https://zhanglab.ccmb.med.umich.edu/ LOMETS/). 7. SUPERFAMILY finds a protein fold based on a collection of hidden Markov models, which represent structural protein domains at the SCOP superfamily level [35, 36]. SUPERFAMILY is a “true” threading program, aimed at finding a protein fold (https://supfam.org/SUPERFAMILY/). 2.4 Secondary Structure Prediction

When searching information for a protein without close structurally resolved homologs, prediction of secondary structure (SS) may yield useful information. Best performances for SS prediction are obtained with programs based on multiple sequence alignment profiles and neural networks, such as PSIPRED [37, 38] (http:// bioinf.cs.ucl.ac.uk/psipred/) and SPIDER3 [39] (https://sparkslab.org/server/spider3/).

2.5 Automated Structure Prediction

Several web servers perform automated structural prediction from a query sequence. They are based on threading methods to find templates, and then they use different methods for the subsequent modeling steps. Here, we list three automated 3D prediction servers compared in this chapter: 1. Phyre2 [30]: After template detection by mining a HMM database, the subsequent molecular modeling step is carried out with MODELLER based on the resulting sequence alignment between query and templates (http://www.sbg.bio.ic.ac. uk/~phyre2/). 2. I-TASSER (Iterative Threading ASSEmbly Refinement) [40– 42] is a hierarchical approach to protein structure prediction. It first identifies structural templates by comparing the best hits from the “best” 10 out of 14 threading methods with LOMETS, and then it builds full-length atomic models by iterative template-based fragment assembly simulations (https://zhanglab.ccmb.med.umich.edu/I-TASSER/).

Homology Modeling in the Twilight Zone

7

3. ROBETTA [43] is a protein structure prediction service. In the option for Rosetta comparative modeling (RosettaCM) [43], four independent methods (see Note 3) are used to detect templates and generate sequence alignments, and then models are built from template hybridization (https://robetta. bakerlab.org). 2.6 Customized Structure Prediction

For users who wish to build their own models, the MODELLER program [9] builds molecular models of the query from the template structure(s) by minimizing structural, stereochemical, and user-defined restraints. The structural restraints are based on the structure of the template(s) and the alignment between query and template(s). To customize models in order to match structural and functional requirements, expert information can be introduced by (1) adding user-defined restraints such as distance between two residues or secondary structure elements in the modeling procedure and (2) by combining user-selected templates and template fragments.

2.7

With threading methods, the metrics to compare models and templates must be more sensitive to the global fold similarity than to local structural variation. This is not the case of the traditional rootmean-square deviation (RMSD). The TM-score (see Note 4) has been specifically designed to solve this problem [44]. A threshold of 0.5 differentiates proteins with similar fold from proteins with different fold [45]. TM-scores can be calculated after structural alignment with TM-align [46] at the Zhang lab server (https:// zhanglab.ccmb.med.umich.edu/TM-align/).

Model Validation

2.8 Graphical Analysis

3 3.1

Graphical analysis of templates and models can be carried out by a variety of molecular visualization programs, such as chimera [47] or PyMOL (https://pymol.org/). Note that the low identity rates between template and query sequences in the twilight zone prevent the use of sequence-based structural superposition functions, such as the align function in PyMOL.

Methods Our Case Study

As a case study, we chose a small protein, the plethodontid receptivity factor isoform 1 (PRF1) from Plethodon jordani (UniProt entry: Q9PUJ2_PLEJO) [12]. This 215 amino acid protein (including a 23 amino acid peptide signal) is a courtship pheromone produced by males to increase female receptivity (see Note 5). When PRF1 was discovered in 1999, it was acknowledged that its sequence was weakly related (around 16% sequence identity) to the ciliary neurotrophic factor (CNTF) and cardiotrophin-1 (CTF1), two cytokines of the interleukin-6 (IL6) family [12]. Since then,

8

Rym Ben Boubaker et al.

two additional cytokines of the IL6 family have been discovered: cardiotrophin-2 (CTF2, absent in humans) [48] and the cardiotrophin-like cytokine factor 1 (CLCF1) [49]. CTF2 and CLCF1 have, respectively, 26% and 20% sequence identity with PRF1. In mammals, the IL6 family of cytokines includes IL6, interleukin-11 (IL11), leukemia inhibitory factor (LIF), oncostatin M (ONCM), CNTF, CTF1, CTF2 and CLCF1. Albeit the sequence identities can be as low as 8%, these cytokines share a common four-helix bundle fold with an up-up-down-down topology (Fig. 2). In addition, they signal through the gp130 receptor subunit and share similar binding sites with cognate receptors (see Note 6) [50–53]. Crystal structures have been resolved for five cytokines from the IL6 family: IL6 (1ALU [54], 5FUC [55]), IL11 (4MHL [56]), CNTF (1CNT [57]), LIF (1EMR, 1LKI [58], 2Q7N [59]), and ONCM (1EVS [60]). No crystal or NMR structure has been reported to date for PRF1 or cardiotrophin-like cytokines. 3.2 Template Search by PDB Mining

In InterPro, PRF1 is described as belonging to the 4_helix_cytokine-like_core superfamily (IPR009079) and to the PRF/cardiotrophin-like family (IPR010681). Additional information from the PRF1 sequence was searched for using the SUPERFAMILY assignment server [35]. SUPERFAMILY predicts that PRF1 is in the class of all-alpha proteins and belongs to the fold/superfamily of 4-helical cytokines with an E-value of 3 × 10-55 (see Note 7). It also suggests an “uncertain” classification for the family level as longchain cytokine with an E-value of only 0.005. The next step to find a template was the mining of the PDB in search of homologs. Using the mature sequence of PRF1 (residues 24-215) as a query, we performed different searches (Table 1): 1. Straightforward mining of the PDB with BLAST: this search led to no hit. 2. Sequence-profile search: a PSI-BLAST search seeded with the PRF1 sequence, followed by selection of hits on most query sequence (>60%), led to CNTF as a hit with an unreliable Evalue of 5.9. 3. Profile HMM search: profile search using the HMMER program [29] was carried out on the HHpred server. The search led to two hits, CNTF and LIF, as putative templates with very significant E-values of 10-10 or lower. 4. Profile HMM-profile HMM search: the HHpred algorithm led to six hits with E-values lower than 0.1: CNTF, LIF, ONCM, IL11, G-CSF (granulocyte colony-stimulating factor), and IL6. Among them, G-CSF (E-value of 8 × 10-13) is a fourhelical cytokine that does not belong to the IL6 family but shares the same up-up-down-down four-helix bundle fold.

Homology Modeling in the Twilight Zone

9

Table 1 PDB mining using the PRF1 sequence as a query Search method

Program

Hitsa

Sequence based

BLAST

No hit

Sequence profile based

PSI-BLAST

CNTF

5.9

Profile HMM based

HMMER

CNTF LIF

7 × 10-20 4 × 10-10

Profile HMM—profile HMM based

HHpred

CNTF LIF IL11 ONCM GCSK IL6

7 × 10-32 3 × 10-31 5 × 10-28 8 × 10-26 8 × 10-13 2 × 10-7

E-value

a

For clarity purpose, only the proteins (and not the PDB numbers) are indicated. Italic fonts indicate a growth factor with the same four-helix bundle fold as the IL6 family

3.3 Template Search with LOMETS

Finally, a comparison of 11 threading methods was carried out with the LOMETS server [34] (Table 2). All the methods, but one, classified CNTF or LIF as best hits. Only CEthreader, which is a contact-based method, privileged prolactin (PDB 1RW5). This growth factor shares the four-helix bundle fold of the IL6 cytokines. Most additional hits include cytokines of the IL6 family (ONCM, IL11, IL6) or cytokines/growth factors with same fourhelix bundle fold (prolactin, lactogen, IL23). However, several methods also found IL1Ra, the interleukin-1 receptor antagonist (PDB 1ILR) [61] that has a beta barrel fold (see Note 8). This finding serves as a reminder of how cautious users need to be when analyzing threading results.

3.4 Sequence Space Investigation

As exemplified by IL1Ra in LOMETS results (Table 2), finding a template by threading does not prove that there is homology, i.e., a common ancestor, between the template and the query, nor that the structure is conserved. The identity of 16% between the PRF1 query and the CNTF or LIF templates is positioned in the twilight zone (Fig. 1). However, homology and structural conservation are transitive properties. Investigation of the sequence space of the query analogs may reveal sequences homologous to both query and template(s) in the safe zone and consequently may validate the template. Indeed, finding intermediates is an efficient strategy to reduce false positives [7, 31]. To investigate the query sequence space, several steps must be carried out as follows: 1. Blast search of the query homologs in sequence databases. Here, using PRF1 as a query in UniProt vertebrate sequences,

10

Rym Ben Boubaker et al.

Table 2 Comparison of the LOMETS threading programs using the PRF1 sequence as a query Programa

Methodb Top hitc

Additional hits with Zn > 1c,d

HHpred

HMM

CNTF

LIF, IL11, ONCM

CEthreader

Contact

Prolactin

IL11, CNTF, lactogen, LIF, IL1Ra

Sparks-X

Profile

CNTF

LIF, IL11, ONCM, IL1Ra, IL6, prolactin

FFAS3D

Profile

CNTF

LIF, ONCM, IL11, G-CSF, IL6, IL1Ra, prolactin, lactogen

MUSTER

Profile

LIF

CNTF, ONCM, IL11

Neff-MUSTER

Profile

LIF

CNTF

HHsearch

HMM

LIF

CNTF, ONCM, IL11, G-CSF, IL6, IL1Ra

SP3

Profile

LIF

CNTF, ONCM, IL11, IL6, G-CSF, IL1Ra, prolactin

PPAS

Profile

CNTF

LIF, ONCM, IL11, G-CSF, IL6, IL1Ra, IL23

PROSPECTOR2

Profile

CNTF

LIF, ONCM, IL11

PRC

HMM

LIF

CNTF, ONCM, IL11

a

The programs are ranked as determined by LOMETS. See Note 2 for references HMM corresponds to profile HHM—profile HMM-based searches; profile corresponds to sequence profile—sequence profile-based searches c For clarity purpose, only the proteins (and not the PDB numbers) are indicated d Normalized Z-scores (Zn) indicate the quality of the hits. They are considered “good” above the threshold of 1. The hits are sorted by decreasing Zn. When several hits correspond to the same protein (different origins, conditions, or methods), only the first hit is indicated. Italic fonts indicate four-helix bundle cytokines/growth factors that do not belong to the IL6 family but share the same fold. The bold fonts for IL1Ra highlight a hit with a beta barrel fold b

we obtained 602 hits with E-value lower than 10. Among them, 190 sequences corresponded to salamander receptivity factors and shared with PRF1 sequence identities larger than 60%. Among very significant hits (E-value 0.5 typically indicates the correct overall fold [81].

Methods

3.1 Overview of Protein Threading

The goal of protein threading is to optimally align a query sequence to a known structural template [82]. This requires identifying the correct or best-fit template from a library of templates and the optimal query-template alignment from the space of all possible query-template alignments. The query-template alignment represents a correspondence between each query residue and the spatial positioning of the aligned template residues. Overall, protein threading can be mainly considered to involving three components: (1) a threading scoring function that evaluates the fitness of querytemplate alignments, (2) identification of the best-fit structural template from the library of templates, and (3) an optimal alignment of the query sequence to the template. In the following, we discuss each component in more details.

3.1.1 Threading Scoring Function

The scoring function plays an important role to quantitatively assess the fitness of query-template alignments [14]. The scoring function normally consists of the profile similarity score, the structural consistency score, and the gap penalty. The profile similarity score can be calculated by comparing the query and template profiles. It quantifies how the query is evolutionary related to the template. The structural consistency score contains two components: consistency of local structures such as secondary structure and solvent accessibility compatibility and consistency of global structures or pairwise interatomic interactions. Weights can be used in the scoring function to control the relative importance of different scoring terms.

Contact-Assisted Protein Threading 3.1.2

Template Selection

45

Identifying the best-fit template inevitably requires using the alignment score of query-template alignments. The raw query-template alignment score cannot be directly used to rank templates due to the biases introduced by the protein length [14]. Both machine learning-based methods and Z-score are used to mitigate the bias. Several protein threading methods [40, 46, 83–85] use machine learning models such as the neural network for the template ranking by formulating the template selection as a classification problem, even though a majority of the threading methods [18, 63, 64] rely on Z-score for the template selection. Z-scores of the querytemplate pair are computed from the means and standard deviations of the scores of the query sequence with all templates of the template library. However, it cannot cancel out all the biases introduced by the protein length. A large protein appears to have a high Z-score. It is also difficult to interpret the Z-score, particularly when the scoring function is the weighted sum of different scoring terms [14].

3.1.3 Optimal QueryTemplate Alignment

The optimal query-template alignment is the alignment that optimally aligns residues in the query sequence homologous to residues in the template. It is often the case that a threading scoring function is effective in selecting the homologous template, but the querytemplate alignment is significantly weak [25, 86]. In such cases, the alignment may be suboptimal, which might result in less accurate template-based models built from such an alignment, that is, the sensitivity of query-template alignment directly affects the overall performance of template-based modeling.

3.2 Contact-Assisted Protein Threading

A contact map of a protein is a binary, square, symmetric matrix with vertices corresponding to residues of the protein, and a contact edge indicates that the distance between a residue pair is smaller than a given threshold. Typically, this distance threshold is considered 8 Å between the Cα and Cβ atoms of the residue pairs [16, 20]. Here, the set of contacts between residue pair (i, j) is defined as: n d ij ≤ 8 Å C ði, j Þ = 01 ifotherwise

3.2.1 Residue-Residue Contact Map

where dij is the distance between the residue pair (i, j). Figure 1 shows a representative protein 3D structure and its corresponding 2D residue-residue contact map. 3.2.2 Contact Map Alignment

Contact map alignment is a way of measuring the similarity between two contact maps. The maximum contact map overlap problem tries to evaluate the similarity of the two proteins by calculating the maximum overlap between their contact maps while preserving the ordering of residues of both sequences, leading to a pairwise sequence alignment as illustrated in Fig. 2. Since

46

Sutanu Bhattacharya et al.

Fig. 1 A representative protein 3D structure and its corresponding 2D binary contact map. (a) 3D structure of a representative protein (PDB ID 1cc8A), (b) the corresponding 2D residue-residue contact map, considering Cα atoms and a distance threshold of 8 Å

direct contact map alignment is computationally expensive [63], several approximation algorithms [62, 87–92] have been developed to address the contact map alignment problem including the eigendecomposition-based strategy, graphlet degree-based approach, and iterative double dynamic programming-based approach. Eigendecomposition decomposes a contact map into eigenvectors and corresponding eigenvalues. This approach compares two proteins by comparing their contact map eigenvectors, which can be performed in polynomial time. For example, approaches such as EIGAs [87], SABERTOOTH [89], and Al-Eigen [90] use the eigendecomposition to approximate contact maps using the top eigenvectors and use the global alignment of key eigenvectors to find the similarity between two contact maps. GR-Align [92] is a fast contact map alignment approach based on graphlet degree distribution. Moreover, [93] proposes a contact map alignment algorithm C-Align based on Cα atoms using dynamic programming. Recent methods such as map_align [62] employ iterative double dynamic programming to calculate contact map alignment, with the goal of optimizing the number of contact overlaps while minimizing the number of gaps. 3.3 Overview of Existing ContactAssisted Threading Methods

Table 1 shows several publicly available contact-assisted threading methods. These approaches can be broadly subdivided into two classes: (1) methods that implicitly use contact information via pairwise contact potential such as PROSPECT [46], PROSPECTOR [75, 76], and RAPTOR [14]; and (2) methods that explicitly use contact information via predicted residue-residue contacts including the current state-of-the-art contact-assisted threading methods such as EigenTHREADER [20], map_align [62], CEthreader [63], CATHER [64], ThreaderAI [65], and our in-house threading method [16]. We briefly discuss them below.

Contact-Assisted Protein Threading

47

Fig. 2 Contact map alignment. (a) contact map of a representative protein (PDB ID 1cc8A), (b) contact map of another representative protein (PDB ID 1wvnA), (c) sequence alignment of 1cc8A and 1wvnA using Al-Eigen. In both cases, Cα atoms and the distance threshold of 8 Å are considered. (d) 1wvnA (in rainbow) is structurally superimposed on 1cc8A (in gray) 3.3.1 Threading Methods That Implicitly Use Contact Information via Pairwise Contact Potential

PROSPECT (PROtein Structure Prediction and Evaluation Computer Toolkit) [46] is one of the earliest protein threading methods, which makes use of pairwise contact potential by introducing a contact term into its scoring function. This study considers that pairwise contact potentials are measured only between core secondary structures. The contact cutoff is set at 7 Å between the Cβ atoms. Additionally, the method uses a divide-and-conquer

48

Sutanu Bhattacharya et al.

Table 1 Selected publicly accessible threading methods that implicitly or explicitly use contact information Name (reference)

Method

Availability

PROSPECT (Xu and coworkers [46])

Divide-and-conquer algorithm

http://compbio.ornl.gov/structure/ prospect/

PROSPECTOR (Skolnick and coworkers [75, 76])

Hierarchical approach

http://bioinformatics. danforthceneter.org/services/ threading.html

RAPTOR (Xu and coworkers [14])

Linear programming

http://www.cs.uwaterloo.ca/~j3xu/ RAPTOR_form.htm

EigenTHREADER (Jones and coworkers [20])

Dynamic programming and eigendecomposition

http://bioinfadmin.cs.ucl.ac.uk/ downloads/eigenTHREADER/

map_align (Baker and coworkers [62])

Iterative double dynamic programming

https://github.com/sokrypton/ map_align

CEthreader (Zhang and coworkers [63])

Dynamic programming and eigendecomposition

https://zhanglab.ccmb.med.umich. edu/CEthreader/

CATHER (Yang and coworkers [64])

Iterative double dynamic programming

https://yanglab.nankai.edu.cn/ CATHER/

ThreaderAI (Shen and coworkers [65])

Deep residual neural network and dynamic programming

https://github.com/ShenLab/ ThreaderAI

algorithm for the alignment searching procedure. Another method, PROSPECTOR (PROtein Structure Predictor Employing Combined Threading to Optimize Results) [75, 76], uses a “partly thawed” technique to assess the contact potential based on the previous alignment iterations. RAPTOR (RApid Protein Threading by Operation Research technique) [14] is another protein threading method that introduces contact capacity score. It considers only contacts between two core residues where the spatial distance between Cα atoms is 7 Å with a sequence separation of 4. It addresses threading as a problem of wide-scale integer programming, relaxes it to a problem of linear programming, and uses a branch-and-bound approach to solve the integer program. However, the performance contribution of pairwise contact potential in the above methods is not significant compared to that of sequence profile, particularly for distantly related proteins. The underlying reason may be noisy contacts that do not hold any extra signal, yielding just modest improvement. 3.3.2 Threading Methods That Explicitly Use Contact Information via Predicted Residue-Residue Contacts

Recent successful applications of deep learning have resulted in significantly improved inter-residue contact prediction methods [53, 56, 60, 94]. As such, the newest contact-assisted threading methods have been explicitly integrating predicted residue-residue

Contact-Assisted Protein Threading

49

contact information to improve threading performance. EigenTHREADER [20], developed in 2017, extends Al-Eigen [90] to enable threading by predicting a protein’s contact map using classical neural network-based predictor MetaPSICOV [53] and then searching a library of templates’ contact maps. Despite the superior performance of EigenTHREADER over other profile-based threading methods for low-homology threading, it can be further improved by integrating other linear features such as sequence profiles along with inter-residue contact maps. map_align [62], developed in 2017, proposes an iterative double dynamic programming algorithm [95] that aligns contact maps, predicted by pure coevolutionary-based predictor GREMLIN [96], in combination with metagenomic sequences of microbial DNA [97]. The elevated performance of map_align can be attributed to the contribution of contact maps in low-homology threading. However, considering that the outcomes rely on the initial estimate of the similarity matrix, which is not always optimal, this approach does not necessarily guarantee optimal solutions. CEthreader [63] (Contact Eigenvector-based threader), developed in 2019, uses contact maps predicted from deep residual neural-network-based predictor ResPRE [94]. Similar to Al-Eigen, this work uses the eigendecomposition technique to approximate contact maps by the cross product of single-body eigenvectors. CEthreader introduces a dot-product scoring function by incorporating contact information along with secondary structures and sequence profiles to align contact eigenvectors and uses dynamic programming to generate the query-template alignments. However, the method can be further strengthened by considering negative eigenvalues in addition to positive eigenvalues, since the incorporation of both positive and negative eigenvalues restores the contact map. Another new contact-assisted threading algorithm CATHER [64] (contactassisted THreadER), developed in 2020, uses both conventional sequential profiles and contact maps predicted by a deep learningbased method MapPred [98]. A very recent method ThreaderAI [65] integrates deep learning-based contact information with traditional sequential and structural features by formulating the task of threading as the classical computer vision’s classification problem. This work introduces a deep residual neural network to predict query-template alignments. Based on the reported results of the above methods, contact-assisted threading methods significantly outperform profile-based threading methods by a large margin, particularly for low-homology targets. Our in-house threading method [16], developed in 2019, integrates the standard threading technique along with interresidue contact information predicted by the state-of-the-art ultra-deep learning-based method RaptorX [56]. First, our method

50

Sutanu Bhattacharya et al.

applies the standard threading technique to select the top templates based on the Z-score and then applies the contact map overlap score using Al-Eigen along with the Z-score to calculate the final score for selecting the best-fit template. Based on large-scale benchmarking results, this method outperforms profile-based threading method MUSTER as well as other contact-assisted threading methods EigenTHREADER and map_align. 3.4 Significance of Contact Maps Quality in Threading

While incorporating contact information into threading is highly effective, our recent study [99] shows the impact of diverse quality of contact maps on contact-assisted threading performance in that integrating high-quality contacts having the Matthews correlation coefficient (MCC) ≥0.5 results in improved threading performance for ~30% of the cases, while low-quality contacts having MCC 1, usually referred as positive selection. So, the positive selection is the fixation of advantageous mutation and is a crucial process behind adaptive changes which confer evolutionary innovations and species differentiation [46]. The majority of genes, mainly those with crucial organismal functions, have been subjected to strong negative selection. However, some genes can present sites with ω > 1 that are expected to have important protein

68

Liliana Silva and Agostinho Antunes

functionality and therefore demonstrate their role in the development of phenotypic differentiation and fitness improvement [47, 48]. Uncovering the molecular mechanism by which this phenotypic differentiation arose has long been a crucial goal for evolutionary biology [49]. Recently it was hypothesized that inside bacterial proteins, those that show evolutionary constrain and are subject of evolutionary selection are an increase potential drug targets since mutations in such proteins could have important loss of function and makes bacteria less susceptible to random development of resistance [50]. As a matter of fact, the dN/dS of gene coding for known drug targets is significantly lower than the genome average, which suggests that the actually known drug targets have evolved slowly and this approach could be a great predictor tool of new drugs design [50].

4 Scientific Advances Boosted by Comparative Genomics: Protein Structures Integration Considering the three-dimensional structures obtained by X-ray crystallography, homology modeling, or even remote homology, all have an important role in understanding the molecular function of proteins and the pinpoint of highly conserved or variable sites, under a comparative evolutionary genomics perspective, providing the basis of multiple recent advances in areas such as medicine or biotechnology. In the next paragraphs, examples are shown on how integrative data inferences are retrieved from evolutionary genomics and how protein structures allow major scientific advances. 4.1 Insights in Cancer, Longevity, and Immunity Field

Cancer, longevity, and immunity are three closely connected fields with high research interest due to huge impact in human health. Understanding the information flow from genes to proteins in several cancer hallmarks and immune protein highlights the longevity of species. An interesting example regards on the white shark (Carcharodon carcharias) and whale shark (Rhincodon typus) species, which genomes reveal high degree of stability that may reflect the combined selective pressure of its large genome size (>3 Gbp), high repeat content, high long-interspersed element retrotransposon representation, large body size, and long life spans (white shark and whale shark can reach a life span of 73 and 140 years, respectively) [46]. It is known that the majority of cancers present genomic instability [51] by accumulation of high frequency of genomic mutations. Moreover, overtime the genome of an organism is under exogenous, endogenous, and cellular processes that can inflict DNA damage, compromise the DNA integrity, and predisposes cells to malignant transformation, neurological disease, and premature aging. Moreover, defense mechanisms, some of them

Protein Functionality Retrieved by Omics and Remote Homology

69

conserved across ancient species, are evolving to safeguard the genetic information. Sharks present a fantastic wound healing capability [52] that is regulated by highly complex process with multiple genes involved in key phases, and interrelated steps of wound healing are under positive selection (such as some elements involved in vascular endothelial growth, heparin sulfate formation, fibrinogen, and keratin maintenance). The theoretical risk of developing cancer increases with life span [53], but in the case of elasmobranchs, this hypothesis does not seem to be applied. The answer may be related with oncological regulation. The p53 currently referred as the “guardian of the genomes” is owing to its capacity to block proliferation of cells with damaged genome (such as oncological cells) [54] and is regulated by MDM2 and MDM4. Studies performed in these Elasmobranchii genomes reveal that MDM4 gene is under positive selection and the mapping of these sites in a three-dimensional structure of MDM4 allows the precise location of selected sites, as also inference of its interactions with other p53 regulatory elements [46]. Additionally, proteomic analysis of another elasmobranch species (Isurus oxyrinchus) reveals that low cancer risk should be connected with overexpression of immunity and tumor suppressor genes [55], some of them positively selected [56]. Understanding the function of these positive selected sites may be a key to comprehend oncological deregulations and longevity issues. Moreover, the former regulatory systems under positive selection are examples that can provide not only relevant information to understand basic biology issues of sharks but also information potentially valuable for biomedical research. Inside the immunological research field, there are multiple gene families involved in tuned with recognition processes to guarantee protection against nonself structures, like pathogens. The toll-like receptors (TLRs) are one of these supergene family expressed in cell membrane and intracellular vesicles and are involved in a first line of immune defense against viral and nonviral pathogens. TLRs are present since early invertebrate lineages, with an additional role of developmental function, whereas in vertebrates TLRs are involved solely in immune response [57]. A recent comparative evolutionary genomic study in almost 80 species distributed across all vertebrate classes revealed the dynamic gain/ loss of 26 TLRs since the early vertebrate evolution. The lineagespecific gene gain or loss of TLRs, integrated with the analysis of positive selected sites and its location in three-dimensional structure, support the rapid evolution of TLRs. Here, the pinpoint of positive selected sites on the predicted protein structures (obtained by homology modeling) elucidates the potential function and structural significance of highly positive selected sites (Fig. 5). The majority of positive selected sites of the TLR proteins are restricted to the variable extracellular domain involved in the ligand recognition and dimerization. With integration of all these steps, in

70

Liliana Silva and Agostinho Antunes

Fig. 5 Advances in immunity field. The screening of TLRs in massive avian dataset reveals multiple evidences of positive selection sites in LRR domains of TLRs like, for example, in the TLR5, expressed in immune and nonimmune cells and with crucial role in flagellin recognition. These results highlight the avian host-pathogen arms race and coevolution of immune ligands/receptors. (Adapted from Khan et al. [58])

birds it was shown that alterations in both viral and nonviral TLRs are related with the host-pathogen arms race and coevolution of ligands/receptors, which reflect the importance of these class of vertebrates as significant vectors of zoonotic pathogens and reservoirs for virus [58]. 4.2 Insights in Mitochondrial Function and Related Diseases

Mitochondrion is a very peculiar cellular organelle presenting its own genome (mitogenome). The replication of mitochondria occurs by binary fission and the main function of this structure is to supply around 95% of the cellular energy. The genomic size of the mitogenome is very similar to bacterial genomes [59]; it has its own genetic code (with variations in some eukaryotic lineages [60]) and it is exclusively maternally inherited with the absence of recombination. The mitogenome typically encodes 13 peptide subunits involved in complexes of oxidative phosphorylation (OXPHOS). Amino acid changes in these elements of respiratory complexes have large functional effects and some mutations have been linked to human diseases [61]. Since several mutations of the proteincoding genes of the mitochondria DNA (mtDNA) have been associated with variations in thermal or metabolic needs across species that experience different environmental conditions or distinct locomotive strategies, several studies have been focus on the evolution of mitochondrial function in species of challenging habitats and

Protein Functionality Retrieved by Omics and Remote Homology

71

distributed across geographic clines or ecological gradients. Turtles are an interesting group of species to explore mitochondrial genome evolution since different species present distinct respiratory challenges and perform multiple strategies to ensure efficient respiration. After exploration of mitochondrial genomes of 57 turtle species, it was reported several positive selected sites prevalent in OXPHOS complex I proteins that were mapped onto the threedimensional crystal structures [62]. The interpretation of positive selected sites location in three-dimensional structures, together with recent structural and biomedical studies, reveals that many of the identified sites are functionally relevant and may be involved in proton translocation pathways and coupling functions, which may help in the coordination of conformational changes among subunits. Additionally, some positive selected sites are located near accessory subunits that interact between complexes, suggesting a role of selection in shaping high-order interactions among subunits and complexes of respiratory chain. The high-prevalence signatures of positive selection among mtDNA protein-coding genes of turtles support the hypothesis that alterations in habitat (e.g., highly aquatic lifestyles in soft-shell turtles) are associated with changes in mitochondrial functions [62]. Since variations in elements of OXPHOS may be related with multiple diseases, including optic neuropathies, neurodegenerative diseases, and diabetes, and the location of mutations reflects the potential degree of disease (e.g., alterations in critical catalytic residues that abolish activity may be less likely to occur, whereas mutations that slightly compromise the activity may be tolerated more easily), the identification of natural selection acting in each residue of these complexes, interlinked with the structural location on protein structures, will be an interesting tool for future medical research [61]. Understanding the molecular function of a toxic compound, interlinked with comparative genomics and protein homology, is crucial to unravel the natural detoxification mechanisms and allow creation of antidotes. These approaches can also be applied to extinct organism in order to understand the past and highlight the future of actually threatened species. An amazing example was the recent report of whole genome of extinct Carolina parakeet (Conuropsis carolinensis)—the last surviving specimen died in 1918. The Carolina parakeet was known for its predilection for cockleburs (Xanthium strumarium) that contain high levels of carboxyatractyloside diterpenoid glucoside (CAT) [63]. CAT is a lethal toxin that inhibits production of mitochondrial energy by inhibition of ATP transporters (SLC25A4-6 and SLC25A31) [64]. After release of Carolina parakeet genome, ATP transporters were accessed and compared with their orthologs in other available avian genomes and the results are fascinating. Carolina parakeet SLC25A4 and SLC25A5 carry nonsynonymous amino acid changes that where associated to a helix of protein and flanking

72

Liliana Silva and Agostinho Antunes

Fig. 6 Advances in mitochondrial field. SLC25A4 is an ATP transporter responsible for production of mitochondrial energy and its function is blocked by CAT lethal toxin. The sequencing of whole genome of the extinct Carolina parakeet reveals a mutation in SLC25A4 possible related with great adaptation of Carolina parakeet to deal with the toxicity of high CAT levels present in its predilect food (cockleburs). (Adapted from Gelabert et al. [66])

pocket sites, with potential effect on functionality of both proteins. Once again, the three-dimensional modeled structure of SLC25A4 and SLC25A5 has a great impact in structural and functional interpretation of positive selected sites (Fig. 6). It was also hypothesized that these mutations conferred the exclusive adaptive mechanism to deal with toxic CAT present in their diet. Parrots, which are close to Carolina parakeet, are also known by ingestion of fruits and seeds with toxic profile to other vertebrates [65]. It was being proposed that some species could neutralize the toxic compounds by consuming clay from rivers as a toxin-absorbing strategy. However, other physiological detoxification mechanisms should be considered such as the avian variations in SLC25A4 and SLC25A5 [66]. Moreover, the great adaptation of Carolina parakeet to CAT toxin and the lack of evidence for a long-term decline or widespread inbreeding suggests that, unfortunately, abrupt disappearance of this species was directly attributable to human pressures [66]. 4.3 Insights in Venom Proteins and Drug Design

Production of venoms is a strategy of several species to improve their prey mechanisms and/or to act as defense machinery. Venom production is a strategy present in multiple metazoans lineages, from invertebrate species to more apical vertebrate ones. Venoms are composed by complex cocktails of toxins with wide-ranging molecular activities, such as necrotoxicity, neurotoxicity, myotoxicity, and cytotoxicity, which make these moistures a rich source of molecules with wide potential for pharmacological applications [67, 68]. Moreover, the use of toxins as species weapon also potentiates the coevolution of predators, to release the most efficient venom with the lowest molecular cost, and preys to develop venom resistance. This chemical arm race is reflected at molecular level by huge number of venom-coding genes under positive

Protein Functionality Retrieved by Omics and Remote Homology

73

Fig. 7 Advances in venoms field. CRISPs are an example of recruited toxin-encoding genes in reptile species while maintaining a nonvenomous role in mammals. The venom CRISPs present marks of positive selection, whereas ortholog CRISPs involved in mammalian reproduction has a ω < 1. (Adapted from Sunagar et al. [70])

selection and suffering rapid alterations. For example, despite the common ancestor of all venomous reptiles presented a core set of toxin-encoding genes, during the posterior lineages evolution, additional toxins were recruited to originate complex venoms present in extant venomous lizard and snakes [69]. Several times, the recruited toxin-encoding genes are also present in other species without poisonous effect. CRISPs are glycoproteins with important role in mammalian reproduction but are also present in reptile venoms (Fig. 7). Whereas the mammalian CRISPs are under strong negative selection, reptilian CRISPs are evolving under positive selection and it was detected several positive selected sites that after protein structure prediction where marked in molecular surface and functional cysteine-rich domain. Moreover, the predatory mechanism used by several snake families is strictly related with content of venom and selective pressure acting on venomous protein-coding genes [70]. Another example of high variation and evolution of venoms were reported in snake species of genus Crotalus whereas it is possible to detect variations in toxin among species but also within populations of a specific species occurs distinct venom profiles [71]. This discovery highlights the tremendous biochemical diversity of venom arsenal drive by variable strengths of evolutionary pressures. The availability of three-dimensional structures of toxin compounds is extremely relevant not only to understand the shape and conformation of venom molecules but also to highlight the binding sites and clarify the interaction mechanisms of prey-predator molecules. Vampire bats are elements of subfamily Desmodontinae, known by their hematophagous lifestyle. In order to improve their feeding habits, these species produce venom with strong

74

Liliana Silva and Agostinho Antunes

anticoagulant and proteolytic properties [72]. One of the toxins present in their venom cocktail is the plasminogen activator desmokinase that dissolves fibrin clots and allows a continuous flow of blood [72]. Due to its properties, desmokinase has been strongly investigated for use as thrombolytic drug for strokes [68]. Additionally, bat desmokinase is under strong positive selection which allows rapid accumulation of mutations. Evaluation of threedimensional desmokinase structure reveals that majority of alterations are occurring in molecular surface and do not affect the molecular integrity and toxin function. So, these frequent mutations on molecular surface of a toxin prevent the development of immunological resistance in prey animals and improve the fitness of vampire bats [72].

5 Revolutionary Use of Homology Prediction in Biotechnology: Biosensors and Recombinant Proteins The integration of concepts that comes from comparative genomics/“omics” and protein homology can be applied to a panoply of distinct fields, mainly biotechnology area that applies technological knowledge to biomolecule manipulation. Two promising examples with high impact are biosensors and recombinant proteins, both fields boosted by recent advances in protein homology prediction. Bitter tastings are usually representatives of toxicity signals [73] and fast and reliable detection of these molecules has commercial interest to guarantee the safety of foods. Several former strategies tried to create bitter molecule biosensors based on gustatory tissue cells and natural bitter receptors [74–76]. However, the use of biological tissues is complex and the reliability and reproducibility are dependent on multiple environmental factors. The major challenge was finding a small sensitive protein, easy to access and modify, stable, with good cost-effectiveness, and with ability to enhance the practicability [77]. The solution was odorant-binding proteins (OBPs) [78]. OBPs are small proteins present in nasal mucosa of vertebrates and in antennae of insects. They mediate the presentation of airborne odor cues to chemo-sensorial receptors by interaction between odorant molecules and OBP hydrophobic binding pocket [77, 79]. In a recent study [78], the protein-ligand interaction between OBP76 of Drosophila melanogaster (also known as LUSH) and an array of bitter molecules was accessed using the three-dimensional structure of LUSH protein and docking assays (Fig. 8). The protein structure of LUSH is wellknown and with high homology structure with bitter taste receptors which could be explored in bitter molecules detection field. Due to success of interactions, OBPs were immobilized in electrodes and electrochemical impedance spectroscopy analysis showed

Protein Functionality Retrieved by Omics and Remote Homology

75

Fig. 8 Advances in biotechnological field. The available protein structure of odorant-binding proteins allows the knowledge of active binding sites of these small sensitive proteins. The knowledge of their function, associated with their ease to access and modification, stability, and good cost-effectiveness, permitted the development of multiple biosensors directed to chemical detection and repellent creation

significant binding properties to denatonium, quinine, and berberine bitter molecules. Therefore, OBP biosensor promises a powerful analytic technique to obtain fast and simple gustatory information which has great value in bitter taste evaluation [78]. Some years before, OBP biosensors had been developed to detect volatile organic compounds indicative of Salmonella contamination [80] and also biosensors with ability to discriminate the octenol molecules from the carvone and thus assess food contamination by molds [81]. Recently, a novel biosensor using OBP1 of Anopheles was developed to detect metabolites related with Escherichia coli, such as indole in order to allow a water safety test with high specificity and sensitivity [82]. However, the manipulation of OBP proteins and its incorporation in biosensors is a drop in the ocean. Since crystal structure of Anopheles gambiae OBP1 was available, OBPs started to be used in binding assays (by testing battery of molecules present in essential oils extracted from aromatic plants with repellent potential [83]) and docking studies (by testing large quantity of molecules with ability to bind the long hydrophobic tunnel of invertebrates OBPs [84]) which increase the identification of mosquito repellents and open doors to new environmental friendly strategies to insect population control (Fig. 8). Recent binding assays with known OBPs present in the giant panda, Ailuropoda melanoleuca, were conducted to elucidate the molecular mechanisms responsible for a unique and anomalous

76

Liliana Silva and Agostinho Antunes

herbivore diet of this carnivora species [85]. Additionally, this strategy of binding assay promotes noninvasive approaches, crucially attending the vulnerable status of giant pandas [86]. The conclusions of study were fascinating and demonstrated that OBP3, present with abundance in giant panda nasal mucus, has high affinity to plant volatiles, such as cedrol best represented in bamboo leaves collected in the spring [85]. Another revolutionary use of three-dimensional protein structure regards on recombinant protein field. Recombinant proteins are obtained by laboratory manipulation of DNA from two distinct sources. The most common example is the introduction of proteincoding DNA fragment inside a plasmid vector of bacteria or yeast, followed by laboratorial cloning, expression, and purification of targets [87, 88]. The diversity and applications of available recombinant proteins are extensive and revolutionized the medicine allowing new proteins with important pharmacological applications (such as growth hormone, insulin, erythropoietin, and interferon beta-1a) as also the development of new vaccines. Recently, a new protocol was developed to produce recombinant proteins with alterations in N-terminal domain that culminates with a new promising low-cost synthetic lung surfactant [89]. The industry also benefits with recombinant proteins since it is getting easier and low-cost produce and manipulate several enzymes and monomers [90]. At the same time, it is crucial to monitor the process and ensure the quality of obtained recombinant proteins. While multiple approaches are available, such as chromatographic and spectroscopy tools, the assessment of three-dimensional structure of recombinant proteins is also an interesting step [91] that helps in understanding the functionality and interactions of these artificial proteins. As examples, we have the structural evaluation of recombinant adenine phosphoribosyltransferase (APRT) with potential target as antiparasitic drugs [92] or recombinant human granulocyte macrophage-colony stimulation factor (rhGM-CSF) that promotes white blood cell proliferation and maturation [93]. Thus, the emerging of new models of three-dimensional protein structure prediction will allow the optimization of recombinant protein production protocols as well as its quality control.

6

Conclusion A strictly well-orchestrated regulation of proteins and proteincoding genes is crucial across all kingdoms of life. In recent years, several omics approaches allowed the detail exploration of multiple biomolecules, from DNA, RNA, to proteins or metabolites. However, it is extremely juicy (while challenging) to integrate the results provided from these subareas of biology with the three-dimensional structures of related proteins.

Protein Functionality Retrieved by Omics and Remote Homology

77

The almost 170,000 protein structures available in public databases (like PDB) are just a drop in the ocean and much more protein structures are yet to be solved. Thus, the recent homology and remote homology strategies provide a fast and high-accurate approach to disentangle three-dimensional protein structures. The obtained structures allow interpretation of protein interactions, which is crucial to understand biological pathways and phenomenon such as cancer, immunity, or mitochondrial diseases. At the same time, integration of omics results with protein structures is a vital step to highlight potential sites for drug design, allows the creation of biotechnological devices (like biosensors), and enables massive conception of recombination proteins with multiple biological and industrial applications. In some decades ago, a fast ab initio protocol of protein structure prediction with low computational weight and high-precision results was a utopia; nowadays this new era of bioinformatics is providing more accurate inferences and will change the way how we look at cellular organization and function.

Acknowledgments L.S. was supported by a PhD grant from “Fundac¸˜ao para a Cieˆncia e a Tecnologia” (FCT) (SFRH/BD/134565/2017; COVID/ BD/151995/2021). A.A. was partially supported by the Strategic Funding UIDB/04423/2020 and UIDP/04423/2020 through national funds provided by FCT and the European Regional Development Fund (ERDF) in the framework of the program PT2020, by the European Structural and Investment Funds (ESIF) through the Competitiveness and Internationalization Operational Program – COMPETE 2020 and by National Funds through the FCT under the project PTDC/CTA-AMB/31774/2017 (POCI-01-0145FEDER/031774/2017). References 1. Sanger F (1949) The terminal peptides of insulin. Biochem J 45:563–574 2. Muirhead H, Perutz MF (1963) Structure of hemoglobin. A three-dimensional Fourier synthesis of reduced human hemoglobin at 5.5 Å resolution. Nature 199:633–638 3. Kendrew JC, Bodo G, Dintzis HM et al (1958) A three-dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature 181:662–666 4. Matthews BW, Sigler PB, Henderson R et al (1967) Three-dimsensional structure of Tosyl-α-chymotrypsin. Nature 214:652–656

5. Freer ST, Kraut J, Robertus JD et al (1970) Chymotrypsinogen: 2.5 Å crystal structure, comparison with α-chymotrypsin, and implications for zymogen activation. Biochemistry 9: 1997–2009 6. Poljak RJ, Amzel LM, Avey HP et al (1973) Three-dimensional structure of the fab’ fragment of a human immunoglobulin at 2.8 Å resolution. Proc Natl Acad Sci U S A 70: 3305–3310 7. Berman HM, Westbrook J, Feng Z et al (2000) The protein data bank (www.rcsb.org). Nucleic Acids Res 28:235–242

78

Liliana Silva and Agostinho Antunes

8. Callaway E (2020) ‘It opens up a whole new universe’: revolutionary microscopy technique sees individual atoms for first time. Nature 582: 156–157 9. Martı´-Renom MA, Stuart AC, Fiser A et al (2000) Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 29:291–325 10. Rost B, Schneider R, Sander C (1997) Protein fold recognition by prediction-based threading. J Mol Biol 270:471–480 11. Agu¨ero-Chapin G, Galpert D, Molina-Ruiz R et al (2020) Graph theory-based sequence descriptors as remote homology predictors. Biomol Ther 10:26 12. Agu¨ero-Chapin G, Pe´rez-Machado G, MolinaRuiz R et al (2011) TI2BioP: topological indices to BioPolymers. Its practical use to unravel cryptic bacteriocin-like domains. Amino Acids 40:431–442 13. Marrero-Ponce Y, Castillo-Garit JA, Olazabal E et al (2004) TOMOCOMD-CARDD, a novel approach for computer-aided ‘rational’ drug design: I. Theoretical and experimental assessment of a promising method for computational screening and in silico design of new anthelmintic compounds. J Comput Aided Mol Des 19:615–634 ˜ edo N et al 14. Dı´az HG, Olazabal E, Castan (2002) Markovian chemicals “in silico” design (MARCH-INSIDE), a promising approach for computer aided molecular design II: experimental and theoretical assessment of a novel method for virtual screening of fasciolicides. J Mol Model 8:237–245 15. Senior AW, Evans R, Jumper J et al (2020) Improved protein structure prediction using potentials from deep learning. Nature 577: 706–710 16. Heo L, Feig M (2020) High-accuracy protein structures by combining machine-learning with physics-based refinement. Proteins 88: 637–642 17. Heo L, Feig M (2020) Modeling of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) proteins by machine learning and physics-based refinement. bioRxiv 2020.03.25.008904 18. Fleischmann RD, Adams MD, White O et al (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496–498 19. Fraser CM, Gocayne JD, White O et al (1995) The minimal gene complement of Mycoplasma genitalium. Science 270:397–404

20. Goffeau A, Barrell BG, Bussey H et al (1996) Life with 6000 genes. Science 274:562–567 21. Consortium TCeS (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282:2012–2018 22. Adams MD, Celniker SE, Holt RA et al (2000) The genome sequence of Drosophila melanogaster. Science 287:2185–2195 23. Rubin GM, Yandell MD, Wortman JR et al (2000) Comparative genomics of the eukaryotes. Science 287:2204–2215 24. Consortium IHGS (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921 25. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74:5463–5467 26. Shendure J, Balasubramanian S, Church GM et al (2017) DNA sequencing at 40: past, present and future. Nature 550:345–353 27. Scientists GKCo (2009) Genome 10K: a proposal to obtain whole-genome sequence for 10 000 vertebrate species. J Hered 100:659– 674 28. Koepfli KP, Paten B, Scientists GKCo et al (2015) The genome 10K project: a way forward. Annu Rev Anim Biosci 3:57–111 29. Lewis HA, Robison GE, Kress WJ et al (2018) Earth BioGenome Project: sequencing life for the future of life. Proc Natl Acad Sci U S A 115: 4325–4333 30. Rogers J, Gibbs RA (2014) Comparative primate genomics: emerging patterns of genome content and dynamics. Nat Rev Genet 15:347– 359 31. Cazzanelli G, Pereira F, Alves S et al (2018) The yeast Saccharomyces cerevisiae as a model for understanding RAS proteins and their role in human tumorigenesis. Cell 7:14 32. Bigham AW, Mao X, Mei R et al (2009) Identifying positive selection candidate loci for high-altitude adaptation in Andean populations. Hum Genomics 4:79–90 33. Itan Y, Gerbault P, Pines G (2015) Evolutionary genomics. Evol Bioinforma 11:53–55 34. Lau SKP, Woo PCY, Lai KKY et al (2011) Complete genome analysis of three novel picornaviruses from diverse bat species. J Virol 85:8819–8828 35. Barros DR, Alfenas-Zerbini P, Beserra JEA Jr et al (2011) Comparative analysis of the genomes of two isolates of cowpea aphid-borne mosaic virus (CABMV) obtained from different hosts. Arch Virol 156:1085–1091

Protein Functionality Retrieved by Omics and Remote Homology 36. Jungreis I, Sealfon R, Kellis M (2021) SARSCoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nat Commun 12:2642 37. Finnigan GC, Hanson-Smith V, Stevens TH et al (2012) Evolution of increased complexity in a molecular machine. Nature 481:360–364 38. Duan M, Zhao W, Zhou L et al (2020) Omics research in vascular calcification. Clin Chim Acta 511:198–207 39. Mele´ M, Ferreira PG, Reverter F et al (2015) Human genomics. The human transcriptome across tissues and individuals. Science 348: 660–665 40. Sandberg R (2014) Entering the era of singlecell transcriptomics in biology and medicine. Nat Methods 11:22–24 41. Sharma PV, Thaiss CA (2020) Hostmicrobiome interactions in the era of singlecell biology. Front Cell Infect Microbiol 10: 569070 42. Das T, Andrieux G, Ahmed M et al (2020) Integration of online omics-data resources for cancer research. Front Genet 11:1320 43. Daviss B (2005) Growing pains for metabolomics. Scientist 19:25–28 44. Hollywood K, Brison DR, Goodacre R (2006) Metabolomics: current technologies and future trends. Proteomics 6:4716–4723 45. McCafferty CL, Verbeke EJ, Marcotte EM et al (2020) Structural biology in the multi-omics era. J Chem Inf Model 60:2424–2429 46. Marra NJ, Stanhope MJ, Jue NK et al (2019) White shark genome reveals ancient elasmobranch adaptations associated with wound healing and the maintenance of genome stability. Proc Natl Acad Sci U S A 116:4446–4455 47. Machado JP, Philip S, Maldonado E et al (2016) Positive selection linked with generation of novel mammalian dentition patterns. Genome Biol Evol 8:2748–2759 48. Machado JP, Johnson WE, Gilbert MTP et al (2016) Bone-associated gene evolution and the origin of flight in birds. BMC Genomics 17: 371 49. Parsons KJ, Albertson RC (2013) Unifying and generalizing the two strands of evo-devo. Trends Ecol Evol 28:584–591 50. Gladki A, Kaczanowski S, Szczesny P et al (2013) The evolutionary rate of antibacterial drug targets. BMC Bioinform 14 51. Negrini S, Gorgoulis VG, Halazonetis TD (2010) Genomic instability — an evolving hallmark of cancer. Nat Rev Mol Cell Biol 11:220– 228

79

52. Chin A, Mourier J, Rummer JL (2015) Blacktip reef sharks (Carcharhinus melanopterus) show high capacity for wound healing and recovery following injury. Conserv Physiol 3: cov062 53. Albanes D (1998) Height, early energy intake, and cancer. Evidence mounts for the relation of energy intake to adult malignancies. BMJ 317: 1331–1332 54. Toufektchan E, Toledo F (2018) The guardian of the genome revisited: p53 downregulates genes required for telomere maintenance, DNA repair, and centromere structure. Cancers 10:135 55. Domingues RR, Mastrochirico-Filho VA, Mendes NJ et al (2020) Comparative eye and liver differentially expressed genes reveal monochromatic vision and cancer resistance in the shortfin mako shark (Isurus oxyrinchus). Genomics 112:4817–4826 56. Marra NJ, Richards VP, Early A et al (2017) Comparative transcriptomics of elasmobranchs and teleosts highlight important processes in adaptive immunity and regional endothermy. BMC Genomics 18:87 57. Imler JL, Hoffmann JA (2002) Toll receptors in drosophila: a family of molecules regulating development and immunity. Curr Top Microbiol Immunol 270:63–79 58. Khan I, Maldonado E, Silva L et al (2019) The Vertebrate TLR Supergene Family Evolved Dynamically by Gene Gain/Loss and Positive Selection Revealing a Host–Pathogen Arms Race in Birds. Diversity 11:131 59. Andersson SG, Karlberg O, Canb€ack B et al (2003) On the origin of mitochondria: a genomics perspective. Philos Trans R Soc B 358: 165–177 60. Quek ZBR, Chang JJM, Ip YCA et al (2020) Mitogenomes reveal alternative initiation codons and lineage-specific gene order conservation in echinoderms. Mol Biol Evol 38:981– 985 61. Bridges HR, Birrel JA, Hirst J (2011) The mitochondrial-encoded subunits of respiratory complex I (NADH:ubiquinone oxidoreductase): identifying residues important in mechanism and disease. Biochem Soc Trans 39:799– 806 62. Escalona T, Weadick CJ, Antunes A (2017) Adaptive patterns of mitogenome evolution are associated with the loss of shell scutes in turtles. Mol Biol Evol 34:2522–2536 63. Stuart BP, Cole RJ, Gosser HS (1981) Cocklebur (Xanthium strumarium, L. var. stnrmarium) intoxication in swine: review and

80

Liliana Silva and Agostinho Antunes

redefinition of the toxic principle. Vet Pathol 18:368–383 64. Pebay-Peyroula E, Dahout-Gonzalez C, Kahn R et al (2003) Structure of mitochondrial ADP/ATP carrier in complex with carboxyatractyloside. Nature 426:39–44 65. Gilardi JD, Toft CA (2012) Parrots eat nutritious foods despite toxins. PLoS One 7:e38293 66. Gelabert P, Sandoval-Velasco M, Serres A et al (2020) Evolutionary history, genomic adaptation to toxic diet, and extinction of the Carolina Parakeet. Curr Biol 30:108–114 67. Mohamed Abd El-Aziz T, Garcia Soares A, Stockand JD (2019) Snake venoms in drug discovery: valuable therapeutic tools for life saving. Toxins 11:564 68. Bordon KCF, Cologna CT, Fornari-Baldo EC et al (2020) From animal poisons and venoms to medicines: achievements, challenges and perspectives in drug discovery. Front Pharmacol 11:1132 69. Fry BG, Udheim EAB, Ali SA et al (2013) Squeezers and leaf-cutters: differential diversification and degeneration of the venom system in toxicoferan reptiles. Proteomics 12:1881– 1889 70. Sunagar K, Johnson WE, O’Brien SJ et al (2012) Evolution of CRISPs associated with toxicoferan-reptilian venom and mammalian reproduction. Mol Biol Evol 29:1807–1822 71. Sunagar K, Undheim EA, Scheib H et al (2014) Intraspecific venom variation in the medically significant Southern Pacific Rattlesnake (Crotalus oreganus helleri): biodiscovery, clinical and evolutionary implications. J Proteome 99:68–83 72. Low DH, Sunagar K, Undheim EA et al (2013) Dracula’s children: molecular evolution of vampire bat venom. J Proteome 89:95–111 73. Nissim I, Dagan-Wiener A, Niv MY (2017) The taste of toxicity: a quantitative analysis of bitter and toxic molecules. IUBMB Life 69: 938–946 74. Li Y, Liu Q, Xu Y et al (2005) The development of taste transduction and taste chip technology. Chin Sci Bull 50:1415–1423 75. Lu L, Hu X, Zhu Z (2017) Biomimetic sensors and biosensors for qualitative and quantitative analyses of five basic tastes. TrAC Trends Analyt Chem 87:58–70 76. von Molitor E, Riedel K, Hafner M et al (2020) Sensing senses: optical biosensors to study gustation. Sensors 20:1811 77. Brito NF, Oliveira DS, Santos TS et al (2020) Current and potential biotechnological applications of odorant-binding proteins. Appl Microbiol Biotechnol 104:8631–8648

78. Chen Z, Zhang Q, Shan J et al (2020) Detection of bitter taste molecules based on odorantbinding protein-modified screen-printed electrodes. ACS Omega 5:27536–27545 79. Muthukumar S, Rajesh D, Selvam RM et al (2018) Buffalo nasal odorant-binding protein (bunOBP) and its structural evaluation with putative pheromones. Sci Rep 8:9323 80. Sankaran S, Panigrahi S, Mallik S (2011) Odorant binding protein based biomimetic sensors for detection of alcohols associated with salmonella contamination in packaged beef. Biosens Bioelectron 26:3103–3109 81. Di Pietrantonio F, Benetti M, Cannata` D et al (2015) A surface acoustic wave bio-electronic nose for detection of volatile odorant molecules. Biosens Bioelectron 67:516–523 82. Dimitratos SD, Hommel AS, Konrad KD et al (2019) Biosensors to monitor water quality utilizing insect odorant-binding proteins as detector elements. Biosensors 9:62 83. Kro¨ber T, Koussis K, Bourquin M et al (2018) Odorant-binding protein-based identification of natural spatial repellents for the African malaria mosquito Anopheles gambiae. Insect Biochem Mol Biol 96:36–50 84. Murphy EJ, Booth JC, Davrazou F et al (2013) Interactions of Anopheles gambiae odorantbinding proteins with a human-derived repellent: implications for the mode of action of N, N-diethyl-3-methylbenzamide (DEET). J Biol Chem 288:4475–7785 85. Zhu J, Arena S, Spinelli S et al (2017) Reverse chemical ecology: olfactory proteins from the giant panda and their interactions with putative pheromones and bamboo volatiles. Proc Natl Acad Sci U S A 114:E9802–E9810 86. Swaisgood R, Wang D, Wei F (2016) Ailuropoda melanoleuca (errata version published in 2017). The IUCN Red List of Threatened Species 2016: e.T712A121745669. https:// doi.org/10.2305/IUCN.UK.2016-2.RLTS. T712A45033386.en 87. Umstead A, Soliman AS, Lamp J et al (2020) Validation of recombinant human protein purified from bacteria: an important step to increase scientific rigor. Anal Biochem 611: 113999 88. Seetaraman Amritha TM, Mahajan S, Subramaniam K et al (2020) Cloning, expression and purification of recombinant dermatopontin in Escherichia coli. PLoS One 15:e0242798 89. Kronqvist N, Sarr M, Lindqvist A et al (2017) Efficient protein production inspired by how spiders make silk. Nat Commun 8:15504 90. Sutherland TD, Huson MG, Rapson TD (2018) Rational design of new materials using

Protein Functionality Retrieved by Omics and Remote Homology recombinant structural proteins: current state and future challenges. J Struct Biol 201:76–83 91. Mizutani K, Toyoda M, Otake Y et al (2012) Structural and functional characterization of recombinant medaka fish alpha-amylase expressed in yeast Pichia pastoris. Biochim Biophys Acta 1824:954–962 92. Esipov RS, Timofeev VI, Sinitsyna EV et al (2018) Three-dimensional structure of recombinant adenine phosphoribosyltransferase from

81

thermophilic bacterial strain Thermus thermophilus HB27. Rus J Bioorg Chem 44:504–510 93. Aubin Y, Gingras G, Sauve´ S (2008) Assessment of the three-dimensional structure of recombinant protein therapeutics by NMR fingerprinting: demonstration on recombinant human granulocyte macrophage-colony stimulation factor. Anal Chem 80:2623–2627

Chapter 5 Easy Not Easy: Comparative Modeling with High-Sequence Identity Templates Diego Javier Zea, Elin Teppa, and Cristina Marino-Buslje Abstract Homology modeling is the most common technique to build structural models of a target protein based on the structure of proteins with high-sequence identity and available high-resolution structures. This technique is based on the idea that protein structure shows fewer changes than sequence through evolution. While in this scenario single mutations would minimally perturb the structure, experimental evidence shows otherwise: proteins with high conformational diversity impose a limit of the paradigm of comparative modeling as the same protein sequence can adopt dissimilar three-dimensional structures. These cases present challenges for modeling; at first glance, they may seem to be easy cases, but they have a complexity that is not evident at the sequence level. In this chapter, we address the following questions: Why should we care about conformational diversity? How to consider conformational diversity when doing template-based modeling in a practical way? Key words Comparative modeling, Homology modeling, Conformational diversity, Structural divergence, Conformational ensemble, Protein structure, Native state, Protein dynamics

1

Introduction In the 1990s, the concept of modeling protein structures by homology began to be explored and since then many methods have been developed and improved [1]. These methods assume that proteins with similar sequences have similar structures; thus, a protein with a known structure can serve as a template to build the structure of another protein with a similar sequence. Because the compared proteins are frequently homologues, the technique is commonly called “homology modeling.” However, this term implies a phylogenetic relationship between the two proteins which might not be true in all cases, being “comparative” or “template-based” modeling as more comprehensive names than “homology modeling.”

Diego Javier Zea and Elin Teppa contributed equally with all other contributors. Sławomir Filipek (ed.), Homology Modeling: Methods and Protocols, Methods in Molecular Biology, vol. 2627, https://doi.org/10.1007/978-1-0716-2974-1_5, © Springer Science+Business Media, LLC, part of Springer Nature 2023

83

84

Diego Javier Zea et al.

Today there are around 180,000,000 protein sequences stored at UniProt—the repository of protein sequences and annotations [2]—and only around 164,000 protein structures corresponding to 91,000 different proteins are deposited in the Protein Data Bank (PDB), the repository of protein structures [3]. This is because solving three-dimensional structures is a difficult, laborious, costly, and time-consuming task. While the structure of most of the proteins remains unknown, we know most existing protein folds, evidenced by the lack of new protein folds contributed to the PDB since 2011 despite the increasing number of deposited structures. Because the number of folds that protein domains can adopt is as small as ~1200 (according to the hierarchical domain classification systems CATH [4] and SCOP [5]), it is highly probable to find a template to model a protein structure (see Note 1) pointing to the usefulness of the comparative modeling. A problem that arises from the small number of available structures is that our knowledge is “dynamically incomplete” [6]. The high-resolution structure of a particular protein provides a snapshot of the protein in a given conformation. But proteins in their biological environment are not rigid objects and several conformers coexist in a dynamic equilibrium [7, 8]. The conformational ensemble populated by proteins in their native state is what allows them to interact with different partners, bind their ligands, undergo allosteric transitions, etc. [9]. Consequently, any change that perturbs the ability of a protein to properly navigate the conformational landscape of the native state may result in pathogenesis. In humans, a wide variety of diseases, including cancer, cardiovascular disease, diabetes, Parkinson’s disease, antitrypsin deficiency, Alzheimer’s disease, and spongiform encephalopathies, are caused, in part, by errors in the folding of the set of structures or imbalances of conformers in the native state, and they are considered “conformational diseases” [10–13]. An approach to studying the set of protein conformations in the native state is to analyze different structures of the same protein. Some protein structures were solved at least twice, sometimes in different experimental conditions, or by NMR (several models for the same condition), providing information about the conformational diversity of the protein’s native states. Nowadays, the number of known structural conformations by proteins is still very scarce, and further efforts will be needed to amend the incompleteness of the PDB in terms of dynamical information [6]. As protein exists as an ensemble of conformers, which information is missed when a single structure is modeled? Does the modeled structure represent the most biologically relevant structure? Is one model enough to understand the biological process of my study?

Comparative Modeling, Easy Cases

85

This chapter will review the point of modeling protein structures for which there is a recognizable homologue protein to use as a template, that is, the “easy cases.” Tress and colleagues [14] defined the cases as “when it is possible to find a structural template using a simple BLAST search, without needing a profile approach as PSI-BLAST.” Currently, the technical definition of easy cases to model is more complex, based on several scores [15], but the previous definition is still useful for modeling purposes. As a thumb rule, it can be applied that above 30% sequence identity between target and template the proteins will have the same fold (backbone chain); above 50% it is a safe zone (the model will be right) and above 70% identity it is an easy case. We will discuss in the following points how easy cases can end up being not so easy.

2

Conformational Diversity As mentioned, proteins are flexible molecules that exist in multiple conformations in their native state, and the ensemble of all the conformations shows their conformational diversity. For some proteins, the structural conformations that populate the native state are quite different because of secondary structure changes, for example, ɑ-helix/loop transition [16], relative movements between domains, and order/disorder transitions, among other structural changes. In such cases, they have a high conformational diversity. However, other proteins have low conformational diversity, meaning that in their native state, the population of conformers is very similar. Small movements can be biologically relevant even for proteins showing low conformational diversity. For example, rotations in the lateral chains of residues belonging to a tunnel are enough to change from open to close conformation (Fig. 1a, b). Several works show that small structural changes cause large catalytic consequences in enzymes [17, 18]. The importance of conformational diversity does not depend on the degree of the structural changes but on how relevant is the movement for the biological function [19]. The extent of conformational diversity is observed by comparing different experimentally solved structures for the same protein. Structures solved by X-ray crystallography in different conditions might result in stabilizing a different set of conformers. In the case of structures solved by NMR spectroscopy, each model represents a possible conformation of the protein in the same condition. The most popular estimators of structural similarity between proteins are the root-mean-square deviation (RMSD) and the TM-score; the first one represents the mean distance between corresponding atoms (usually Cα atoms), while the second one is

86

Diego Javier Zea et al.

Fig. 1 Small and large conformational changes of identical sequences. (a, b) There are almost no significant structural changes between the two conformers of cellulase Cel48F protein (RMSD = 0.17 Å), however the tunnel (shown in red) is present in one conformer (PDB: 1F9D-panel a) whereas is absent in the other one (PDBs: 1F9O-panel b). Tunnel detection and image generation were carried out using CAVER Analyst software [20]. (c) High conformational change of calmodulin (RMSD = 12.15 Å), conformers are shown in yellow and green (PDBs: 1G4Y and 2BCX, respectively). The image was generated using ChimeraX [21]

scaled by a protein length-dependent distance parameter [22] (see Note 2). Monzon et al. showed that proteins having low conformational diversity have on average 0.83 Å between their conformers while proteins having large conformational diversity have on average 1.3 Å [19] using MAMMOTH as superposition method [23]. Different structural comparison methods might give different RMSD values, making it difficult to compare them between articles. For example, Burra et al. show proteins requiring conformational changes as large as 23.7 Å RMSD using a rigid superimposition with the Kabsch algorithm [24, 25] guided by sequence alignment [26] (Fig. 1c). A comparative study with structures determined by both crystallography and NMR spectroscopy (on a set of 109 proteins) shows that NMR ensembles have larger RMSD values between

Comparative Modeling, Easy Cases

87

models than the conformers obtained by X-ray [9, 27]. This is partially explained by different environments of the protein, solid state for X-ray crystallography, and aqueous solution for NMR. When analyzing the structural differences between any two structures, it is useful to look at their experimental conditions. Kosloff and Kolodny et al. [28] analyzed the causes that lead to structural differences in a set of 278 PDB chain pairs sharing ≥99% sequence identity and an RMSD ≥6 Å. The causes of structural differences ordered by frequency are different quaternary proteinprotein interactions, presence/absence of ligands, different crystallization conditions (e.g., different pH), alternative crystallographic conformations of the same protein (e.g., asymmetric homomers; intra-chain differences including point mutations, oxidized versus reduced intra-chain disulfide bonds, presence/absence of a part of the protein chain), and proteins with/without DNA-RNA interaction.

3

High-Sequence Identity Does Not Guarantee an Accurate Model The success of the comparative modeling depends in part on the ability to detect the homologous protein structure to use as a template. Several works point out that pairwise sequence identity between the target and template is not an effective parameter to describe the expected quality of a model [29, 30]. Rataj et al. found that in evaluating a model by its ligand binding properties and performance in virtual screening, the best template was not the phylogenetically closest, as the paradigm states. These results indicate that for some purposes, different characteristics in the template (not only sequence identity) might render a better model. Also, taking into account the conformational diversity and the different states of some proteins (open/closed, apo/holo), a more distant homologue might give a more accurate model depending on the purpose of the model (see Note 3). At last, the resolution of the crystal structure is a parameter to take into account to obtain a more accurate model, sometimes resigning from higher sequence identity (see Note 4 and 5).

3.1 A Single Amino Acid Change (Minimal Sequence Changes) May Produce a Huge Conformational Change

The structural changes induced by single mutations are higher than the conformational diversity due for ligand binding, pH, or temperature [31], clearly indicating that even a single amino acid substitution can lead to large structural differences. Not all the substitutions impact on the protein structure, but the impact is hard to predict. The first known protein-misfolding disease was the sickle cell anemia. In this disorder, a point mutation of a glutamic acid into valine in the β-globulin [32, 33] results in a conformational change,

88

Diego Javier Zea et al.

Fig. 2 Conformational change of a tyrosine kinase. Epidermal growth factor receptor tyrosine kinase domain. Green, close (inactive) conformation (PDB: 1M14); yellow, open (active) conformation (PDB:3W32) [21]

exposing a hydrophobic patch that leads to the polymerization of the protein. Another example are the single mutations in a particular region of tyrosine kinases that release the “molecular brake” of the kinase activity (an autoinhibitory mechanism mediated by a network of hydrogen bonds). The majority of kinase-activating mutations act by releasing the brake, favoring the active state conformation of the kinase triggering (together with other factors) numerous cancers [34] (Fig. 2). An astonishing example is the human apolipoprotein E (UniProt code P02649) that plays a central role in the metabolism of cholesterol and triglycerides. Human ApoE is naturally expressed in three isoforms differing only in two residues between them: ApoE4 constitutes the most important genetic risk factor for Alzheimer’s disease, ApoE3 is neutral, and ApoE2 is protective [35]. The single change of a cysteine by an arginine in position 112 (in a protein of 299 residues) between ApoE3 and ApoE4 renders a protein less structured, less stable, less compact, and structurally more heterogeneous with multiple conformers. Alzheimer’s disease is the most common form of dementia, affecting approximately 46 million people worldwide including a rapid decline in memory, cognitive faculties, and impairments in activities of daily living. 3.2 Candidate Templates with Conformational Diversity

The occurrence of sequence-similar structurally dissimilar proteins in the PDB poses a challenge to comparative modeling, particularly for automatic modeling servers. A given template may have different conformers sampling a large conformational space [36]. For

Comparative Modeling, Easy Cases

89

Fig. 3 Different conformations of TonB from E. coli. PDB 1IHR at the left and PDB 1u07 at the right. The four chains are 100% identical as they are two structures of the same protein crystallized in two different homodimeric forms [21]

example, for the C-terminal domain of TonB from Enterobacter aerogenes (residues 171–240 of UniProt code P02929), the closest homologue is TonB from E.coli that was crystallized in two alternative homodimeric forms. An automatic server SWISS-MODEL [37] gives several possible templates with different structures with the same or similar identity percent to the target protein. The server first option will be the template that maximizes the coverage of the target and the quality of the model. Figure 3 shows the two conformations of TonB. The monomer of each dimeric form cannot be superimposed, showing the extent of the structural difference between them. Even the secondary structure of the residues is not identical in the two conformers, challenging the secondary structure prediction software. If the different conformations are so evident as in Fig. 3, you will easily notice, but most of the time, differences are not that evident. Frequent cases are those of proteins having more than one conformation as calmodulin and tyrosine kinases depicted in Figs. 1c and 2. The putative templates might include PDBs with and without ligands, metals, inhibitors, or in complex, between other conditions. In such cases, it is necessary to make models with the different conformers of the templates (see Note 6). 3.3 Structural Divergence Within a Protein Family

The term “structural divergence” refers to the structural differences that arise through the evolutionary process between members of a protein family. Two proteins of at least 80 residues, showing a sequence identity as low as 25%, tend to have the same protein fold [38], but the fold might have large structural divergence. Zea et al. 2018 carried out a study of structural divergence within a protein family of homologous proteins using the Pfam database, a database of protein families [39]. To ensure a sufficient sequence divergence within the protein family, the dataset comprised 817 families that meet the following requirements: at least 4 clusters of sequences at 62% identity and at least 1 known structure each cluster [40]. Within each family, they superimposed the structures (all against all) guided by the Pfam alignment and

90

Diego Javier Zea et al.

Fig. 4 Structural variations within a Pfam family. (a) The comparison of all against all structures shows that the mean RMSD per family is 2.48 Å on average and the maximum RMSD between structures within the family is on average 5.65 Å. (b) Solvent accessibility and inter-residue contacts undergo large variations between structures of the same family

calculated the RMSD between them (using Cα). From these comparisons emerged two characteristics: first, the average RMSD per family was 2.48 Å (calculated by pairwise comparison of all the structures within a family). Second, the RMSD between the two more different structures within each family was on average 5.65 Å. While these values highlight the differences at the backbone level in a family, greater differences are observed for more subtle structural variables. Specifically, on average 50.57% of the positions change their status from buried to exposed to the solvent, in other words, a position that in a protein is buried, and in another protein of the family the equivalent position is exposed, conserving the overall protein structure. Also, on average, 53.32% of the inter-residue contact pairs that are observed in one structure are not present in another (Fig. 4). Other works show that the degree of structural divergence of a protein family is related to the degree of conformational diversity of their members [41, 42]. The maximum RMSD observed between two proteins of the family correlates with the maximum RMSD between two conformers of the same protein (Pearson’s correlation coefficient of 0.75). This observation is important for comparative modeling as families with high structural divergence and conformational

Comparative Modeling, Easy Cases

91

diversity will be between the most difficult cases, even when templates with high-sequence similarity are found. In those cases, it will be necessary to make several models to obtain the main conformers that populate the native state of the protein to gain insight into its functional mechanism.

4

Does My Query Protein Have Conformational Diversity? To choose the modeling strategy, it is very helpful to know if the protein has conformational diversity, that is, if it is worth modeling different conformations. While there is no proven method to predict this, we can find some hints by looking at the protein and its family. The first step will be to look for homologous proteins, for example, with a BLAST in the PDB database. In such a way, we will have a first impression on the structural similarity of the templates and the difficulty of the modeling process. In case of finding more than one protein suitable to use as a template, their superposition will show if they are structurally different, anticipating the need of making several models including the more phylogenetically distant templates (see Note 7). As the degree of structural divergence of a family correlates with the conformational diversity of their member proteins [42], a structurally diverse set of templates could indicate the conformational diversity of a protein and, in consequence, the need of modeling different conformers. When modeling with an automatic server, be aware of the suggested templates, compare their structures, and decide if it is better to choose other than the suggested or more than one if the server allows it, for example, SWISS-MODEL offered the possibility of using 14 templates in the example in Fig. 3 (see Note 8). The presence of protein disorder or low complexity regions could be another indicator of high conformational diversity. A study comparing multiple crystallographic structures of the same protein [19] shows that almost 60% of the analyzed proteins are rigid. These rigid proteins have no disordered regions and a small conformational diversity (mean RMSD of 0.83 Å). However, they still need to perform some movements to function. In particular, the study shows that rigid proteins vary the tunnel length (when present) between conformers more than flexible proteins and this is achieved by movements of the side chains without affecting the backbone. This study also classifies protein with disordered regions into malleable and partially disordered. As they worked with crystallographic structures, there are not fully disordered proteins in their study. Malleable proteins have fewer conformers with disordered regions than partially disordered proteins. However, malleable proteins reach higher RMSD values (1.3 Å on average) than partially disordered ones (1.1 Å on average), when the ordered

92

Diego Javier Zea et al.

a PF00273 PF00106 PF17209

4

RMSD

3

2

1

0 20

40

60

80

100

Percent Identity

b

c

d

Fig. 5 Example of sequence and structure divergence in three protein families. (a) Structural divergence in terms of RMSD vs percent of identity of three protein families with different characteristics: malleable (green), partially disordered (purple), and rigid (orange). The Pfam family PF00273 (serum albumin, green) shows the greatest values of structural divergence (higher RMSD values at low identity percent) and conformational diversity (high RMSD values at 100% identity). The structure of the human serum albumin (UniProt code: P02768, PDB 1AO6) is highlighted for this family. The Pfam family PF00106, a short-chain dehydrogenase (purple) that contains the partially disordered protein estradiol 17-beta-dehydrogenase 1 (UniProt code: P14061, PDB code: 1A27), has high structural divergence while low conformational diversity. The Pfam family PF17209, RNA-binding protein Hfq (orange), has small structural divergence and the lowest conformational diversity. Protein (UniProt code: P0A6X3 PDB 4JRI) is highlighted. (b–d) show the superimposed structures for these families following the same color code. The analysis was performed using the MITOS suite [43] superimposed structures were rendered with ChimeraX [21]

regions of the conformers are compared. All of the above indicate that the presence of intrinsically disordered regions in the templates can be an indicator of high flexibility in the protein family and as a consequence could indicate that the target protein might have high conformational diversity (see Notes 9 and 10). Figure 5 shows the conformational diversity and structural divergence in three protein families. A protein family allows observing both, the structural divergence accumulated throughout the evolution of the different protein members and also the different conformations of a single protein. It is interesting to observe that the amplitude of the distance between two conformers of the same protein is as big as the distance between any two proteins at different sequence identities (even as low as 20%).

Comparative Modeling, Easy Cases

4.1 Modeling Different Conformations of a Protein

93

Modeling different conformations of a protein is a challenging task. Researches developed different computational methods to address that problem, each of them with its strengths and weaknesses. After the pioneering work of Elber and Karplus modeling multiple structural conformations of myoglobin through molecular dynamics simulations [44], many authors used that methodology to obtain conformational ensembles. Elber and Karplus noticed that the molecular dynamics simulation of myoglobin showed structural variations similar to those observed in the whole globin family. Molecular dynamic simulations allow the exploration of protein movements on the nanosecond scale but cannot explore longer timescales where some functionally important movements occur (it is technically possible but requires high computational time). To enhance the exploration of the conformational space, different groups developed improved sampling techniques, such as umbrella sampling and accelerated molecular dynamics, see Narayana et al. [45] and references therein. Another way to solve the conformational exploration problem is to use coarse-grain ˜ o et al. [46] methods, such as normal mode analysis, see Saldan and references therein. However, those methods could sometimes produce unrealistic conformations. The correlation between structural divergence and conformational diversity [42] supports the SWISS-MODEL new approach of using more than one template when they show significant differences. Then, the modeling pipeline incorporates structural superimposition and clustering of template structures to model alternative conformations [37, 47]. Narunsky et al. [48] proposed a different approach for modeling alternative conformations: to use a structural search instead of an homology (sequence-based) search to find templates. This work shows that proteins sharing one conformer are likely to share others, and therefore they look for structures similar to the query structure, to use the different conformations of the retrieved proteins as templates in a comparative modeling pipeline. That procedure is more likely to find alternative conformers for the query protein when the number of gathered structures is high and the sequence identity to the query is low. The former is because highly abundant conformations mask the less common ones. Palopoli et al. [49] also show that template-based modeling methods sample the conformational space of a protein more comprehensively when they use distant evolutionary rather than closer homologous templates. In this article, the authors used CASP results to assess how many times a poorly scored model was a good model but with an alternative conformation. In this article, the authors used CASP results to assess how many times a poorly scored model was a good model but with an alternative conformation.

94

Diego Javier Zea et al.

4.2 When Do I Need to Model Multiple Conformations?

A protein model can have multiple uses, for example, know a protein fold, approach protein function, plan mutation experiments, explain experimental results, elucidate an enzymatic mechanism, predict interaction, and ligand docking between hundreds of other purposes. For some of these purposes, for example, to know a protein fold, it is not necessary to model different conformers. Whether open or closed, a kinase domain is a kinase domain. So modeling conformers should be done when there is a need, for example, there is evidence that it is important for understanding the function. Conformational diversity is expected if the family has structural divergence, if there are disordered or low complexity regions, and if there are biological data that indicate possible states, for example, apo/holo enzymes and evidence of allosterism, among others. Figure 6 shows a decision scheme to evaluate if protein conformers are expected and justify modeling several conformers.

Fig. 6 When multiple conformations are expected? Decision flow is a clue to decide whether to make multiple models or not. The use of the model is not included in the decision flow, but it is the first criterion to be considered

Comparative Modeling, Easy Cases

5

95

Conclusions Since the early 1990s, comparative protein modeling closed the gap between the number of known structures and known sequences. It was of major help to plan molecular biology experiments (such as mutations) and also to explain experimental results. At the early stages where conformational diversity was out of discussion, sequence identity was the only condition for accepting a protein as a good template to model another similar in sequence. It is in the last decades that the conformational diversity began to be considered an important aspect to understand protein function. Today much knowledge has been gained and it is easier for the user to envisage if the protein will have conformational diversity and if it is important for its biological function. This chapter gives clues to guide structural modeling when having either a template or the target protein with conformational diversity. This chapter is focused on modeling with high-sequence identity templates, but it is applicable to the whole range of template identities. Is not necessary for all cases to take into account the issue of different conformations so reading the literature is important to accompany the modeling process. We must ask ourselves what we expect to answer with a model of our protein of interest.

6

Notes 1. There are two very informative graphs to understand the growth of known folds and the growth of new structures during the time (folds are not growing since 2011): http:// www.rcsb.org/pdb/statistics/contentGrowthChart.do?con tent=fold-cath and https://www.rcsb.org/stats/growth/ growth-released-structures 2. RMSD is the most popular measure to evaluate structure similarity between proteins. However, it depends not only on the structure similarity but also on the crystallography resolution and the protein length. It is expressed in angstroms (Å) and lies between 0 and 1. vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N u1 X RMSD = t δ2i N i=1

where δi is the distance between atom i and either a reference structure or the mean position of the N equivalent atoms. Another popular measure to evaluate structure similarity is the TM-score, a protein size-independent score [50]:

96

Diego Javier Zea et al.

" # L ali 1 X 1 TM - score = 2 2 L i = 1 1 þ d i =d 0 max where L is the length of the native structure, Lali is the length of the aligned residues to the template structure, di is the distance between the ith pair of aligned residues, and d0 is a scale to normalize the match difference. “Max” denotes the maximum value after optimal spatial superposition. The value of the TM-score stays in (0, 1] with a higher value indicating a stronger similarity. TM-score calculation server: https:// zhanglab.ccmb.med.umich.edu/TM-score/. 3. If you know in advance that a protein has more than one conformation in the native state, take into account that when modeling one of the states, you know half of the history. Depending on the purpose of the modeling, you might need another conformation or both. It might be necessary to find another structure, for example, a more distant homologue, to model all the states. 4. A practical problem is that the same protein might be solved at different resolutions, for example, hepatitis C virus NS5B RNA-dependent RNA polymerase was solved at 1.9 Å (PDB code: 1C2P) and at 2.8 Å (PDB code: 1CSJ) resolution. The first one has 97% of the residues at the most favored region in the Ramachandran plot and the second one has 90%. Also, the clash score between atoms is higher in the second structure (7 vs 23). A model built with 1C2P will be more accurate than one made with 1CSJ. 5. In all of the cases, the models have to be validated to test its accuracy: stereochemical allowed regions for backbone torsion angles in a Ramachandran plot [51–53], clashes between atoms, and compatibility of an atomic model (3D) with its own amino acid sequence (1D) [54], among many others [55]. Currently there are many different methods to assess model quality. The Protein Data Bank incorporated several validation tools in the assignment of the structure quality, and these can be taken as a reference since they are considered the state-of-the-art methods to validate template-based models [55]. A table of validation methods and their links is here: https://www.rcsb.org/pdb/static.do?p=software/software_ links/analysis_and_verification.html. Also, every biological information is of utmost importance to validate the model. As an example, if there is experimental evidence of disulfide bonds, the model should have them; residues known as antigenic or protein binding sites should be on the surface; etc.

Comparative Modeling, Easy Cases

97

6. Superpose all the conformers to see the amplitude of the conformational diversity and make models with several conformers of the template. Structure superposition can be done with structure visualization software as PyMOL [56, 57] and UCSF ChimeraX [21] allow it, among others; there are also many superposition servers, for example, mulPBA [58] or SuperPose [59], among many others. 7. Large families of homologous proteins tend to have high structural diversity among the family members. When structural diversity is suspected, it is necessary to generate more than one model to cover the structural space of the protein family and in consequence, due to the correlation, the conformational space of the modeled protein. It is necessary to superpose all the templates in the same way as described in Note 6. 8. It is OK to use an automatic server with default parameters, but it is useful to know in advance, by superpositions, if the template proteins have different conformations to decide if it is worth modeling with more than one template. 9. Low complexity regions (LCR) are another aspect to care about. LCR is a compositionally biased region of the protein, enriched in a few amino acid types compared to an average composition. Typical examples are poli-Q regions, histidinerich and proline-rich areas, etc. Also repetitive, biased stretches of the protein are low complexity regions. There is a certain overlap between disordered and low complexity regions, but it is not a direct association. Disordered regions might not have low complexity and also low complexity regions might not be disordered. It is useful to run a low complexity regions predictor as well as the disorder predictor. A platform of tools and meta server for calculation and visualization of LCR is here: http://platoloco.aei.polsl.pl/ #!/query [60]. 10. The MobiDB2 database [61] (http://mobidb.bio.unipd.it/) has eight different disorder predictors and useful protein annotations. You can also run a disorder predictor, for example, IUPred [62], to know the disorder content of your query protein. This is a good indicator of possible high conformational diversity of the protein and high structural divergence of the family. References 1. Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234:779–815 2. Consortium TU, The UniProt Consortium (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47:D506–D515

3. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242 4. Dawson NL, Lewis TE, Das S, Lees JG, Lee D, Ashford P, Orengo CA, Sillitoe I

98

Diego Javier Zea et al.

(2017) CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Res 45:D289–D295 5. Andreeva A, Kulesha E, Gough J, Murzin AG (2020) The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res 48:D376–D382 6. Marino-Buslje C, Monzon AM, Zea DJ, Fornasari MS, Parisi G (2017) On the dynamical incompleteness of the Protein Data Bank. Brief Bioinform 20:356–359 7. Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181:223– 230 8. Boehr DD, Nussinov R, Wright PE (2009) The role of dynamic conformational ensembles in biomolecular recognition. Nat Chem Biol 5: 789–796 9. Monzon AM, Fornasari MS, Zea DJ, Parisi G (2019) Exploring protein conformational diversity. Methods Mol Biol 1851:353–365 10. Salahuddin P, Distributed Information Sub-Centre (DISC), Interdisciplinary Biotechnology Unit, Aligarh Muslim University (A. M. U. ), Aligarh, India (2015) Protein folding, misfolding, aggregation and amyloid formation: mechanisms of Aβ oligomer mediated toxicities. J Biochem Mol Biol Res 1:36–45 11. Lin J-C, Liu H-L (2006) Protein conformational diseases: from mechanisms to drug designs. Curr Drug Discov Technol 3:145– 153 12. Ellisdon AM, Bottomley SP (2004) The role of protein misfolding in the pathogenesis of human diseases. IUBMB Life 56:119–123 13. Sweeney P, Park H, Baumann M et al (2017) Protein misfolding in neurodegenerative diseases: implications and strategies. Transl Neurodegener 6:6 14. Tress M, Tai C-H, Wang G, Ezkurdia I, Lo´pez G, Valencia A, Lee B, Dunbrack RL Jr (2005) Domain definition and target classification for CASP6. Proteins 61(Suppl 7):8–18 15. Kinch LN, Kryshtafovych A, Monastyrskyy B, Grishin NV (2019) CASP13 target classification into tertiary structure prediction categories. Proteins 87:1021–1036 16. Yassine W, Taib N, Federman S et al (2009) Reversible transition between alpha-helix and beta-sheet conformation of a transmembrane domain. Biochim Biophys Acta 1788:1722. https://doi.org/10.1016/j.bbamem.2009. 05.014

17. Koshland DE (1998) Conformational changes: how small is big enough? Nat Med 4:1112– 1114 18. Mesecar AD, Stoddard BL, Koshland DE Jr (1997) Orbital steering in the catalytic power of enzymes: small structural changes with large catalytic consequences. Science 277:202–206 ˜o 19. Monzon AM, Zea DJ, Fornasari MS, Saldan TE, Fernandez-Alberti S, Tosatto SCE, Parisi G (2017) Conformational diversity analysis reveals three functional mechanisms in proteins. PLoS Comput Biol 13:e1005398 20. Jurcik A, Bednar D, Byska J et al (2018) CAVER Analyst 2.0: analysis and visualization of channels and tunnels in protein structures and molecular dynamics trajectories. Bioinformatics 34:3586–3588 21. Goddard TD, Huang CC, Meng EC, Pettersen EF, Couch GS, Morris JH, Ferrin TE (2018) UCSF ChimeraX: meeting modern challenges in visualization and analysis. Protein Sci 27:14– 25 22. Olechnovicˇ K, Monastyrskyy B, ˇ (2019) ComparKryshtafovych A, Venclovas C ative analysis of methods for evaluation of protein models against native structures. Bioinformatics 35:937–944 23. Lupyan D, Leo-Macias A, Ortiz AR (2005) A new progressive-iterative algorithm for multiple structure alignment. Bioinformatics 21: 3255–3263 24. Kabsch W (1976) A solution for the best rotation to relate two sets of vectors. Acta Crystallogr A 32:922–923 25. Kabsch W (1978) A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr A 34:827–828 26. Burra PV, Zhang Y, Godzik A, Stec B (2009) Global distribution of conformational states derived from redundant models in the PDB points to non-uniqueness of the protein structure. Proc Natl Acad Sci U S A 106:10505– 10510 27. Sikic K, Tomic S, Carugo O (2010) Systematic comparison of crystal and NMR protein structures deposited in the protein data bank. Open Biochem J 4:83–95 28. Kosloff M, Kolodny R (2008) Sequencesimilar, structure-dissimilar protein pairs in the PDB. Proteins Struct Funct Bioinf 71: 891–902 29. Tramontano A, Morea V (2004) Assessment of homology-based predictions in CASP5. Proteins Struct Funct Bioinf 55:782–782 30. Rataj K, Witek J, Mordalski S, Kosciolek T, Bojarski AJ (2014) Impact of template choice

Comparative Modeling, Easy Cases on homology model efficiency in virtual screening. J Chem Inf Model 54:1661–1668 31. Parisi G, Zea DJ, Monzon AM, Marino-Buslje C (2015) Conformational diversity and the emergence of sequence signatures during evolution. Curr Opin Struct Biol 32:58–65 32. Ingram VM (1957) Gene mutations in human haemoglobin: the chemical difference between normal and sickle cell haemoglobin. Nature 180:326–328 33. Hunt JA, Ingram VM (1959) A terminal peptide sequence of human haemoglobin? Nature 184(Suppl 9):640–641 34. Molina-Vila MA, Nabau-Moreto´ N, Tornador C, Sabnis AJ, Rosell R, Estivill X, Bivona TG, Marino-Buslje C (2014) Activating mutations cluster in the “molecular brake” regions of protein kinases and do not associate with conserved or catalytic residues. Hum Mutat 35:318–328 35. Huang Y-WA, Zhou B, Wernig M, Su¨dhof TC (2017) ApoE2, ApoE3, and ApoE4 differentially stimulate APP transcription and Aβ secretion. Cell 168:427–441.e21 36. Illerga˚rd K, Ardell DH, Elofsson A (2009) Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins Struct Funct Bioinf 77:499–508 37. Waterhouse A, Bertoni M, Bienert S et al (2018) SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res 46:W296–W303 38. Sander C, Schneider R (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9:56–68 39. El-Gebali S, Mistry J, Bateman A et al (2019) The Pfam protein families database in 2019. Nucleic Acids Res 47:D427–D432 40. Zea DJ, Monzon AM, Parisi G, Marino-Buslje C (2018) How is structural divergence related to evolutionary information? Mol Phylogenet Evol 127:859–866 41. Vetrivel I, de Brevern AG, Cadet F, Srinivasan N, Offmann B (2019) Structural variations within proteins can be as large as variations observed across their homologues. Biochimie 167:162–170 42. Monzon AM, Zea DJ, Marino-Buslje C, Parisi G (2017) Homology modeling in a dynamical world. Protein Sci 26:2195–2206 43. Zea DJ, Anfossi D, Nielsen M, Marino-Buslje C (2017) MIToS.jl: mutual information tools for protein sequence analysis in the Julia language. Bioinformatics 33:564–565

99

44. Elber R, Karplus M (1987) Multiple conformational states of proteins: a molecular dynamics analysis of myoglobin. Science 235:318–321 45. Narayanan C, Bernard DN, Doucet N (2016) Role of conformational motions in enzyme function: selected methodologies and case studies. Catalysts. https://doi.org/10.3390/ catal6060081 ˜ o TE, Freixas VM, Tosatto SCE, 46. Saldan Parisi G, Fernandez-Alberti S (2020) Exploring conformational space with thermal fluctuations obtained by normal-mode analysis. J Chem Inf Model 60:3068. https://doi.org/ 10.1021/acs.jcim.9b01136 47. Bienert S, Waterhouse A, de Beer TAP, Tauriello G, Studer G, Bordoli L, Schwede T (2017) The SWISS-MODEL repository—new features and functionality. Nucleic Acids Res 45:D313–D319 48. Narunsky A, Nepomnyachiy S, Ashkenazy H, Kolodny R, Ben-Tal N (2015) ConTemplate suggests possible alternative conformations for a query protein of known structure. Structure 23:2162–2170 49. Palopoli N, Monzon AM, Parisi G, Fornasari MS (2016) Addressing the role of conformational diversity in protein structure prediction. PLoS One 11:e0154923 50. Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins 57:702–710 51. Ramachandran GN, Sasisekharan V (1968) Conformation of polypeptides and proteins **The literature survey for this review was completed in September 1967, with the journals which were then available in Madras and the preprinta which the authors had received. {{By the authors’ request, the publishers have left certain matters of usage and spelling in the form in which they wrote them. In: Anfinsen CB, Anson ML, Edsall JT, Richards FM (eds) Advances in protein chemistry. Academic Press, pp 283–437 52. Laskowski RA, MacArthur MW, Moss DS, Thornton JM (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Crystallogr 26:283. https:// doi.org/10.1107/S0021889892009944 53. Zhou AQ, O’Hern CS, Regan L (2011) Revisiting the Ramachandran plot from a new angle. Protein Sci 20:1166–1171 54. Eisenberg D, Lu¨thy R, Bowie JU (1997) VERIFY3D: assessment of protein models with three-dimensional profiles. Methods Enzymol 277:396–404

100

Diego Javier Zea et al.

55. Gore S, Sanz Garcı´a E, Hendrickx PMS et al (2017) Validation of structures in the Protein Data Bank. Structure 25:1916–1927 56. Schrodinger LLC (2010) The PyMOL molecular graphics system Version 1:0 57. Pettersen EF, Goddard TD, Huang CC (2004) UCSF Chimera—a visualization system for exploratory research and analysis. J Comput Chem 25:1605–1612 58. Le´onard S, Joseph AP, Srinivasan N, Gelly J-C, de Brevern AG (2014) mulPBA: an efficient multiple protein structure alignment method based on a structural alphabet. J Biomol Struct Dyn 32:661–668 59. Maiti R, Van Domselaar GH, Zhang H, Wishart DS (2004) SuperPose: a simple server

for sophisticated structural superposition. Nucleic Acids Res 32:W590–W594 60. Jarnot P, Ziemska-Legiecka J, Dobson L et al (2020) PlaToLoCo: the first web meta-server for visualization and annotation of low complexity regions in proteins. Nucleic Acids Res 48:W77. https://doi.org/10.1093/nar/ gkaa339 61. Potenza E, Di Domenico T, Walsh I, Tosatto SCE (2015) MobiDB 2.0: an improved database of intrinsically disordered and mobile proteins. Nucleic Acids Res 43:D315–D320 62. Me´sza´ros B, Erdos G, Doszta´nyi Z (2018) IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res 46: W329–W337

Chapter 6 Quality Estimates for 3D Protein Models Ali H. A. Maghrabi, Fahd M. F. Aldowsari, and Liam J. McGuffin Abstract Protein structure modeling is one of the most advanced and complex processes in computational biology. One of the major problems for the protein structure prediction field has been how to estimate the accuracy of the predicted 3D models, on both a local and global level, in the absence of known structures. We must be able to accurately measure the confidence that we have in the quality predicted 3D models of proteins for them to become widely adopted by the general bioscience community. To address this major issue, it was necessary to develop new model quality assessment (MQA) methods and integrate them into our pipelines for building 3D protein models. Our MQA method, called ModFOLD, has been ranked as one of the most accurate MQA tools in independent blind evaluations. This chapter discusses model quality assessment in the protein modeling field, demonstrating both its strengths and limitations. We also present some of the best methods according to independent benchmarking data, which has been gathered in recent years. Key words Protein structure prediction, Model quality assessment, Estimates of model accuracy, Accuracy self-estimates, Template-based modeling, Free modeling

1

Introduction Understanding protein function is one of the keys for understanding life at the molecular level. Each protein molecule has its own unique sequence, which consist of linear chains of amino acids. These amino acid chains fold to form tertiary structures, which confer the proteins function. In other words, characterizing protein structures leads to the ability to better understand their functions. Experimental methods such as X-ray crystallography and nuclear magnetic resonance have been considered as the methods of choice for 3D structure determination. However, such methods are costly and time-consuming, and some proteins are also problematic or impossible to be characterized using these methods. Consequently, the process of growing protein structure data is relatively slow in comparison to the speed of sequencing genomes and their encoded proteins, which has kept increasing, especially after breakthroughs in the genetic sequencing technology. As a result, a gap has grown

Sławomir Filipek (ed.), Homology Modeling: Methods and Protocols, Methods in Molecular Biology, vol. 2627, https://doi.org/10.1007/978-1-0716-2974-1_6, © Springer Science+Business Media, LLC, part of Springer Nature 2023

101

102

Ali H. A. Maghrabi et al.

between known protein sequences and their resolved structures, and it has been necessary to find another solution. Computational methods, which predict the structure of proteins directly from their own sequences, have become fast and effective alternatives to experimental methods. Over the past 20 years, there has been an emergence of different types of protein structure prediction methods, the most accurate type being the comparative modeling method, which consists of a number of steps including template recognition, alignment, quality assessment, and ending with refinement. Each of these steps plays an essential part in order to achieve a successful modeling pipeline, but perhaps the most critical step for the wider acceptance of 3D models of proteins has been the protein modeling quality assessment pipeline. In this step, the predicted models are evaluated in terms of their likely accuracy without the need of an experimental structure. Numerous challenges were identified and many approaches to the quality estimation problem have been developed over the years including the use of statistical potentials, stereochemistry checks, and machine learning techniques. Such methods have traditionally been referred to as the model quality assessment (MQA or QA) methods, and they have been evaluated in successive critical assessment of structure prediction (CASP) experiments under the estimates of model accuracy (EMA) and the accuracy self-estimates (ASE) prediction categories.

2 Estimates of Model Accuracy (EMA) Are Essential for Template-Based Modeling (TBM) and Template-Free Modeling (FM) The fact that evolutionarily related proteins have similar structures has encouraged researchers to develop methods for predicting the structure of proteins from their sequences [1]. One way of modeling a protein structure is by aligning the sequence to those of already experimentally observed protein structures and then using those structures as templates in order to map the 3D coordinates of each aligned residue. This procedure has been termed as homology modeling or comparative modeling [2]. However, sometimes structurally homologous proteins can have a very low sequence identity, and in these cases homology modeling methods fail to identify suitable template structures or produce poor alignments. This issue led to another way of determining protein structure called threading or fold recognition [3]. This modeling method does not use the homologous proteins with known structures but rather uses statistical knowledge of the relationship between the structures, which have been deposited in the PDB database and the targeted sequence. Both approaches have been improved over the years along with the integration of EMA programs, and systematic differences were noticed.

Model Quality Estimates

103

In recent years, fold recognition and homology modeling techniques have somewhat merged with the ability to detect ever more distant evolutionary relationships using profile-profile searching methods and HMM-HMM methods, such as the popular HHpred method [4]. The general concept of modeling based on existing structures is now classified as template-based modeling (TBM), and the success of such methods relies on the availability and accurate detection of suitable templates. As the amount of detectable similarity between target protein and template structures decreases, the accuracy of template-based techniques starts to be insufficient and such methods become unreliable. In this case, another structure prediction technique, traditionally called de novo or ab initio protein structure prediction, is the only remaining option. The technique is based on predicting the structure of proteins without the need of a template and is therefore known as template-free modeling or FM [5]. FM methods are not nearly as accurate as TBM methods when templates are available [6]. However, the concept of such techniques is fairly simpler comprising of only two elements: firstly, an algorithm to search the space of possible protein configurations for cost function minimization; secondly, various restraints, which are the composition of the cost function itself, being either derived from physical laws and structural features predicted by machine learning or other types of statistical systems [7]. FM techniques have been incrementally improving and can provide us with valuable information on how novel domains may fold [8]. Regardless of whether TBM or FM approaches are used to model a protein target, a researcher will often end up with dozens, or even hundreds, of alternative models for the same protein target. The first problem they will then face is how to select the best model from among the alternatives and then, once selected, they will need to know how confident they can be in the model accuracy overall and, more specifically, which local regions of the model can be trusted; EMA methods are critical for answering all of these questions.

3

Methods for Estimates of Model Accuracy Traditionally, protein structure modeling has been far less trusted in terms of accuracy than deriving protein structures from experiments. Models are typically left unannotated with quality estimates and can span a broad range of the accuracy spectrum, whereas the accuracy of observed protein structures can be estimated from experiments and falls within a narrow range [2]. Therefore, a number of quality evaluation methods have been developed by modelers using techniques such as statistical potentials, molecular mechanics energy-based functions, stereochemistry checks, and

104

Ali H. A. Maghrabi et al.

machine learning in order to analyze the correctness of protein structures and models [9]. Examples of the early/simple quality checking tools include WHAT-CHECK [10], PROCHECK [11], and, more recently, MolProbity [12]. These tools use basic stereochemical checks, and they are very useful in identifying unusual geometric features in a model. However, such quality checking tools are not able to produce a single score for ranking alternative models. Other examples of early quality assessment tools that use a variety of different methods include ProSA [13] and DFIRE [14], which have been used along with VERIFY3D [15] in order to provide single scores that relate to the global quality of protein models. Machine learning-based quality assessment programs have also been utilized to provide a higher value of prediction accuracy. ProQ [16], ModFOLD method [17], and QMEAN [18] are examples of early machine learning-based QA methods, which helped programmers to use various combinations of structural features and individual energy potentials in order to predict the accuracy of global model quality.

4

Observed Model Accuracy Scoring In order to evaluate predicted model quality scores, in the early years of structure prediction, the predicted models were compared with the superposed observed structures simply by using the rootmean-square deviation (RMSD). To overcome some of the RMSD limitations, EMA developers started to use improved similarity scoring measures such as GDT-HA and GDT [19], MaxSub [20], TM-score [21] (which are superposition based), and local Distance Difference Test (lDDT) (which does not require superposition of the model and observed structures) [22]. These scores were used to measure the predicted model quality for each individual model by comparing them to the observed native (solved experimental) structures. The term GDT stands for “global distance test,” in both the GDT and GDT-HA scores. These two scores represent the measurement of similarity between two protein structures that both have identical amino acid sequences but may have different tertiary structures, i.e., a predicted model and the observed crystal structure [21]. The difference between GDT and GDT-HA is that GDT-HA is “high accuracy” and uses smaller cutoff distances, which makes it more rigorous and, as a result, is more stringent than GDT [23]. MaxSub is a measure that identifies in a model the largest subset of Cα atoms that superimpose over the experimental structure, producing a single normalized score that represents the quality of that model. The TM-score stands for “template modeling” score. Likewise, this measure is for calculating the similarity between two models with the same sequence but with different tertiary structure. The TM-score is arguably more accurate than

Model Quality Estimates

105

Fig. 1 Predicted model quality scores versus observed model quality scores. The plot compares the predicted scores for one of the top-performing individual EMA methods, ModFOLDclust2 (Mc2s), against TM-score observed scores. (The data set was collected from CASP12)

GDT and GDT-HA in comparing the similarity of structures with full-length protein chains rather than domains [21]. Each of these measures indicate the difference between two protein structures (predicted versus observed) by providing a score between 0 and 1, where 1 is a perfect match between the two compared structures (i.e., identical relative atom coordinates) and 0 is a nonmatched structure [20]. The comparison between the predicted and observed scores of each region of the protein structure is compared using the pairwise correlation technique, an example of this type of correlation can be seen in Fig. 1. Such superposition-based scoring measurement may have some limitations as they are affected by differences in the relative orientation of domains following global superposition in structures with more than a domain. This can lead to, for example, poor scores given for correct small domains because the largest domain will be dominating the global rigidbody superposition. The local Distance Difference Test (lDDT) scoring is independent of superposition, so it does not have the same issues when scoring multiple domains with different relative orientations. A variety of observed model accuracy scoring methods are used as the target functions in order to train and benchmark EMA methods over the years. Practically, the GDT score and the lDDT score have been used more recently due to their adoption as the gold standards for the CASP and CAMEO experiments, respectively [24].

106

5

Ali H. A. Maghrabi et al.

EMA Classification The field of computational protein structure prediction is evolving constantly, following the increase in computational power of machines and the development of intelligent algorithms. Despite the rapid development of methods and fusion of applied approaches, a broad classification and categorization of these methods can be made. Numerous methods have been developed over the years in an attempt to provide users with scores that will give them confidence in their 3D models and allow them to identify any potentially suspect regions. As previously mentioned, the model quality assessment field has its roots in early structure validation tools [10, 11, 25]. While such tools can be used to perform basic stereochemical checks and identify unusual geometric features in a model, they are not able to produce a single global score that can be used for ranking alternative models nor can they be relied upon for discriminating good models from bad (often bad models will still have good stereochemistry). Modern methods for EMA can be classified into three broad categories in terms of input: pure-singlemodel methods [14, 15, 17, 18, 25–28] which consider only information within an individual model, clustering/consensus approaches [29–33] which can only be used if you have multiple alternative models built for the same protein target, and quasisingle-model methods [34, 35] which can score an individual model against a pool of alternative models generated from the target sequence. Each approach has its advantages and disadvantages. Clustering methods have been far more accurate than puresingle-model methods, but they are more computationally intensive and do not work when very few similar models are available, which is often the case in real-life research scenarios. Pure-singlemodel methods are less accurate overall, but they are more rapid, they produce consistent scores for single or few models at a time, and they often perform better at model ranking and selection. Quasi-single-model methods attempt to provide comparable accuracy to clustering methods while addressing real-life needs of researchers with few/single models. Moreover, there are several other factors that EMA methods can be categorized with, such as the predicted property, target function, machine learning method, and other features. Table 1 contains a list of some of the most popular programs and servers for EMA, which have been independently evaluated in the CASP [36] and CAMEO [37] experiments.

Model Quality Estimates

107

Table 1 Examples of different EMA methods used in CASP13

Method

Local/ global

Inputs

Structure features

Predicted features

Target function

Machine learning method

Sec. str and surface area

LDDT (local)

Multilayer perceptron

S-score (local)

Multilayer perceptron

FaeNNz [38]

Local Model and (global full-length is avg. target local) sequence

Statistical potentials of mean force + distance constraints from templates + solvent acc.

ModFOLD7 [39]

Model and Local full-length (global target is sum sequence of local)

Contacts, Pairwise sec. str comparisons of and generated disorder reference models, residue contacts

ProQ3 [26]

ProQ2 + energy Profile + Local terms model + (global predictions is sum + energies of local)

Sec. str and surface area

S-score (local)

Linear SVM

VoroMQA-A [40]

Local and Model global

Not used

Not used

Statistical potential

MULTICOM- Global CLUSTER [41]

Model and full-length sequence

Voronoi tessellationbased contact areas

GDT_TS Deep Contacts, Secondary (global) network + sec. str, structure, ensemble surface solvent area, and accessibility, structural residue contacts scores

The methods have been chosen randomly taking into consideration the differences between them with regard to their measuring method (local/global), inputs, structure features, predicted features, target function, and machine learning method

6

ModFOLD: A Leading EMA Web Server One of the top leading EMA methods is ModFOLD, which has been developed by Prof. Liam McGuffin and colleagues [17]. Since its inception, ModFOLD has been continuously improved, going through many upgrades until its latest version, ModFOLD8 [42].

6.1

ModFOLD History

In the 2 years following CASP7, performances of protein structural QA servers were observed to be considerably increasing. Model quality assessment programs, or MQAPs, have become the

108

Ali H. A. Maghrabi et al.

cornerstone of many protein structure modeling methods. More than a dozen papers were published in the area of QA between CASP7 and CASP8, and 45 methods were submitted for evaluation to CASP8 in that category. 6.1.1 The Initial Construction of ModFOLD

ModFOLD is a machine learning-based QA program which was developed at the University of Reading by the McGuffin group. The original ModFOLD method was developed based on the nFOLD protocol [43], which was a combination of the new GenTHREADER protocol [44] and a number of extra inputs into the underlying neural network, including the SSEA score [45], a new functional site detection score (MetSite) [46], and a simple model quality checking algorithm, MODCHECK [44]. Initially, ModFOLD was developed in two editions: ModFOLD, designed to be fast and used for the global assessment of either single or multiple models, and ModFOLDclust, a more intensive method that carries out clustering of multiple models and provides a per-residue local quality assessment. ModFOLDclust was shown to significantly outperform all of its clustering/ multiple MQAP competitors, while ModFOLD has competed well against some of the best “true” single-model MQAP methods [17]. Since CASP ranking relies on the prediction accuracy regardless of the method used, clustering- or consensus-based MQAPs were ranked as the most accurate methods for predicting 3D model quality, outperforming the single-model methods.

6.1.2 ModFOLDclustQ for Speed, Accuracy, and Consistency

Despite their accuracy, it was noticed that a number of advantages of the single-model-based methods were missing in the clustering methods. One missing feature was the speed. Like Pcons and other consensus-based approaches, ModFOLDclust carries out pairwise comparisons of numerous models by using multiple structural alignments, and that makes it often CPU intensive [28]. Another difficulty found in QA programs including ModFOLDclust was the requirement of a large pool of diverse models, and thus, smaller numbers of models can minimize the accuracy. To overcome such problems, McGuffin and Roche designed an upgraded version of the same method, called ModFOLDclustQ [33]. The initial “Q” labeled in the upgraded version name is referred to a score called Q-score; this score was utilized in ModFOLDclustQ while also standing for “Quick.” The Q-score is derived from the Q measure that was developed by the Wolynes group [47]. The Q-score has the ability to efficiently estimate structural relations between two proteins based on their residue distances. This method has been suggested by the CASP8 assessors as an alternative to the other scoring methods such as the GDT-TS [48]. By importing Q-score, ModFOLDclustQ was shown to compete with the leading consensus MQAPs. Furthermore, when taking the mean of ModFOLDclustQ score and the older ModFOLDclust score, a significant

Model Quality Estimates

109

increase in prediction accuracy was achieved, with little computational overhead. That led McGuffin and Roche to combine both scoring methods to form a new method named ModFOLDclust2 [35]. There are a number of other MQAPs that also used Q-score to assess each individual residue in a model pertaining to the per-residue accuracy. A successful per-residue consensus-based method was Pcons method, which was superseded then by one of the leading consensus single-model per-residue programs, known as ProQ [49]. The method was then upgraded by updating its structural and predicted features, this upgrade to be as the second top ranking MQAP, ProQ2 [50]. Although upgrading ModFOLDclust to ModFOLDclustQ and combining their scores showed a high improvement in the quality assessment speed and accuracy level, McGuffin also noticed the potential of using ModFOLDclust2 to guide 3D modeling using multiple templates. In the process of modeling, using more than onefold template is helpful in assessing models more accurately. However, it was noticed that such a technique is not preferable in many cases as it may result in poorer model quality. Besides the speed and the accuracy of an MQAP, there has to be consistency as well. To solve such a problem, McGuffin and colleagues have started to investigate the use of local as well as global model quality prediction scores that are produced by ModFOLDclust2. This led to improvements in the selection of target-template alignments for the construction of multiple-template models. After the investigation, it was found that the most accurate and consistent way in improving models is to use accurate local model quality scores to guide alignment selection while using accurate global model quality before selection for re-ranking alignments. Applying this technique has made significant performance improvements to the tertiary structure prediction IntFOLD server [51]. 6.1.3 The Quasi-SingleModel Approach

Another important feature that was missing in the clustering-based approaches was addressing the real-life needs of protein researchers, when often only a single or few models for each protein target are available for evaluation. In such cases, clustering methods will provide poor performance. McGuffin’s research group was aware of this problem and they found a way to address it. Instead of proceeding with a direct clustering to the submitted model/s, a tertiary structure prediction method [52] was used at the beginning as the first stage of the quality assessment procedure, in order to generate an initial reference set of template-based models. The user-submitted model/s are then pooled with the generated models and clustered using ModFOLDclust2 as the second stage of the process. By integrating this algorithm, if the server received multiple models, then the procedure will go with the full clustering approach, whereas if only single or few models are submitted,

110

Ali H. A. Maghrabi et al.

then the pipeline will be diverted to the so-called quasi-singlemodel approach which operates with comparable accuracy. This method was implemented initially with the ModFOLD v3.0: a server developed using ModFOLDclust2 integrated with the IntFOLD-QA tertiary structure prediction pipeline [33]. The algorithm has since been independently tested for confidence and published as the fourth version of the ModFOLD server [34]. CASP assessments of QA methods were more concerned about the quality scoring results rather than other practical considerations, such as the researcher’s accessibility, until the assessment was updated following the eighth and ninth seasons of the experiment [53] (details about CASP in Subheading 7). In CASP10, the criteria were modified to rebalance the quality assessment. This modification was implied by using smaller bespoke data sets rather than allowing large sets of models, which some participants argued had unfairly favored clustering approaches in previous CASPs. Despite this change of criteria in CASP10, ModFOLD4 was ranked among the top-performing methods in the quality assessment category. ModFOLD4 also provided a free service for accurate prediction of global and local QA of 3D protein models. The server had a comparable performance to clustering-based methods but retained the capability of making predictions for a single model at a time [34]. 6.2 Latest Versions of ModFOLD

7

In 2015, the fifth version of ModFOLD was released. This version was integrated with the upgraded tertiary structure prediction IntFOLD3-TS pipeline which gave ModFOLD5 the ability to generate a greater number and variety of reference models [54]. In 2017, ModFOLD was upgraded to its sixth version with a new neural network-based quasi-single-model method that took as its input a sliding window of per-residue scores from six different pure-single and quasi-single scoring methods and a single quality score for each residue in the model [55]. ModFOLD6 was independently evaluated during the CASP12 experiment and it is freely available at https://www.reading.ac.uk/bioinf/ModFOLD/ ModFOLD6_form.html (Fig. 2). During the past 2 years, ModFOLD had further improvements and was upgraded to the seventh and eighth versions, which were tested in CASP13 and CASP14, respectively. More details about the ModFOLD server interface and inputs and outputs can be found in Maghrabi and McGuffin 2017 [55].

EMA in Community-Wide Experiments EMA and a few of several other modeling techniques have been developed and utilized through the last decades in order to solve the protein sequence-structure gap dilemma. The methods and servers were included for evaluation as a category in two major

Model Quality Estimates

111

Fig. 2 ModFOLD6 server results for models submitted to CASP12 generated for target T0859 (PDB ID: 5jzr). (a) An example of the graphical output from the server showing the main results page with a summary of the results from each method (truncated here to fit page). Clicking on the thumbnail images in the main table allows results to be visualized in more detail. (b) A histogram of the local or per-residue errors for the top-ranked model, with the residue number on the x-axis and the predicted residue error (distance of the Cα atom from the native structure in Å) on the y-axis, which may be downloaded. (c) Interactive views of models, which can be manipulated in 3D using the JSmol/HTML5 framework and/or downloaded for local viewing. (Adapted from Maghrabi and McGuffin [55])

worldwide organizations that are specialized in the protein structure prediction field. The first organization conducts independent blind testing with the Critical Assessment of Techniques for Protein Structure Prediction (CASP) [36] experiments, which are held every other year. The second organization is the continuously automatic model evaluation project called CAMEO [37]. Both organizations have highlighted the importance of the EMA development for the improvement of protein structure prediction and have helped to encourage progress in the field. The importance and far-reaching implications of having the ability to predict protein structures from their amino acid is manifested by the ongoing biennial experiment on “Critical Assessment of Structure Prediction” (CASP). The Critical Assessment of Techniques for Protein Structure Prediction or CASP is a global community-wide experiment that has started taking place every other year since 1994 [56]. Protein structure modelers in more than a hundred research centers around the world dedicate their late spring and summer to preparing their methods to be independently tested in this center. CASP is designed as a blind prediction

112

Ali H. A. Maghrabi et al.

Fig. 3 EMA ranking section in the CASP community web page. Results from CASP13 showing the top ranking EMA based on stage 1 which consists of 20 models, and the scores were ranked against the observed scores from GDT_TS. https://www.predictioncenter.org/

experiment (Fig. 3). A set of protein sequences are selected by the assessors in order to test the performance of the methods in predicting their protein structures which are already experimentally observed and hidden with the organizers, for an attempt to help advancing these protein prediction methods. In the first CASP, the experiment was quite basic consisting of just three parts: collecting protein targets (which will subsequently be solved experimentally), collecting tertiary structure predictions, and assessing and discussing the results [56]. CASP experiment has since become popular, and its participants and prediction categories have been growing over the years. CASP takes the form of an international competition, which can be thought of as the world championships for protein structure prediction. Fourteen CASP experiments have been performed during the last 25 years, with the last one completed in late 2020. The competition has evolved over the years and is now carried out by dividing its experiments into slightly more complicated subcategories, including the following: tertiary structure prediction; disorder prediction; contact prediction; model quality assessment or QA, which is also called estimates of model accuracy (EMA); binding site prediction; protein-protein interactions; oligomerization state; and protein model refinement [57]. Each category represents an important part of the structure prediction process that needs further improvements in terms of the predictive power of the underlying algorithms. An aim of CASP is to drive new developments, which will lead to higher levels of accuracy and consistency in producing models that are closer in quality to the experimentally derived protein structures.

Model Quality Estimates

113

Fig. 4 CAMEO continuous benchmarking for EMA servers. A 6-month result of the top EMA methods being benchmarked continuously in CAMEO servers. https://www.cameo3d.org/

Another evaluation resource for EMA methods is the CAMEO project, where the methods are continuously automatically evaluated each week, with tables and plots produced that show if there are any significant improvements between competitors (Fig. 4). Every week, CAMEO publishes benchmarking results based on models collected during a 4-day prediction window by assessing an average of a hundred targets during a time frame of 1 week, 1 month, 3 months, 6 months, and 1 year. The server benchmarks the most popular and top-ranked protein prediction methods as well as EMA methods separately. The benchmarking data is generated consistently for all participants at the same time, enabling them to benchmark and crossvalidate the performance of their methods. CAMEO sends emails with submission statistics and low performance warnings weekly in order to facilitate server development and promote shorter release cycles. This server has become a compliment to many participants of CASP and helped them when preparing their methods for upcoming community experiments [58, 59].

8

Recent Advances in EMA Methods Most recent breakthroughs have arisen with the onset of deep learning. New approaches built using artificial intelligence (AI) have been accelerating the structure prediction field by far. A

114

Ali H. A. Maghrabi et al.

method called AlphaFold [60], developed by the DeepMind AI company, has shown significant progress on generating 3D models of proteins in the worldwide protein prediction competition, CASP. The method was placed first in rankings among the teams that entered in protein modeling competitions. The reason behind this success lies in the integration of the deep neural networks (DNNs) approach, which is a system of neural layers trained to accurately predict the distances between residues, and as a result, it generates highly accurate structures. Such a success has drawn the attention from all structural biologists to start studying this field in depth [60]. Unlike standard neural networks (NNs), the multiple layers in the DNNs give it the ability to process more complicated problems [61]. By testing the visual pattern recognition example using DNNs, the neurons in the first layer could recognize edges, and then the neurons in the second layer would learn to recognize more shapes like triangles or rectangles which are built up from edges which already have been learnt in the first layer. The third layer could then recognize static more complex shapes, and the fourth learns animatic shapes, and so on. This reminds us of how children start to learn basic shapes around them when their brains that contain multiple layers of neurons give them a compelling advantage in starting to learn complex patterns. We can expect that having more hidden layers would make our networks more powerful. However, changing a single layer to multiple-layered neural networks could lead to having more complex intermediate layers which can have multiple layers of abstraction [62]. DNNs can compute more advanced problems with several techniques and architectures to be formulated; the multilayer perceptron (MLP) has been the chosen feedforward neural network class that has the ability to map a set of inputs which pass it through hidden layers and send the calculated data to an output unit [63]. MLP networks have been considered as a powerful technique in a large number of applications from different fields of research. The benefits of MLPs come from the appropriateness in dealing with most of the problems involving function approximation, pattern classification, process control, and time series forecasting [64]. MLPs have been used in many successful EMA methods (Table 1) and they have grown in complexity to accommodate the growth in input data. Recent studies have shown that up until CASP14, there has been a small but significant improvement in EMA methods. It was noted that many of the improved methods have used deep learning but in various ways. However, the indications for such an implementation remain vague and are still under evaluation. The best way to use machine learning for EMA is still not functionally available, and there is plenty of space for developers to work on this area for improvements. We notice that on average the best

Model Quality Estimates

115

EMA methods select models that are better than those provided by the best individual TBM- or FM-based server. However, still, further significant improvements could be achieved if there were possible ways to always select the best model for each target. Finally, we do notice systematic differences when using different model evaluation methods. Single-model methods perform relatively better when using local evaluation methods and appear better at ranking higher-quality models [65]. Assessing the quality of protein structure prediction has been continuously improving over the last decades. Variant types of methods were developed for different tasks in the estimates of model accuracy sector, and the most succeeding ones were the pure-single, quasi-single, and the clustering methods which have shown significant results in controlling the prediction quality in CASP and CAMEO. Recently, around 50 EMA methods participated in CASP13 showing an increase in the number compared to the previous season. The recent concern which was focused on for EMA development was having more FM targets for which highquality models were generated by the TS servers. Another concern was having higher consensus among high-quality models on average than ever before [66]. There are also some challenges that need to be overcome such as improving the way EMA methods are trained and the integration of deep learning tools for having more accurate prediction checking. References 1. Kaczanowski S, Zielenkiewicz P (2010) Why similar protein sequences encode similar three-dimensional structures? Theor Chem Accounts 125:643–650. https://doi.org/10. 1007/s00214-009-0656-3 2. Martı´-Renom MA, Stuart AC, Fiser A et al (2000) Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 29:291–325. https://doi.org/ 10.1146/annurev.biophys.29.1.291 3. Rost B, Schneider R, Sander C (1997) Protein fold recognition by prediction-based threading. J Mol Biol 270:471–480. https://doi. org/10.1006/jmbi.1997.1101 4. Hildebrand A, Remmert M, Biegert A, So¨ding J (2009) Fast and accurate automatic structure prediction with HHpred. Proteins 77(Suppl 9):128–132. https://doi.org/10.1002/prot. 22499 5. Jothi A (2012) Principles, challenges and advances in ab initio protein structure prediction. Protein Pept Lett 19:1194–1204 6. Moult J, Fidelis K, Rost B et al (2005) Critical assessment of methods of protein structure prediction (CASP)–round 6. Proteins 61

(Suppl 7):3–7. https://doi.org/10.1002/ prot.20716 ˜aga P, Calvo B, Santana R et al (2006) 7. Larran Machine learning in bioinformatics. Brief Bioinform 7:86–112 8. Dhingra S, Sowdhamini R, Cadet F, Offmann B (2020) A glance into the evolution of template-free protein structure prediction methodologies. Biochimie 175:85–92. https://doi.org/10.1016/j.biochi.2020. 04.026 9. Kryshtafovych A, Fidelis K (2009) Protein structure prediction and model quality assessment. Drug Discov Today 14:386–393. https://doi.org/10.1016/j.drudis.2008. 11.010 10. Hooft RWW, Vriend G, Sander C, Abola EE (1996) Errors in protein structures. Nature 381:272–272. https://doi.org/10.1038/ 381272a0 11. Laskowski RA, Rullmann JAC, MacArthur MW et al (1996) AQUA and PROCHECKNMR: programs for checking the quality of protein structures solved by NMR. J Biomol

116

Ali H. A. Maghrabi et al.

NMR 8:477–486. https://doi.org/10.1007/ BF00228148 12. Lovell SC, Davis IW, Arendall WB et al (2003) Structure validation by Calpha geometry: phi, psi and Cbeta deviation. Proteins 50:437–450. https://doi.org/10.1002/prot.10286 13. Sippl MJ (1993) Recognition of errors in three-dimensional structures of proteins. Proteins Struct Funct Bioinform 17:355–362. https://doi.org/10.1002/prot.340170404 14. Zhou H, Zhou Y (2002) Distance-scaled, finite ideal-gas reference state improves structurederived potentials of mean force for structure selection and stability prediction. Protein Sci Publ Protein Soc 11:2714–2726. https://doi. org/10.1110/ps.0217002 15. Eisenberg D, Lu¨thy R, Bowie JU (1997) VERIFY3D: assessment of protein models with three-dimensional profiles. Methods Enzymol 277:396–404 16. Wallner B, Elofsson A (2003) Can correct protein models be identified? Protein Sci Publ Protein Soc 12:1073–1086. https://doi.org/ 10.1110/ps.0236803 17. McGuffin LJ (2007) Benchmarking consensus model quality assessment for protein fold recognition. BMC Bioinform 8:345. https://doi. org/10.1186/1471-2105-8-345 18. Benkert P, Tosatto SCE, Schomburg D (2008) QMEAN: a comprehensive scoring function for model quality assessment. Proteins Struct Funct Bioinform 71:261–277. https://doi. org/10.1002/prot.21715 ˇ , Moult J, Fidelis K 19. Zemla A, Venclovas C (1999) Processing and analysis of CASP3 protein structure predictions. Proteins Struct Funct Bioinform 37:22–29. https://doi.org/ 10.1002/(SICI)1097-0134(1999)37:3 +3.0.CO;2-W 20. Siew N, Elofsson A, Rychlewski L, Fischer D (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics (Oxford, England) 16: 776–785. https://doi.org/10.1093/bioinfor matics/16.9.776 21. Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins Struct Funct Bioinform 57:702–710. https://doi.org/10.1002/ prot.20264 22. Mariani V, Biasini M, Barbato A, Schwede T (2013) IDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics (Oxford, England) 29:2722–2728. https:// doi.org/10.1093/bioinformatics/btt473

23. Read RJ, Chavali G (2007) Assessment of CASP7 predictions in the high accuracy template-based modeling category. Proteins Struct Funct Bioinform 69:27–37. https:// doi.org/10.1002/prot.21662 24. Huang YJ, Mao B, Aramini JM, Montelione GT (2014) Assessment of template-based protein structure predictions in CASP10. Proteins Struct Funct Bioinform 82:43–56. https:// doi.org/10.1002/prot.24488 25. Wiederstein M, Sippl MJ (2007) ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic Acids Res 35:W407–W410. https://doi.org/10.1093/nar/gkm290 26. Uziela K, Wallner B (2016) ProQ2: estimation of model accuracy implemented in Rosetta. Bioinformatics (Oxford, England) 32:1411– 1413. https://doi.org/10.1093/bioinformat ics/btv767 27. Uziela K, Mene´ndez Hurtado D, Shu N et al (2017) ProQ3D: improved model quality assessments using deep learning. Bioinformatics (Oxford, England) 33:1578–1580. https:// doi.org/10.1093/bioinformatics/btw819 28. McGuffin LJ (2008) The ModFOLD server for the quality assessment of protein structural models. Bioinformatics (Oxford, England) 24:586–587. https://doi.org/10.1093/bioin formatics/btn014 29. McGuffin LJ (2009) Prediction of global and local model quality in CASP8 using the ModFOLD server. Proteins 77(Suppl 9):185–190. https://doi.org/10.1002/prot.22491 30. Larsson P, Skwark MJ, Wallner B, Elofsson A (2009) Assessment of global and local model quality in CASP8 using Pcons and ProQ. Proteins 77(Suppl 9):167–172. https://doi.org/ 10.1002/prot.22476 31. Benkert P, Tosatto SCE, Schwede T (2009) Global and local model quality estimation at CASP8 using the scoring functions QMEAN and QMEANclust. Proteins 77(Suppl 9): 173–180. https://doi.org/10.1002/prot. 22532 32. Cheng J, Wang Z, Tegge AN, Eickholt J (2009) Prediction of global and local quality of CASP8 models by MULTICOM series. Proteins 77(Suppl 9):181–184. https://doi.org/ 10.1002/prot.22487 33. McGuffin LJ, Roche DB (2010) Rapid model quality assessment for protein structure predictions using the comparison of multiple models without structural alignments. Bioinformatics (Oxford, England) 26:182–188. https://doi. org/10.1093/bioinformatics/btp629

Model Quality Estimates 34. McGuffin LJ, Buenavista MT, Roche DB (2013) The ModFOLD4 server for the quality assessment of 3D protein models. Nucleic Acids Res 41:W368–W372. https://doi.org/ 10.1093/nar/gkt294 35. Roche DB, Buenavista MT, McGuffin LJ (2014) Assessing the quality of modelled 3D protein structures using the ModFOLD server. Methods Mol Biol (Clifton, NJ) 1137:83–103. https://doi.org/10.1007/978-1-49390366-5_7 36. Kryshtafovych A, Schwede T, Topf M et al (2019) Critical assessment of methods of protein structure prediction (CASP)—Round XIII. Proteins Struct Funct Bioinform 87: 1011–1020. https://doi.org/10.1002/prot. 25823 37. Haas J, Barbato A, Behringer D et al (2018) Continuous Automated Model EvaluatiOn (CAMEO) complementing the critical assessment of structure prediction in CASP12. Proteins 86(Suppl 1):387–398. https://doi.org/ 10.1002/prot.25431 38. Studer G, Rempfer C, Waterhouse AM et al (2020) QMEANDisCo—distance constraints applied on model quality estimation. Bioinformatics 36:1765–1771. https://doi.org/10. 1093/bioinformatics/btz828 39. Maghrabi AHA, McGuffin LJ (2020) Estimating the quality of 3D protein models using the ModFOLD7 server. In: Kihara D (ed) Protein structure prediction. Springer US, New York, pp 69–81 ˇ (2019) VoroMQA 40. Olechnovicˇ K, Venclovas C web server for assessing three-dimensional structures of proteins and protein complexes. Nucleic Acids Res 47:W437–W442. https:// doi.org/10.1093/nar/gkz367 41. Wang Z, Eickholt J, Cheng J (2010) MULTICOM: a multi-level combination approach to protein structure prediction and its assessments in CASP8. Bioinformatics (Oxford, England) 26:882–888. https://doi.org/10.1093/bioin formatics/btq058 42. McGuffin LJ, Aldowsari FMF, Alharbi SMA, Adiyaman R (2021) ModFOLD8: accurate global and local quality estimates for 3D protein models. Nucleic Acids Res 49:W425– W430. https://doi.org/10.1093/nar/ gkab321 43. Jones DT, Bryson K, Coleman A et al (2005) Prediction of novel and analogous folds using fragment assembly and fold recognition. Proteins 61(Suppl 7):143–151. https://doi.org/ 10.1002/prot.20731 44. Jones DT, McGuffin LJ (2003) Assembling novel protein folds from super-secondary

117

structural fragments. Proteins Struct Funct Bioinform 53:480–485. https://doi.org/10. 1002/prot.10542 45. McGuffin LJ, Jones DT (2003) Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics 19:874–881. https://doi.org/10.1093/bioinformatics/ btg097 46. Sodhi JS, Bryson K, McGuffin LJ et al (2004) Predicting metal-binding site residues in low-resolution structural models. J Mol Biol 342:307–320. https://doi.org/10.1016/j. jmb.2004.07.019 47. Eastwood MP, Hardin C, Luthey-Schulten Z, Wolynes PG (2001) Evaluating protein structure-prediction schemes using energy landscape theory. IBM J Res Dev 45:475– 497. https://doi.org/10.1147/rd.453.0475 48. Ben-David M, Noivirt-Brik O, Paz A et al (2009) Assessment of CASP8 structure predictions for template-free targets. Proteins 77 (Suppl 9):50–65. https://doi.org/10.1002/ prot.22591 49. Wallner B, Elofsson A (2003) Can correct protein models be identified? Protein Sci 12:1073– 1086. https://doi.org/10.1110/ps.0236803 50. Wallner B, Elofsson A (2007) Prediction of global and local model quality in CASP7 using Pcons and ProQ. Proteins 69(Suppl 8): 184–193. https://doi.org/10.1002/prot. 21774 51. Buenavista MT, Roche DB, McGuffin LJ (2012) Improvement of 3D protein models using multiple templates guided by singletemplate model quality assessment. Bioinformatics (Oxford, England) 28:1851–1857. https://doi.org/10.1093/bioinformatics/ bts292 52. Roche DB, Buenavista MT, Tetchner SJ, McGuffin LJ (2011) The IntFOLD server: an integrated web resource for protein fold recognition, 3D model quality assessment, intrinsic disorder prediction, domain prediction and ligand binding site prediction. Nucleic Acids Res 39:W171–W176. https://doi.org/10. 1093/nar/gkr184 53. Kryshtafovych A, Fidelis K, Tramontano A (2011) Evaluation of model quality predictions in CASP9. Proteins 79(Suppl 10):91–106. https://doi.org/10.1002/prot.23180 54. McGuffin LJ, Atkins JD, Salehe BR et al (2015) IntFOLD: an integrated server for modelling protein structures and functions from amino acid sequences. Nucleic Acids Res 43:W169–W173. https://doi.org/10.1093/ nar/gkv236

118

Ali H. A. Maghrabi et al.

55. Maghrabi AHA, McGuffin LJ (2017) ModFOLD6: an accurate web server for the global and local quality estimation of 3D protein models. Nucleic Acids Res 45:W416–W421. https://doi.org/10.1093/nar/gkx332 56. Moult J, Pedersen JT, Judson R, Fidelis K (1995) A large-scale experiment to assess protein structure prediction methods. Proteins 23: i i – v. h t t p s : // d o i . o r g / 1 0 . 1 0 0 2 / p r o t . 340230303 57. Roche DB, McGuffin LJ (2016) In silico identification and characterization of protein-ligand binding sites. In: Computational design of ligand binding proteins. Springer, pp 1–21 58. Yang J, Anishchenko I, Park H et al (2020) Improved protein structure prediction using predicted inter-residue orientations. Proc Natl Acad Sci 117:1496–1503. https://doi.org/10. 1073/pnas.1914677117 59. McGuffin LJ, Adiyaman R, Maghrabi AHA et al (2019) IntFOLD: an integrated web resource for high performance protein structure and function prediction. Nucleic Acids Res 47:W408–W413. https://doi.org/10. 1093/nar/gkz322 60. Senior AW, Evans R, Jumper J et al (2020) Improved protein structure prediction using potentials from deep learning. Nature 577: 706–710. https://doi.org/10.1038/s41586019-1923-7

61. Toth G, Lent CS, Tougaw PD et al (1996) Quantum cellular neural networks. Superlattice Microst 20:473–478. https://doi.org/10. 1006/spmi.1996.0104 62. Ba J, Caruana R (2014) Do Deep Nets Really Need to be Deep? In: Ghahramani Z, Welling M, Cortes C et al (eds) Advances in neural information processing systems 27. Curran Associates, Inc, pp 2654–2662 63. Orbach J (1962) Principles of neurodynamics. Perceptrons and the theory of brain mechanisms. Arch Gen Psychiatry 7:218–219. https:// doi.org/10.1001/archpsyc.1962. 01720030064010 ¨ nu¨t S, Kahraman C (2009) A 64. Efendigil T, O decision support system for demand forecasting with artificial neural networks and neurofuzzy models: a comparative analysis. Expert Syst Appl 36:6697–6707. https://doi.org/ 10.1016/j.eswa.2008.08.058 65. Cheng J, Choe M-H, Elofsson A et al (2019) Estimation of model accuracy in CASP13. Proteins Struct Funct Bioinform, vol 87, p 1361 66. Won J, Baek M, Monastyrskyy B et al (2019) Assessment of protein model structure accuracy estimation in CASP13: challenges in the era of deep learning. Proteins Struct Funct Bioinform 87:1351–1360. https://doi.org/ 10.1002/prot.25804

Chapter 7 Using Local Protein Model Quality Estimates to Guide a Molecular Dynamics-Based Refinement Strategy Recep Adiyaman and Liam J. McGuffin Abstract The refinement of predicted 3D models aims to bring them closer to the native structure by fixing errors including unusual bonds and torsion angles and irregular hydrogen bonding patterns. Refinement approaches based on molecular dynamics (MD) simulations using different types of restraints have performed well since CASP10. ReFOLD, developed by the McGuffin group, was one of the many MD-based refinement approaches, which were tested in CASP 12. When the performance of the ReFOLD method in CASP12 was evaluated, it was observed that ReFOLD suffered from the absence of a reliable guidance mechanism to reach consistent improvement for the quality of predicted 3D models, particularly in the case of template-based modelling (TBM) targets. Therefore, here we propose to utilize the local quality assessment score produced by ModFOLD6 to guide the MD-based refinement approach to further increase the accuracy of the predicted 3D models. The relative performance of the new local quality assessment guided MD-based refinement protocol and the original MD-based protocol ReFOLD are compared utilizing many different official scoring methods. By using the per-residue accuracy (or local quality) score to guide the refinement process, we are able to prevent the refined models from undesired structural deviations, thereby leading to more consistent improvements. This chapter will include a detailed analysis of the performance of the local quality assessment guided MD-based protocol versus that deployed in the original ReFOLD method. Key words Protein model refinement, Tertiary structure prediction, Molecular dynamics, Model quality assessment (MQA), Protein structure prediction, Protein modelling, Critical Assessment of Techniques for Protein Structure Prediction (CASP)

1

Introduction Determining protein structures in atomic detail is vital for a more complete understanding of the function of the proteins [1, 2]. Mainly X-ray crystallography [3–5], nuclear magnetic resonance (NMR) [6, 7], and cryo-electron microscopy [7, 8] are used to experimentally determine protein 3D structures. However, such

Supplementary Information The online version contains supplementary material available at https://doi.org/ 10.1007/978-1-0716-2974-1_7. Sławomir Filipek (ed.), Homology Modeling: Methods and Protocols, Methods in Molecular Biology, vol. 2627, https://doi.org/10.1007/978-1-0716-2974-1_7, © Springer Science+Business Media, LLC, part of Springer Nature 2023

119

120

Recep Adiyaman and Liam J. McGuffin

experimental methods are comparatively slow, labor-intensive, and high-cost processes and are not yet efficient enough to allow us to bridge the huge gap between known sequences and available structures, which is widening due to next-generation sequencing [9– 11]. Therefore, in silico approaches should be taken into consideration, as protein 3D modelling is relatively fast, cheap, and can produce near experimental quality structures [1, 12–14]. Most successful in silico protein modelling pipelines have included the prediction of 3D models followed by their quality assessment and/or their refinement [15, 16]. The prediction of 3D models is now possible using both template-free (FM) and template-based modelling (TBM) methods at high accuracy with the recent application of deep learning methods [17]. Predicting tertiary structures with TBM and/or FM methods may result in the generation of many dozens of 3D models, or decoys, in various alternative conformations [16, 18, 19]. The predicted 3D structures can be assessed using model quality assessment programs (MQAPs) in terms of their global and local quality in order to find the most native-like 3D model [20–22]. 1.1 The Local Quality Estimation of 3D Models

The assessment of the global and local quality of the predicted 3D models may include the evaluation of abnormal bonds and angles, residue solvent accessibility/burial, secondary structures, or similarity with the mean predicted template-based model [23, 24]. The local quality assessment score provides residue-level information about which parts of the predicted structure are accurately modelled or vice versa. Most of the state-of-the-art MQAPs produce both global and local quality assessment scores [22, 25, 26]. The ModFOLD server developed by the McGuffin group was first tested in the seventh critical assessment of structure prediction experiment (CASP7) [1, 24], and it is now routinely used to evaluate global and local quality of the predicted 3D models [23, 24, 27]. Since its inception in CASP7, the ModFOLD server has undergone major development over successive versions, with each version providing incremental improvements in the accuracy of both global and local quality estimates [1, 20, 24, 27]. ModFOLD6 accommodates different local quality assessment measurements to produce the per-residue accuracy score using a neural network [28]. The per-residue accuracy score relates to the predicted C-alpha distances of each residue in a model from its equivalent residue in the native structure [28]. The first local quality assessment component of ModFOLD6 is the contact distance agreement (CDA) between the contacts predicted by MetaPSICOV [29] and the contacts in predicted 3D models [28, 29]. The second component is the secondary structure agreement (SSA), which is a new pure single local quality assessment method is based on the secondary structure of each residue predicted by PSIPRED [30] and the state of each residue in

Protein Model Refinement Guided by Local Quality

121

secondary structure of the model according to Dictionary of Secondary Structures of Proteins (DSSP) [28, 30, 31]. The local model quality assessment ProQ2 [32] score is also included in the calculation of the local quality assessment score [28, 32]. The IntFOLD4 reference set of 130 models is used to produce the local quality assessment score in quasi-single mode [22, 28]. The ModFOLDclustQ single local [28] quality assessment measurement is also added in the combination. The Disorder B-factor Agreement (DBA) between the disordered residues predicted by DISOPRED3 [33] and the accuracy score produced by ModFOLD5_single are used as the last component of the local model quality assessment [28, 33]. The six local quality assessment measurements for each residue are used as inputs to a simple multilayer NN in order to generate the final single per-residue accuracy score [28]. The global quality scores are also derived from the combination of various mean local quality assessment scores [28]. 1.2 The Refinement of the Predicted 3D Models

The accuracy of the protein structures has been an important factor for further biological applications, such as drug discovery, protein interactions, and function predictions [34, 35]. Strategies for the refinement of predicted 3D models have emerged in recent years, in order to increase the local and global accuracy of the predicted structures and move them toward experimental accuracy [21, 22, 34]. However, obtaining consistent and significant increases in the accuracy of the predicted 3D models using refinement has remained elusive. The progress of the different refinement approaches has been well documented since CASP8 with the introduction of the refinement category [36–39]. Although the folds of protein targets are often well predicted using TBM and FM, the 3D models may still have significant errors within them, including irregular bonds, geometrical clashes, and unrealistic angles in the predicted models [34]. Refinement approaches have been developed with the goal of fixing these errors to improve the overall accuracy of the predicted models [34, 40]. However, the refinement of predicted models may also lead to deterioration in the accuracy, especially for structures predicted by TBM, as there is often less room for improvement [34, 40]. It has also been challenging to specifically select the improved models from among a pool of alternatives [34, 38, 41, 42]. The refinement of predicted 3D models can be divided into two main stages: sampling and scoring [34, 43–45]. In the sampling stage, different sampling approaches are used to generate 3D models in order to improve them over the initial models [34, 45]. Subsequently, the improved models must be scored to identify those which have benefited most from the refinement process [34, 38, 39, 45]. There are two major categories of sampling strategies: (1) the fully automated server-based programs and (2) the non-serverbased highly CPU intensive programs [34, 46–48]. Knowledge-

122

Recep Adiyaman and Liam J. McGuffin

based approaches [49–53], Monte Carlo simulations [50, 52, 54– 59], and physics-based MD simulations [34, 52, 59–66] have all been combined in various sampling approaches. The automated server-based methods have relied principally on side-chain optimization and energy minimization of the structures. The automated server-based approaches are more practical and scalable as they require less computational resources than the non-server-based sampling strategies [34, 38, 41, 42, 48]. The rapid automated approaches have been also found to be relatively conservative, risk-averse. As a result, server methods have been less successful in CASP experiments in cases where there is plenty of room for the improvement of models (i.e., poorly predicted initial structures), compared with the more intensive sampling strategies, which utilize MD-based-approaches [34, 38, 39, 42, 45, 46, 65, 67]. Since CASP10, GPU-accelerated computing and different restraints strategies have been found to improve the efficiency and success of MD-based sampling strategies [36, 45–47, 68]. The Feig group developed a MD-based protocol using C-alpha restraints under explicit solvent conditions, and this protocol has been a milestone toward more consistent refinement [39, 69–71]. Nevertheless, the sampling protocol still requires impractically large amounts of computational resources (75,000 core h, or 12 days on 256 cores) to refine a single refinement target model [72], so it is not a scalable approach for use by fully automated prediction pipelines. The force fields used in the MD-based approaches play a vital role for the determination of the molecular geometries and clashes. Chemistry at HARvard Macromolecular Mechanics (CHARMM) c22/CMAP [73, 74] and c36 [74] versions and the AMBER ff14SB [75] are among the popular force fields used in the MD-based approaches. However, the flaws in the force fields may lead to structural deviations from the native basin [52, 62, 71, 72]. Therefore, further improvement of force fields is important for our ability to move 3D models toward the native structure [52, 62, 71, 72]. Advances in high-throughput computing (HTC), cloud resources, and improved force fields may also provide more future opportunities for refining multiple targets for groups with low resource. The majority of high-performing groups in the recent CASP experiments have utilized MD-based sampling strategies as a part of their refinement pipelines [52, 62, 71, 72]. Nevertheless, to avoid structural deviations resulting from the usage of imperfect force fields, different restraint strategies based on the knowledge of prior structures [45, 63, 76–79], all C-alphas [48, 71, 72, 80], and flat-bottom potential widths of 2–4 Å [45, 63, 76, 81] have been applied by many prediction groups in CASP experiments.

Protein Model Refinement Guided by Local Quality

1.3 The ReFOLD Server

123

The ReFOLD server was developed by our group to refine 3D models, and it incorporates MD in a way that requires far less computational effort compared with other MD-based protocols [48]. The refinement of the 3D models using ReFOLD consists of three main stages to improve the local and global qualities of the structure. In the first stage, the 3D model is refined using i3Drefine [26] with 20 iterative cycles. The i3Drefine method is based on the optimization of the hydrogen bonds and energy minimization [21, 26]. In the next stage, the iterative refinement of the 3D models is followed by a MD-based protocol which is inspired by that of Feig and Mirjalili [80]. The MD-based protocol is performed by applying restraints on all C-alphas and it utilizes the NAMD simulation package [82]. The 3D models generated by the iterative and MD-based protocols are also scored and ranked using ModFOLD6, in terms of their local and global qualities, in the final stage of the refinement pipeline [48]. This unique refinement approach showed promising performance in CASP12 [39, 48]. The refinement of large targets using our MD-based protocol has been also conducted using more modest computational resources, such as normal desktop computers and laptop computers, compared with the large supercomputers required by the other MD-based refinement groups in recent CASP experiments [48, 82]. Although the original MD-based protocol of ReFOLD is not highly computationally intensive and time-consuming (e.g., 909 min for a 257 residue model) compared to other MD-based approaches [48], structural deviations from the native basin were observed in CASP12 due to the flaws in the force fields. Different restraint strategies have been applied to prevent the 3D models from structural deviations and minimize the effects of the flaws in the force fields [34, 48, 72, 77, 80]. However, one of the main issues is also determining which regions should be restrained during the MD simulations. The local quality assessment measurements can be used to provide more information about the accuracy of the specific regions within the starting 3D models of the refinement targets [28]. The ModFOLD method is a leading MQAP, which is able to produce the per-residue accuracy scores which are a measure of the predicted distances for each residue (in a˚ngstro¨ms (Å)) from the native structures [22, 24, 28]. In this study, the per-residue accuracy score produced by ModFOLD6 is used to guide the original MD-based protocol of ReFOLD in an attempt to increase the accuracy of the 3D models toward the native structure. The restraint strategy is based on the philosophy “If it isn’t broken, then don’t fix it,” in other words, the specific regions in the starting model that are identified by ModFOLD6 will be restrained (and therefore not fixed), and conversely poorly predicted regions identified in need of further improvement will be focused on during the MD simulations [28].

124

2

Recep Adiyaman and Liam J. McGuffin

Materials The performance of the MD-based protocols has been tested using 56 CASP12 targets: 22 regular and 34 refinements. Of the targets, 29 out of the 56 were designated as TBM targets, 14 as FM targets, and 13 as TBM/FM targets. CASP12 targets were downloaded models from the CASP website (http://predictioncenter.org/ download_area//CASP12). We also used the 3D models generated by the original MD-based protocol of ReFOLD during CASP12, and these were obtained from our ReFOLD server result pages for the related targets. The MD simulations were conducted using NAMD version 2.10 [82] on GPU-accelerated mode and performed on a desktop with an Intel® Core™ i7 processor and an NVIDIA GeForce GTX 1070 GPU card. MoldFOLD6 was used for generating the per-residue accuracy score and ranking refined 3D models. The performance of the ModFOLD6 method in terms of the selection of the optimal models among the 3D models was also analyzed using standard official metrics, such as GDT-HA [83] and MolProbity [84] scores.

3

Methods The refinement of the 3D models was performed with the guidance of the local quality assessment scores produced by ModFOLD6 [28] for each target, starting with the identification of the poorly and well-predicted regions. The starting models obtained from the CASP website were submitted to ModFOLD6 to obtain the local quality assessment scores [28]. The produced per-residue accuracy score was also added to the B-factor (temperature factor) column of the PDB structures by ModFOLD6 [28]. After the identification of the well-predicted regions in the 3D model, a threshold based on the per-residue accuracy score was set to determine the regions should be restrained considering the distribution of the per-residue accuracy score during the MD simulations. It was proposed that if the per-residue accuracy score produced by ModFOLD6 was less than 3 Å, then the residues should not be refined further, as this may lead to deterioration in the overall quality of the 3D model [28]. If it was more than 8 Å, then the residues required a substantial refinement in order to increase the accuracy of the 3D model. In addition to this, three different thresholds were also applied during MD simulations to observe the effects of varying the thresholds at 3, 5, and 8 Å. The determination of the threshold was followed by the application of the restraints for residues that fell below each threshold, prior to each MD simulation step.

Protein Model Refinement Guided by Local Quality

125

Fig. 1 The refinement of a FM/TBM CASP12 target by the local quality assessment guided MD-based protocol. The per-residue accuracy score is produced by ModFOLD6 then the restraint strategy, based on the per-residue accuracy score, is applied to determine the restrained regions during the MD simulation: (a) the initial structure provided by CASP (the CASP12 refinement target TR896); (b) the initial structure is colored using the occupancy column, where blue regions indicate restrained regions and red regions indicate unrestrained during the MD simulation; (c) the superposition of the initial structure (cyan), the best model generated by the local quality assessment guided MD-based protocol (magenta), and native structure (green). The initial structure versus the best model, a GDT_HA improvement from 0.468 to 0.4826

For all atoms below the determined accuracy threshold, a weak harmonic positional restraint with a constant of 0.05 kcal/mol/ Å2 was applied to prevent them deviating from the native basin [48, 80]. On the other hand, the atoms with per-residue errors above the determined threshold are refined further toward the native structure. Unlike in other applications, our restraint strategy also was also applied to all atoms below the determined threshold, including the C-alphas (Fig. 1a, b).

126

Recep Adiyaman and Liam J. McGuffin

The distinctive part of the new MD-based protocol is our use of the per-residue accuracy score produced by ModFOLD6 to guide the MD-based protocol. The MD simulation parameters optimized for the original MD-based protocol of ReFOLD, inspired by Feig and Mirjalili [48, 80], were used to understand the outcome of the usage of the restraint strategy based on the per-residue accuracy score [28, 48, 80]. The MD simulations were maintained at 298 K and a pressure of 1 bar, to meet the normal cellular conditions in explicit solvent [73, 82]. The default CHARMM22/27 force field and TIP3P water model parameters were also used to conduct the simulation [48, 73, 80, 85]. Other parameters, which were optimized for the original MD-based protocol of ReFOLD, were also used [48]: Na + or Cl- ions were inserted to balance the charge of the system for the ion-sensitive proteins utilizing particle mesh Ewald (PME) and the temperature (298 K) was also controlled by Langevin dynamics [86, 87] [48]. The hydrogen bonds were also rigidified using the rigidBonds function with 2 fs time step. For the simulation of the system, four parallel simulations were run for 2 ns and 8 ns in total, and 164 3D models were generated per target after the completion of the MD simulation [48]. The local quality assessment guided MD-based process is also outlined in Fig. 1. 3.1 The Performance of the Local Quality Assessment Guided MD-Based Protocol

The results were analyzed by comparing the original MD-based protocol of ReFOLD and the local quality assessment guided MD-based protocol by taking into account the different prediction categories of the initial structures. In the final section, the performance of ModFOLD6 was also investigated in terms of the selection of the optimal model. The aim of this research is to analyze the effectiveness of using the local quality assessment score to guide the original MD-based protocol of ReFOLD. The original MD-based protocol of ReFOLD tested in CASP12 did not always perform well at improving the quality of the TBM models compared with the FM models, due to detrimental structural deviations from the native structure [28, 48]. The per-residue accuracy score produced by ModFOLD6 has been proven to provide accurate detailed information about the local quality of the 3D models in both the CAMEO and CASP experiments [28, 39, 88]. Using the per-residue accuracy score to guide the MD simulation process may help to avoid unnecessary deviations away from the native basin (see Note 1). To investigate the performance of the restraint strategy based on the per-residue accuracy score, the 3D refined models generated by the local quality assessment guided MD-based protocol were compared with the 3D models generated by the original MD-based protocol of ReFOLD. It is also observed that the prediction categories of the initial target models have been an important factor affecting the success of refinement and the local quality assessment score, which is used to guide the MD-based protocol. In other

Protein Model Refinement Guided by Local Quality

127

Fig. 2 Comparison of the performance of the original MD-based protocol of ReFOLD and the local quality assessment guided MD-based protocol (the applied threshold 3 Å) on TR870 (a FM CASP12 target, GDT-HA score of the initial structure 0.25). The blue line and green bars represent the local quality assessment guided MD-based protocol, and the red line and yellow bars represent the original MD-based protocol of ReFOLD (higher GDT HA scores are better)

words, the per-residue accuracy scores for TBM targets may provide more noticeable guidance compared with those of FM targets, as there is less room for improvement with TBM targets and more opportunities for introducing errors (see Note 1) [28, 48]. Although the local quality assessment guided MD-based protocol has higher slightly cumulative mean GDT-HA score (∑GDTHAmean = 3.7697908) and higher cumulative minimum GDT-HA score (∑GDT-HAmin = 3.5713) than that of the original ReFOLD protocol (∑GDT-HAmean = 3.7301776, ∑GDTHAmin = 3.3081), the original MD-based protocol of ReFOLD showed a better performance according to the cumulative maximum GDT-HA score for the FM targets (∑GDT-HAmax of 4.218 vs. 3.9974) (see examples Fig. 2 and Table 1). Using the stricter restraint strategy, based on the local quality assessment on all atoms below the determined threshold, prevented the generation of 3D models with the larger structural deviations observed in the original protocol, which led to a population of models with GDT-HA scores gathered closer to that of the starting model (Fig. 2). Conversely, the 3D models sampled by the original protocol of ReFOLD have a wider range of GDT-HA scores compared to the local quality assessment guided protocol with the application of the weak restraints on C-alphas (Fig. 2). Both protocols were able to improve the quality of the FM targets, as there is

Regular

Regular

Regular

Regular

Regular

Regular

Refinement 0.3427 0.323

Refinement 0.4032 0.3495

Refinement 0.6082 0.5433

Refinement 0.2885 0.2476

Refinement 0.25

Refinement 0.3244 0.2448

T0866

T0880

T0886

T0897

T0904

T0915

TR594

TR862

TR866

TR869

TR870

TR905

3.8208 3.3081

0.2064

0.2808 0.2451

0.2267 0.1752

0.0582 0.0534

0.1681 0.1638

0.0596 0.0544

0.2391 0.187

0.4032 0.3575

0.0674 0.1987

-0.0052 0.0597755 0.0001755

-0.0043 0.179839

0.2508

-0.013152 -0.012098 0.009889 -0.005287 -0.011012 -0.015794 -0.000385 -0.042101

-0.0515 0.213548

-0.0357 0.268702

-0.0197 0.352589

-0.0537 0.397913

-0.0649 0.597188

-0.0409 0.272706

-0.0436 0.249615

-0.0796 0.282299

-0.5127 3.7301776 -0.0906224 4.218

0.3223

0.305

0.2933

0.6587

0.457

0.3876

0.3068

0.0687

-0.0048 0.0595611 0.0013611

0.011739

0.2543

-0.016115

0.4328

0.2146

-0.0521 0.222985

0.01467 -0.012513

0.18277

0.2431

0.262

0.5769

0.3656

0.3287

0.2614

0.2122

0.0563

0.1627

0.0544

0.2304

0.371

0.1615

0.3972

3.5713

0.001927

-0.006522

0.2565

0.414

0.1814

0.006843

-0.025643

0.007492

-0.011999

-0.005814

-0.008387

0.001457

0.3223

0.2729

0.2909

0.6346

0.422

0.3652

0.289

0.2323

0.062

0.1921

-0.2495 3.7697908 -0.0510092 3.9974

-0.0393 0.298757

-0.0069 0.257492

-0.0265 0.276501

-0.0313 0.602386

-0.0376 0.394813

0.344157

-0.007172

-0.0194 0.273628 -0.014

-0.005657

-0.0145 0.221043

-0.0019 0.0587799 0.0005799

-0.0054 0.174943

3.06E-11

5.43E-06

9.27E-07

0.01722

0.000228

2.67E-08

8.90E-07

0.01064

2.93E-06

1.60E-15

2.20E-16

0.01662

2.20E-16

p-value

0.1766

-0.0021 2.40E-16

0.0229

0.0024

0.0264

0.0188

0.0225

0.0082

0.0056

0.0038

0.024

0.0026

0.0174

0.0108

0.0133

Maximum score DiffMax

-0.0052 0.0576659 -0.0019341 0.0622

-0.0087 0.241027

-0.0322 0.396678

0.00382

Mean score DiffMean

-0.0066 0.17192

Minimum score DiffMin

-0.0021 0.2851

0.055

0.0048

0.0505

0.0538

0.0449

0.026

0.0241

0.0105

0.0306

0.0078

0.0152

0.0296

0.0465

Maximum score DiffMax

-0.0457 0.390687

-0.011

Mean score DiffMean

The local quality assessment guided MD-based protocol (threshold is 3 Å) Wilcoxon test

One-tailed unpaired Wilcoxon tests were also used to compare the MD-based protocols (mean p-value 3.19E-03)

The cumulative scores

Regular

T0862

0.1681 0.1571

Regular

T0859

Starting Minimum model score DiffMin

CASP category

The original MD-based protocol of ReFOLD

Target ID by domain

CASP target

GDT-HA score

Table 1 The comparison of the local quality assessment guided MD-based protocol and the original MD-based protocol of ReFOLD performances on CASP12 FM targets according to GDT-HA score (higher GDT-HA scores are better)

Protein Model Refinement Guided by Local Quality

129

Fig. 3 Comparison of the performance of the original MD-based protocol of ReFOLD and the local quality assessment guided MD-based protocol (the applied threshold 8 Å) on T0890 (a FM/TBM CASP12 target, GDT-HA score of the initial structure 0.1742). The blue line and green bars represent the local quality assessment guided MD-based protocol, and the red line and yellow bars represent the original MD-based protocol of ReFOLD (higher GDT HA scores are better)

plenty of room for improvement with these targets; however the local quality assessment guided MD-based protocol is more riskaverse than the original MD-based protocol of ReFOLD (Fig. 2 and Table 1). The local quality assessment guided MD-based protocol showed a similar trend for the FM/TBM targets with the 3D models sampled by the original ReFOLD method having a wider range of quality (Fig. 3 and Table 2). The 3D models sampled by the local quality assessment guided MD-based protocol have higher cumulative mean GDT-HA scores (∑GDT-HAmean of 4.059576 vs. 3.947198) and higher cumulative minimum GDT-HA scores (∑GDT-HAmin of 3.8526 vs. 3.5384), but they have lower cumulative maximum GDT-HA scores than the original MD-based protocol (∑GDT-HAmax of 4.3066 vs. 4.4411). The local quality assessment guided MD-based protocol has outperformed the original MD-based protocol of ReFOLD considering the mean GDT-HA scores and minimum GDT-HA scores (Table 2). It is worthy of note that the initial 3D models for the TBM targets have more accurate structures compared with those for the FM targets. Refining 3D models that are already highly accurate is much more challenging as there is much less room for

Regular

Regular

Regular

Regular

Regular

Refinement

Refinement

Refinement

Refinement

Refinement

Refinement

Refinement

Refinement

T0890

T0898

T0899

T0909

T0945

TR694

TR868

TR890

TR896

TR898

TR901

TR909

TR945

4.1571

0.412

0.4257

0.3061

0.2524

0.468

0.3245

0.6143

0.2376

0.3493

0.2703

0.1628

0.1599

0.1742

model

Starting

3.5384

0.386

0.3566

0.2388

0.2123

0.3837

0.2434

0.5095

0.2129

0.3167

0.268

0.1347

0.1242

0.1516

score

Minimum

0.162086 -0.012114 0.140154 -0.019746 0.152224 -0.010576 0.303593 0.033293 0.355063 0.005763 0.233538 -0.004062 0.568945 -0.045355 0.281735 -0.042765 0.421912 -0.046088 0.245933 -0.006467 0.273182 -0.032918 0.393724 -0.031976 0.415109 0.003109 3.947198 -0.209902

-0.0226 -0.0357 -0.0281 -0.0023 -0.0326 -0.0247 -0.1048 -0.0811 -0.0843 -0.0401 -0.0673 -0.0691 -0.026 -0.6187

DiffMean

score

DiffMin

Mean

The original MD-based protocol of ReFOLD

4.4411

0.4493

0.4399

0.315

0.2901

0.468

0.3285

0.6738

0.2538

0.378

0.3341

0.1707

0.163

0.1769

score

Maximum

0.407

0.2899

0.284

3.8526

0.0373 0.3807

0.0142 0.4159

0.0089 0.2836

0.0377 0.2358

0

0.004

0.0595 0.5738

0.0162 0.212

0.0287 0.3327

0.0638 0.268

0.0079 0.1448

0.0031 0.1475

0.0027 0.1609

DiffMax score

Minimum

-0.3045

-0.0313

-0.0098

-0.0225

-0.0166

-0.061

-0.0346

-0.0405

-0.0256

-0.0166

-0.0023

-0.018

-0.0124

-0.0133

DiffMin

DiffMean

-0.01056

-0.00566

-0.00522

-0.00424

4.059576 -0.097524

0.396554 -0.015446

0.430218 0.004518

0.30088

0.24816

0.443089 -0.024911

0.305804 -0.018696

0.604776 -0.009524

0.22704

0.34364

0.278384 0.008084

0.153423 -0.009377

0.158659 -0.001241

0.168949 -0.005251

score

Mean

4.3066

0.412

0.4452

0.3161

0.2618

0.4826

0.3271

0.6548

0.2414

0.3587

0.2943

0.1664

0.1693

0.1769

score

Maximum

The local quality assessment guided MD-based protocol (threshold is 3 Å)

One-tailed unpaired Wilcoxon tests were also used to compare the MD-based protocols (mean p-value 5.42E-04)

scores

The cumulative

category

CASP

Target ID by domain

CASP target

GDT-HA score

2.20E-16

2.20E-16

0.1495

0

0.00358

0.0195 2.20E-16

0.01

0.0094 2.20E-16

0.0146 2.20E-16

0.0026 2.20E-16

0.0405 2.20E-16

0.0038 4.26E-13

0.0094 0.00347

0.024

0.0036 2.20E-16

0.0094 2.20E-16

0.0027 2.20E-16

DiffMax p-value

test

Wilcoxon

Table 2 The comparison of the local quality assessment guided MD-based protocol and the original MD-based protocol of ReFOLD performances on CASP12 FM/TBM targets according to GDT-HA score (higher GDT-HA scores are better)

improvement. Moreover, the refinement of the 3D models predicted by TBM may also lead to deterioration in the quality of the initial structures. The original MD-based protocol of ReFOLD suffered from the lack of a reliable guidance to improve the quality of the initial structures of TBM targets during CASP12. In the new protocol, the regions identified as well predicted by ModFOLD6 server in the initial structures were restrained to prevent them from deviating from the native basin, while regions identified as poorly predicted were left unrestrained to allow the opportunity for them to improve. The population of the 3D models sampled by the original MD-based protocol of ReFOLD contained more errors overall than population of models that were generated following the local quality assessment guided MD-based protocol (∑GDT-HAmean of 14.453249 vs. 15.157452; ∑GDT-HAmin of 12.7742 vs. 14.3863; and Table 3 and Fig. 4). This shows that the restraint strategy based on the per-residue accuracy score has managed to prevent the generated 3D models from deviating from the native basin in comparison with those produced using the original MD-based protocol of ReFOLD. Although the local quality assessment guided MD-based protocol showed a similar trend in TBM as in FM and FM/TBM, the original MD-based protocol of ReFOLD did not perform as well due to the lack of reliable guidance (Fig. 4). Nevertheless, the 3D models generated by the original MD-based protocol of ReFOLD have a higher cumulative maximum GDT-HA score as seen for the FM and FM/TBM targets (∑GDT-HAmax of 16.2173 vs. 15.9957). However, the selection of the very best model is not usually possible prior to the experimental solution of the target structure, so the cumulative maximum GDT-HA score is perhaps less important that the cumulative mean GDT-HA score. That is to say, that there is a greater chance of selecting an improved model from a population of models with a higher mean quality and lower range in quality. Therefore, in this respect, the new local quality assessment guided MD-based protocol has performed much better than the original MD-based protocol of ReFOLD, and this proves the reliability of the targeted restraint strategy (see Notes 2 and 3). Although the higher cumulative mean score of the models generated by the local quality assessment guided MD-based protocol demonstrates that on average the quality of the 3D models is improved, the protocol did not always manage to improve upon all targets (Table 3). To gauge the effects of applying different thresholds for our local quality guided protocol, we applied restraints to residues 8 may be in need of more refinement (Supplementary Tables S1, S2, and S3).

0.6071

0.693176

-0.1496

Refinement 0.7567

0.706848

TR891

0.6418

-0.1419

Refinement 0.7837

0.668846

TR885

0.6013

-0.0886

Refinement 0.6899

0.410781

TR882

0.3205

-0.1585

Refinement 0.479

0.497477

TR881

0.4341

-0.1989

Refinement 0.633

0.453761

TR879

0.4155

-0.0739

Refinement 0.4894

0.552178

TR877

0.4716

-0.0966

Refinement 0.5682

0.523768

-0.1495

TR872

0.4315

0.477226

Refinement 0.581

0.4144

-0.1091

TR520

0.5235

0.453246

Regular

0.4143

-0.0671

T0948

0.4814

0.240111

Regular

0.2166

-0.0796

T0947

0.2962

0.512469

Regular

0.4545

-0.0959

T0946

0.5504

0.353898

Regular

0.3232

-0.0917

T0944

0.4149

0.226078

Regular

0.1961

-0.1011

T0913

0.2972

0.473702

Regular

0.4229

-0.0938

T0911

0.5167

0.621886

Regular

0.5538

-0.0253

T0895

0.5791

0.437362

Regular

0.3977

-0.0909

T0882

0.4886

-0.063524

-0.076852

-0.021054

-0.068219

-0.135523

-0.035639

-0.016022

-0.057232

-0.046274

-0.028154

-0.056089

-0.037931

-0.061002

-0.071122

-0.042998

0.042786

-0.051238

DiffMean

Regular

Mean score

DiffMin

T0872

score

Minimum

category

domain

model

CASP

Target ID by

Starting

The original MD-based protocol of ReFOLD

CASP target

GDT-HA score

0.7701

0.7885

0.7437

0.4715

0.6261

0.4842

0.6307

0.5818

0.5352

0.5086

0.2945

0.5514

0.412

0.28

0.525

0.6994

0.4943

score

Maximum

0.0134

0.0048

0.0538

-0.0075

-0.0069

-0.0052

0.0625

0.0008

0.0117

0.0272

-0.0017

0.001

-0.0029

-0.0172

0.0083

0.1203

0.0057

DiffMax

0.683

0.6995

0.6677

0.4257

0.5614

0.4472

0.5199

0.5156

0.4698

0.4329

0.2551

0.5

0.3802

0.2794

0.4771

0.557

0.4432

score

Minimum

-0.0737

-0.0842

-0.0222

-0.0533

-0.0716

-0.0422

-0.0483

-0.0654

-0.0537

-0.0485

-0.0411

-0.0504

-0.0347

-0.0178

-0.0396

-0.0221

-0.0454

DiffMin

0.721636

0.747322

0.71137

0.452469

0.583705

0.466093

0.541712

0.543085

0.495054

0.459048

0.27321

0.533065

0.398787

0.291071

0.501923

0.588816

0.472564

Mean score

-0.035064

-0.036378

0.02147

-0.026531

-0.049295

-0.023307

-0.026488

-0.037915

-0.028446

-0.022352

-0.02299

-0.017335

-0.016113

-0.006129

-0.014777

0.009716

-0.016036

DiffMean

0.7589

0.7909

0.75

0.4802

0.6261

0.4877

0.5739

0.581

0.5268

0.4771

0.2954

0.5534

0.4142

0.3027

0.5292

0.6297

0.5028

score

Maximum

0.0022

0.0072

0.0601

0.0012

-0.0069

-0.0017

0.0057

0

0.0033

-0.0043

-0.0008

0.003

-0.0007

0.0055

0.0125

0.0506

0.0142

DiffMax

The local quality assessment guided MD-based protocol (threshold is 3 Å)

2.20E-16

2.20E-16

2.20E-16

1.11E-14

2.20E-16

2.20E-16

5.45E-05

3.03E-07

2.20E-16

3.10E-13

2.20E-16

2.20E-16

2.20E-16

2.20E-16

2.20E-16

2.20E-16

2.20E-16

p-value

test

Wilcoxon

Table 3 The comparison of the local quality assessment guided MD-based protocol and the original MD-based protocol of ReFOLD performances on CASP12 TBM targets according to GDT-HA score (higher GDT-HA scores are better)

132 Recep Adiyaman and Liam J. McGuffin

Refinement 0.4534

Refinement 0.6535

Refinement 0.6039

Refinement 0.4801

Refinement 0.7581

Refinement 0.4274

Refinement 0.3333

Refinement 0.5603

Refinement 0.5157

Refinement 0.5956

TR913

TR917

TR920

TR921

T0922

TR928

TR942

TR944

TR947

TR948

12.7742

0.5352

0.4557

0.4684

0.2752

0.3087

0.5766

0.4203

0.484

0.5799

0.3587

0.3896

0.605

0.401011 0.628904 0.548031 0.477502 0.709603 0.356789 0.304669 0.51819 0.502683 0.591222 14.453249 -1.262351

-0.0947 -0.0736 -0.1199 -0.0598 -0.1815 -0.1187 -0.0581 -0.0919 -0.06 -0.0604 -2.9414

-0.004378

-0.013017

-0.04211

-0.028631

-0.070611

-0.048497

-0.002598

-0.055869

-0.024596

-0.052389

-0.064416

0.450184

-0.125

-0.029152

0.661648

-0.0858

16.2173

0.6527

0.5357

0.5642

0.3424

0.4267

0.8024

0.5236

0.6279

0.6829

0.4482

0.4917

0.7219

0.5017

0.0571

0.02

0.0039

0.0091

-0.0007

0.0443

0.0435

0.024

0.0294

-0.0052

-0.0229

0.0311

14.3863

0.5705

0.49

0.501

0.3081

0.3783

0.7177

0.4402

0.5434

0.61

0.4164

0.4688

0.6272

One-tailed unpaired Wilcoxon tests were also used to compare the MD-based protocols (mean p-value 1.09E-04)

scores

15.7156

Refinement 0.5146

TR895

The cumulative

Refinement 0.6908

TR893

-1.3293

-0.0251

-0.0257

-0.0593

-0.0252

-0.0491

-0.0404

-0.0399

-0.0605

-0.0435

-0.037

-0.0458

-0.0636

-0.000809

-0.004005

-0.030326

-0.012593

-0.030958

0.008696

-0.014557

-0.035373

-0.016129

-0.016792

-0.021711

-0.035621

15.157452 -0.558148

0.594791

0.511695

0.529974

0.320707

0.396442

0.766796

0.465543

0.568527

0.637371

0.436608

0.492889

0.655179

15.9957

0.6242

0.5329

0.5583

0.332

0.4245

0.8105

0.4928

0.6016

0.6618

0.4586

0.5188

0.6997

0.2801

0.0286

0.0172

-0.002

-0.0013

-0.0029

0.0524

0.0127

-0.0023

0.0083

0.0052

0.0042

0.0089

2.20E-16

3.10E-13

2.20E-16

2.20E-16

2.20E-16

0.000546

2.48E-08

2.20E-16

5.08E-08

2.20E-16

2.20E-16

0.002565

Protein Model Refinement Guided by Local Quality 133

134

Recep Adiyaman and Liam J. McGuffin

Fig. 4 Comparison of the performance of the original MD-based protocol of ReFOLD and the local quality assessment guided MD-based protocol (the applied threshold 3 Å) on TR882 (a TBM CASP12 target, GDT-HA score of the initial structure 0.6899). The blue line and green bars represent the local quality assessment guided MD-based protocol, and the red line and yellow bars represent the original MD-based protocol of ReFOLD (higher GDT HA scores are better)

Nevertheless, 3 Å seems more applicable for the majority of the targets (see Notes 2 and 3). The GDT-HA score is based on the C-alpha superposition of the 3D model with the experimentally determined structure. Conversely, the MolProbity score also considers all atoms including C-alphas while calculating the overall quality score. Therefore, the MolProbity score is used to measure the quality of the 3D models generated by both MD-based protocols, as a non-native dependent scoring method. Both protocols performed well in terms of improving the quality of the initial structures according to MolProbity scores. However, the local quality assessment guided MD-based protocol showed much better performance than the original MD-based protocol of ReFOLD according to both the cumulative mean and minimum MolProbity scores (lower MolProbity scores indicate better models; Supplementary Table S4). This also supports the strategy of refining all atoms below the determined threshold based on the per-residue accuracy, as the refinement of all atoms including the C-alphas showed a better improvement in the quality of the 3D models compared with the application of the restraint just on C-alphas. It is also evident that the C-alpha superposition-based scores (GDT-HA and GDT-TS) may not be sufficient for comparing refinement protocols, and all atoms should perhaps be considered (see Note 4).

Protein Model Refinement Guided by Local Quality 3.1.1 ModFOLD6 in the Refinement Pipeline

4

135

MoldFOLD6 was also used to gauge our ability to select the best models from among the 3D models sampled by the local quality assessment guided MD-based protocol in the absence of the native structures. Following selection by ModFOLD6, the top selected models for each target were scored using the native structures. In 17 cases ModFOLD6 managed to select the improved models compared to the initial structures according to GDT-HA score, however for most targets it did not (Supplementary Table S5). Nevertheless, it should be noted that the cumulative GDT-HA score of the 3D models selected by ModFOLD6 is still higher than the cumulative mean GDT-HA score. Therefore, selecting refinement models using ModFOLD6 performed better than just relying on random selection. It must also be stressed that MQAPs such as ModFOLD6 have not been optimized specifically for the selection of refinement models, where the quality of models is often already high and the range of quality in the population of models is comparatively low, as all generated models are very similar (see Note 5).

Notes 1. ReFOLD, developed by McGuffin group, managed to refine the 3D models utilizing less computational effort compared to other MD-based protocols tested in CASP12 [39, 48]. However, significant structural deviations from the native basin have been observed for the refinement of 3D models predicted by FM and particularly TBM, due to the lack of a reliable guidance during the MD simulations. To generate 3D models without detrimental deviations from the native structure, the per-residue accuracy scores produced by ModFOLD6, which shows the predicted C-alpha distances from the native structure, were used to guide the MD-based protocol. 2. A threshold based on the per-residue accuracy score was applied as a part of the new restraint strategy to restrict the refinement only to those specific regions which needed improvement. The restraint strategy based on the per-residue accuracy score has managed to prevent the 3D models generated by the MD-based protocol from generating detrimental structural deviations. Using the local quality assessment to guide the MD-based protocol has been a major progress toward a more consistent refinement [34, 39, 48]. 3. It is clear that the local quality assessment guided MD-based protocol has managed to generate more 3D models which are closer to the native structure in comparison with the original MD-based protocol of ReFOLD, according to the GDT-HA scores. The 3D models generated by the local quality

136

Recep Adiyaman and Liam J. McGuffin

assessment guided MD-based protocol have much higher cumulative mean GDT-HA scores, which improves the chances of selecting an improved model. 4. The local quality assessment guided MD-based protocol also outperforms the original MD-based protocol of ReFOLD according to the MolProbity score. This also demonstrates that the local quality assessment guided MD-based protocol performs much better when all atoms are considered not just C-alphas. 5. Selecting the improved model compared to the initial structure also remains problematic as the models generated by the different sampling approaches, including the MD-based protocols, are very similar to each other. ModFOLD6 was also used to select the best models [28, 48]. Despite the fact that ModFOLD6 has been among the top performing MQAPs for the prediction pipeline in recent CAMEO and CASP experiments [28, 39, 48, 88], it has not shown optimal performance for the selection of refinement models. In the future, specialized versions of the ModFOLD server could be optimized for use in refinement pipelines.

5

Addendum We have recently used the knowledge gained from the work presented in this chapter to develop two new versions of the ReFOLD and ModFOLD servers [89, 90]. These new servers were independently validated in the CASP14 experiment and the latest version of ModFOLD is also continuously evaluated by the CAMEO project. For further information, please refer to our recent papers describing these new servers [89, 90].

References 1. McGuffin LJ (2008) Protein fold recognition and threading. In: Computational structural biology: methods and applications. World Scientific, pp 37–60 2. McGuffin LJ (2008) Aligning sequences to structures. In: Protein structure prediction. Humana Press, Totowa, pp 61–90 3. Kendrew JC, Bodo G, Dintzis HM et al (1958) A three-dimensional model of the myoglobin molecule obtained by X-ray analysis. Nature 181:662–666. https://doi.org/10.1038/ 181662a0 4. Perutz MF, Rossmann MG, Cullis AF et al (1960) Structure of Hæmoglobin: a threedimensional Fourier synthesis at 5.5-Å.

Resolution, obtained by X-Ray analysis. Nature 185:416–422. https://doi.org/10.1038/ 185416a0 5. Drenth J (1999) Principles of protein X-ray crystallography. Springer 6. Heinemann U, Frevert J, Hofman, KP et al (2002). Linking structural biology with genome research. In Genomics and proteomics, pp. 179–189. Springer, Boston, MA. https://doi.org/10.1007/0-30646823-9_15 7. Murata K, Wolf M (2018) Cryo-electron microscopy for structural analysis of dynamic biological macromolecules. Biochim Biophys

Protein Model Refinement Guided by Local Quality Acta – Gen Subj 1862:324–334. https://doi. org/10.1016/J.BBAGEN.2017.07.020 8. Jonic S, Ve´nien-Bryan C (2009) Protein structure determination by electron cryomicroscopy. Curr Opin Pharmacol 9:636– 642. https://doi.org/10.1016/J.COPH. 2009.04.006 9. Brocchieri L, Karlin S (2005) Protein length in eukaryotic and prokaryotic proteomes. Nucleic Acids Res 33:3390–3400. https://doi.org/10. 1093/nar/gki615 10. Rangwala H, Karypis G (2010) Introduction to protein structure prediction: methods and algorithms. Wiley 11. Roche D, Buenavista M, McGuffin L (2013) Predicting protein structures and structural annotation of proteomes. In: Roberts GCK (ed) Encylopedia of biophysics. Springer, pp 2061–2068. https://doi.org/10.1007/9783-642-16712-6_418 12. Moult J, Fidelis K, Zemla A, Hubbard T (2003) Critical assessment of methods of protein structure prediction (CASP)-round V. Proteins Struct Funct Genet 53:334–339. https://doi.org/10.1002/prot.10556 13. Zhang Y, Skolnick J (2005) The protein structure prediction problem could be solved using the current PDB library. Proc Natl Acad Sci U S A 102:1029–1034. https://doi.org/10. 1073/pnas.0407152101 14. Jumper J, Evans R, Pritzel A et al (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596:1–11. https:// doi.org/10.1038/s41586-021-03819-2 15. Lee J, Wu S, Zhang Y (2009) Ab initio protein structure prediction. In: Rigden DJ (ed) From protein structure to function with bioinformatics. Springer, Dordrecht, pp 3–25 16. Pavlopoulou A, Michalopoulos I (2011) Stateof-the-art bioinformatics protein structure prediction tools (review). Int J Mol Med 28:295– 31 0. https://doi.org/1 0.38 92/ijmm. 2011.705 17. Senior AW, Evans R, Jumper J et al (2020) Improved protein structure prediction using potentials from deep learning. Nature 577: 706–710. https://doi.org/10.1038/s41586019-1923-7 18. Roche BMT, Tetchner SJ, McGuffin LJ (2011) The IntFOLD server: an integrated web resource for protein fold recognition, 3D model quality assessment, intrinsic disorder prediction, domain prediction and ligand binding site prediction. Nucleic Acids Res 39:171– 176. https://doi.org/10.1093/nar/gkr184 19. McGuffin RDB (2011) Automated tertiary structure prediction with accurate local model

137

quality assessment using the intfold-ts method. Proteins 79:137–146. https://doi.org/10. 1002/prot.23120 20. McGuffin LJ (2010) Model quality prediction. In: Introduction to protein structure prediction. John Wiley & Sons, Inc., Hoboken, pp 323–342 21. Bhattacharya D, Cheng J (2013) 3Drefine: consistent protein structure refinement by optimizing hydrogen bonding network and atomic-level energy minimization. Proteins 81:119–131. https://doi.org/10.1002/prot. 24167 22. McGuffin LJ, Buenavista MT, Roche DB (2013) The ModFOLD4 server for the quality assessment of 3D protein models. Nucleic Acids Res 41:W368–W372. https://doi.org/ 10.1093/nar/gkt294 23. McGuffin LJ, Roche DB (2010) Rapid model quality assessment for protein structure predictions using the comparison of multiple models without structural alignments. Bioinformatics 26:182–188. https://doi.org/10.1093/bioin formatics/btp629 24. McGuffin LJ (2008) The ModFOLD server for the quality assessment of protein structural models. Bioinformatics 24:586–587. https:// doi.org/10.1093/bioinformatics/btn014 25. Roche DB, Tetchner SJ, McGuffin LJ (2010) The binding site distance test score: a robust method for the assessment of predicted protein binding sites. Bioinformatics 26:2920–2921. https://doi.org/10.1093/bioinformatics/ btq543 26. Bhattacharya D, Nowotny J, Cao R, Cheng J (2016) 3Drefine: an interactive web server for efficient protein structure refinement. Nucleic Acids Res 44:W406–W409. https://doi.org/ 10.1093/nar/gkw336 27. McGuffin LJ (2009) Prediction of global and local model quality in CASP8 using the ModFOLD server. Proteins 77:185–190. https:// doi.org/10.1002/prot.22491 28. Maghrabi AHA, McGuffin LJ (2017) ModFOLD6: an accurate web server for the global and local quality estimation of 3D protein models. Nucleic Acids Res 45:W416–W421. https://doi.org/10.1093/nar/gkx332 29. Jones DT, Singh T, Kosciolek T, Tetchner S (2015) MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics 31:999–1006. https://doi. org/10.1093/bioinformatics/btu791 30. Buchan DWA, Minneci F, Nugent TCO et al (2013) Scalable web services for the PSIPRED protein analysis workbench. Nucleic Acids Res

138

Recep Adiyaman and Liam J. McGuffin

41:W349–W357. https://doi.org/10.1093/ nar/gkt381 31. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637. https://doi.org/ 10.1002/bip.360221211 32. Uziela K, Wallner B (2016) ProQ2: estimation of model accuracy implemented in Rosetta. Bioinformatics 32:1411–1413. https://doi. org/10.1093/bioinformatics/btv767 33. Jones DT, Cozzetto D (2015) DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics 31:857–863. https://doi.org/10. 1093/bioinformatics/btu744 34. Adiyaman R, McGuffin LJ (2019) Methods for the refinement of protein structure 3D models. Int J Mol Sci 20:2301. https://doi.org/10. 3390/ijms20092301 35. Bonneau R, Tsai J, Ruczinski I, Baker D (2001) Functional inferences from blind ab initio protein structure predictions. J Struct Biol 134: 186–190. https://doi.org/10.1006/JSBI. 2000.4370 36. Heo L, Feig M (2018) What makes it difficult to refine protein models further via molecular dynamics simulations? Proteins 86:177–188. https://doi.org/10.1002/prot.25393 37. Moult J, Fidelis K, Kryshtafovych A et al (2016) Critical assessment of methods of protein structure prediction: progress and new directions in round XI. Proteins 84:4–14. https://doi.org/10.1002/prot.25064 38. MacCallum JL, Hua L, Schnieders MJ et al (2009) Assessment of the protein-structure refinement category in CASP8. Proteins 77: 66–80. https://doi.org/10.1002/prot.22538 39. Hovan L, Oleinikovas V, Yalinca H et al (2018) Assessment of the model refinement category in CASP12. Proteins 86:152–167. https://doi. org/10.1002/prot.25409 40. Bhattacharya D, Cheng J (2013) i3Drefine software for protein 3D structure refinement and its assessment in CASP10. PLoS One 8: e69648. https://doi.org/10.1371/journal. pone.0069648 41. Khoury GA, Smadbeck J, Kieslich CA et al (2017) Princeton_TIGRESS 2.0: high refinement consistency and net gains through support vector machines and molecular dynamics in double-blind predictions during the CASP11 experiment. Proteins 85:1078–1098. https://doi.org/10.1002/prot.25274 42. MacCallum JL, Pe´rez A, Schnieders MJ et al (2011) Assessment of protein structure

refinement in CASP9. Proteins 79:74–90. https://doi.org/10.1002/prot.23131 43. Meiler J, Baker D (2003) Rapid protein fold determination using unassigned NMR data. Proc Natl Acad Sci U S A 100:15404–15409. https://doi.org/10.1073/pnas.2434121100 44. Sliwoski G, Kothiwale S, Meiler J, Lowe EW (2014) Computational methods in drug discovery. Pharmacol Rev 66:334–395. https:// doi.org/10.1124/pr.112.007336 45. Feig M (2017) Computational protein structure refinement: almost there, yet still so far to go. Wiley Interdiscip Rev Comput Mol Sci 7: e1307. https://doi.org/10.1002/wcms.1307 46. Nugent T, Cozzetto D, Jones DT (2014) Evaluation of predictions in the CASP10 model refinement category. Proteins 82:98–111. https://doi.org/10.1002/prot.24377 47. Modi V, Dunbrack RL (2016) Assessment of refinement of template-based models in CASP11. Proteins 260–281:260. https://doi. org/10.1002/prot.25048 48. Shuid AN, Kempster R, McGuffin LJ (2017) ReFOLD: a server for the refinement of 3D protein models guided by accurate quality estimates. Nucleic Acids Res 45:422–428. https:// doi.org/10.1093/nar/gkx249 49. Lu H, Skolnick J (2003) Application of statistical potentials to protein structure refinement from low resolutionab initio models. Biopolymers 70:575–584. https://doi.org/10.1002/ bip.10537 50. Misura KMSS, Baker D (2005) Progress and challenges in high-resolution refinement of protein structure models. Proteins Struct Funct Genet 59:15–29. https://doi.org/10. 1002/prot.20376 51. Arnautova YA, Jagielska A, Scheraga HA (2006) A new force field (ECEPP-05) for peptides, proteins, and organic molecules. J Phys Chem B 110:5025–5044. https://doi.org/10. 1021/jp054994x 52. Jagielska A, Wroblewska L, Skolnick J (2008) Protein model refinement using an optimized physics-based all-atom force field. Proc Natl Acad Sci U S A 105:8268–8273. https://doi. org/10.1073/pnas.0800054105 53. Zhang Y (2009) Protein structure prediction: when is it useful? Curr Opin Struct Biol 19: 145–155 54. Han R, Leo-Macias A, Zerbino D et al (2008) An efficient conformational sampling method for homology modeling. Proteins 71:175–188. https://doi.org/10.1002/prot.21672 55. Kim DE, Blum B, Bradley P, Baker D (2009) Sampling bottlenecks in De novo protein structure prediction. J Mol Biol 393:249–

Protein Model Refinement Guided by Local Quality 260. https://doi.org/10.1016/J.JMB.2009. 07.063 56. Leaver-Fay A, Tyka M, Lewis SM et al (2011) Rosetta3: an object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol 487:545–574. https://doi. org/10.1016/B978-0-12-381270-4. 00019-6 57. Song Y, DiMaio F, Wang RY-R et al (2013) High-resolution comparative modeling with RosettaCM. Structure 21:1735–1742. https://doi.org/10.1016/j.str.2013.08.005 58. Ovchinnikov S, Park H, Kim DE et al (2018) Protein structure prediction using Rosetta in CASP12. Proteins 86:113–121. https://doi. org/10.1002/prot.25390 59. Lin MS, Head-Gordon T (2011) Reliable protein structure refinement using a physical energy function. J Comput Chem 32:709– 717. https://doi.org/10.1002/jcc.21664 60. Fan H, Mark AE (2004) Refinement of homology-based protein structures by molecular dynamics simulation techniques. Protein Sci 13:211–220. https://doi.org/10.1110/ps. 03381404 61. Chen B (2007) Can molecular dynamics simulations provide high-resolution refinement of protein structure? Proteins 67:922–930. https://doi.org/10.1002/prot.21345 62. Summa CM, Levitt M (2007) Near-native structure refinement using in vacuo energy minimization. Proc Natl Acad Sci U S A 104: 3177–3182. https://doi.org/10.1073/pnas. 0611593104 63. Ishitani R, Terada T, Shimizu K (2008) Refinement of comparative models of protein structure by using multicanonical molecular dynamics simulations. Mol Simul 34:327– 3 3 6 . h t t p s : // d o i . o r g / 1 0 . 1 0 8 0 / 08927020801930539 64. Kannan S, Zacharias M (2010) Application of biasing-potential replica-exchange simulations for loop modeling and refinement of proteins in explicit solvent. Proteins 78:2809–2819. https://doi.org/10.1002/prot.22796 65. Gront D, Kmiecik S, Blaszczyk M et al (2012) Optimization of protein models. Wiley Interdiscip Rev Comput Mol Sci 2:479–493. https://doi.org/10.1002/wcms.1090 66. Lee MR, Tsai J, Baker D, Kollman PA (2001) Molecular dynamics in the endgame of protein structure prediction. J Mol Biol 313:417–430. https://doi.org/10.1006/JMBI.2001.5032 67. Jones DT, Buchan DWA, Cozzetto D, Pontil M (2012) PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments.

139

Bioinformatics 28:184–190. https://doi.org/ 10.1093/bioinformatics/btr638 68. Heo L, Feig M (2018) Experimental accuracy in protein structure refinement via molecular dynamics simulations. Proc Natl Acad Sci U S A 115:13276–13281. https://doi.org/10. 1073/pnas.1811364115 69. Best RB, Buchete N-V, Hummer G (2008) Are current molecular dynamics force fields too helical? Biophys J 95:L07–L09. https://doi. org/10.1529/biophysj.108.132696 70. Shaw DE, Maragakis P, Lindorff-Larsen K et al (2010) Atomic-level characterization of the structural dynamics of proteins. Science 330: 341–346. https://doi.org/10.1126/science. 1187409 71. Mirjalili V, Feig M (2013) Protein structure refinement through structure selection and averaging from molecular dynamics ensembles. J Chem Theory Comput 9:1294–1303. https://doi.org/10.1021/ct300962x 72. Mirjalili V, Noyes K, Feig M (2014) Physicsbased protein structure refinement through multiple molecular dynamics trajectories and structure averaging. Proteins 82:196–207. https://doi.org/10.1002/prot.24336 73. MacKerell AD, Banavali N, Foloppe N (2001) Development and current status of the CHARMM force field for nucleic acids. Biopolymers 56:257–265 74. Best RB, Zhu X, Shim J et al (2012) Optimization of the additive CHARMM all-atom protein force field targeting improved sampling of the backbone ϕ, ψ and side-chain χ 1 and χ 2 dihedral angles. J Chem Theory Comput 8: 3257–3273. https://doi.org/10.1021/ ct300400x 75. Maier JA, Martinez C, Kasavajhala K et al (2015) ff14SB: improving the accuracy of protein side chain and backbone parameters from ff99SB. J Chem Theory Comput 11:3696– 3713. https://doi.org/10.1021/acs.jctc. 5b00255 76. Cao W, Terada T, Nakamura S, Shimizu K (2003) Refinement of comparative-modeling structures by multicanonical molecular dynamics. Genome Inform 14:484–485. https://doi. org/10.11234/gi1990.14.484 77. Park H, Seok C (2012) Refinement of unreliable local regions in template-based protein models. Proteins 80:1974–1986. https://doi. org/10.1002/prot.24086 78. Park IH, Gangupomu V, Wagner J et al (2012) Structure refinement of protein low resolution models using the GNEIMO constrained dynamics method. J Phys Chem B 116:2365– 2375. https://doi.org/10.1021/jp209657n

140

Recep Adiyaman and Liam J. McGuffin

79. Lee GR, Heo L, Seok C (2016) Effective protein model structure refinement by loop modeling and overall relaxation. Proteins 84:293– 301. https://doi.org/10.1002/prot.24858 80. Feig M, Mirjalili V (2016) Protein structure refinement via molecular-dynamics simulations: what works and what does not? Proteins 84(Suppl 1):282–292. https://doi.org/10. 1002/prot.24871 81. Zhang J, Liang Y, Zhang Y (2011) Atomiclevel protein structure refinement using fragment-guided molecular dynamics conformation sampling. Structure 19:1784–1795. https://doi.org/10.1016/J.STR.2011. 09.022 82. Phillips JC, Braun R, Wang W et al (2005) Scalable molecular dynamics with NAMD. J Comput Chem 26:1781–1802. https://doi. org/10.1002/jcc.20289 83. Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins 57:702–710. https://doi.org/10.1002/prot.20264 84. Davis IW, Murray LW, Richardson JS, Richardson DC (2004) MOLPROBITY: structure validation and all-atom contact analysis for nucleic acids and their complexes. Nucleic Acids Res 32:W615–W619. https://doi.org/10.1093/ nar/gkh398 85. Jorgensen WL, Chandrasekhar J, Madura JD et al (1983) Comparison of simple potential functions for simulating liquid water. J Chem

Phys 79:926. https://doi.org/10.1063/1. 445869 86. Go¨tz AW, Williamson MJ, Xu D et al (2012) Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. Generalized born. J Chem Theory Comput 8:1542–1555. https://doi.org/10.1021/ ct200909j 87. Loncharich RJ, Brooks BR, Pastor RW (1992) Langevin dynamics of peptides: the frictional dependence of isomerization rates of N-acetylalanyl-N′-methylamide. Biopolymers 32:523– 5 3 5 . h t t p s : // d o i . o r g / 1 0 . 1 0 0 2 / b i p . 360320508 88. Haas J, Barbato A, Behringer D et al (2018) Continuous Automated Model EvaluatiOn (CAMEO) complementing the critical assessment of structure prediction in CASP12. Proteins 86:387–398. https://doi.org/10.1002/ prot.25431 89. Adiyaman R, McGuffin LJ (2021) ReFOLD3: refinement of 3D protein models with gradual restraints based on predicted local quality and residue contacts. Nucleic Acids Res 49:W589– W596. https://doi.org/10.1093/NAR/ GKAB300 90. McGuffin LJ, Aldowsari FMF, Alharbi SMA, Adiyaman R (2021) ModFOLD8: accurate global and local quality estimates for 3D protein models. Nucleic Acids Res 49:W425– W430. https://doi.org/10.1093/NAR/ GKAB321

Chapter 8 Specificities of Modeling of Membrane Proteins Using Multi-Template Homology Modeling Julia Koehler Leman and Richard Bonneau Abstract Structures of membrane proteins are challenging to determine experimentally and currently represent only about 2% of the structures in the Protein Data Bank. Because of this disparity, methods for modeling membrane proteins are fewer and of lower quality than those for modeling soluble proteins. However, better expression, crystallization, and cryo-EM techniques have prompted a recent increase in experimental structures of membrane proteins, which can act as templates to predict the structure of closely related proteins through homology modeling. Because homology modeling relies on a structural template, it is easier and more accurate than fold recognition methods or de novo modeling, which are used when the sequence similarity between the query sequence and the sequence of related proteins in structural databases is below 25%. In homology modeling, a query sequence is mapped onto the coordinates of a single template and refined. With the increase in available templates, several templates often cover overlapping segments of the query sequence. Multi-template modeling can be used to identify the best template for local segments and join them into a single model. Here we provide a protocol for modeling membrane proteins from multiple templates in the Rosetta software suite. This approach takes advantage of several integrated frameworks, namely, RosettaScripts, RosettaCM, and RosettaMP with the membrane scoring function. Key words Homology modeling, Multiple templates, Membrane proteins, Rosetta, Comparative modeling, Protein structure prediction

1

Introduction Computational modeling has aided experimental structure determination of proteins over the past several decades. Modeling methods have improved substantially during this time, to a point where computation can replace experiments in well-defined cases. The number of computational methods has increased drastically and can now address a large variety of scientific questions. These tools rely mostly on different ways of sampling the conformational space depending on the scientific question to be answered, under the influence of specific scoring functions that can be physics based, statistically derived, or a combination of both.

Sławomir Filipek (ed.), Homology Modeling: Methods and Protocols, Methods in Molecular Biology, vol. 2627, https://doi.org/10.1007/978-1-0716-2974-1_8, © Springer Science+Business Media, LLC, part of Springer Nature 2023

141

142

Julia Koehler Leman and Richard Bonneau

Protein structure prediction can be carried out by one of three methods: homology modeling, fold recognition, and ab initio structure prediction, the use of which depends on the sequence similarity between the protein in question (query) and the sequence of the most closely related protein available in structural databases (template). For high sequence similarities between query and template (100 to 25%), homology modeling yields the most accurate models. In homology modeling, the query sequence is mapped onto the coordinates of the template protein based on a pairwise sequence alignment, followed by loop modeling to fill in gaps, and high-resolution refinement. For sequence similarities between 25% and 10%, pairwise sequence alignments become too error-prone to yield high-quality models, and fold recognition (also called threading) is typically used. Here, the query sequence is mapped onto a large number of different protein folds, and a carefully tuned scoring function determines which fold is the best match for the given sequence. If the sequence similarity between query and template is below 10%, the scoring functions used for fold recognition become too inaccurate to identify reliable matches between the query sequence and available folds. These cases are handled with ab initio protein structure prediction. The increase in the number of proteins in structural databases has made it possible to model an increasing number of proteins through homology modeling, therefore improving our understanding of their role in health and disease. Homology modeling relies on the assumption that similar sequences adopt similar structures, which generally holds true but can be complicated by large-scale conformational changes and divergent (same sequence, different structures) and convergent (different sequence, same structure) evolution in the fold space. Because homology modeling models the query sequence close to the template structure, it samples a relatively narrow conformational space. The most popular methods for homology modeling, like MODELLER [1] and SWISS-MODEL [2], are available via fully automated pipelines through a web interface, in addition to downloadable applications. For fold recognition, I-TASSER [3], which is available through a web interface, is widely used. While I-TASSER is available for GPCRs [4], to our knowledge MODELLER and SWISS-MODEL lack specific applications for modeling membrane proteins. Since homology modeling maintains the coordinates of the query structure close to those of the template structure, membrane protein-specific scoring functions may not have a large effect on the accuracy of the final model when using a single template. This means that the fold of the protein will be similar to that of the template, while structural details might be less accurate than if a membrane protein scoring function is used. If an automated tool for homology modeling of membrane proteins is needed,

Specificities of Membrane Protein Modeling

143

MEDELLER is the method of choice [5]. The challenges and variety of tools for modeling membrane proteins are outside of the scope of this paper and have been extensively reviewed elsewhere [6]. An advance in homology modeling pipelines is joining multiple templates into a single model. This concept is clear when the query protein consists of multiple domains, each of which have similar structures that are previously separately determined. Each domain can then be modeled using homology modeling, and the gap in between can be closed via loop modeling. With the growth of the Protein Data Bank, it is now more likely that multiple templates will overlap in the multiple sequence alignment (MSA). With multiple templates available, it is necessary to determine the best template if sequence similarity, coverage of the query sequence, and structural quality differ. The solution goes beyond a single template and consists of combining the “best pieces” of all the templates into a single model. In multi-template homology modeling, switching between templates to increase local sequence similarity within protein segments can improve model accuracy [7]. Unlike in singletemplate modeling, the effect of the scoring function may be significant. Therefore, when modeling membrane proteins via multi-template modeling, having a membrane protein-specific scoring function is critical, particularly to ensure that the geometries at the joints between the templates are scored more realistically in a membrane environment. Here we describe in detail the specific steps for multi-template homology modeling of membrane proteins (Fig. 1); we use the creatine transporter CT1 as an example [8, 9] and carry out the main computations in the Rosetta software suite [10–12]. CT1 has 12 transmembrane (TM) spans that mediate the uptake of creatine

Fig. 1 Overview of the steps of the modeling protocol on the CT1 creatine transporter

144

Julia Koehler Leman and Richard Bonneau

into the cell by going through a transport cycle that consists of several conformational states, correct function of which is relevant to brain function [9]. Recent advances for membrane protein modeling in Rosetta are the newly created framework RosettaMP, which facilitates combining the membrane scoring function with various modeling protocols [13], and improvements to the membrane scoring function itself [14]. The protocol starts with identifying suitable templates and aligning them in an MSA that is subsequently manually optimized. The MSA is then used to thread the query sequence onto each of the template structures, the resulting models of which are then hybridized into a single model with RosettaCM [7]. This is accomplished for all three conformational states separately. The homology models are later refined under the influence of the full-atom scoring function. Finally, ligand docking is carried out with the natural ligand creatine on each of the conformational states.

2

Materials 1. Installation of PDB-BLAST [15]—https://blast.ncbi.nlm.nih. gov/Blast.cgi 2. Installation of Rosetta [10] (see Notes 1 and 2)—https://www. rosettacommons.org/software 3. Installation of the BCL (BioChemical Library) [16] (see Note 3)—http://meilerlab.org/index.php/bclcommons/show/b_ apps_id/1 4. Installation of PyMOL [17] or other protein structure visualization program—https://pymol.org/2/ 5. Installation of MUSTANG [18] structural alignment—http:// lcb.infotech.monash.edu.au/mustang/ 6. Installation of Jalview [19] or other sequence alignment visualization and editing tool—http://www.jalview.org/getdown/ release/ 7. Ideally access to a high-performance computing cluster.

3

Methods We demonstrate the protocol to build a multi-template homology model using the creatine transporter CT1 as an example. When working through the protocol, or for computational work in general, we recommend following specific rules for directory structure and file naming to facilitate bookkeeping (see Note 4).

Specificities of Membrane Protein Modeling

3.1 Template Search and Identification

145

The first step is identifying suitable templates for modeling. The outcome of this step ultimately determines whether and how many templates are available for homology modeling, sequence similarities between templates and the query, and whether other methods like fold recognition or ab initio modeling are required. In our example, the amino acid sequence for the slc6a8 gene encoding the creatine transporter CT1 is obtained from UniProt [20] (https://www.uniprot.org/) with the UniProt ID P48029. The sequence is provided in FASTA format and saved as slc6a8.fasta. Next, PDB-BLAST [15] is used to identify similar proteins for which structures are available in the Protein Data Bank [21]. The number of threads depends on the local run environment and can be adjusted. /path/to/ncbi/blast/bin/psiblast \ -num_threads 22 \ -outfmt 7 \ -num_iterations 2 \ -evalue 1 \ -db db/pdbaa/pdbaa \ -comp_based_stats 1 \ -inclusion_ethresh 0.001 \ -pseudocount 2 \ -export_search_strategy slc6a8.ss \ -query slc6a8.fasta \ -out_ascii_pssm slc6a8.pssm \ -num_alignments 300 \ -out_pssm slc6a8.cp \ -out slc6a8.pb \

The backslashes at the end of each line are required to tell the computer that they all belong to a single command. Whitespaces after the backslashes can lead to error messages and should be removed. Further, the backslashes can be removed entirely when the command is written on a single line; however, we choose to keep each option on a separate line to make debugging easier. The output files contain information about the templates, such as PDBIDs, sequence identities to the query, sequence coverages, and e-values. Homology modeling can be carried out with templates that have as low as 25–30% sequence identity to the query sequence. If the sequence identity is between 10% and 25%, the quality of the sequence alignment and therefore of the resulting model would be low. In these cases, fold recognition or threading is typically used for model building. For even lower sequence identities, the structure will have to be modeled without any structural knowledge (ab initio modeling). For CT1, 34 templates are found with sequence similarities around 45% and around 25% (Table 1). The templates belong to

146

Julia Koehler Leman and Richard Bonneau

Table 1 Templates for CT1 PDBID

Chain

Conformation

Organism

Protein

Seq identity

5I6Z 5I6X 4XP4 4XP9 4XNX 4XP1 4XP5 4XPH 4XPB 4XPT 4M48 4XPG 4XNU 3TT1 3QS4 3F3A

A A A C A A A A A A A A A A A A

Outward-facing

Human Human Drosophila Drosophila Drosophila Drosophila Drosophila Drosophila Drosophila Drosophila Drosophila Drosophila Drosophila Aquifex Aquifex Aquifex

Serotonin transporter Serotonin transporter Dopamine transporter Dopamine transporter Dopamine transporter Dopamine transporter Dopamine transporter Dopamine transporter Dopamine transporter Dopamine transporter Dopamine transporter Dopamine transporter Dopamine transporter Leucine transporter Leucine transporter Leucine transporter

44.75 44.57 45.36 45.36 45.36 45.36 45.36 45.36 45.17 45.17 44.99 44.99 44.99 25.16 25.37 25.16

4HOD 2A65 3GJD 2QJU 3MPN 4FXZ 3TU0 3F3D

A A A A A A A A

Occluded

Aquifex Aquifex Aquifex Aquifex Aquifex Aquifex Aquifex Aquifex

Leucine transporter Leucine transporter Leucine transporter Leucine transporter Leucine transporter Leucine transporter Leucine transporter Leucine transporter

25.58 25.37 25.37 25.37 25.37 25.16 25.16 25.16

3TT3

A

Inward-facing

Aquifex

Leucine transporter

24.95

various transporters: the human serotonin transporter, leucine transporter, and dopamine transporter in three different conformations: outward-facing, occluded, and inward-facing. Some of the templates have different ligands bound, for instance, leucine, tryptophan, or dopamine. All templates are downloaded from the PDB [21] (https://www.rcsb.org/), simultaneously loaded into PyMOL, and superimposed (see Note 5). Each template is thoroughly inspected for (1) missing coordinates, which are exposed by gaps when tracing the backbone from the N- to the C-terminus; (2) the type of protein, e.g., leucine transporter; (3) the conformation, e.g., outward-facing; (4) the type and binding site of relevant ligands bound, if any, e.g., leucine; and (5) the type and binding sites of metal ions or other ligands, e.g., lipids or cofactors. Thorough inspection of the templates can be time-consuming, but identifying high-quality templates and excluding mediocre ones are crucial for the quality of the final model. Since a multitude of templates is available for CT1, we choose to exclude templates with missing (i.e., unresolved) residues (PDBIDs 4MM4, 4MMF,

Specificities of Membrane Protein Modeling

147

4MMB, 3GJC, 3QS5, 3QS6, 5JAG, 3MPQ, 4US3, 3M3G). In cases where only one or very few templates are available, it is advisable to keep templates with missing residues and rebuild the unresolved coordinates via loop modeling. For CT1, the final templates include 16 proteins in the outward-facing conformation (PDBIDs 3F3A, 3QS4, 3TT1, 4M48, 4XNU, 4XNX, 4XP1, 4XP4, 4XP5, 4XP9, 4XPB, 4XPG, 4XPH, 4XPT, 5I6X, 5I6Z), 8 templates in the occluded conformation (PDBIDs 2A65, 2QJU, 3F3D, 3GJD, 3MPN, 3TU0, 4FXZ, 4HOD), and 1 template in the inward-facing conformation (PDBID 3TT3). 3.2 Sequence and Structural Alignments

Since CT1 is a membrane protein, all template structures are downloaded from the OPM database (https://opm.phar.umich.edu/), which embeds membrane proteins into the membrane by transforming them into a unified coordinate frame, assembles the biological complex by deleting or copying and transforming chains, and adds dummy atoms for visualizing the membrane bilayer. Alternatively, template structures can be downloaded from the PDBTM database (http://pdbtm.enzim.hu/), which achieves the same with the exception of adding dummy atoms. Through visualizing all templates and their membrane embedding in a single PyMOL session, one master template is chosen based on the quality of the membrane embedding. We choose the leucine transporter LeuT template with PDBID 2A65. The templates are then cleaned of dummy atoms, ligands, and other hetero atoms and renumbered consecutively using the command: ~/Rosetta/tools/protein_tools/scripts/clean_pdb. py 2A65.pdb A

where the last letter is the chain in question. The cleaned templates are visualized in PyMOL, superimposed onto the master template, LeuT, and individually saved as new PDB files (we term them supPDB files). The coordinates for these are slightly different from those in the OPM structure files, as these have been superimposed on the LeuT master template. Since Rosetta is used to model CT1, span files are required that maintain the residues in the membrane bilayer, allowing the models to be scored correctly. Span files are created from the newly saved PDB files with the mp_span_from_pdb application in RosettaMP [13, 22] with the following command: ~/Rosetta/main/source/bin/mp_span_from_pdb.macosclangrelease \ -database ~/Rosetta/main/database \ -in:file:s 2A65_A.pdb \ -ignore_unrecognized_res true \

148

Julia Koehler Leman and Richard Bonneau

The quality of the final homology model primarily depends on both the quality of the sequence alignments between query and template sequences and the quality of loop modeling the gaps. Since the quality of the sequence alignment deteriorates with lower sequence similarity, we use information from the structural alignment to improve the sequence alignment. The tool of choice is to run MUSTANG [18] on the supPDB files, which uses the structural alignment of the templates to create a sequence alignment of their sequences. The command we use is as follows: mustang -i 2A65_A_tr_sup.pdb 2QJU_A_tr_sup.pdb 3F3A_A_tr_sup.pdb 3F3D_A_tr_sup.pdb 3GJD_A_tr_sup. pdb 3MPN_A_tr_sup.pdb 3QS4_A_tr_sup.pdb 3TT1_A_tr_sup. pdb 3TT3_A_tr_sup.pdb 3TU0_A_tr_sup.pdb 4FXZ_A_tr_sup. pdb 4HOD_A_tr_sup.pdb 4M48_A_tr_sup.pdb 4XNU_A_tr_sup. pdb 4XNX_A_tr_sup.pdb 4XP1_A_tr_sup.pdb 4XP4_A_tr_sup. pdb 4XP5_A_tr_sup.pdb 4XP9_C_tr_sup.pdb 4XPB_A_tr_sup. pdb 4XPG_A_tr_sup.pdb 4XPH_A_tr_sup.pdb 4XPT_A_tr_sup. pdb 5I6X_A_tr_sup.pdb 5I6Z_A_tr_sup.pdb -o mustang -r ON -F fasta

The only sequence missing in the MSA is the one from the query, CT1. We align the CT1 sequence to the MSA using the MAFFT [23] online tool (https://mafft.cbrc.jp/alignment/ server/add_sequences.html). Even though the templates cover three structural conformations (outward-facing, occluded, and inward-facing), we create and adjust a single MSA covering all conformations, as sequence-structure relationships in one conformation function as restraints for the others. Next, the MSA is carefully adjusted by simultaneously examining the superimposed structures in PyMOL with sequence view turned on, the span files mapped onto the templates in another PyMOL window (using the check_spanfile_from_pdb.pl script as described in [22]), and the MSA in Jalview. This is accomplished from the N-terminus to the C-terminus, ensuring the best possible structural and sequence alignments of the TM spans, and then of the loops in between. Sometimes cysteine residues forming disulfide bonds or binding sites to ligands, metal ions, or cofactors can aid in that step. While the MSA is modified, the span files for the templates are adjusted accordingly. Adjusting the MSA is likely the most time-consuming step in the homology modeling procedure

Specificities of Membrane Protein Modeling

149

and, depending on the number of templates available, can take several days to weeks to do properly. Further, the more that is known about the query protein or template protein class, the more the alignment can be improved. Obtaining a high-quality MSA is crucial for generating a high-quality homology model. Being one residue off in the sequence alignment can lead to artifacts like bulges or gaps in the model that even excellent scoring functions are unable to resolve. Once a satisfactory MSA alignment and corresponding span files are created, flexible loop regions at the termini are removed to circumvent them from influencing scoring during model building. For this, 51 residues are trimmed from the N-terminus and 37 from the C-terminus. The CT1 sequence is as follows: >slc6a8 MAKKSAENGIYSVSGDEKKGPLIAPGPDGAPAKGDGPVGLGTPGGRLAVP PRETWTRQMDFIMSCVGFAVGLGNVWRFPYLCYKNGGGVFLIPYVLIALV GGIPIFFLEISLGQFMKAGSINVWNICPLFKGLGYASMVIVFYCNTYYIM VLAWGFYYLVKSFTTTLPWATCGHTWNTPDCVEIFRHEDCANASLANLTC DQLADRRSPVIEFWENKVLRLSGGLEVPGALNWEVTLCLLACWVLVYFCV WKGVKSTGKIVYFTATFPYVVLVVLLVRGVLLPGALDGIIYYLKPDWSKL GSPQVWIDAGTQIFFSYAIGLGALTALGSYNRFNNNCYKDAIILALINSG TSFFAGFVVFSILGFMAAEQGVHISKVAESGPGLAFIAYPRAVTLMPVAP LWAALFFFMLLLLGLDSQFVGVEGFITGLLDLLPASYYFRFQREISVALC CALCFVIDLSMVTDGGMYVFQLFDYYSASGTTLLWQAFWECVVVAWVYGA DRFMDDIACMIGYRPCPWMKWCWSFFTPLVCMGIFIFNVVYYEPLVYNNT YVYPWWGEAMGWAFALSSMLCVPLHLLGCLLRAKGTMAERWQHLTQPIWG LHHLEYRAQDADVRGLTTLTPVSESSKVVVVESVM

50 100 150 200 250 300 350 400 450 500 550 600

The residues in blue are trimmed for modeling. We also assumed disulfide bonds for the cysteines in red and orange (red-red and orange-orange) to restrict loop conformations of this very long loop between TM3 and 4 (see Note 6). The span files for the templates are adjusted accordingly. Based on the TM spans in the MSA, a span file is manually created for the query sequence in the following format: TM region prediction for slc6a8_t.fasta manually from MSA 12 547 antiparallel n2c 8 29 37 59 78 112 182 201 205 227 252 277

150

Julia Koehler Leman and Richard Bonneau 287 307 350 378 393 412 418 445 466 489 506 527

The first line is ignored for modeling. The second line contains the number of TM spans and the number of residues in the protein (after clipping the termini). The third and fourth lines indicate how the TM spans are modeled, antiparallel from the N- to the C-terminus (Rosetta currently does not have other options available). The remaining lines denote the residue numbers for each TM span from beginning to end. For instance, the first TM span ranges from residue number 8–29. The final MSA is shown in Fig. 2. 3.3 Multi-Template Homology Modeling with RosettaCM

Now that we have a single MSA that covers all conformations, i.e., constrains all conformations both structurally and sequence-wise, we can use this as a prerequisite to build models for each conformation separately. This means that homology modeling, refinement, and ligand docking (not ligand preparation—this only has to be done once) need to be carried out for each conformation (outwardfacing, occluded, inward-facing) independently. The FASTA file of the query CT1 is used to create fragments for modeling. This is accomplished with the Robetta server [24, 25]; homologues are included. The FASTA files of the templates, which contain the sequence of the resolved residues (from PDB ATOM lines—see Note 5), are converted into the Grishin alignment format [7] using the following command: ~/Rosetta/tools/protein_tools/scripts/fasta2grishin.py 2A65_A.fasta

Grishin alignment format is a Rosetta-specific format for a sequence alignment that is described here: (https://www. rosettacommons.org/docs/latest/rosetta_basics/file_types/ Grishan-format-alignment). The Grishin files are used to thread the query sequence onto each template separately with the following command: ~/Rosetta/main/source/bin/partial_thread.macosclangrelease \ -database ~/Rosetta/main/database \ -in:file:fasta slc6a8_t.fasta \ -in:file:alignment slc6a8_t_5I6Z_A.grishin \ -in:file:template_pdb 5I6Z_A_tr_sup.pdb \

Specificities of Membrane Protein Modeling

151

Fig. 2 Optimized multiple sequence alignment between query sequence and all templates. The query sequence is at the top; gray shaded regions are the TM spans. PDBIDs of the templates are on the left

For each of the three conformations, a RosettaScripts [26] XML file (here: rosetta_cm.xml) needs to be created that contains the protocol for the RosettaCM [7, 27] homology modeling step, the scoring functions, and the templates. The XML file for the outward-facing conformation (with 16 templates) looks like this:

152

Julia Koehler Leman and Richard Bonneau





















Specificities of Membrane Protein Modeling

153









For the occluded and inward-facing conformations, only the template PDB section needs to be edited; the remainder of the script is identical. After creating a directory named decoys to write the models to and using the XML scripts, RosettaCM [7, 27] is run to generate 1000 models for each of the conformations separately. The command is as follows: ~/Rosetta/main/source/bin/rosetta_scripts.linuxclangrelease \ -database ~/Rosetta/main/database \ -in:file:fasta slc6a8_t.fasta \ -parser:protocol rosetta_cm.xml \ -nstruct 1000 \ -relax:minimize_bond_angles \ -relax:minimize_bond_lengths \ -relax:jump_move true \ -default_max_cycles 200 \ -relax:min_type lbfgs_armijo_nonmonotone \ -relax:jump_move true \ -score:weights stage3_rlx_membrane.wts \ -use_bicubic_interpolation \ -hybridize:stage1_probability 1.0 \ -chemical:exclude_patches LowerDNA UpperDNA Cterm_amidation SpecialRotamer VirtualBB ShoveBB VirtualDNAPhosphate VirtualNTerm CTermConnect sc_orbitals pro_hydroxylated_case1 pro_hydroxylated_case2

ser_phosphorylated

thr_phosphorylated

tyr_phosphorylated tyr_sulfated lys_dimethylated lys_monomethylated lys_trimethylated lys_acetylated glu_carboxylated cys_acetylated tyr_diiodinated N_acetylated C_methylamidated

154

Julia Koehler Leman and Richard Bonneau MethylatedProteinCterm \ -membrane \ -in:file:spanfile slc6a8_t.span \ -membrane:no_interpolate_Mpair \ -membrane:Menv_penalties \ -multiple_processes_writing_to_one_directory true \ -out:path:pdb decoys \

This step is computationally expensive and benefits from running several threads simultaneously. The option multiple_processes_writing_to_one_directory ensures that the outputs from different threads do not conflict with each other. 3.4 High-Resolution Refinement

In the previous step, 1000 multi-template homology models are created for each of the three conformations of CT1. These models are subjected to high-resolution refinement to resolve clashes, optimize loop conformations, include possible constraints (e.g., disulfide bond constraints), and superimpose all models (see details below). To ensure that all models are scored correctly in the membrane bilayer during high-resolution refinement, their membrane embedding needs to be similar. This is accomplished by superimposing them onto a single structure for which the membrane embedding is optimized. We use the lowest-scoring model (by total Rosetta score) from the homology modeling step, superimpose it in PyMOL to the LeuT master template, and save it as a new PDB file—this is the reference model to which all other models are superimposed during the refinement. It is important to mention that PyMOL can superimpose two proteins of different length, while Rosetta cannot, necessitating a reference model of the same length as the newly built homology models but with optimized membrane embedding (see Note 7). High-resolution refinement is accomplished using RosettaScripts [26] in three consecutive steps: 1. The structures are superimposed onto the reference model to ensure proper membrane embedding. 2. High-resolution refinement [22] with a maximal backbone dihedral angle perturbation of 2 degrees (angle_max ¼ 2) is carried out to create 10 models for each input structure. Because CT1 has an extremely long loop between TM helices 3 and 4, we included two disulfide bond constraints (between residues 121/130 and 139/149) into the refinement step to restrict possible loop conformations. 3. Finally, the models are superimposed onto the reference model again. The RosettaScripts XML file outlining these steps is presented below:

Specificities of Membrane Protein Modeling

155

















The RosettaScripts executable (shown below) is run with the XML file described above: ~/Rosetta/main/source/bin/rosetta_scripts.linuxgccrelease \ -database ~/Rosetta/main/database \ -in:file:l decoys_rosettaCM.ls \ -parser:protocol refinement.xml \ -nstruct 10 \ -mp:setup:spanfiles slc6a8_t.span \ -in:fix_disulf slc6a8_t.disulfide \ -multiple_processes_writing_to_one_directory true \ -out:path:pdb decoys_refinement \

The decoys_rosettaCM.ls file contains a list of the 1000 output models generated during the homology modeling step. Full paths need to be given unless the application is run in the same directory. The disulfide bond constraints are provided via the slc6a8_t.

156

Julia Koehler Leman and Richard Bonneau

disulfide file, which simply lists the residue numbers for each disulfide bond on a new line: 121 130 139 149

The output files are written into the decoys_refinement directory, which must be created before the run is executed. As with homology modeling, this step is computationally expensive, and it is advisable to execute it on multiple threads, providing the option—multiple_processes_writing_to_one_directory. 3.5 Preparation of the Ligand

Our goal is to build the models with the protein’s natural ligand, creatine, which requires ligand docking. Before this can be accomplished, the ligand files must be prepared to allow sampling of different ligand conformations (i.e., internal degrees of freedom within the ligand) during the docking process. Ligand conformers are generated using two different methods: (1) conformers from known PDB structures and (2) conformers generated using the BioChemical Library (BCL) conformer generator [16], which uses rotamers from the Cambridge Structural Database [28]. The advanced search function on the Protein Data Bank website allows text search (or chemical name search) for creatine, which identifies three PDB structures that contain creatine (CRN) as the ligand (PDB IDs 1V7Z, 3A6J, and 3B6R). The PDBs are downloaded and visualized in PyMOL. Two of the structures are homohexamers with the ligand bound in each subunit. All creatine molecules from the three structures are extracted (using the create command in PyMOL) and superimposed with the pair_fit command, and hydrogens are added. Visualization shows that the creatine conformations are somewhat similar to each other. Each ligand conformer is saved as separate PDB and SDF files in PyMOL. One of the SDF conformers is used as a starting point for conformer generation using the BCL. The BCL conformer generator [16] (see Note 3) is used to create an SDF file of all conformers using rotamers from the Cambridge Structural Database [28]. Conformers are generated using the following command: ~/BCL/3.4.0/bcl molecule:ConformerGenerator \ -ensemble_filenames crn01_pdb.sdf \ -temperature 1.0 \ -max_iterations 10000 \ -scheduler PThread 24 \ -add_h \ -conformers_single_file conformers \ -sample_all_rotamers \ -rotamer_library csd \ -top_models 200 \

Specificities of Membrane Protein Modeling

157

Visualizing the BCL-generated conformers reveals that they have large-scale changes around rotatable bonds. In total, there are 12 conformers from PDB structures and 22 conformers generated via the BCL, covering a wide range of creatine conformations. All 34 conformers are saved in a single SDF file. A Rosetta params file is generated from the 34 creatine conformers with the following command: ~/Rosetta/main/source/scripts/python/public/molfile_to_params.py -n CRN -p CRN --conformers-inone-file crn_all_conformers_34.sdf

3.6 Ligand Docking Using RosettaLigand

Most template structures have leucine as the ligand, and most ligands bind to a pocket deep inside the transporter. As a starting position for ligand docking, creatine is manually placed into the transporter models, overlaying them with leucine from the PDB ID:4HOD. We chose this template because creatine is most similar to leucine and the 4HOD structure has all ions bound (two sodium and one chloride ions). Creatine is then docked into the 10 lowest-scoring models by total Rosetta score from the RosettaCM/refinement run, and 1000 models are generated for each of the 10 input models, generating 10,000 models total. This is accomplished in RosettaScripts with the following command: ~/Rosetta/main/source/bin/rosetta_scripts.linuxgccrelease \ -database ~/Rosetta/main/database \ -in:file:l top10_from_refinement.ls \ -in:file:extra_res_fa CRN.params \ -packing:ex1 \ -packing:ex2 \ -packing:no_optH false \ -packing:flip_HNQ true \ -packing:ignore_ligand_chi true \ -parser:protocol ligand-docking.xml \ -mistakes:restore_pre_talaris_2013_behavior true \ -out:path:pdb decoys_ligand_docking \ -out:file:scorefile scores_ligand_docking.sc \ -nstruct 1000 \ -multiple_processes_writing_to_one_directory true \ -ignore_unrecognized_res true \

The input list file top10_from_refinement.ls contains the list of filenames of the ten lowest-scoring models from the refinement run. The RosettaScripts XML file is as follows:

158

Julia Koehler Leman and Richard Bonneau













size of the pocket sampled, moves outside will be rejected, demo has width=15



initial_perturb will perturb ligand starting position and orientation, wasn’t set in the demo









After 10,000 models are generated, the highest-quality models are identified by plotting the ligand RMSDs against the interface scores. RosettaScripts is used to compute the ligand RMSDs with the XML file being:









and the command being: ~/Rosetta/main/source/bin/rosetta_scripts.linuxgccrelease \ -database ~/Rosetta/main/database \ -in:file:l decoys_ligand_docking.ls \ -in:file:native decoys_ligand_docking/S_00272_0007.pdb_0309. pdb \ -in:file:extra_res_fa CRN.params \ -parser:protocol interface_analyzer.xml \ -mistakes:restore_pre_talaris_2013_behavior true \ -out:file:scorefile score_interface_analyzer.sc \

160

Julia Koehler Leman and Richard Bonneau

The decoys_ligand_docking.ls file contains the filenames of the 10,000 output models from the ligand docking. The lowestscoring model from ligand docking is used as a reference model against which the RMSDs are calculated. The interface score vs. ligand RMSD plot is analyzed by visualizing the ten lowest-scoring models in PyMOL (see Note 8). Ideally, the RMSDs of these lowest-scoring models should be very similar, indicating that they are located in the same binding pocket or even bind in very similar conformations. However, this is not always the case, and even the single (or few) lowest-scoring model(s) can deviate from the others in terms of RMSD. In this case, it might be appropriate to consider that if a particular binding site (or binding conformation) is found more often computationally, it is more likely to be the conformation found in nature, even if there is a rarely sampled, lower-energy conformation available. Ultimately, the modeler has to decide which models are most appropriate to consider highest quality, under which circumstances and how to justify their decision (see Note 9). Experimental data can influence these decisions. Figure 3 shows the final models with ligandbinding sites in all three conformations.

4

Notes 1. Rosetta specifics: Rosetta is a large software suite with extensive documentation (https://www.rosettacommons.org/docs/lat est/Home). Instructions on installation and compilation can be found here: https://www.rosettacommons.org/docs/lat Rosetta est/build_documentation/Build-Documentation. compilation depends on the compute environment (Linux vs. Mac), the compiler (GCC, Clang, or others), and the mode (debug vs. release). The extension of the executable specifies all three of these parameters. For example, rosetta_scripts.linuxgccrelease runs RosettaScripts on Linux, compiled with GCC, in release mode. The compute environment is obviously provided. Tests can easily be run on a local machine (Linux, Mac, Ubuntu, etc.), while large-scale runs require a high-performance computing cluster (mostly running Linux). The computations performed in this example are mostly run on a cluster, especially the homology modeling and refinement steps. The compiler can be chosen in local environments but are often specified on HPC clusters. More details about this can be found in the installation and build documentation. Two modes are available for compilation, debug and release mode. Release mode runs about ten times faster than debug mode but does not provide detailed error messages. For testing purposes, we suggest starting in debug mode and switching to release

Specificities of Membrane Protein Modeling

161

Fig. 3 Final models of CT1 with ligand-binding sites in all three conformations. The outward-facing (OF) conformation is on the left, the occluded conformation (OCC) is in the center, and the inward-facing (IF) conformation is on the right. The center row shows the ligand-binding site at the center of the protein, and the bottom row shows how the ligand-binding site differs between the protein conformations

mode for production runs, once executables run without errors. 2. Rosetta revisions: The computations here are run with revision 60,538 included in the 2018.51 release and can be run on newer revisions; somewhat older revisions should work as well. We provide revisions for protocol captures because the codebase is constantly changing, and new applications and features are continuously being added. Relevant details about membrane protein modeling in Rosetta are also available [13, 22]. 3. BioChemical library: Documentation of the BCL is available through the application itself by adding the --help or --readme options. The options are context-sensitive, meaning results will differ when running bcl.exe --help versus bcl.exe molecule:ConformerGenerator --help. Available applications are also listed on

162

Julia Koehler Leman and Richard Bonneau

http://meilerlab.org/index.php/bclcommons/show/b_ apps_id/1. 4. Directory structure and file naming: When working through the protocol, we suggest creating a directory for each major step and copying the relevant files into the directory for the following step. This way it is always clear for each step what the inputs and outputs are and at which step specific files are created. This helps tremendously in bookkeeping and when going back to these files at a later point in time. As an example, the directory tree structure for this run with some of the most relevant files is shown below: 1_template_search/ slc6a8.fasta -- fasta from Uniprot cmd_pdbblast.sh -- the cmd files contain the command for running specific applications seqsim20-25/ 2A65_tr.pdb ... seqsim40-45/ ... 2_mustang_msa_from_structalign/ slc6a8_t.fasta -- trimmed fasta slc6a8_t.span slc6a8_t.frag3 slc6a8_t.frag9 mustang_mafft.fasta -- MSA in fasta format cmd1_create_database.sh cmd2_mustang.sh cmd3_create_spanfiles.sh database/ 2A65_tr.pdb 2A65_tr_A.pdb 2A65_tr_A_ignorechain.fasta ... 3_seq_alignment/ all_templates.fasta slc6a8_t_2A65_A.grishin ... 4_outward-facing_from_MSA/ 3F3A_A_tr_sup.pdb . . . slc6a8_on_3F3A_A_tr_sup.pdb . . . cmd1_rosetta_cm.sh cmd2_super_mprr_super.sh cmd3_ligand-docking.sh decoys1_rosetta_cm/ ... decoys2_refinement/ ... decoys3_ligand_docking/ ... score1_rosettaCM.sc score2_refinement.ss

Specificities of Membrane Protein Modeling

163

score3_ligand_docking.sc score4_intf_analyzer.sc 5_outward-facing-occluded_from_MSA/ -- setup same as directory 4 6_inward-facing_from_MSA/ -- setup same as directory 4 7_creatine_preparation/ -- output feeds into ligand docking in directories 4, 5, and 6 CRN.params CRN.pdb CRN_conformers.pdb cmd_create_params.sh conformer-generation_BCL/ ... conformers_PDB/ ... crn_all_conformers_34.sdf

5. Protein sequence disparities: The UniProt sequence is the full sequence of the protein. When a protein structure is being determined, the sequence of the construct can differ from the UniProt sequence, for instance, when purification tags (like His-tags) are added or when a chimera is created. The sequence of the construct is reported in the SEQRES lines in the PDB file, which does not always agree with the sequence in the ATOM lines in the case of unresolved residues. This can happen, for instance, during crystallization when the electron density is too weak or too blurred to assign residues, revealing a “gap” in the sequence in the ATOM lines. Further, some residues include modifications of the standard 20 amino acids, for instance, selenomethionine (three-letter code MSE), which is used for crystallization and needs to be taken into account when translating sequences from three-letter codes to one-letter codes and comparing them to the UniProt sequence. PyMOL can display the protein sequence at the top of the structure viewer window. PyMOL versions 2.1.1 and newer display the constructs’ sequence from the SEQRES lines and only colors the resolved residues (i.e., present in the ATOM lines). Unresolved residues remain gray in the sequence window and are displayed as dotted lines in the structure window. This feature is incredibly helpful in identifying sequence disparities between the construct and resolved residues and is unfortunately not available in older versions of PyMOL. Additionally, PyMOL’s structural alignment (using the align command) discards atoms when optimizing the alignment between two structures. It does so in multiple iterations, and output can be observed in the text window during alignment. Also, PyMOL can align two structures of different

164

Julia Koehler Leman and Richard Bonneau

lengths, while Rosetta’s SuperimposeMover cannot easily do that. This is the reason why we create the reference model to superimpose (in Rosetta) all of the generated homology models to. 6. Use of additional information as restraints: When modeling biological targets, it is important to use as much information as possible as restraints to restrict the conformational sampling space, leading to more accurate models. This information can come from the literature, experiments, or general biochemical knowledge. For instance, it is often very useful to look at the distribution and location of cysteines in the templates and assess whether they can be used as restraints in modeling. In the CT1 example, CT1 is the only protein in the MSA that has four cysteines in the loop between TM3 and TM4. The dopamine and serotonin transporters have two of those cysteines. Since the loop in CT1 is 69 residues long, we decided to model the additional two cysteines as a disulfide bond to restrict the conformational space. Otherwise, loop modeling would be meaningless for a loop of that length. Other useful information includes the location of charged residues, metal ion-binding sites, glycosylation or other posttranslational modifications, and known binding interactions with small molecules or other proteins. 7. Reference model: We use the lowest-scoring model from the homology modeling step as the reference model for superposition of all other models. Another valid option is to create the reference model using automated prediction servers like MODELLER, SWISS-MODEL, or I-TASSER. This has the advantage of being able to examine similarities and differences between the automated model and the manually built one. It is important to recognize that automated servers create less accurate models because the hosts only allow a small computational expense per job request, for instance, generating a small number of models from which a single one is chosen by predefined criteria. The manual pipeline we present here allows generating several orders of magnitude more models, minimizing sampling errors and leading to higher model quality. Further, automated servers use default parameters that work well overall but do not necessarily work best for the query protein. 8. Data analysis: We describe above how we choose the best models from a large pool of generated ones. It is important to recognize that data analysis cannot be generalized for any Rosetta protocol and requires a good amount of computational expertise. While most often, columns in the score file are plotted for analysis (for instance, score vs. RMSD) and decisions are made based on those features that are analyzed depending on the application and the scientific problem.

Specificities of Membrane Protein Modeling

165

Commonly used features include total score, RMSD, the interface score for protein-protein or protein-ligand docking, and ligand RMSD, while various features can technically be used to identify high-quality models. Frequently, the lowest-scoring model is chosen as the final model, which can be a flawed choice if the scoring function is imprecise for that particular example and produces a false minimum with a lower energy than the true minimum. Such a scoring failure happens occasionally, which is why we choose to analyze the ten lowestscoring models simultaneously. If the score vs. RMSD plot reveals more drastic scoring failures, it might be useful to look at more than just ten models. 9. Model quality: It is important to note that the final model is still a model, which can be close to or far away from the truth. Similar to the resolution of a crystal structure, the computational model has a quality associated with it, which depends on the quality and quantity of the information used for generating the model. Similar to the local resolution of the electron density in a crystal structure, the model typically does not have the same quality throughout. For instance, for the models we created, we are reasonably confident in the fold and the core structure of the TM spans. However, we are much less confident in the structure of the loop between TM spans 3 and 4 because the quality of loop modeling drastically decreases for loops longer than 15 residues, and this specific loop is 69 residues long. References 1. Webb B, Sali A (2016) Comparative protein structure modeling using MODELLER. In: Current protocols in bioinformatics. Wiley, Hoboken, pp 5.6.1–5.6.37 2. Bienert S, Waterhouse A, de Beer TAP et al (2017) The SWISS-MODEL Repository— new features and functionality. Nucleic Acids Res 45:D313–D319. https://doi.org/10. 1093/nar/gkw1132 3. Yang J, Zhang Y (2015) Protein structure and function prediction using I-TASSER. In: Current protocols in bioinformatics. Wiley, Hoboken, pp 5.8.1–5.8.15 4. Zhang J, Yang J, Jang R, Zhang Y (2015) GPCR-I-TASSER: a hybrid approach to G protein-coupled receptor structure modeling and the application to the human genome. Structure 23:1538–1549. https://doi.org/ 10.1016/j.str.2015.06.007 5. Kelm S, Shi J, Deane CM (2010) MEDELLER: homology-based coordinate generation for membrane proteins. Bioinformatics 26:

2833–2840. https://doi.org/10.1093/bioin formatics/btq554 6. Koehler Leman J, Ulmschneider MB, Gray JJ (2015) Computational modeling of membrane proteins. Proteins Struct Funct Bioinform 83: 1–24. https://doi.org/10.1002/prot.24703 7. Song Y, Dimaio F, Wang RY-RR et al (2013) High-resolution comparative modeling with RosettaCM. Structure 21:1735–1742. https://doi.org/10.1016/j.str.2013.08.005 8. Christie DL (2007) Functional insights into the creatine transporter. Subcell Biochem 46: 99–118. https://doi.org/10.1007/978-14020-6486-9_6 9. Salazar MD, Zelt NB, Saldivar R et al (2020) Classification of the molecular defects associated with pathogenic variants of the SLC6A8 creatine transporter. Biochemistry 59:1367–1377. https://doi.org/10.1021/ acs.biochem.9b00956 10. Koehler Leman, J., Weitzner, B. D., Lewis, S. M., Adolf-Bryfogle, J., Alam, N., Alford,

166

Julia Koehler Leman and Richard Bonneau

R. F., Aprahamian, M., Baker, D., Barlow, K. A., Barth, P., Basanta, B., Bender, B. J., Blacklock, K., Bonet, J., Boyken, S. E., Bradley, P., Bystroff, C., Conway, P., Cooper, S., Correia, B. E., Coventry, B., Das, R., De Jong, R. M., DiMaio, F., Dsilva, L., Dunbrack, R., Ford, A. S., Frenz, B., Fu, D. Y., Geniesse, C., Goldschmidt, L., Gowthaman, R., Gray, J. J., Gront, D., Guffy, S., Horowitz, S., Huang, P. S., Huber, T., Jacobs, T. M., Jeliazkov, J. R., Johnson, D. K., Kappel, K., Karanicolas, J., Khakzad, H., Khar, K. R., Khare, S. D., Khatib, F., Khramushin, A., King, I. C., Kleffner, R., Koepnick, B., Kortemme, T., Kuenze, G., Kuhlman, B., Kuroda, D., Labonte, J. W., Lai, J. K., Lapidoth, G., Leaver-Fay, A., Lindert, S., Linsky, T., London, N., Lubin, J. H., Lyskov, S., Maguire, J., Malmstro¨m, L., Marcos, E., Marcu, O., Marze, N. A., Meiler, J., Moretti, R., Mulligan, V. K., Nerli, S., Norn, ´ ’Conchu´ir, S., Ollikainen, N., OvchinniC., O kov, S., Pacella, M. S., Pan, X., Park, H., Pavlovicz, R. E., Pethe, M., Pierce, B. G., Pilla, K. B., Raveh, B., Renfrew, P. D., Burman, S. S. R., Rubenstein, A., Sauer, M. F., Scheck, A., Schief, W., Schueler-Furman, O., Sedan, Y., Sevy, A. M., Sgourakis, N. G., Shi, L., Siegel, J. B., Silva, D. A., Smith, S., Song, Y., Stein, A., Szegedy, M., Teets, F. D., Thyme, S. B., Wang, R. Y. R., Watkins, A., Zimmerman, L. & Bonneau, R. Macromolecular modeling and design in Rosetta: recent methods and frameworks. Nat. Methods 2020 177 17, 665–680 (2020). 11. Alford RF, Leaver-Fay A, Jeliazkov JR et al (2017) The Rosetta all-atom energy function for macromolecular modeling and design. J Chem Theory Comput 13:1–35. https://doi. org/10.1101/106054 12. Leaver-Fay A, Tyka M, Lewis SM et al (2011) ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol 487:545–574. https://d oi.org/10.1016 /B97 8-0-12381270-4.00019-6.R 13. Alford RF, Koehler Leman J, Weitzner BD et al (2015) An integrated framework advancing membrane protein modeling and design. PLoS Comput Biol 11:e1004398. https:// doi.org/10.1371/journal.pcbi.1004398 14. Alford RF, Fleming PJ, Fleming KG, Gray JJ (2019) Protein structure prediction and design in a biologically-realistic implicit membrane. bioRxiv 630715. https://doi.org/10.1101/ 630715 15. Camacho C, Coulouris G, Avagyan V et al (2009) BLAST+: architecture and applications. BMC Bioinform 10:421. https://doi.org/10. 1186/1471-2105-10-421

16. Kothiwale S, Mendenhall JL, Meiler J (2015) BCL::Conf: small molecule conformational sampling using a knowledge based rotamer library. J Cheminform 7:47. https://doi. org/10.1186/s13321-015-0095-1 17. Software: The PyMOL Molecular Graphics System, Version 1.8, Schroedinger LLC 18. Konagurthu AS, Whisstock JC, Stuckey PJ, Lesk AM (2006) MUSTANG: a multiple structural alignment algorithm. Proteins Struct Funct Bioinform 64:559–574. https://doi. org/10.1002/prot.20921 19. Waterhouse AM, Procter JB, Martin DMA et al (2009) Jalview Version 2-A multiple sequence alignment editor and analysis workbench. Bioinformatics 25:1189–1191. https://doi.org/ 10.1093/bioinformatics/btp033 20. UniProt Consortium (2015) UniProt: a hub for protein information. Nucleic Acids Res 43: D204–D212. https://doi.org/10.1093/nar/ gku989 21. Rose PW, Prlic´ A, Bi C et al (2015) The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Res 43:D345–D356. https:// doi.org/10.1093/nar/gku1214 22. Koehler Leman J, Mueller BK, Gray JJ (2016) Expanding the toolkit for membrane protein modeling in Rosetta. Bioinformatics 11:1–3. https://doi.org/10.1093/bioinformatics/ btw716 23. Katoh K, Rozewicki J, Yamada KD (2017) MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization. Brief Bioinform 20:1160. https://doi.org/10.1093/bib/bbx108 24. Kim DE, Chivian D, Baker D (2004) Protein structure prediction and analysis using the Robetta server. Nucleic acids Res 32:526– 531. https://doi.org/10.1093/nar/gkh468 25. New Robetta server – http://new.robetta.org/ 26. Fleishman SJ, Leaver-Fay A, Corn JE et al (2011) RosettaScripts: a scripting language interface to the Rosetta macromolecular modeling suite. PLoS One 6:1–10. https://doi. org/10.1371/journal.pone.0020161 27. Bender BJ, Cisneros A, Duran AM et al (2016) Protocols for molecular modeling with Rosetta3 and RosettaScripts. Biochemistry 55: 4748. https://doi.org/10.1021/acs.biochem. 6b00444 28. Groom CR, Bruno IJ, Lightfoot MP et al (2016) The Cambridge structural database. Acta Crystallogr Sect B Struct Sci Cryst Eng Mater 72:171–179. https://doi.org/10. 1107/S2052520616003954

Chapter 9 Homology Modeling of the G Protein-Coupled Receptors Stefan Mordalski and Tomasz Kos´cio´łek Abstract G protein-coupled receptors (GPCRs) are therapeutically important family of membrane proteins. Despite growing number of experimental structures available for GPCRs, homology modeling remains a relevant method for studying these receptors and for discovering new ligands for them. Here we describe the stateof-the-art methods for modeling GPCRs, starting from template selection, through fine-tuning sequence alignment to model refinement. Key words Comparative modeling, Homology modeling, G protein-coupled receptor, GPCR, Model optimization

1

Introduction

1.1 G ProteinCoupled Receptors

The G protein-coupled receptor superfamily constitutes the largest group of membrane proteins in the human genome, consisting of approximately 800 different members [1]. Together with the variety of receptors comprising this family is a great diversity of physiological functions. Roughly half of the GPCRs are responsible for detecting external stimuli, including tastes, odors, pheromones, or light, whereas the other members play a regulatory role in a number of physiological processes like inflammation, mood regulation, pain perception, or immune system regulation [2–4]. This variety of functions is associated with a diversity of GPCR ligands that encompass neurotransmitters, lipids, and peptides, among others. Considering the sequence similarity as well as classes of accommodated ligands and receptor function, the GPCR family can be further divided into several classes of receptors, sharing specific sequential and structural features [5, 6]. Despite the great number of receptor family members and diversity of sequences, all GPCRs share a common topology of seven transmembrane helices (for this reason, historically they

Stefan Mordalski and Tomasz Kos´cio´łek contributed equally. Sławomir Filipek (ed.), Homology Modeling: Methods and Protocols, Methods in Molecular Biology, vol. 2627, https://doi.org/10.1007/978-1-0716-2974-1_9, © Springer Science+Business Media, LLC, part of Springer Nature 2023

167

168

Stefan Mordalski and Tomasz Kos´cio´łek

Fig. 1 A general structure of the transmembrane (7TM) domain of G proteincoupled receptor. The highlighted amino acids are the most conserved positions for each helix [8] and form a base for the generic residue numbering systems [9]

were also referred to as 7TM receptors [7]) which are responsible for passing the signal across the cellular membrane. The size and role of the extracellular domains may differ between classes (see Note 1), yet the 7TM domain is conserved across all the receptors from the family, which also allowed for developing of the universal residue numbering system for GPCRs (Fig. 1) [8, 9]. As mentioned before, the role of the 7TM bundle is to pass the signal from external stimuli to the inside of the cell. In the most basic model, the receptor upon activation by the ligand engages with the signaling proteins which trigger appropriate downstream signaling pathways. The key element of this mechanism is the activation of the GPCR, which, as experimental data revealed, makes the 7TM domain undergo a substantial conformational rearrangement [10]. The feasibility of this rearrangement needs to be considered when building homology models (see Notes 2 and 3).

Homology Modeling of GPCRs

1.2 Homology Modeling

169

The central paradigm of structural biology claims that the primary sequence of a protein determines its three-dimensional structure [11]; however, due to a tremendous conformational complexity of the protein-folding problem, analytically predicting structures, for example by using molecular dynamics, is generally limited to small proteins (less than 100 amino acids) and burdened by a high computational cost. Substantially more popular approaches fall into two main categories: de novo protein structure prediction, where through a combination of evolutionarily derived constraints, physics-based potentials, and coarse-grained simulations one may obtain a low-resolution model of a protein; or homology modeling. The latter approach relies more heavily on available experimental structural data and is based on a premise that has greater structural conservation compared to amino acid sequence conservation within a family [12]. Thus, the concept of homology (structural correspondence between features derived from a common ancestor [13]) has been extrapolated to the field of molecular biology. Consequently, the dominant approach in the comparative (homology) modeling of proteins relies on the assumption that the sequence similarity implies structural similarity. In the case of GPCRs, this paradigm does not imply a linear correspondence between the similarity and the quality of the structure, especially in virtual screening applications [14, 15]; however, as a starting point for experiments, its effectiveness has been demonstrated [16]. In general, the process of homology modeling can be narrowed to four steps: (i) Template selection (ii) Sequence alignment (iii) Inferring the coordinates of conserved regions from the template (iv) Free modeling of non-aligned regions In the case of G protein-coupled receptors there are, however, a few nuances that need to be considered: A. Structural anomalies in the templates For a long time, the helices from the 7TM bundle were considered to be pure α-helices and that assumption led to a dogma that gaps in the sequence alignment in any of the TM regions were prohibited (see Note 4). However, the analysis of the increasing amount of the available structures of GPCRs revealed that the α-helices in the 7TM domain can locally become either a π-helix (also called bulge) having an extra residue per turn, or a 310-helix (also called constriction), having one amino acid fewer per turn. The occurrence of either of the anomalies can lead to a misalignment of the sequences or gaps

170

Stefan Mordalski and Tomasz Kos´cio´łek

in the alignment and as a result, to the construction of the homology models with errors (discontinuities in the helical motif in the modeled region). In addition, these anomalies can result in the offset of the generic numbering and in consequence to the loss of the information on the spatial orientation of several residues. The biological relevance of the helical anomalies is still unclear, as some of them appear only for certain receptor types while others tend to be more prevalent across the GPCR class [9], but they are certainly an effect to consider when attempting homology modeling of a GPCR. B. Low homology regions The family of G protein-coupled receptors has the most structural diversity in the loop regions. Unlike kinases or ion channels, where there is a limited set of structural motifs and the loop regions do not contain specific secondary structure motifs, GPCRs, through the diversity of recognized ligands, have a variety of short structural motifs, helping accommodate the stimulus or steer the ligand selectivity. Loops The extracellular region of the GPCRs is composed of the N-terminus of the receptor and the extracellular loops (ECL1-3) connecting the helices of the 7TM bundle. This region is responsible for accommodation of orthosteric ligands in classes B-F GPCRs and undergoes significant sequence/structure variability across the superfamily of the G protein-coupled receptors (Table 1). Thus, it has been postulated ECLs play crucial roles in ligand selectivity, agonist binding, and receptor activation [17–19]. The analysis of the structural motifs in the structures deposited in Protein DataBank [20] reveals that they are characteristic of the subfamily of the GPCR. Peptide-binding receptors have an extracellular loop 2 forming a β-sheet motif, penetrating the binding site of the receptor. On the other hand, aminergic GPCRs form a one-turn helix on the intracellular loop 2. Intracellular loop 3 (ICL3) is hardly present in crystal structures. A handful of the examples where it can be observed come from the receptors with fairly short ICL3 (sub10 residues), like rhodopsin or μOR. For targets with long ICL3, like dopamine or serotonin receptors, the presumable lack of defined secondary structure and high flexibility of this region prevents its crystallization and detection. In many GPCR crystal structures, the ICL3 is replaced by the T4 lysozyme to facilitate the crystallization of the receptor. Distinct structural motifs present in the loop regions can also be a basis for the initial classification of the orphan receptors to the subfamilies. It was a case for Orphan GPCRs where the candidate peptide and protein receptors were filtered for further deorphanization study [21].

Homology Modeling of GPCRs

171

Table 1 Examples of distinct structural motifs found in the structures of GPCRs Domain

Structural motif

Example receptor

PDB code

N-terminus

Rhodopsin

1U19

N-terminus

LPA1R

4Z35

ECL1

US28

4XT1

ECL2

β1-AR

4BVN

ECL2

FFA1R

4PHU

ECL1 + 2

A2AR

4EYI

ICL2

β2-AR

4LDE

ICL2

M2R

4MQS

ECL3

P2Y1R

4XNV

Despite all of the different conformations and structural motifs that can be adopted by the extracellular loops, there are unique structural features, disulfide bridges, which contribute to the stability of the receptor structure. The most conserved disulfide bridge linking TM3 (Cys3.25) and ECL2 is present in most of the GPCRs with very few exceptions, e.g., Sphingosine 1-phosphate receptor 1 (S1P1R). In addition, ECL3 contains an intraloop disulfide bridge (CXnC motif), which limits the conformational space of the loop and potentially contributes to the function of the receptor. Tails The N-terminal tail of the GPCRs is often neglected and until recently was hardly ever present in the structures of receptors

172

Stefan Mordalski and Tomasz Kos´cio´łek

deposited in the PDB. There are several examples of engineering the N-terminus of the receptor to replace it with T4 lysozyme that helps the crystallization process (3SN6, 4GBR, 4LDL, 4QKX or 5JQH for β2-AR) [22–26]. In many other cases, it is simply removed from the construct. On the other hand, in many cases, especially for the non-class A GPCRs, the extracellular domain can either bind a different ligand than the 7TM domain, or takes part in the binding of the ligand interacting with the transmembrane region (see Note 1). The availability of structures of N-termini in GPCRs varies between classes. For class A GPCRs, the N-terminus is relatively short ( > > < - 1, i ≠ j and ðvi , vj Þ∈E Lði, j Þ = ð3Þ > > > : 2E: 0, i ≠ j and ðvi , vj Þ= 2.3.2 Spectral Simplicial Complex

The spectral simplicial complex model is derived from combinatorial Laplacian (or Hodge-Laplacian) matrixes, constructed based on a simplicial complex [59–66]. For an oriented simplicial complex, its k-th boundary (or incidence) matrix Bk can be defined as follows: 8 1, if σ ki - 1 ⊂ σ kj and σ ki - 1  σ kj > > > < B k ði, j Þ = - 1, if σ ki - 1 ⊂ σ kj and σ ki - 1 σ kj > > > : 0, if σ ki - 1 6 σ kj : These boundary matrixes satisfy the condition that BkBk+1 = 0. The k-th combinatorial Laplacian matrix can be expressed as follows: Lk = BTk Bk þ Bkþ1 BTkþ1 : Note that 0-th combinatorial Laplacian is graph Laplacian as in Eq. (3) and can also be expressed as L0 = B1 BT1 : Further, if the highest order of the simplicial complex K is n, then the n-th combinatorial Laplacian matrix is Ln = BTn Bn . The above combinatorial Laplacian matrixes can be explicitly described in terms of the simplex relations. More specifically, L0, i.e., when k = 0, can be expressed as

Persistent Homology for RNA Data Analysis

L0 (i, j) =

223

d(σi0 ), if i = j −1, if i = j and σi0  σj0 0, if i = j and σi0  σj0 .

It can be seen that this expression is exactly the graph Laplacian as in Eq. 3. Further, when k > 0, Lk can be expressed as [60],

Lk (i, j) =

d(σik ) + k + 1, if i = j 1, if i = j, σik  σjk , σik  σjk and σik ∼ σjk ð4Þ −1, if i = j, σik  σjk , σik  σjk and σik ∼ σjk 0, if i = j, σik  σjk or σik  σjk .

Interestingly, the multiplicity of zero eigenvalues, i.e., the total number of zero eigenvalues, of Lk equals to the k-th Betti number βk. Figure 4 illustrates the Hodge-Laplacian matrixes constructed from simplicial complexes.

Fig. 4 Illustration of a nested sequence of Hodge-Laplacian matrixes from the filtration process of RNA 2Q1O. The Hodge-Laplacian matrixes are generated by using Eq. (4). During the filtration, for L0, i.e., 0-dimensional Hodge-Laplacian matrix, its non-zero off-diagonal terms keep increasing until all off-diagonal terms have the same value -1. For L1 and L2, their non-zero terms in off-diagonal region increase at first and then decrease until they all become zero

224

Kelin Xia et al.

2.3.3 RNA-Based Persistent Spectral Models

Mathematically, persistent spectral theory studies the persistence and variation of eigen spectrum during a filtration process. As stated above, a filtration operation will deliver a nested sequence of graphs, G0 ⊆ G1 ⊆    ⊆ Gm, or a nested sequence of simplicial complexes, K 0 ⊆ K 1 ⊆    ⊆ K m: Here i-th graph Gi or simplicial complex Ki is generated at a certain filtration value fi. Computationally, we can equally divide the filtration region (of the filtration parameter) into m intervals and consider topological representations at each interval. A series of Laplacian matrixes {Li|i=1,2,. . .,m} can be generated from these graphs. The variation of the spectral information from the series of Laplacian matrixes reflects the intrinsic structure information. Figure 4 illustrates the nested sequence of Hodge-Laplacian matrixes obtained from the filtration process of RNA 2Q1O. The eigenvalues can be systematically calculated from these sequences of matrixes.

2.4 Persistent Models Based Machine Learning Models for RNA Data Analysis

Feature engineering or featurization is key to the performance of learning models for biomolecular data analysis. Both persistent homology and persistent spectral models can be used to generate molecular fingerprints for RNA data. More specifically, for persistent homology, features can be obtained from persistent barcodes, by using various different ways, including barcode statistics, algebraic functions, binding approaches, persistent codebook, persistent path and signature features, and 2D/3D representations. For persistent spectral models, persistent attributes variables can be defined on the series of eigen spectrums from the HodgeLaplacian matrixes. For instance, the multiplicity (or number) of Dim(k) zero eigenvalues equals to Betti number βk, thus persistent multiplicity, which is defined as the multiplicity of Dim(k) zero eigenvalues over a filtration process, is exactly the Persistent Betti number or Betti curve. Further, we can consider the basic statistic properties, such as mean, standard deviation, maximum and minimum, of all non-zero eigenvalues, and define four other PerSpect variables, i.e., persistent mean, persistent standard deviation, persistent maximum and persistent minimum. Other spectral information, including algebraic connectivity, modularity, Cheeger constant, vertex/edge expansion, and other flow, random walk, and heat kernel related properties, can also be generalized into their corresponding PerSpect variables or functions (see Notes 3 and 4). An illustration of three persistent attributes from chain A of RNA 2Q1O can be found in Fig. 5.

Persistent Homology for RNA Data Analysis

225

Fig. 5 The illustration of three types of persistent attributes, i.e., persistent multiplicity (b), persistent mean (c), and persistent standard deviation (d), from chain A of RNA 2Q1O. It can be seen that persistent multiplicity is equal to persistent Betti number, which is the summation of barcode number at each filtration values

Recently, with the great significance of biomolecular flexibility in biomolecular dynamics and functional analysis, various experimental and theoretical models are developed. Experimentally, Debye–Waller factor, also known as B-factor, measures atomic mean-square displacement and is usually considered as an important measurement for flexibility. Theoretically, elastic network models, Gaussian network model, flexibility-rigidity model, and other computational models have been proposed for flexibility analysis by shedding light on the biomolecular inner topological structures. Motivated by the success of element-specific persistent homology based machine learning model in protein flexibility analysis [67, 68], we propose WPH-based machine learning model for RNA flexibility analysis [54]. Biomolecular flexibility is of great importance for biomolecular dynamics and functional analysis. Experimentally, Debye–Waller factor, also known as B-factor, is used to measure biomolecular flexibility through atomic meansquare displacement. Theoretically, various network-based models have been proposed, including elastic network models, Gaussian network model, flexibility-rigidity model, and other computational models. In our WPH-based machine learning model, we incorporate physical, chemical, and biological information into topological measurements using weight functions. Our model is trained on a well-established RNA dataset, and numerical experiments show

226

Kelin Xia et al.

that our model can achieve a PCC of up to 0.5822 on its test set. An increase of accuracy by at least 10% is achieved in our model in comparison with the previous sequence-information-based learning models [5].

3

Notes

1. Topological models are of key importance for RNA structure data analysis. Traditionally, RNA topology is simply regarded as RNA graph models. However, graph, which is composed of vertices and edges, is only suitable for pair-wise interactions and cannot be used to describe many-body interactions. Simplicial complexes and hypergraphs are more general topological models that can describe more complicated many-body interactions within the RNA systems. 2. Traditional physical models, such as molecular dynamics, normal mode analysis, elastic network models, etc., are all based on graph representations. Simplicial complex and hypergraphbased physical models may have great potential to reveal more intrinsic properties. 3. Highly effective molecular descriptors are key to the success of molecular machine learning models. Different from all previous models, the persistent models provide a representation that balances the structure complexity and data simplification. 4. A multiscale representation of intrinsic mathematical invariants is used, so that the corresponding molecular descriptors and fingerprints are highly abstract and have great transferability. Other than the two persistent models, there are other persistent functions [69–75] that can be constructed and further combined with learning models. References 1. Singh J, Hanson J, Paliwal K, Zhou Y (2019) RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning.Nat Commun 10(1):1–13 2. Liu B (2019) BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief Bioinfor 20(4), 1280–1294 3. Puton T, Kozlowski LP, Rother KM, Bujnicki JM (2013) CompaRNA: a server for continuous benchmarking of automated methods for

RNA secondary structure prediction. Nucleic Acids Res 41(7):4307–4323 4. Bellaousov S, Mathews DH (2010) ProbKnot: fast prediction of RNA secondary structure including pseudoknots. RNA 16(10): 1870–1880 5. Guruge I, Taherzadeh G, Zhan J, Zhou Y, Yang Y (2018) B-factor profile prediction for RNA flexibility using support vector machines. J Comput Chem 39(8):407–411 6. Wei H, Wang B, Yang J, Gao J (2019) RNA flexibility prediction with sequence profile and

Persistent Homology for RNA Data Analysis predicted solvent accessibility. IEEE/ACM Trans Comput Biol Bioinf 18:2017–2022 7. Verri A, Uras C, Frosini P, Ferri M (1993) On the use of size functions for shape analysis. Biolog Cybern 70(2):99–107 8. Edelsbrunner H, Letscher D, Zomorodian A (2002) Topological persistence and simplification. Discrete Comput Geom 28:511–533 9. Zomorodian A, Carlsson G (2005) Computing persistent homology. Discrete Comput Geom 33:249–274 10. Zomorodian A, Carlsson G (2008) Localized homology. Comput Geom Theory Appl 41(3): 126–148 11. Edelsbrunner H, Harer J (2010) Computational topology: an introduction. American Mathematical Society, Providence 12. Kaczynski T, Mischaikow K, Mrozek M (2004) Computational homology. Springer, Berlin 13. Xia KL, Wei GW (2014) Persistent homology analysis of protein structure, flexibility and folding. Int J Num Methods Biomed Eng 30: 814–844 14. Wang B, Wei GW (2016) Object-oriented persistent homology. J Comput Phys 305:276– 299 15. Cang ZX, Wei GW (2017) TopologyNet: topology based deep convolutional and multitask neural networks for biomolecular property predictions. PLOS Comput Biol 13(7): e1005690 16. Cang ZX, Wei GW (2017) Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction. Int J Numer Methods Biomed Eng 34: e2914. https://doi.org/10.1002/cnm.2914 17. Nguyen DD, Xiao T, Wang ML, Wei GW (2017) Rigidity strengthening: a mechanism for protein–ligand binding. J Chem Inf Modeling 57(7):1715–1721 18. Cang ZX, Wei GW (2017) Analysis and prediction of protein folding energy changes upon mutation by element specific persistent homology. Bioinformatics 33(22):3549–3557 19. Cang ZX, Mu L, Wei GW (2018) Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLoS Comput Biol 14(1): e1005929 20. Wu KD, Wei GW (2018) Quantitative toxicity prediction using topology based multi-task deep neural networks. J Chem Inf Modeling 58:520–531. https://doi.org/10.1021/acs. jcim.7b00558 21. Ghrist R (2008) Barcodes: the persistent topology of data. Bull Amer Math Soc 45(1):61–75

227

22. Tausz A, Vejdemo-Johansson M, Adams H (2011) Javaplex: a research software package for persistent (co)homology. Software available at http://code.google.com/p/javaplex 23. Nanda V, Perseus: the persistent homology software. Software available at http://www. sas.upenn.edu/~vnanda/perseus 24. Bauer U, Kerber M, Reininghaus J (2014) Distributed computation of persistent homology. In: Proceedings of the sixteenth workshop on algorithm engineering and experiments (ALENEX) 25. Dionysus: the persistent homology software. Software available at http://www.mrzv.org/ software/dionysus 26. Binchi J, Merelli E, Rucco M, Petri G, Vaccarino F (2014) jHoles: a tool for understanding biological complex networks via clique weight rank persistent homology. Electron Notes Theoret Comput Sci 306:5–18 27. Maria C (2015) Filtered complexes. In: GUDHI User and Reference Manual, GUDHI Editorial Board 28. Fasy BT, Kim J, Lecci F, Maria C (2014) Introduction to the R package TDA. Preprint arXiv:1411.1830 29. Mischaikow K, Nanda V (2013) Morse theory for filtrations and efficient computation of persistent homology. Discrete Comput Geom 50(2):330–353 30. Bubenik P, Kim PT (2007) A statistical approach to persistent homology. Homol Homotopy Appl 19:337–362 31. Bubenik P (2015) Statistical topological data analysis using persistence landscapes. J Mach Learn Res 16(1):77–102 32. Carlsson G (2009) Topology and data. Am Math Soc 46(2):255–308 33. Chintakunta H, Gentimis T, Gonzalez-Diaz R, Jimenez MJ, Krim H (2015) An entropy-based persistence barcode. Pattern Recogn 48(2): 391–401 34. Merelli E, Rucco M, Sloot P, Tesei L (2015) Topological characterization of complex systems: Using persistent entropy. Entropy 17(10):6872–6892 35. Rucco M, Castiglione F, Merelli E, Pettini M (2016) Characterisation of the idiotypic immune network through persistent entropy. In: Proceedings of ECCS 2014, pp 117–128. Springer, Berlin 36. Xia KL, Li ZM, Mu L (2018) Multiscale persistent functions for biomolecular structure characterization. Bull Math Biol 80(1):1–31 37. Collins A, Zomorodian A, Carlsson G, Guibas LJ (2004) A barcode shape descriptor for curve

228

Kelin Xia et al.

point cloud data. Comput Graph 28(6): 881–894 38. Cohen-Steiner D, Edelsbrunner H, Harer J (2007) Stability of persistence diagrams. Discrete Comput Geom 37(1):103–120 39. Cohen-Steiner D, Edelsbrunner H, Harer J, Mileyko Y (2010) Lipschitz functions have lpstable persistence. Found Comput Math 10(2): 127–139 40. Dawson RJM (1990) Homology of weighted simplicial complexes. Cahiers de Topologie et Ge´ome´trie Diffe´rentielle Cate´goriques 31(3): 229–243 41. Ren SQ, Wu CY, Wu J (2018) Weighted persistent homology. Rocky Mountain J Math 48(8):2661–2687 42. Wu CY, Ren SQ, Wu J, Xia KL (2018) Weighted (co) homology and weighted Laplacian. Sci China Math 43. Edelsbrunner H (1992) Weighted alpha shapes, vol 92. University of Illinois at Urbana-Champaign, Department of Computer Science, Champaign 44. Bell G, Lawson A, Martin J, Rudzinski J, Smyth C (2017) Weighted persistent homology. Preprint arXiv:1709.00097 45. Guibas L, Morozov D, Me´rigot Q (2013) Witnessed k-distance. Discrete Comput Geom 49(1):22–45 46. Buchet M, Chazal F, Oudot SY, Sheehy DR (2016) Efficient and robust persistent homology for measures. Comput Geom 58:70–96 47. Xia KL, Wei GW (2015) Multidimensional persistence in biomolecular data. J Comput Chem 36:1502–1520 48. Xia KL, Zhao ZX, Wei GW (2015) Multiresolution persistent homology for excessively large biomolecular datasets. J Chem Phys 143(13): 10B603_1 49. Petri G, Scolamiero M, Donato I, Vaccarino F (2013) Topological strata of weighted complex networks. PloS one 8(6):e66506 50. Xia KL, Wei GW (2014) Persistent homology analysis of protein structure, flexibility, and folding. Int J Numer Methods Biomed Eng 30(8):814–844 51. Nguyen DD, Cang ZX, Wu KD, Wang ML, Cao Y, Wei GW (2019) Mathematical deep learning for pose and binding affinity prediction and ranking in D3R Grand Challenges. J Comput-Aided Molec Design 33(1):71–82 52. Meng ZY, Anand DV, Lu YP, Wu J, Xia KL (2020) Weighted persistent homology for biomolecular data analysis. Sci Rep 10(1):1–15 53. Anand DV, Meng ZY, Xia KL, Mu YG (2020) Weighted persistent homology for osmolyte

molecular aggregation and hydrogen-bonding network analysis. Sci Rep 10(1):1–17 54. Pun CS, Yong BYS, Xia K (2020) Weightedpersistent-homology-based machine learning for rna flexibility analysis. PloS one 15(8): e0237747 55. Chung F (1997) Spectral graph theory. American Mathematical Society, Providence 56. Spielman DA (2007) Spectral graph theory and its applications. In: 48th annual IEEE symposium on foundations of computer science (FOCS’07), pp 29–38, IEEE 57. Mohar B, Alavi Y, Chartrand G, Oellermann OR (1991) The Laplacian spectrum of graphs. Graph Theory Combin Appl 2(871–898):12 58. Von Luxburg U (2007) A tutorial on spectral clustering. Statist Comput 17(4):395–416 59. Eckmann B (1944) Harmonische funktionen und randwertaufgaben in einem komplex. Commen Math Helvetici 17(1):240–255 60. Muhammad A, Egerstedt M (2006) Control using higher order Laplacians in network topologies. In: Proceeding of the 17th international symposium on mathematical theory of networks and systems, pp 1024–1038. CiteSeer 61. Horak D, Jost J (2013) Spectra of combinatorial Laplace operators on simplicial complexes. Adv Math 244:303–336 62. Barbarossa S, Sardellitti S (2020) Topological signal processing over simplicial complexes. IEEE Trans Signal Process 68:2992–3007 63. Mukherjee S, Steenbergen J (2016) Random walks on simplicial complexes and harmonics. Random Struct Algor 49(2):379–405 64. Parzanchevski O, Rosenthal R (2017) Simplicial complexes: spectrum, homology and random walks. Random Struct Algor 50(2): 225–261 65. Shukla S, Yogeshwaran D (2020) Spectral gap bounds for the simplicial Laplacian and an application to random complexes. J Combin Theory Ser A 169:105134 66. Torres JJ, Bianconi G (2020) Simplicial complexes: higher-order spectral dimension and dynamics. Preprint arXiv:2001.05934 67. Bramer D, Wei G-W (2018) Blind prediction of protein b-factor and flexibility. J Chem Phys 149(13):134107 68. Bramer D, Wei G-W (2020) Atom-specific persistent homology and its application to protein flexibility analysis. Comput Math Biophys 8(1): 1–35 69. Wee J, Xia K (2021) Forman persistent Ricci curvature (FPRC) based machine learning models for protein-ligand binding affinity

Persistent Homology for RNA Data Analysis prediction. Briefings in Bioinformatics 22: bbab136 70. Wee J, Xia K (2021) Ollivier persistent Ricci curvature-based machine learning for the protein–ligand binding affinity prediction. J Chem Inf Modeling 61(4):1617–1626 71. Liu X, Wang XJ, Wu J, and Xia KL (2021) Hypergraph based persistent cohomology (HPC) for molecular representations in drug design. Briefings in Bioinformatics 22:bbaa411 72. Wang R, Nguyen DD, Wei G-W (2020) Persistent spectral graph. Int J Numer Methods Biomed Eng 36:e3376

229

73. Wang R, Zhao R, Ribando-Gros E, Chen J, Tong Y, Wei G-W (2020) HERMES: persistent spectral graph software. Found Data Sci 3:67– 97 74. Zhao R, Wang M, Chen J, Tong Y, Wei G-W (2020) The de Rham-Hodge analysis and modeling of biomolecules. Bull Math Biol 82(8):1–38 75. Zhao R, Desbrun M, Wei G-W, Tong Y (2019) 3D Hodge decompositions of edge-and facebased vector fields. ACM Trans Graph (TOG) 38(6):1–13

Chapter 13 Computational Methods to Predict Intrinsically Disordered Regions and Functional Regions in Them Hiroto Anbo, Motonori Ota, and Satoshi Fukuchi Abstract Intrinsically disordered regions (IDRs) are protein regions that do not adopt fixed tertiary structures. Since these regions lack ordered three-dimensional structures, they should be excluded from the target portions of homology modeling. IDRs can be predicted from the amino acid sequences, because their amino acid compositions are different from that of the structured domains. This chapter provides a review of the prediction methods of IDRs and a case study of IDR prediction. Key words Intrinsically disordered protein, Bioinformatics, Amino acid sequences, Machine learning, Molecular recognition features, Protean segments, Short linear motif

1

Introduction It had been believed that most proteins autonomously fold into unique three-dimensional structures determined by the amino acid sequences. However, this classical paradigm is being reexamined since late research revealed that intrinsically disordered proteins (IDPs) are prevalent and that some proteins require molecular chaperons to efficiently fold in the cell. IDPs are proteins that involve intrinsically disordered regions (IDRs) and do not by themselves adopt fixed tertiary structures under physiological conditions [1–3]. Some IDPs are partially disordered (i.e., a part(s) of its sequence is disordered, with the rest assuming globular structure (s)), while the other IDPs are fully disordered. Regardless of the proportion of IDRs, IDPs are known to be enriched in eukaryotic proteomes: one report estimated that more than one-third of human proteins contain IDRs longer than or equal to 30 amino acid residues [4]. IDPs tend to be involved in signaling pathways and transcription regulations and disproportionately function as hub proteins of protein-protein interaction networks [5]. We previously showed that intrinsic disorder, hub proteins, and multiple subcellular localizations (particularly, in the nucleus and cytoplasm)

Sławomir Filipek (ed.), Homology Modeling: Methods and Protocols, Methods in Molecular Biology, vol. 2627, https://doi.org/10.1007/978-1-0716-2974-1_13, © Springer Science+Business Media, LLC, part of Springer Nature 2023

231

232

Hiroto Anbo et al.

are strongly correlated with each other, and these properties are responsible for intercellular information processing including signaling and transcription [6]. Also, long IDRs are the target of posttranscriptional modifications, such as phosphorylation; these regions are exposed to solvent and modification enzymes can easily access them [7]. As IDPs are important in biological processes, it is unsurprising that IDPs are frequently associated with diseases [8]. As is common with other proteins, IDPs function via protein interactions. IDRs of some IDPs adopt tertiary structures when they interact with interaction partners. This folding mechanism is called coupled folding and binding. This type of folding is observed in various IDPs and is considered to be one of the universal characteristics of IDPs. In addition, the induced structural segments are important in protein-protein interactions. Therefore, some databases collect such functional regions in IDRs together with structures in the binding complexes. These functional regions are variously termed Molecular Recognition Features (MORFs) [9], Short Linear Motifs (SLiMs) [10], Disordered Binding Sites (DIBSs) [11], and Protean Segments (ProSs) [12]. Coupled folding and binding requires the formation of protein structure at the end. However, some proteins reportedly interact with partner proteins without forming a structure. Table 1 summarizes the interaction types of IDRs with references. Interaction mechanisms of IDPs have not been fully elucidated and are a subject of future study. Let us present an example to facilitate understanding of IDP’s structure and function. p53 is a typical IDP and is a transcription factor that suppresses many types of tumors. Figure 1a is a schematic diagram of p53 with IDRs and functional regions in the IDRs, based on data in IDEAL, a database of IDPs with experimentally verified IDRs and functional regions (ProSs). As shown in Fig. 1a, p53 has two structural domains (blue bars) corresponding to a DNA binding domain [13] and an oligomerization domain [14], flanked by three IDRs (red bars). The N-terminal IDR [15] contains two transactivation domains: TADI and TADII (striped bars). Several binding-partner proteins in these domains (functional regions, green bars) have been known. TADI and TADII form structures by coupled folding and binding when they interact Table 1 Interaction types of IDRs Subject IDR

a

Verb Interacts with

Object Structural domain IDR Structural domain IDR

Upon binding a

Structured Structureda Unstructured Unstructured

References [41, 42] [43] [44] [45]

These structured segments are called MORF, SLiM, DIBS, or ProS. In this review, a segment is named by the functional segment in it

Prediction of Intrinsically Disordered Regions

233

Fig. 1 Structure of p53. (a) Structural diagram of p53. The experimentally verified IDRs and structural domains are shown in red and blue, respectively. Binding partner proteins of the functional regions (light green rectangles) are denoted. TAD signifies trans-activation domain. (b) The structure of a functional region formed by coupled folding and binding. The region corresponding to TADI (green) is disordered in isolation (left) [15], but takes local structure (right) upon binding to p300 (white) [40]

with the binding-partner proteins, and the complexes have been structurally determined (Fig. 1b). The linker connecting the DNA-binding and oligomerization domains was verified to be disordered as the residues are missing in many X-ray structures (e.g., PDB: 5a7b). The IDR at the C-terminus also has experimental evidence to be disordered [16] and has many binding partners (green bar). The multiple functional regions found in p53 are typical of IDRs. In this review, we describe the prediction methods of IDRs and the functional regions in them. Before modeling protein structures, it is important to know whether the protein eventually assumes a fixed structure. If the target regions are IDRs, modeling is clearly impossible because no template structure is available. If one focuses on the functional region(s) in an IDR that undergoes coupled folding and binding, one can model the structure(s) if homologs of known structural complexes are available. It is the first and mandatory step in protein structure modeling to know the location of IDRs in the query amino acid sequences. IDRs and functional regions can be predicted by computer programs from their amino acid sequences. In the following, we present an overview of several programs, and a case study of the NeProc program, which we developed recently.

234

2

Hiroto Anbo et al.

IDR Prediction Methods Many IDR prediction programs have been developed (Table 2). Most of the programs determine if a residue in a query sequence is in a globular domain or in an IDR. These programs assign some feature values to each residue, with different programs utilizing different features. Features frequently used are physicochemical properties of amino acids, a tendency to form secondary structures, and a position-specific scoring matrix (PSSM). PSSM is a score table usually obtained by the PSI-Blast homology search program [17]. The frequency of each amino acid at a site in a multiple sequence alignment by PSI-Blast is converted into a score in this matrix. In general, these feature values are utilized in various combinations. Also, many programs employ a window in order to consider peripheral residues around a site in interest. For example,

Table 2 IDR prediction programs

Year

Method

Input feature

Web service

Source code

uversky plot

2000

SF

PP

Glob plot

2003

SF

FoldIndex

2005

SF

AAC

+

+

PP

+

IUPred2A

2018

DISpro

2005

SF

Pairwise energy

+

ML(RNN)

PSSM, 2D

+

PrDOS

2007

ML(SVM)

PSSM, AAS

+

ESpritz

2011

ML(RNN)

PSSM, AAS

+

SPINE-D

2012

ML(NN)

PP, PSSM, 2D

+

MFDp2

2013

ML(SVM)

PSSM, 2D, AAS

+

s2D

2014

ML(NN)

PSSM

+

SLIDER

2014

ML(NN, SVM)

PP

+

DISOPRED3

2015

ML(NN, SVM, KNN)

PSSM

DisPredict

2015

ML(SVM)

PSSM, AAS, 2D, ASA, PP

+

SPOT-Disorder

2017

ML(RNN)

PSSM, 2D, ASA, PP

+

NeProc

2020

ML(NN, SVM)

PSSM

+

MetaDisorder

2012

CN

13 programs

+

MobiDB-lite

2017

CN

8 programs

+

+

+ +

+

+

PP physicochemical property, AAC amino acid composition, PSSM position-specific scoring matrix, AAS amino acid sequence, 2D secondary structure prediction, ASA accessible surface area, SF scoring function, ML machine learning, RNN recurrent neural network, SVM support vector machine, NN neural network, KNN k-nearest neighbor algorithm, CN consensus

Prediction of Intrinsically Disordered Regions

235

when one employs the window size of 21, the residue in interest and its 10 flanking residues on both sides are considered. IDR prediction methods are roughly divided into three categories: scoring function methods, machine learning methods, and consensus methods. IDRs frequently contain functional regions that form structures by coupled folding and binding. Some recently developed programs can predict these functional regions as well as IDRs. 2.1 Scoring Function Methods

Scoring function methods introduce a scoring function to predict IDRs. Programs in this category require lower computational costs compared to those of machine learning and consensus methods, and generally compute the result more rapidly. Since the scoring functions are clearly defined, the predictions of scoring function methods are easier to interpret in contrast to machine learning methods, in which it is not straightforward to identify the causal relationship between a query sequence and the prediction result.

2.1.1 Uversky Plot and FoldIndex

The Uversky plot [18] is a classical method to discriminate IDPs from globular proteins. While other programs predict IDRs, this method classifies IDPs and globular proteins. This method uses characteristics of IDPs; charged residues frequently appear, while hydrophobic residues are located rarely in IDPs [19, 20]. Although not many web services compute the Uversky plot, PONDR (see Subheading 2.2.3) provides the plot (Fig. 2a). By plotting the mean net charge against the mean hydrophobicity of each protein, IDPs and globular proteins can be discriminated by the line. While the Uversky plot gives the foldability of entire proteins, FoldIndex [21] predicts IDRs by applying a sliding window to a query sequence to provide the Uversky plot of the value of each window. FoldIndex is available at https://fold.weizmann.ac.il/fldbin/findex. The window size is 51 residues by default, but can be changed.

2.1.2

IUpred utilizes a statistical potential to characterize the tendencies of amino-acid-residue pairs in contact. The energy of the residue in position k of amino acid type i is defined by

IUpred

e ki =

20 X j =1

P ij c kj ,

where c kj is the ratio of amino acid type j in the sequence neighborhood of position k, and Pij is the contact energy of amino acids i and j. The Pijs are optimized so that the energy calculated from the amino acid sequence fits that of the known globular structure. IUpred thereby determines if a residue has favorable energies for globular structures, resulting in the detection of IDRs as a stretch of amino acid residues with unfavorable energies. The latest version of IUpred, IUpred2A [22], also provides the ANCHOR2

236

Hiroto Anbo et al.

Fig. 2 Examples of IDR predictors. (a) A Uversky plot provided by PONDR. PONDR can provide Uversky plots by selecting “Charge-hydropathy model” (see Subheading 2.2.3). The plot of a typical IDP, Fus RNA-binding protein, is shown (green square). The red and blue points signify representative IDPs and globular proteins so identified by PONDR. The line discriminates these two classes and the section containing the green square and many red points is that of IDPs. (b) An example of IUpred2A output. The top graph shows the disorder scores predicted by IUpred2A in red and the propensity of functional

Prediction of Intrinsically Disordered Regions

237

program. ANCHOR2 is the latest version of ANCHOR, which predicts functional regions in IDRs. By using the statistical potential of IUPred, ANCHOR2 evaluates if a region can form favorable interactions with globular proteins and if it resides in an IDR. A region fulfilling the criteria is judged to be a functional region. Figure 2b shows an example of the prediction by IUpred2A. IUpred2A and ANCHOR2 together with the source codes are available at https://iupred2a.elte.hu. 2.2 Machine Learning Methods

Programs in this category use machine-learning techniques such as a neural network, a support vector machine, and a k-nearest neighboring method. Since prediction results depend on the algorithms used, it is hard to know the internal structure of the models, unlike scoring function methods. However, it is easy to add or delete features in machine learning methods. Most of the programs use a number of feature values to adequately characterize IDRs.

2.2.1

DISOPRED

The DISOPRED family started from DISOPRED, followed by DISOPRED2 and 3. The latest version of DISOPRED3 [23] employs a neural network model using the following 4 feature values: the result obtained by the original version of the DISOPRED model, the result of the DISOPRED2 model, the result obtained by the model based on a k-nearest neighbor method, and the location of the query residue in the entire chain. PSSM is used to get the first three feature values. DISOPRED3 also predicts functional regions in IDRs based on a support vector machine. As the feature values, DISOPRED3 uses PSSM, the length and location of the predicted IDRs, and the amino acid composition in a 15-residue window. The DISOPRED3 server and the source code of the program are available at http://bioinf.cs.ucl.ac.uk/psipred/ and https://github.com/psipred/disopred, respectively. To run DISOPRED3 locally, LIBSVM, dso_lib, PSI-Blast, and Uniref90 are needed (see DISOPRED3 for documentation).

2.2.2

SPOT-Disorder

SPOT-Disorder [24] uses Long-Short Term Memory (LSTM), a recurrent neural network technique, which considers the flanking regions of a residue in interest. Then, although SPOT-Disorder does not employ a window, it implicitly considers neighboring ä Fig. 2 (continued) regions by ANCHOR2 in blue. Larger disorder scores represent higher probabilities to be disordered. The bar diagrams below show other information such as Pfam domain, short linear motif (SLiM), post-translational modification, known IDRs, and PDB structures. (c) An example of MetaDisorder results. Disorder tendencies by 4 predictors (listed at the right side) are shown by line graphs in different colors

238

Hiroto Anbo et al.

residues around the one in interest. SPOT-Disorder uses the following feature values: PSSM, Shannon’s entropy of the multiple sequence alignment at the site, secondary structure prediction results of SPIDER2, and physicochemical properties of amino acid residues. SPOT-Disorder is available at https://sparks-lab. org/ together with its source codes. PSI-Blast must be preinstalled before installing the program locally, but pre-computed PSSMs are accepted by SPOT-Disorder. 2.2.3

PONDR

The PONDR family has five predictors: VLXR, XL1, CAN_XT, VL3_BA, and VSL2. Short IDR prediction is improved in VSL2, resulting in better performance than the other predictors of the PONDR family [25]. PONDR_VSL2 is composed of two models based on a support vector machine: VSL2-S is used for IDRs shorter than or equal to 30 residues, while VSL2-L is used for the longer IDRs. MetaPredictorM judges which of the VSL2-L and VSL2-S models is suitable for a query. All models use the following features: amino acid composition, location in the sequence, a local sequence complexity (the K2 entropy), net charge, average hydrophobicity, flexibility, PSSM, and secondary structure prediction results. The PONDR family programs are available at http:// www.pondr.com. In addition, the website provides the Uversky plot by the “Charge-Hydropathy model” option.

2.3 Consensus Methods

Consensus methods employ some existing programs and build a consensus from the results of these programs. Although consensus rules are different in different programs, relatively conservative thresholds tend to be employed to produce low false positive rates. Most consensus methods thus provide a small number of predicted IDRs.

2.3.1

MobiDB-lite

MobiDB-lite [26] refers to 8 IDR predictors: ESpritz-DisProt, ESpritz-NMR, ESpritz-X-ray, IUpred-long, IUpred-short, DisEmble-465, DisEmble-HL, and GlobPlot. MobiDB-lite judges residues as disordered if they are predicted as such by at least 5 out of the 8 predictors. Subsequently, disordered and ordered regions less than three residues are reclassified as ordered and disordered regions, respectively. Furthermore, ordered regions less than 10 residues are recategorized as disordered regions, if the flanking 20 residues on both sides are disordered. These stringent conditions of MobiDB-lite result in a low false-positive prediction rate. The source code of MobiDB-lite can be downloaded from http:// old.protein.bio.unipd.it/mobidblite/.

2.3.2

MetaDisorder

MetaDisorder [27] refers to the following 13 predictors: DisEMBL, DISOPRED2, DISpro, Globplot, iPDA, IUpred-long, IUpred-short, Pdisorder, POODLE-S, POODLE-L, PrDOS, Spritz, and RONN. MetaDisoder makes IDR prediction by

Prediction of Intrinsically Disordered Regions

239

summing the scores from all the predictors multiplied by the accuracy of each predictor. That is, the accuracy is used as the weight of each prediction result in this program. A result of MetaDisorder is shown in Fig. 2c. MetaDisorder is available at http://iimcb. genesilico.pl/metadisorder/. Also, this site provides other metapredictors based on the genetic algorithms MetaDisorder3D, MetaDisorderMD, and MetaDisorderMD2. MetaDisorder3D uses six-fold recognition programs, while MetaDisorderMD and MetaDisorderMD2 are combinations of MetaDisorder and MetaDisorder3D. Though several representative predictors were described, numerous predictors have been developed, some of which are listed in Table 2. Although some benchmark tests reported that predictors based on machine learning tend to show better performance, most of the predictors reach the level sufficient for practical use. We would like to emphasize that different predictors were constructed by different algorithms with different data of disordered and ordered residues. Each predictor naturally has advantages and disadvantages. We recommend the reader to consult at least a couple of predictors.

3 3.1

Case Study NeProc

We developed an IDR prediction program, NeProc [28] which predicts not only IDRs but also functional regions in IDRs. NeProc was originally designed to predict functional regions rather than IDRs. The basic idea of NeProc is that functional regions are segments with high structural propensity surrounded by IDRs. Based on this idea, NeProc predicts functional regions in IDRs by searching for such segments within predicted IDRs. NeProc is categorized as a machine learning method, because it is based on a neural network and a support vector machine. An outline of NeProc is shown in Fig. 3. In contrast to most machine learning methods that employ many feature values as input, NeProc only uses PSSM (Fig. 3b). NeProc employs two neural network models, Smodel and Lmodel, which use small and large windows, respectively. These neural networks were trained by PSSMs with ordered or disordered information at each location (Fig. 3c). Eventually, Smodel and Lmodel predict local and global structures, respectively (Fig. 3d). The two results are integrated by using a support vector machine to produce the final prediction (Fig. 3e). Although the training of NeProc does not require information on functional regions, NeProc can predict them as segments with high structural propensity detected by Smodel within the IDRs predicted by Lmodel. It is unique that NeProc predicts functional regions in IDRs without using real sample data, considering that the number of experimentally determined functional regions is limited at the

240

Hiroto Anbo et al.

Fig. 3 An outline of prediction by NeProc. A query sequence (a) is converted into PSSM (b) by the PSI-Blast homology search program. The information obtained by PSSM is processed by two neural networks, Smodel and Lmodel, employing small and large windows, respectively (c) to yield the predictions of IDRs and functional regions (d). The two results are integrated by a support vector machine to give the final result (e)

Prediction of Intrinsically Disordered Regions

241

moment. The performance of NeProc in IDR predictions as assessed by Matthews correlation coefficient (MCC) on the test dataset derived from CASP10 [29] is 0.561, while the respective MCC values of SPOT-Disorder, DISOPRED3, IUpred2-short, IUpred2-long, and Mobi-DB-lite are 0.542, 0.536, 0.270, 0.165, and 0.163. The MCC of functional regions by NeProc is 0.381, while those by DISOPRED3 and ANCHOR2 are 0.175 and 0.381, respectively. 3.2 Prediction by NeProc

Figure 4a shows the submit page of NeProc (http://flab.neproc. org/neproc/). A user needs a query sequence in the FASTA format and an e-mail address to receive the result. The returned e-mail contains the URL showing the result. Since a schematic diagram displays the result, the user can understand it intuitively. The result of p53 is shown in the second gray-tinted box of Fig. 4b, where the three bars at the top represent the prediction (label (1)). The red, blue, and light green bars represent the predicted IDRs, ordered regions, and functional regions, respectively. The long-structured region from residue 94 to 285 actually corresponds to the DNA binding domain of p53, and the structured region from residue 324 to 360 is known as the oligomerization domain, while a short IDR splits these regions. The other regions are predicted to be disordered, and the predicted functional regions mostly correspond to the experimentally determined functional regions of p53 at the N and C termini (see Fig. 1a). In addition to the prediction results of NeProc, the page provides domain assignments by homology searches, including Blast and reverse PSI-Blast searches against the PDB (label (2)), and reverse PSI-Blast and HMM [30] searches against Pfam [31] (label (3)) as well as against SCOP [32] (labels (4)). The page also provides the predictions of trans-membrane regions and low-complexity regions by the TMHMM [33] and SEG [34] programs, respectively. In conclusion, the prediction of IDRs is considered accurate enough to meet practical use, as some programs showed MCC greater than 0.5. The predicted IDRs can then be excluded from the target proteins for homology modeling. On the other hand, prediction of functional regions is not as reliable as that of IDRs, and has room for improvement. Despite the progress in IDR prediction methods, most functions of IDRs remain to be elucidated. Recently IDPs are receiving much attention because of their involvement in liquid-liquid phase separation. IDRs have been reported to be the main contributor to form membrane-less organelle in the cell [35, 36], and functional analyses of IDRs have advanced [37–39]. We are anticipating further progress on IDR research, and accumulation of experimental results on IDRs, which will hopefully improve prediction of IDRs and functional regions within them.

242

Hiroto Anbo et al.

Fig. 4 The submit page and the result page of NeProc. (a) The submit page of NeProc. An e-mail address and an amino acid sequence in the FASTA format are input. (b) An example of the NeProc result. The prediction result of p53 is presented (compare this with Fig. 1a). Section (1) shows the prediction result, where the red, blue, and light-green bars represent predicted IDRs, ordered regions, and functional regions, respectively. Section (2) shows the domain assignment obtained by Blast and reverse PSI-Blast homology searches against the PDB. Section (3) shows the domain search results against Pfam by reverse PSI-Blast and HMMer and section (4) shows those against the SCOP database by reverse PSI-Blast and HMMer. The gray bar at the bottom represents a low-complexity region detected by the SEG program

Prediction of Intrinsically Disordered Regions

243

Acknowledgments We are grateful to Prof. Keiichi Homma for his careful reading of this manuscript. References 1. Wright PE, Dyson HJ (1999) Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol 293(2):321–331. https://doi.org/10.1006/ jmbi.1999.3110 2. Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradovic Z (2002) Intrinsic disorder and protein function. Biochemistry 41(21):6573–6582. https://doi.org/10. 1021/bi012159+ 3. Tompa P (2005) The interplay between structure and function in intrinsically unstructured proteins. FEBS Lett 579(15):3346–3354. https://doi.org/10.1016/j.febslet.2005. 03.072 4. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337(3): 635–645. https://doi.org/10.1016/j.jmb. 2004.02.002 5. Haynes C, Oldfield CJ, Ji F, Klitgord N, Cusick ME, Radivojac P, Uversky VN, Vidal M, Iakoucheva LM (2006) Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. PLoS Comput Biol 2(8):e100. https://doi.org/10.1371/journal. pcbi.0020100 6. Ota M, Gonja H, Koike R, Fukuchi S (2016) Multiple-localization and hub proteins. PLoS One 11(6):e0156455. https://doi.org/10. 1371/journal.pone.0156455 7. Koike R, Amano M, Kaibuchi K, Ota M (2020) Protein kinases phosphorylate long disordered regions in intrinsically disordered proteins. Protein Sci 29(2):564–571. https://doi.org/ 10.1002/pro.3789 8. Iakoucheva LM, Brown CJ, Lawson JD, Obradovic´ Z, Dunker AK (2002) Intrinsic disorder in cell-signaling and cancer-associated proteins. J Mol Biol 323(3):573–584. https://doi.org/10.1016/s0022-2836(02) 00969-5 9. Mohan A, Oldfield CJ, Radivojac P, Vacic V, Cortese MS, Dunker AK, Uversky VN (2006) Analysis of molecular recognition features (MoRFs). J Mol Biol 362(5):1043–1059. https://doi.org/10.1016/j.jmb.2006.07.087

10. Ren S, Uversky VN, Chen Z, Dunker AK, Obradovic Z (2008) Short Linear Motifs recognized by SH2, SH3 and Ser/Thr Kinase domains are conserved in disordered protein regions. BMC Genomics 9 Suppl 2(Suppl 2): S26. https://doi.org/10.1186/1471-2164-9s2-s26 11. Schad E, Ficho E, Pancsa R, Simon I, Dosztanyi Z, Meszaros B (2018) DIBS: a repository of disordered binding sites mediating interactions with ordered proteins. Bioinformatics 34(3):535–537. https://doi.org/ 10.1093/bioinformatics/btx640 12. Fukuchi S, Sakamoto S, Nobe Y, Murakami SD, Amemiya T, Hosoda K, Koike R, Hiroaki H, Ota M (2012) IDEAL: intrinsically disordered proteins with extensive annotations and literature. Nucleic Acids Res 40(Database issue):D507–D511. https://doi.org/10. 1093/nar/gkr884 13. Canadillas JM, Tidow H, Freund SM, Rutherford TJ, Ang HC, Fersht AR (2006) Solution structure of p53 core domain: structural basis for its instability. Proc Natl Acad Sci U S A 103(7):2109–2114. https://doi.org/10. 1073/pnas.0510941103 14. Lee W, Harvey TS, Yin Y, Yau P, Litchfield D, Arrowsmith CH (1994) Solution structure of the tetrameric minimum transforming domain of p53. Nat Struct Biol 1(12):877–890. https://doi.org/10.1038/nsb1294-877 15. Dawson R, Mu¨ller L, Dehner A, Klein C, Kessler H, Buchner J (2003) The N-terminal domain of p53 is natively unfolded. J Mol Biol 332(5):1131–1141. https://doi.org/10. 1016/j.jmb.2003.08.008 16. Rustandi RR, Baldisseri DM, Weber DJ (2000) Structure of the negative regulatory domain of p53 bound to S100B(betabeta). Nat Struct Biol 7(7):570–574. https://doi.org/10. 1038/76797 17. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402. https://doi.org/10.1093/nar/25.17.3389

244

Hiroto Anbo et al.

18. Uversky VN, Gillespie JR, Fink AL (2000) Why are “natively unfolded” proteins unstructured under physiologic conditions? Proteins 41(3):415–427. https://doi.org/10.1002/ 1 097-013 4(2 00011 15)41:33.0.co;2-7 19. Hemmings HC Jr, Nairn AC, Aswad DW, Greengard P (1984) DARPP-32, a dopamineand adenosine 3′:5′-monophosphate-regulated phosphoprotein enriched in dopamineinnervated brain regions. II. Purification and characterization of the phosphoprotein from bovine caudate nucleus. J Neurosci 4(1): 99–110. https://doi.org/10.1523/jneurosci. 04-01-00099.1984 20. Weinreb PH, Zhen W, Poon AW, Conway KA, Lansbury PT Jr (1996) NACP, a protein implicated in Alzheimer’s disease and learning, is natively unfolded. Biochemistry 35(43): 13709–13715. https://doi.org/10.1021/ bi961799n 21. Prilusky J, Felder CE, Zeev-Ben-Mordehai T, Rydberg EH, Man O, Beckmann JS, Silman I, Sussman JL (2005) FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics 21(16): 3435–3438. https://doi.org/10.1093/bioin formatics/bti537 22. Meszaros B, Erdos G, Dosztanyi Z (2018) IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res 46 (W1):W329–W337. https://doi.org/10. 1093/nar/gky384 23. Jones DT, Cozzetto D (2015) DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics 31(6):857–863. https://doi.org/10. 1093/bioinformatics/btu744 24. Hanson J, Yang Y, Paliwal K, Zhou Y (2017) Improving protein disorder prediction by deep bidirectional long short-term memory recurrent neural networks. Bioinformatics 33(5): 685–692. https://doi.org/10.1093/bioinfor matics/btw678 25. Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z (2006) Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics 7:208. https://doi.org/10.1186/ 1471-2105-7-208 26. Necci M, Piovesan D, Dosztanyi Z, Tosatto SCE (2017) MobiDB-lite: fast and highly specific consensus prediction of intrinsic disorder in proteins. Bioinformatics 33(9):1402–1404. https://doi.org/10.1093/bioinformatics/ btx015 27. Kozlowski LP, Bujnicki JM (2012) MetaDisorder: a meta-server for the prediction of intrinsic

disorder in proteins. BMC Bioinformatics 13: 111. https://doi.org/10.1186/1471-210513-111 28. Anbo H, Amagai H, Fukuchi S (2020) NeProc predicts binding segments in intrinsically disordered regions without learning binding region sequences. Biophys Physicobiol 17:147–154. h t t p s : // d o i . o r g / 1 0 . 2 1 4 2 / b i o p h y s i c o . BSJ-2020026 29. Monastyrskyy B, Kryshtafovych A, Moult J, Tramontano A, Fidelis K (2014) Assessment of protein disorder region predictions in CASP10. Proteins 82(Suppl 2):127–137. https://doi.org/10.1002/prot.24391 30. Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7(10): e1002195. https://doi.org/10.1371/journal. pcbi.1002195 31. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44(D1):D279– D285. https://doi.org/10.1093/nar/ gkv1344 32. Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG (2014) SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res 42(Database issue): D310–D314. https://doi.org/10.1093/nar/ gkt1242 33. Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305(3):567–580. https://doi.org/10.1006/ jmbi.2000.4315 34. Wootton JC (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18(3):269–285. https://doi.org/10.1016/ 0097-8485(94)85023-2 35. Mitrea DM, Kriwacki RW (2016) Phase separation in biology; functional organization of a higher order. Cell Commun Signal 14:1. https://doi.org/10.1186/s12964-0150125-7 36. Chong PA, Forman-Kay JD (2016) Liquidliquid phase separation in cellular signaling systems. Curr Opin Struct Biol 41:180–186. https://doi.org/10.1016/j.sbi.2016.08.001 37. Kato M, Han TW, Xie S, Shi K, Du X, Wu LC, Mirzaei H, Goldsmith EJ, Longgood J, Pei J, Grishin NV, Frantz DE, Schneider JW, Chen S, Li L, Sawaya MR, Eisenberg D, Tycko R, McKnight SL (2012) Cell-free formation of

Prediction of Intrinsically Disordered Regions RNA granules: low complexity sequence domains form dynamic fibers within hydrogels. Cell 149(4):753–767. https://doi.org/10. 1016/j.cell.2012.04.017 38. Pak CW, Kosno M, Holehouse AS, Padrick SB, Mittal A, Ali R, Yunus AA, Liu DR, Pappu RV, Rosen MK (2016) Sequence determinants of intracellular phase separation by complex coacervation of a disordered protein. Mol Cell 63(1):72–85. https://doi.org/10.1016/j. molcel.2016.05.042 39. Vernon RM, Chong PA, Tsang B, Kim TH, Bah A, Farber P, Lin H, Forman-Kay JD (2018) Pi-Pi contacts are an overlooked protein feature relevant to phase separation. elife 7. https://doi.org/10.7554/eLife.31486 40. Feng H, Jenkins LM, Durell SR, Hayashi R, Mazur SJ, Cherry S, Tropea JE, Miller M, Wlodawer A, Appella E, Bai Y (2009) Structural basis for p300 Taz2-p53 TAD1 binding and modulation by phosphorylation. Structure 17(2):202–210. https://doi.org/10.1016/j. str.2008.12.009 41. Kriwacki RW, Hengst L, Tennant L, Reed SI, Wright PE (1996) Structural studies of p21Waf1/Cip1/Sdi1 in the free and Cdk2bound state: conformational disorder mediates binding diversity. Proc Natl Acad Sci U S A

245

93(21):11504–11509. https://doi.org/10. 1073/pnas.93.21.11504 42. Sugase K, Dyson HJ, Wright PE (2007) Mechanism of coupled folding and binding of an intrinsically disordered protein. Nature 447(7147):1021–1025. https://doi.org/10. 1038/nature05858 43. Borgia A, Borgia MB, Bugge K, Kissling VM, Heidarsson PO, Fernandes CB, Sottini A, Soranno A, Buholzer KJ, Nettels D, Kragelund BB, Best RB, Schuler B (2018) Extreme disorder in an ultrahigh-affinity protein complex. Nature 555(7694):61–66. https://doi.org/ 10.1038/nature25762 44. Mittag T, Orlicky S, Choy WY, Tang X, Lin H, Sicheri F, Kay LE, Tyers M, Forman-Kay JD (2008) Dynamic equilibrium engagement of a polyvalent ligand with a single-site receptor. Proc Natl Acad Sci U S A 105(46): 17772–17777. https://doi.org/10.1073/ pnas.0809222105 45. Demarest SJ, Martinez-Yamout M, Chung J, Chen H, Xu W, Dyson HJ, Evans RM, Wright PE (2002) Mutual synergistic folding in recruitment of CBP/p300 by p160 nuclear receptor coactivators. Nature 415(6871): 549–553. https://doi.org/10.1038/415549a

Chapter 14 Homology Modeling of Transporter Proteins Ingebrigt Sylte, Mari Gabrielsen, and Kurt Kristiansen Abstract Membrane transporter proteins are divided into channels/pores and carriers and constitute protein families of physiological and pharmacological importance. Several presently used therapeutic compounds elucidate their effects by targeting membrane transporter proteins, including anti-arrhythmic, anesthetic, antidepressant, anxiolytic and diuretic drugs. The lack of three-dimensional structures of human transporters hampers experimental studies and drug discovery. In this chapter, the use of homology modeling for generating structural models of membrane transporter proteins is reviewed. The increasing number of atomic resolution structures available as templates, together with improvements in methods and algorithms for sequence alignments, secondary structure predictions, and model generation, in addition to the increase in computational power have increased the applicability of homology modeling for generating structural models of transporter proteins. Different pitfalls and hints for template selection, multiple-sequence alignments, generation and optimization, validation of the models, and the use of transporter homology models for structure-based virtual ligand screening are discussed. Key words Homology modeling, Transporter proteins, Carriers, Channels and pores, Model building and refinements, Model validation, Structure-based virtual screening

1

Introduction Membrane transporter proteins (channels and carriers) are responsible for cellular extrusion and uptake of ions, electrons, nutrients, signaling molecules, drugs, toxic substances, metabolic products, macromolecules, and other components involved in cellular regulation and function. Transporter proteins are necessary for establishing and controlling the voltage gradient across cell membranes and are major determinants for the pharmacokinetics, safety, and efficacy of drugs and toxic substances. The Saier group at the University of California, San Diego, has designed and is maintaining the Transporter Classification Database (TCDB; http://www. tcdb.org), which has become an official classification system approved by the International Union of Biochemistry and Molecular Biology [1, 2]. The TCDB includes transporter proteins from all types of living organisms, and is organized as a five-level hierarchical

Sławomir Filipek (ed.), Homology Modeling: Methods and Protocols, Methods in Molecular Biology, vol. 2627, https://doi.org/10.1007/978-1-0716-2974-1_14, © Springer Science+Business Media, LLC, part of Springer Nature 2023

247

248

Ingebrigt Sylte et al.

system of class, subclass, family, subfamily, and the particular transporter protein. At present (17 March, 2020), the TCDB contains 19 634 protein sequences classified into 1449 transporter families based on phylogeny and function. The TCDB also contains PDB-database (https://www.rcsb.org/) codes of known transporter protein structures and links to structural data. Transporters are multi-spanning integral membrane proteins, and most of them form α-helical bundles and/or barrel-like β-hairpin secondary structures [3, 4]. Based on the transporter classification system, membrane transporter proteins are divided into channels and carriers. Channels are water channels [2] or ion channels [5] that passively transport substances down an electrochemical gradient (also called facilitated diffusion) with a rapid transportation rate (milliseconds) since multiple molecules can pass the channel simultaneously. The two main types of ion channels are voltage-gated and ligand-gated ion channels. Voltage-gated ion channels are classified according to the ion being translocated, and ions are transported through the channels by diffusion down their electrochemical gradient. Voltage-gated ion channels are common targets for anesthetic and anti-arrhythmic drugs. Ligand-gated ion channels open upon binding of a specific substrate, like the γ-aminobutyric acid (GABA) receptor A (GABAAR), which triggers the opening of a chloride ion selective pore upon GABA binding [6]. The receptor has a pentameric structure of five homologues subunits (a combination of α-, β-, and γ-subunits) surrounding the chloride ion selective pore [7, 8]. The GABAAR is the main target for the benzodiazepines, which function as allosteric modulators of the receptor, and are sedative and anxiolytic drugs. From a pharmacological point of view, ligand-gated ion channels are often classified as ionotropic receptors and not transporters [9]. Carriers show stereospecific substrate specificity, where the binding of the substrate triggers conformational changes that allow movement of the bound substrate and release on the other side of the membrane. They mediate passive transport or active transport against a concentration gradient and comprise, among others, solute carriers [10] and ATP-driven pumps, including ABC transporters [11]. During passive carrier transport, the substrate/ solute diffuses along the concentration gradient without consuming energy, while active transport requires energy as the movement of the substrate is against the concentration gradient. Carriers for active transport are divided into primary and secondary active transporters. In primary active transport, hydrolysis of molecules such as ATP provides energy required for the transport of a substrate against its concentration gradient. In secondary active transport, the electrochemical gradient generated by the migration of ions down the gradient is used to transport substrates against their concentration gradient. The secondary active transport can be by

Comparative Modelling of Carriers and Channels

249

antiporters, where the substrate and ion transport across the membrane is in the opposite direction of each other, or by symporters, where the ion and substrate transport is in the same direction. Due to the complicated process of necessary conformational changes, carriers have much lower transportation rates than channels (102– 104 molecules per second for carriers and 106–107 molecules per second for ion channels) [12]. Examples of carriers include monoamine transporters of the neurotransmitter: sodium symporter (NSS) family, that belong to the superfamily of solute carriers [13]. The monoamine transporters are secondary transporters expressed in both the central (CNS) and peripheral nervous systems, being responsible for the reuptake of monoamines (5-hydroxytryptamine (5-HT), dopamine and norepinephrine) from the extracellular space into the presynaptic cell. Dysregulation of monoamine-mediated synaptic transmission in the CNS is connected to prevalent mental disorders including major depressive disorder (MDD) [14], schizophrenia [15], Parkinson’s disease [16], and attention deficit hyperactive disorder (ADHD) [17]. Inhibitors of monoamine transporters are therefore therapeutic agents in the pharmacological treatment of mental disorders, such as the selective serotonin reuptake inhibitors (SSRIs). In addition, monoamine transporters are also the primary sites of action of several psychostimulants and drugs of abuse including cocaine, ecstasy, and methamphetamine [13]. Another therapeutically important superfamily of carriers is the ATP-binding cassette (ABC) superfamily, that utilize the energy from hydrolysis of ATP to pump different substrates, including drugs, out of cells [11]. This superfamily includes the permeability glycoprotein (P-glycoprotein), the multidrug resistance-associated protein (MRP1), the breast cancer resistance protein (BCRP), and several others. Increased expression of these transporters contributes to multidrug resistance (MDR) of multiple structurally unrelated chemotherapeutic drugs [11]. These transporters were first discovered as mediators of MDR, but in addition, they are important for the normal excretion of drugs from the body and in the function of barriers such as the blood-brain barrier (BBB), and are therefore very important for the pharmacokinetics and bioavailability of drugs [11]. In 2011, 67 transport proteins were primary effect-mediating targets for drugs approved by the US Food and Drug Administration (FDA) [18]. This corresponded to 15% of totally 435 primary effect-mediating targets of FDA-approved drugs in 2011, making transporter proteins the third most common class of human drug targets after receptors and enzymes. The most common type of transporter drug targets was voltage-gated ion channels with 29 primary effect-mediating targets. At present (April 2021), approximately 177,000 structural entities have been deposited in the PDB-database (https://www.

250

Ingebrigt Sylte et al.

rcsb.org/), which is a huge increase from approximately 13,500 entities in 2000 [19]. A recent paper by Goodsell and co-workers show that in July 2019 the PDB-database contained 9834 structures of transporter proteins, of which 4131 were of channels (756 voltage gated- and 968 ligand-gated ion channels) and pores. The 9834 structures also included accessory factors involved in transport (1651 structures) and incompletely characterized transport systems (952 structures) [19]. The number of available structures of clinically important transporter proteins has also increased, but still, quite a few human transporters are structurally characterized. The first X-ray crystal structure of an NSS transporter was published in 2005, which was the structure of the sodium-dependent leucine transporter LeuT from the bacterium Aquifex aeolicus [20], while an inhibitor-bound LeuT structure was published in 2008 [21]. The inhibitor-bound structure was used, as a template for constructing homology models of human monoamine transporters for several years. The first human NSS transporter structure, the serotonin transporter (SERT), came several years later in 2016 [22]. The increase in the number of deposited structures indicates that technical advances in crystallization and structural data collection by synchrotron and major advances in three-dimensional cryo-electron microscopy [23] during the last 10 years have contributed to an increased insight into threedimensional structures of transporter proteins and other membrane proteins. Especially carriers, but also other transporter proteins are structurally flexible. Capturing the protein structure in interesting conformational states for further studies may be challenging. In spite of the increase in the number of structures, several transporter families are still poorly characterized at the molecular level, despite of clinical significance and potential as drug targets [24]. The lack of structures at a sufficient level of structural details hampers rational drug design, and limits the understanding of transporter function and interactions. In lack of 3D structure, homology modeling is a valuable approach for obtaining structural information about transporter proteins. The increase in the number of available templates for homology modeling and improvements in computational power and molecular modeling methods have given increased applicability of homology modeling for generating structural models of transporter proteins. Computational methods for predicting the 3D structures of proteins have been used for several years, and the prediction methods are generally classified as de novo modeling, where the 3D structure is predicted directly from its amino acid sequence [25, 26], traditional homology modeling (comparative modeling), and treading which mainly is used in combination with one of the other methods [27]. In spite of improved de novo methods, the homology modeling approach is still considered as the most accurate approach. A general homology modeling

Comparative Modelling of Carriers and Channels

251

approach can be divided into four steps: (1) Identification of homologous proteins of known structure and selection of the best template or set of templates for the modeling. (2) Generating and optimizing (multiple) sequence alignments between the query sequence and homologues sequences (including template protein sequences). (3) Building and optimizing homology models of the query sequence. (4) Validation of the model(s).

2

Materials The quality of the amino acid alignment between the template and the target is very important for the quality of the homology model. Methods for generating multiple-sequence alignments were originally developed for soluble proteins [28] since knowledge-based reference alignments could be generated based on available 3D structures. Most soluble proteins are globular with a hydrophilic surface while most membrane proteins have hydrophobic membrane traversing regions, giving differences in the amino acid substitution preferences between soluble proteins and membrane proteins. Due to the increase in the number of available structures from different membrane protein families, it has been possible to develop alignment methods for membrane proteins that take into account hydrophobicity profiles and transmembrane region predictions. Examples are the AlignMe program [29] that was developed for sequence alignments of solute carriers structurally resembling the bacterial leucine transporter LeuT, and the MP-T (Membrane Protein Threader) program [30]. Such programs have improved the sequence alignments of membrane proteins and thereby the quality of homology models [30]. A table showing frequently used online servers and software tools for protein homology modeling is given by Muhammad and Aki-Yalcin [31]. These resources are commonly used in the construction of homology models of transporter proteins. We have a long-lasting experience with the ICM-modeling program package [32, 33], Prime (Schro¨dinger) [34] and Modeller [35] for generating homology models of solute carriers and G-protein coupled receptors [36–45]. Online servers and services offering the automatic generation of homology models are available, like SWISSMODEL [46], and Phyre2 [47]. Some online tools are specifically designed for the automatic generation of homology models of membrane proteins. These tools have implemented algorithms and methods specifically designed for sequence alignments, prediction of secondary structure, transmembrane regions, and 3D models of membrane proteins [48, 49]. Examples are MEMOIR [50], MEDELLER [51], and RosettaMembrane [52]. Services for automatic generation of homology models of membrane proteins may produce high quality models, especially

252

Ingebrigt Sylte et al.

when the sequence similarity between template and target is high [53]. Fully automatically generated models may also be decent starting models at low similarity between target and template, and the RosettaMembrane program has been specifically developed for modeling transmembrane helical proteins based on distanthomology templates [52]. At lower similarity, manual adjustments of the different steps in the modeling procedure are often necessary, like adjustments of the multiple amino acid sequence alignments and introduction of constraints/restraints in building and optimization of the model. Manual adjustments of the alignment can be based on structural superposition of proteins of known 3D structure. Several molecular modeling packages, like the ICM modeling and Schro¨dinger program packages also contain possibilities for structural superposition of proteins. Most available templates are bacterial, and when using bacterial templates for constructing models of human transporters, each step in the modeling must also be performed with caution due to differences between prokaryotes and eukaryotes. For example, post-translation modifications are lacking in prokaryotes, which may affect protein structure, folding, and dynamics and give differences between prokaryotes and eukaryotes. In such cases, a less automatic process where each step is carefully performed, with necessary manual adjustments, may give more accurate models of human transporters than fully automatically generated models [54]. Relevant experimental data to guide the modeling can increase the accuracy of the models, and increase the hit rate during docking and virtual screening. Appropriate experimental data are results from site-directed mutagenesis studies or other molecular biology approaches that can contribute with structural information about the geometry of binding sites or other functionally important protein regions. Information from different biophysical and structural biology studies of transporter structure, function and dynamics can also be important, Further, ligand binding data like substrate specificity and inhibition kinetics, structure activity relationships studies of inhibitors and substrates can be used to refine models and obtain ligand specific (ligand steered) models. Several programs are available for structural validation of the generated homology models. These programs include WhatCheck [55] and PROCHECK [56] that both are using geometrical, stereochemical, and statistical criteria to check the models, and ERRAT [57] which is comparing the statistics of non-bonded interactions between different atom types of the model and highly refined structures. ICM Protein Health, which is a part of the ICM package, is using normalized force field residue energies and compare the energies with expected energies from high quality crystal structures [58]. In addition, Ramachandran plots can be used to check the backbone geometry of the model. Models can be uploaded to structural validation servers, as the SAVES (https://

Comparative Modelling of Carriers and Channels

253

servicesn.mbi.ucla.edu/SAVES/), and the user can select between different programs for quality checking.

3

Methods Selection of methods and the reliability of the models rely on the availability of templates close in amino acid sequence and function to the target. Homology models are computationally derived approximations of a protein structure, and will always contain inaccuracies and sometimes errors. The quality required for a model depends largely on its intended use. Low-accuracy models can be completely sufficient for designing mutagenesis experiments, while an overall sequence similarity of more than 50% between the target and template of soluble proteins is generally believed to be necessary for obtaining models that can be used for structure-based drug discovery [59]. For mechanistic studies, the highest possible level of accuracy is essential [60]. The transmembrane regions and ligand binding sites are highly conserved within membrane protein families, despite the fact that the overall sequence similarity can be much lower than 50%. Forrest and co-workers showed that a sequence similarity of approximately 30% in transmembrane regions between template and target gave a Cα root mean square deviation (RMSD) of 2 Å in these regions between the model and X-ray structure template [61]. A flowchart indicating the main steps in a homology modeling procedure of transporter proteins is shown in Fig. 1. The following sections outline particular steps in the scheme.

3.1 Template Identification and Selection

Template selection is most often based on a traditional BLAST (Basic Local Alignment Search Tool) search for identifying templates most similar in sequence to the query sequence (the template). The structure closest in sequence to the target sequence is then most often used as a template for modeling the target. However, using the sequence similarity only as the criterion for template selection will not always give the most optimal model (see Note 1). Carriers undergo substantial conformational changes during the transport cycle. For solute carriers, an “alternating access” mechanism has been proposed that requires transition between at least three conformational states in which the ion and substrate binding sites are alternately exposed to the inner and outer side of the membrane, or occluded within the carrier [62]. Crystal structures of the sodium-dependent leucine transporter LeuT from the bacterium Aquifex aeolicus [63], and other secondary transporters with similar 3D fold [64, 65], support the suggested translocation mechanisms, and the structures are classified as being inwardfacing, outward-facing or substrate-occluded states describing the putative pathway for substrate binding, translocation, and release.

254

Ingebrigt Sylte et al.

Target transporter sequence

Template/target alignments

Template identification and selection

Model building

Refinement and regularization Generation of multiple conformations Acceptable models

Validation of models

Models not acceptable

Docking and scoring Virtual screening MD simulations

Structure/function analysis Stereochemistry/geometry MD simulations

Ligand enrichment

Docking

Experimental testing

Databases of known substrates and inhibitors

Fig. 1 Flowchart indicating putative steps in the modeling procedure of transporters. Template selection, alignment, model building and refinements (in dark blue), model validation (light blue), and putative use of the final models (green)

Additional elucidation of the “alternating access” mechanism was given by biophysical studies and theoretical calculations [66, 67]. Several energetically stable conformational states may therefore be possible during the transport cycle, which also needs to be taken into account during template selection. The putative template closest in sequence to the target sequence is not necessarily the best template, since it may not represent the conformational state in the transport cycle that was intended to model. Transporter proteins are structurally flexible, and during binding, the ligand binding cavity adapts to the structure of the binding molecule. The binding site structure may therefore be very different between an apo and holo structures of a transporter protein (see Note 2). Due to structural adaption, the holo structure may be quite selective for the particular complexed ligand or for a structural group of ligands. Binding site differences between apo and holo structures also need to be taken into account during template selection and in docking experiments. Models of transporter proteins are often constructed based on templates with relatively low levels of overall sequence identity with the target (less than 30%), due to the lack of available templates with higher similarity [54]. The application of such models may be limited. However, functionally important regions like substrate

Comparative Modelling of Carriers and Channels

255

Fig. 2 Above: The leucine transporter LeuT from the bacterium Aquifex aeolicus in complex with the inhibitor L-Trp (PDB id: 3F3A). Below: The human SERT structure in complex with the inhibitor paroxetine (PDB id:5I6X). Amino acid side chains within 5 Å of the inhibitors have been displayed. Color coding of atoms: oxygen; red, nitrogen; blue, carbon (SERT); light blue, carbon (LeuT); grey, carbon (inhibitors); yellow. Color coding of ions: Na+: blue sphere, Cl-: green sphere

binding sites exhibit most often higher degrees of conservation than the rest of the structure, and fairly accurate structural models of binding sites may still be constructed in spite of poor overall accuracy [59]. It has been shown, by us and others, that homology models with an overall similarity of less than 25% between the template and target may successfully be used for structure-based virtual ligand screening. For example, by using homology models of the noradrenaline transporter (NET) and experimental verification, Schlessinger and coworkers identified NET inhibitors [68], while we used homology models of the SERT and experimental verification to identify SERT inhibitors [37]. In both studies, the structure-based virtual ligand screening was performed with homology models based on the structure of the leucine transporter LeuT from the bacterium Aquifex aeolicus [20, 21]. The overall amino acid similarity between LeuT and SERT is 21%, while the similarity in transmembrane helices involved in substrate binding is 35% [38]. Figure 2 shows the binding site of LeuT with the

256

Ingebrigt Sylte et al.

inhibitor L-Trp [21], which was used as template for our homology models, and the binding site of human SERT with the SSRI paroxetine [22], which was determined after our homology models. These X-ray structures show that there are both conserved and non-conserved amino acids between the binding sites, and that the L-Trp binding site in LeuT is narrower than the paroxetine binding site in SERT, which may indicate that LeuT has adapted to the smaller inhibitor. Our initial LeuT-based SERT models could not dock most of the known SERT inhibitors [38], and special treatment of the binding site was necessary. By using the ICM software to generate multiple conformations of the binding site and perform ensemble docking [38, 69, 70], we were able to dock most known SERT inhibitors and select binding site conformations for structure-based virtual ligand screening that recognized new SERT inhibitors [37]. These studies indicate that special treatment of binding site amino acids may be important for the success rate of docking and structure-based virtual ligand screening when using carrier models based on low sequence identification between template and target (see Note 3). A single model based on one template only represents a static snapshot, which will reduce the feasibility of the generated model. If several templates are available for the transporter, several models can be constructed, and in that way, structural flexibility is partly taken into account. Another relevant factor for template selection is the structural quality of template structures. In general, highresolution structural templates should be favored over low-resolution templates. Several putative template structures for transporter proteins are low-resolution structures from cryoelectron microscopy. Experimental conditions, putative structural errors, and crystal packing forces will also affect the quality of homology models, and should also be considered in the selection of templates. If crystal packing forces affect particular interesting areas it will affect the quality of the homology model. 3.2 Target-Template Alignments

The sequence alignment between the target and template is a very critical step, and the quality of the alignment determines the quality of the model. Small mistakes in the alignment may give limited accuracy of the models. A multiple-sequence alignment is recommended as a basis for the alignment between target and templates. Such an alignment will highlight evolutionary relationships within the family and increase the probability that corresponding sequence positions are correctly aligned. However, the sequences used in the alignment should be carefully inspected such that the sequence conservation is not biased towards a subset of sequences within the family/subfamily (see Note 4). If more than one temple is available, it may also be useful to adjust the alignment based on structural superposition of known structures (see Note 5).

Comparative Modelling of Carriers and Channels

257

The amino acid similarity between template and target may be low which complicates the alignment procedure. For soluble proteins, an overall sequence identity between template and query higher than 40% will normally give an alignment with few gaps, and models where approximately 90% of backbone atoms can be modeled with an RMSD of about 1 Å. An overall sequence identity of 30–40% will normally result in more frequent insertions and deletion in the alignment and models where 80% of backbone atoms are modeled with an RMSD of approximately 3.5 Å [71, 72]. For membrane protein families the overall amino acids similarity may be quite low, but the structure of membranespanning areas and binding sites for endogenous activators are well conserved. 3.3 Model Building and Refinements

The model building of transporter homology models briefly involves three main steps: (1) Construction of the structurally conserved core region, which for most transporters are the transmembrane parts. (2) Construction of extracellular and intracellular loop regions, which normally are the less conserved parts. (3) Optimization of side chains conformation and energy refinements of the model. The methods used for construction of conserved core regions can be classified into rigid body-assembly methods [33], segmentmatching methods [73], spatial restraint methods [35], and artificial evolution methods [74, 75]. Reviews of core construction methods used by the most popular homology modeling programs are given by Xiang [71] and Muhammad and Aki-Yalcin [31]. The ICM program [33] and Prime (Schro¨dinger) [34] use the rigid body assembly method, while MODELLER which is the most popular homology modeling program (based on citations) uses the spatial restraint method [35]. Some programs use a combination of different methods for constructing core regions [76]. The length of loops and terminals may differ substantially within a membrane protein family, and the construction of these regions is therefore much more uncertain than of the conserved core regions. The structure of terminals and loops may be important for binding specificity and function, and the accuracy of these parts is an important factor for the application of the model in further studies. The inclusion of loops in the model may therefore depend on the planned application of the model, and wrongly modeled loop structure may induce structural stress into conserved regions during refinements (see Note 6). For non-conserved shorter loops (4–7 amino acids), a database loop search is most often used, where available structures in the PDB are searched to provide the loop structure. Another approach is to use de novo prediction methods to search the conformational space of the loop. Monte Carlo (MC) simulations, molecular dynamics

258

Ingebrigt Sylte et al.

(MD) simulations, simulated annealing, and genetic algorithms are used, and often in combination [77–79]. The model refinements process usually involves removal of clashes and geometrical regularisation of bond lengths and angles, but may also involve more sophisticated structural corrections. The refinements process may be performed with traditional molecular mechanical force field programs and often starts with an energy minimization, and involve different steps of side chain conformational sampling, interactive annealing of backbone atoms, and refinements by MC and MD simulations. The different steps in the refinements may eliminate structural errors, but it is important to have in mind that other errors may as well be introduced (see Note 6). Structural templates often represent an unliganded state of the binding site (apo structure) or the binding site geometry is biased against a particular compound complexed with the template (holo structure). Models based on such templates may also represent unliganded or biased conformations of the binding site. Docking known target binders into these models may be problematic and give low docking score. Incorporating ligand binding data into the model optimization process may improve docking into such models. If the template contains a compound in the binding site, one approach is to replace the compound with a preferred target compound, and treat the ligand as a part of the model through the modeling process, and thereby obtaining a ligand-gated or ligandsteered model with a binding site adjusted to the target compound [80]. A simpler approach is to perform induced fit docking of highaffinity compounds that will generate additional conformations of the binding site. Structural clustering of known target compounds and induced fit docking of cluster representatives may give binding site conformations that are specific for a structural cluster of known compounds [44]. Homology models optimized by docking of known compounds (ligand-based models) may give an improved accuracy of the binding site conformation, and have been shown to increase docking enrichment [81], and increase the hit rate during structure-based virtual screening experiments [37, 59]. However, the success depends on correct docking of compounds used to optimize the homology models. 3.4

Model Validation

Homology models of channels and carriers will always contain uncertainties and shortcomings, especially when the similarity in sequence and function between the template and target is low. However, models generated from templates of low similarity may still be used as a working tool for generating hypothesis and designing site-directed mutagenesis. The models need to be validated both for spatial feasibility and predictive applicability. Evaluation of spatial feasibility may access local and global structural errors and may be based on geometrical,

Comparative Modelling of Carriers and Channels

259

stereochemical, statistical, and/or energy criteria. The validation can form the basis for additional refinements of the model, adjustments of the target-template alignment, and rebuilding models. Predicting transporter-ligand interactions by using structurebased virtual ligand screening enrichment has become a commonly used approach for testing the model compliance with experimental ligand binding data. A dataset consisting of known potent compounds for the target and decoys (typical ratio of 1:50) is docked and the compounds are ranked by predicted binding affinities. The decoys resemble potent binders in molecular weight, number of atoms, and physiochemical properties, and are presumed non-binders. If the model is capable of scoring the known binders in front of the decoys, the model is considered to be predictive and has a good potential for structure-based drug discovery. This approach can be used to evaluate between models before a virtual screening campaign (see Note 7). The optimal testing of a model or a set of models is to design and perform in vitro studies based on the models. Several iterative cycles of spatial feasibility and predictivity testing and models adjustments may be necessary before in vitro testing. The in vitro testing may, for example, be site-directed mutagenesis combined by ligand binding studies or testing of hits from a structure-based virtual screening campaign.

4

Notes 1. A careful selection between appropriate templates is necessary. The template closest in sequence to the target is not necessarily the most appropriate. Most transporter proteins (especially carriers) undergo substantial conformational changes during the transport cycle and the template conformation may not be in an appropriate conformation for the target model. 2. If the purpose is to study transporter-ligand interactions, it must be considered if templates represent conformations biased for a ligand or a group of ligands (apo or holo structures). 3. Proper treatment of the binding site amino acids is important for the success rate when docking into carrier models based on low sequence identify between template and target. A docking protocol taking the structural flexibility of the binding site (induced fit, ensemble docking) may increase the success rate. 4. The sequences used in the multiple-sequence alignment should be carefully selected to avoid so that the sequence conservation is biased towards a subset of sequences within the family/ subfamily. However, sequences lacking known 3D structures should also be included in the multiple-sequence alignment,

260

Ingebrigt Sylte et al.

since that will highlight the family/subfamily sequence conservation. 5. Manual adjustments of the multiple-sequence alignment may be necessary. Such an alignment must be based on a structural superimposition of known structures as a knowledge-based reference for the alignment. 6. Homology models of transporters need careful refinements. A tough global refinement process using molecular mechanics force field programs for energetically and structurally refinements by MD or MC simulations may as well induce uncertainties into conserved regions. This is particularly important if the homology between the template and target is low, and binding sites for ions and water molecules are not conserved. MD and MC simulations for refinements may then induce structural stress into conserved regions. 7. The ligand datasets used to validate models by docking should not be biased against a subset of the known binders for the target. References 1. Saier MH Jr, Reddy VS, Tsu BV, Ahmed MS, Li C, Moreno-Hagelsieb G (2016) The Transporter Classification Database (TCDB): recent advances. Nucleic Acids Res 44(D1):D372– D379. https://doi.org/10.1093/nar/ gkv1103 2. Brown D (2017) The discovery of water channels (aquaporins). Ann Nutr Metab 70(Suppl 1 ) : 3 7 – 4 2 . h t t p s : //d o i . o r g / 1 0 . 1 1 5 9 / 000463061 3. Vinothkumar KR, Henderson R (2010) Structures of membrane proteins. Q Rev Biophys 43(1):65–158. https://doi.org/10.1017/ S0033583510000041 4. Niitsu A, Heal JW, Fauland K, Thomson AR, Woolfson DN (2017) Membrane-spanning alpha-helical barrels as tractable protein-design targets. Philos Trans R Soc Lond Ser B Biol Sci 372(1726):20160213. https://doi.org/10. 1098/rstb.2016.0213 5. Liu Y, Wang K (2019) Exploiting the diversity of ion channels: modulation of ion channels for therapeutic indications. Handb Exp Pharmacol 260:187–205. https://doi.org/10.1007/ 164_2019_333 6. Sigel E, Steinmann ME (2012) Structure, function, and modulation of GABA (A) receptors. J Biol Chem 287(48):40224–40231. https://doi.org/10. 1074/jbc.R112.386664

7. Masiulis S, Desai R, Uchanski T, Martin IS, Laverty D, Karia D, Malinauskas T, Zivanov J, Pardon E, Kotecha A, Steyaert J, Miller KW, Aricescu AR (2019) GABA(A) receptor signalling mechanisms revealed by structural pharmacology. Nature 565(7740):454–459 8. Laverty D, Desai R, Uchanski T, Masiulis S, Stec WJ, Malinauskas T, Zivanov J, Pardon E, Steyaert J, Miller KW, Aricescu AR (2019) Cryo-EM structure of the human alpha 1 beta 3 gamma 2 GABA(A) receptor in a lipid bilayer. Nature 565(7740):516–520 9. Alexander SP, Peters JA, Kelly E, Marrion NV, Faccenda E, Harding SD, Pawson AJ, Sharman JL, Southan C, Davies JA, Collaborators C (2017) The concise guide to PHARMACOLOGY 2017/18: ligand-gated ion channels. Br J Pharmacol 174(Suppl 1):S130–S159. https:// doi.org/10.1111/bph.13879 10. Cesar-Razquin A, Snijder B, Frappier-BrintonT, Isserlin R, Gyimesi G, Bai X, Reithmeier RA, Hepworth D, Hediger MA, Edwards AM, Superti-Furga G (2015) A call for systematic research on solute carriers. Cell 162(3):478–487. https://doi.org/10.1016/j. cell.2015.07.022 11. Lusvarghi S, Robey RW, Gottesman MM, Ambudkar SV (2020) Multidrug transporters: recent insights from cryo-electron microscopyderived atomic structures and animal models.

Comparative Modelling of Carriers and Channels F1000Research 9. https://doi.org/10. 12688/f1000research.21295.1 12. Friedmann MH (2008) Facilitated diffusion: channels and carriers. In: Principles and models of biological transport. Springer, pp 111–179 13. Aggarwal S, Mortensen OV (2017) In vitro assays for the functional characterization of the dopamine transporter (DAT). Curr Protoc Pharmacol 79:12.17.11–12.17.21. https:// doi.org/10.1002/cpph.33 14. Otte C, Gold SM, Penninx BW, Pariante CM, Etkin A, Fava M, Mohr DC, Schatzberg AF (2016) Major depressive disorder. Nat Rev Dis Primers 2:16065. https://doi.org/10. 1038/nrdp.2016.65 15. Takano H (2018) Cognitive function and monoamine neurotransmission in schizophrenia: evidence from positron emission tomography studies. Front Psych 9:228. https://doi. org/10.3389/fpsyt.2018.00228 16. German CL, Baladi MG, McFadden LM, Hanson GR, Fleckenstein AE (2015) Regulation of the dopamine and vesicular monoamine transporters: pharmacological targets and implications for disease. Pharmacol Rev 67(4):1005–1024. https://doi.org/10.1124/ pr.114.010397 17. Faraone SV (2018) The pharmacology of amphetamine and methylphenidate: relevance to the neurobiology of attention-deficit/ hyperactivity disorder and other psychiatric comorbidities. Neurosci Biobehav Rev 87: 255–270. https://doi.org/10.1016/j. neubiorev.2018.02.001 18. Rask-Andersen M, Almen MS, Schioth HB (2011) Trends in the exploitation of novel drug targets. Nat Rev Drug Discov 10(8):579–590. https://doi.org/10.1038/ nrd3478 19. Goodsell DS, Zardecki C, Di Costanzo L, Duarte JM, Hudson BP, Persikova I, Segura J, Shao C, Voigt M, Westbrook JD, Young JY, Burley SK (2020) RCSB Protein Data Bank: enabling biomedical research and drug discovery. Protein Sci 29(1):52–65. https://doi.org/ 10.1002/pro.3730 20. Yamashita A, Singh SK, Kawate T, Jin Y, Gouaux E (2005) Crystal structure of a bacterial homologue of Na+/Cl--dependent neurotransmitter transporters. Nature 437(7056):215–223. https://doi.org/10. 1038/nature03978 21. Singh SK, Piscitelli CL, Yamashita A, Gouaux E (2008) A competitive inhibitor traps LeuT in an open-to-out conformation. Science 322(5908):1655–1661. https://doi.org/10. 1126/science.1166777

261

22. Coleman JA, Green EM, Gouaux E (2016) X-ray structures and mechanism of the human serotonin transporter. Nature 532(7599):334–339. https://doi.org/10. 1038/nature17629 23. Garcia-Nafria J, Tate CG (2020) Cryo-electron microscopy: moving beyond X-ray crystal structures for drug receptors and drug development. Annu Rev Pharmacol Toxicol 60:51– 71. https://doi.org/10.1146/annurevpharmtox-010919-023545 24. Hediger MA, Clemencon B, Burrier RE, Bruford EA (2013) The ABCs of membrane transporters in health and disease (SLC series): introduction. Mol Asp Med 34(2–3):95–107. https://doi.org/10.1016/j.mam.2012. 12.009 25. Mullins JG (2012) Structural modelling pipelines in next generation sequencing projects. Adv Protein Chem Struct Biol 89:117–167. https://d oi.org/10.1016 /B97 8-0-12394287-6.00005-7 26. Kmiecik S, Gront D, Kolinski M, Wieteska L, Dawid AE, Kolinski A (2016) Coarse-grained protein models and their applications. Chem Rev 116(14):7898–7936. https://doi.org/ 10.1021/acs.chemrev.6b00163 27. Casadio R, Fariselli P, Martelli PL, Tasco G (2007) Thinking the impossible: how to solve the protein folding problem with and without homologous structures and more. Methods Mol Biol 350:305–320. https://doi.org/10. 1385/1-59745-189-4:305 28. Venclovas C (2012) Methods for sequencestructure alignment. Methods Mol Biol 857: 55–82. https://doi.org/10.1007/978-161779-588-6_3 29. Stamm M, Staritzbichler R, Khafizov K, Forrest LR (2014) AlignMe–a membrane protein sequence alignment web server. Nucleic Acids Res 42(Web Server issue):W246–W251. https://doi.org/10.1093/nar/gku291 30. Hill JR, Deane CM (2013) MP-T: improving membrane protein alignment for structure prediction. Bioinformatics 29(1):54–61. https:// doi.org/10.1093/bioinformatics/bts640 31. Muhammed MT, Aki-Yalcin E (2019) Homology modeling in drug discovery: overview, current applications, and future perspectives. Chem Biol Drug Des 93(1):12–20. https:// doi.org/10.1111/cbdd.13388 32. Cardozo T, Totrov M, Abagyan R (1995) Homology modeling by the ICM method. Proteins 23(3):403–414. https://doi.org/10. 1002/prot.340230314 33. Abagyan R, Totrov M, Kuznetsov D (1994) ICM – a new method for protein modeling

262

Ingebrigt Sylte et al.

and design – applications to docking and structure prediction from the distorted native conformation. J Comput Chem 15(5):488–506 34. Jacobson MP, Pincus DL, Rapp CS, Day TJF, Honig B, Shaw DE, Friesner RA (2004) A hierarchical approach to all-atom protein loop prediction. Proteins Struct Funct Bioinform 55(2):351–367 35. Sali A, Blundell TL (1993) Comparative protein modeling by satisfaction of spatial restraints. J Mol Biol 234(3):779–815 36. Gabrielsen M, Ravna AW, Kristiansen K, Sylte I (2012) Substrate binding and translocation of the serotonin transporter studied by docking and molecular dynamics simulations. J Mol Model 18(3):1073–1085. https://doi.org/ 10.1007/s00894-011-1133-1 37. Gabrielsen M, Kurczab R, Siwek A, Wolak M, Ravna AW, Kristiansen K, Kufareva I, Abagyan R, Nowak G, Chilmonczyk Z, Sylte I, Bojarski AJ (2014) Identification of novel serotonin transporter compounds by virtual screening. J Chem Inf Model 54(3):933–943. https://doi.org/10.1021/ ci400742s 38. Gabrielsen M, Kurczab R, Ravna AW, Kufareva I, Abagyan R, Chilmonczyk Z, Bojarski AJ, Sylte I (2012) Molecular mechanism of serotonin transporter inhibition elucidated by a new flexible docking protocol. Eur J Med Chem 47(1):24–37. https://doi.org/10. 1016/j.ejmech.2011.09.056 39. Warszycki D, Rueda M, Mordalski S, Kristiansen K, Satala G, Rataj K, Chilmonczyk Z, Sylte I, Abagyan R, Bojarski AJ (2017) From homology models to a set of predictive binding pockets-a 5-HT1A receptor case study. J Chem Inf Model 57(2):311–321. https://doi.org/10.1021/acs.jcim.6b00263 40. Ravna AW, Sylte I, Sager G (2009) Binding site of ABC transporter homology models confirmed by ABCB1 crystal structure. Theor Biol Med Model 6:20. https://doi.org/10. 1186/1742-4682-6-20 41. Ravna AW, Sylte I, Kristiansen K, Dahl SG (2006) Putative drug binding conformations of monoamine transporters. Bioorg Med Chem 14(3):666–675. https://doi.org/10. 1016/j.bmc.2005.08.054 42. Jaronczyk M, Wolosewicz K, Gabrielsen M, Nowak G, Kufareva I, Mazurek AP, Ravna AW, Abagyan R, Bojarski AJ, Sylte I, Chilmonczyk Z (2012) Synthesis, in vitro binding studies and docking of long-chain arylpiperazine nitroquipazine analogues, as potential serotonin transporter inhibitors. Eur J Med Chem 49:200–210. https://doi.org/10. 1016/j.ejmech.2012.01.012

43. Gabrielsen M, Wolosewicz K, Zawadzka A, Kossakowski J, Nowak G, Wolak M, Stachowicz K, Siwek A, Ravna AW, Kufareva I, Kozerski L, Bednarek E, Sitkowski J, Bocian W, Abagyan R, Bojarski AJ, Sylte I, Chilmonczyk Z (2013) Synthesis, antidepressant evaluation and docking studies of long-chain alkylnitroquipazines as serotonin transporter inhibitors. Chem Biol Drug Des 81(6):695–706. https://doi.org/10.1111/ cbdd.12116 44. Freyd T, Warszycki D, Mordalski S, Bojarski AJ, Sylte I, Gabrielsen M (2017) Ligandguided homology modelling of the GABAB2 subunit of the GABAB receptor. PLoS One 12(3):e0173889. https://doi.org/10.1371/ journal.pone.0173889 45. Baglo Y, Gabrielsen M, Sylte I, Gederaas OA (2013) Homology modeling of human gamma-butyric acid transporters and the binding of pro-drugs 5-aminolevulinic acid and methyl aminolevulinic acid used in photodynamic therapy. PLoS One 8(6):e65200. https://doi.org/10.1371/journal.pone. 0065200 46. Waterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R, Heer FT, de Beer TAP, Rempfer C, Bordoli L, Lepore R, Schwede T (2018) SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res 46(W1):W296– W303. https://doi.org/10.1093/nar/gky427 47. Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJ (2015) The Phyre2 web portal for protein modeling, prediction and analysis. Nat Protoc 10(6):845–858. https://doi.org/ 10.1038/nprot.2015.053 48. Venko K, Roy Choudhury A, Novic M (2017) Computational approaches for revealing the structure of membrane transporters: case study on bilitranslocase. Comput Struct Biotechnol J 15:232–242. https://doi.org/10. 1016/j.csbj.2017.01.008 49. Almeida JG, Preto AJ, Koukos PI, Bonvin A, Moreira IS (2017) Membrane proteins structures: a review on computational modeling tools. Biochim Biophys Acta Biomembr 1859(10):2021–2039. https://doi.org/10. 1016/j.bbamem.2017.07.008 50. Ebejer JP, Hill JR, Kelm S, Shi J, Deane CM (2013) Memoir: template-based structure prediction for membrane proteins. Nucleic Acids Res 41(Web Server issue):W379–W383. https://doi.org/10.1093/nar/gkt331 51. Kelm S, Shi J, Deane CM (2010) MEDELLER: homology-based coordinate generation for membrane proteins. Bioinformatics

Comparative Modelling of Carriers and Channels 26(22):2833–2840. https://doi.org/10. 1093/bioinformatics/btq554 52. Chen KY, Sun J, Salvo JS, Baker D, Barth P (2014) High-resolution modeling of transmembrane helical protein structures from distant homologues. PLoS Comput Biol 10(5): e1003636. https://doi.org/10.1371/journal. pcbi.1003636 53. Nikolaev DM, Shtyrov AA, Panov MS, Jamal A, Chakchir OB, Kochemirovsky VA, Olivucci M, Ryazantsev MN (2018) A comparative study of modern homology modeling algorithms for rhodopsin structure prediction. Acs Omega 3(7):7555–7566 54. Schlessinger A, Welch MA, van Vlijmen H, Korzekwa K, Swaan PW, Matsson P (2018) Molecular modeling of drug-transporter interactions-an international transporter consortium perspective. Clin Pharmacol Ther 104(5):818–835. https://doi.org/10.1002/ cpt.1174 55. Hooft RW, Vriend G, Sander C, Abola EE (1996) Errors in protein structures. Nature 381(6580):272. https://doi.org/10.1038/ 381272a0 56. Laskowski RA, Macarthur MW, Moss DS, Thornton JM (1993) Procheck – a program to check the stereochemical quality of protein structures. J Appl Crystallogr 26:283–291 57. Colovos C, Yeates TO (1993) Verification of protein structures – patterns of nonbonded atomic interactions. Protein Sci 2(9):1511–1519 58. Maiorov V, Abagyan R (1998) Energy strain in three-dimensional protein structures. Fold Des 3(4):259–269 59. Schmidt T, Bergner A, Schwede T (2014) Modelling three-dimensional protein structures for applications in drug design. Drug Discov Today 19(7):890–897. https://doi. org/10.1016/j.drudis.2013.10.027 60. Baker D, Sali A (2001) Protein structure prediction and structural genomics. Science 294(5540):93–96. https://doi.org/10.1126/ science.1065659 61. Forrest LR, Tang CL, Honig B (2006) On the accuracy of homology modeling and sequence alignment methods applied to membrane proteins. Biophys J 91(2):508–517 62. Jardetzky O (1966) Simple allosteric model for membrane pumps. Nature 211(5052):969–970. https://doi.org/10. 1038/211969a0 63. Krishnamurthy H, Gouaux E (2012) X-ray structures of LeuT in substrate-free outwardopen and apo inward-open states. Nature

263

481(7382):469–474. https://doi.org/10. 1038/nature10737 64. Shimamura T, Weyand S, Beckstein O, Rutherford NG, Hadden JM, Sharples D, Sansom MS, Iwata S, Henderson PJ, Cameron AD (2010) Molecular basis of alternating access membrane transport by the sodium-hydantoin transporter Mhp1. Science 328(5977):470–473. https://doi.org/10. 1126/science.1186303 65. Perez C, Koshy C, Yildiz O, Ziegler C (2012) Alternating-access mechanism in conformationally asymmetric trimers of the betaine transporter BetP. Nature 490(7418):126–130. https://doi.org/10. 1038/nature11403 66. Khelashvili G, Stanley N, Sahai MA, Medina J, LeVine MV, Shi L, De Fabritiis G, Weinstein H (2015) Spontaneous inward opening of the dopamine transporter is triggered by PIP2regulated dynamics of the N-terminus. ACS Chem Neurosci 6(11):1825–1837. https:// doi.org/10.1021/acschemneuro.5b00179 67. Kazmier K, Sharma S, Quick M, Islam SM, Roux B, Weinstein H, Javitch JA, McHaourab HS (2014) Conformational dynamics of ligand-dependent alternating access in LeuT. Nat Struct Mol Biol 21(5):472–479. https:// doi.org/10.1038/nsmb.2816 68. Schlessinger A, Geier E, Fan H, Irwin JJ, Shoichet BK, Giacomini KM, Sali A (2011) Structure-based discovery of prescription drugs that interact with the norepinephrine transporter, NET. Proc Natl Acad Sci U S A 108(38):15810–15815. https://doi.org/10. 1073/pnas.1106030108 69. Rueda M, Bottegoni G, Abagyan R (2009) Consistent improvement of cross-docking results using binding site ensembles generated with elastic network normal modes. J Chem Inf Model 49(3):716–725. https://doi.org/10. 1021/ci8003732 70. Bottegoni G, Kufareva I, Totrov M, Abagyan R (2009) Four-dimensional docking: a fast and accurate account of discrete receptor flexibility in ligand docking. J Med Chem 52(2):397–406. https://doi.org/10.1021/ jm8009958 71. Xiang Z (2006) Advances in homology protein structure modeling. Curr Protein Pept Sci 7(3):217–227. https://doi.org/10.2174/ 138920306777452312 72. Sauder JM, Arthur JW, Dunbrack RL Jr (2000) Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins 40(1):6–22. https://doi.org/ 10.1002/(sici)1097-0134(20000701)40: 13.0.co;2-7

264

Ingebrigt Sylte et al.

73. Levitt M (1992) Accurate modeling of protein conformation by automatic segment matching. J Mol Biol 226(2):507–533 74. Xiang ZX, Soto CS, Honig B (2002) Evaluating conformational free energies: the colony energy and its application to the problem of loop prediction. Proc Natl Acad Sci U S A 99(11):7432–7437 75. Xiang ZX, Honig B (2001) Extending the accuracy limits of prediction for side-chain conformations (vol 311, pg 421, 2001). J Mol Biol 312(2):419–419 76. Dorn M, e Silva MB, Buriol LS, Lamb LC (2014) Three-dimensional protein structure prediction: methods and computational strategies. Comput Biol Chem 53PB:251–276. https://doi.org/10.1016/j.compbiolchem. 2014.10.001 77. Wong SWK, Liu JS, Kou SC (2017) Fast de novo discovery of low-energy protein loop conformations. Proteins 85(8):1402–1412. https://doi.org/10.1002/prot.25300

78. Krieger E, Joo K, Lee J, Lee J, Raman S, Thompson J, Tyka M, Baker D, Karplus K (2009) Improving physical realism, stereochemistry, and side-chain accuracy in homology modeling: four approaches that performed well in CASP8. Proteins 77(Suppl 9):114–122. https://doi.org/10.1002/prot.22570 79. Jamroz M, Kolinski A (2010) Modeling of loops in proteins: a multi-method approach. BMC Struct Biol 10:5. https://doi.org/10. 1186/1472-6807-10-5 80. Dalton JA, Jackson RM (2010) Homologymodelling protein-ligand interactions: allowing for ligand-induced conformational change. J Mol Biol 399(4):645–661. https://doi.org/ 10.1016/j.jmb.2010.04.047 81. Fan H, Irwin JJ, Webb BM, Klebe G, Shoichet BK, Sali A (2009) Molecular docking screens using comparative models of proteins. J Chem Inf Model 49(11):2512–2527. https://doi. org/10.1021/ci9003706

Chapter 15 Modeling of SARS-CoV-2 Virus Proteins: Implications on Its Proteome Manish Sarkar

and Soham Saha

Abstract COronaVIrus Disease 19 (COVID-19) is a severe acute respiratory syndrome (SARS) caused by a group of beta coronaviruses, SARS-CoV-2. The SARS-CoV-2 virus is similar to previous SARS- and MERS-causing strains and has infected nearly six hundred and fifty million people all over the globe, while the death toll has crossed the six million mark (as of December, 2022). In this chapter, we look at how computational modeling approaches of the viral proteins could help us understand the various processes in the viral life cycle inside the host, an understanding of which might provide key insights in mitigating this and future threats. This understanding helps us identify key targets for the purpose of drug discovery and vaccine development. Key words SARS-CoV-2, COVID-19, Homology modeling, Protein modeling, Template-based modeling predictions, Ab initio modeling, Protein-protein interactions, Molecular docking, Proteome

Abbreviations 3CL-pro Ace2 Arg BLAST BST-2 CDK CHARMM COVID-19 Cryo-EM CTD E ERGIC EVD GROMOS IFN Leu Lys

3C-like protease Angiotensin-converting enzyme II Arginine Basic Local Alignment Search Tool Bone marrow stromal antigen type 2 Cyclin-dependent kinase Chemistry at Harvard Macromolecular Mechanics Coronavirus Disease 19 Cryo-electron microscopy C terminal domain Envelope protein Endoplasmic reticulum–Golgi intermediate compartment Extreme value distribution GROningen MOlecular Simulation Interferon Leucine Lysine

Sławomir Filipek (ed.), Homology Modeling: Methods and Protocols, Methods in Molecular Biology, vol. 2627, https://doi.org/10.1007/978-1-0716-2974-1_15, © Springer Science+Business Media, LLC, part of Springer Nature 2023

265

266

Manish Sarkar and Soham Saha

M MD MERS N NiRAN NLS NMR NRP Nsps NTD ORF PBM PD Phe Pro PTM RBD RBM RdRp RMSD S SARS Ser SVM TBM V-ATPase VSR

1

Membrane protein Molecular dynamics Middle East Respiratory Syndrome Nucleocapsid protein NidovirusRdRp-associated nucleotidyl transferase Nuclear localization signal Nuclear Magnetic resonance Neuropilin Nonstructural proteins N terminal domain Open reading frame PDZ-binding motif Peptidase domain Phenylalanine Proline Post-translational modification Receptor binding domain Receptor binding motif RNA-dependent RNA polymerase Root mean square deviation Spike protein Severe acute respiratory syndrome Serine Support vector machine Template-based modeling Vacuolar-H+ ATPase Viral suppressor of RNAi

Introduction A biological system is not an isolated system. Numerous factors, dependent and independent, work in unison to bring forth a biological function. Almost all macroscopic biological functions in a living organism are the result of coordinated micromolecular players, the proteins. Not only are proteins the product of years of accumulated evolutionary changes, but these molecules are also surprisingly adept at determining their own plasticity by motions and conformational changes. Proteins drive enzymatic activities [1], signal transduction [2], substance and ion transport [3, 4], gene expression [5, 6], and also maintain the structural integrity of cells. Allosteric conformational changes upon binding, intermediate excited states, protein transport, and trafficking, all represent the ability of the protein to undergo structural changes and function accordingly. Therefore, the structural determination of proteins provides knowledge about intriguing biological processes in the light of diseases and physiology, complementing rational drug design. The development of X-ray crystallography, Cryo-electron

Structural Proteomics Using Protein Modeling

267

microscopy (Cryo-EM), and NMR spectroscopy opens up a vast pool of understanding about protein structures, but suffers from being time- and labor-intensive. In addition, challenges remain in protein structures that are embedded in membranes or are intrinsically disordered [7, 8]. Protein structural biology offers a different degree of complexity when compared to the structure of DNA double helix. Each protein has its own three-dimensional structure and often, small changes in their sequences impact the biophysical properties of the proteins at unprecedented levels. Experimental validation of different conditions impacting the protein structure-function relationship is a time-intensive process. Computational modeling of protein structure, thus, attracts a lot of attention leading to the development of the field of structural bioinformatics, complementing experimental works on the structural biology of proteins. Modeling substantially reduces the time to obtain structural information of proteins as well as gives a comprehensive view of the dynamics of their conformational transitions [9–11]. There is a second and more compelling reason for computationally modeling protein structures. The development of high fidelity and high efficiencyDNA sequencing methods has given rise to a plethora of new sequences of genes. The development of experimental protein structural studies, on the other hand, is quite obvious to lag behind as determining all the structures is next to impossible. However, a lot of proteins with sufficient sequence similarity exhibit similar three-dimensional structures, and comparative computational modeling allows us to extrapolate experimentally available structures to so-far uncharacterized, yet similar protein sequences [12, 13]. The combinatorial approach of computational modeling with the experimental structural resolution therefore proves to be a powerful tool in our understanding of proteins and their interactions. Obviously, it is not the atomic resolution or precision of a model that determines its usefulness, but the understanding of interpretations and conclusions that can be supported by the model at hand [14].

2

Protein Modeling The structure of a protein depends on important factors: (i) the sequence of the protein which determines the primary, secondary, tertiary, and quaternary structural features of the protein and (ii) its intrinsic thermodynamic parameters which play important role in maintaining the structural stability and integrity. The importance of homology modeling and its implications become quite vigorous when experimental methodologies of protein structure determination fail to meet up with the ever-increasing coverage of the GenBank and other nucleotide repositories. Homology modeling of

268

Manish Sarkar and Soham Saha

Table 1 Servers for homology modeling: Template-based (TBM) and free modeling (FM) Name

Method

Web address

Reference

Swiss Model

TBM

swissmodel.expasy.org/

[135]

Modeller

TBM

salilab.org/modeller/

[136]

ModWeb

TBM

modbase.compbio.ucsf.edu/scgi/modweb.cgi

[137]

RaptorX

TBM

raptorx.uchicago.edu/

[138]

HHpred

TBM

hhpred.tuebingen.mpg.de/hhpred

Phyre2

TBM

sbg.bio.ic.ac.uk/Bphyre2/html/

[139]

I-TASSER

TBM + FM

zhanglab.ccmb.med.umich.edu/I-TASSER/

[140]

Rosetta

FM

rosettacommons.org/software/

[141]

Robetta

FM

robetta.bakerlab.org/

[142]

QUARK

FM

zhanglab.ccmb.med.umich.edu/QUARK/

[31]

GalaxyWeb

TBM

galaxy.seoklab.org/

[143]

LOMETS2

TBM

https://zhanglab.ccmb.med.umich.edu/LOMETS/

[144]

protein structures provides a perfect platform to bridge the gap and help us understand the structural aspects of proteins involved in cellular and biological processes (see Table 1 for the servers assisting in homology modeling). Protein families and super families have similar homologous structures which are involved in a wide range of functions. The workflow of the entire methodology of homology modeling is represented in Fig. 1. Theoretical structure prediction of proteins is of primarily two types: (a) Template-based structure prediction, which includes homology modeling. (b) Ab initio structure prediction, which is template-free modeling. 2.1 Template-Based Structure Prediction

Template Selection It involves the identification of homologous protein structures of our protein of interest in the protein structure databases like PDB, SCOP, ModBase, and SWISSMODEL. Sequence alignments of the query sequence against protein sequence databases are performed depending on sequence identity or evolutionary linkage using the heuristic algorithms constituting the BLAST (Basic Local Alignment Search Tool) family of sequence search programs (Fig. 1a). The basic algorithm of BLAST involves removing the low complexity regions and repeats in the query sequence. The possible matches against the query sequence are listed and scored using the BLOSUM-62 matrix. The high-scoring

Structural Proteomics Using Protein Modeling

269

Fig. 1 (a) Flowchart demonstrating the steps of generalized protein structure modeling (homology-based and ab initio). Important servers and resources used in each step are shown as snapshots of the Web Pages and/or resources. (b) Tools used for structural validation after constructing a protein model are shown as snapshots of the webpages and/or resources

hits are searched in the database for exact matches, and the highscoring exact matches after the database search are extended on both sides to obtain HSPs (High Scoring segment pairs). The HSPs having the maximal scores are taken into consideration using the Gumbel extreme value distribution (EVD). Smith-Waterman local alignments between HSPs are visualized between the query and the target sequence from the database search. The other variants in the BLAST family of sequence search programs are the PSI-BLAST [15], PHI-BLAST [16], and DELTA-BLAST [17]. Alignment of the target and template sequences The target and the template sequences already aligned from the database comparison needs to be realigned using refined alignment algorithms like CLUSTAL, T-Coffee, Praline, and software like MEGA-X [18] to reach the optimal alignment. This is followed by manual alignment to improve the alignment quality. The CLUSTAL group of algorithms follows a heuristic approach to align two or more sequences. This methodology analyses the whole sequences exploiting the

270

Manish Sarkar and Soham Saha

UPGMA/Neighbor-joining method which helps in the formation of distance matrices for the sequence alignments [19]. Threading techniques In case, no suitable template structures are obtained for the query sequence upon database search in PDB, sequence-to-structure alignment is done using the threading algorithm (Fig. 1a). This is a fold recognition process which becomes useful for template recognition of the query sequence, since naturally protein sequences fold into a limited number of conformations which are easy to scan. There are four main steps for the threading approach: (a) A dedicated database having structural templates is constructed from PDB, SCOP, ModBase, FSSP, or CATH after removing structures having sequence homology. (b) A scoring function is generated to measure the significance of the sequence to template structure alignment based on environment fitness potential, mutation potential, pairwise potentials, secondary structure compatibilities, and gap penalties. Protein threading methodologies use a nonlinear scoring function which can be optimized by a dynamic programming algorithm [20]. Recently, the eigenvalue decomposition technique of contact maps between the target and the template is used for optimal sequence-structure alignment along with profile alignments to increase accuracy [21]. (c) The query sequence is aligned with each of the templates of the constructed library by optimization of the scoring function. The scoring function operates with its pairwise contact potential term and determines the accuracy of the aligned sequence-structure pairs. (d) The threading alignment of the query sequence and a particular structure from the structural database is selected based on its statistical significance. Then the structure is modeled by placing the backbone atoms on the query sequence at their corresponding aligned backbone positions of the structural template. Main chain modeling A basic structural framework of the target protein is built consisting of the main chain atoms where the coordinates of the template protein structure are projected onto the query sequence of the target protein. It is done by simply copying the coordinates of the template residues on the backbone scaffold of the target protein. In other protocols, the backbone positions of all the template structures having sequence similarity more than a particular cut-off value, are averaged out to generate the final positions of the main chain atoms of the target structure [22]. Other methodologies [23] use spatial restraints like interatomic distances, bond lengths, angles, and dihedral angles of the residues to determine positions of atoms. Hierarchical approaches

Structural Proteomics Using Protein Modeling

271

for structure prediction identify the template structures from PDB using LOMETS, a multi-threading program developed by the Zhang Lab [24]. Loop building The alignment of the query and the template sequences may have gaps in between which is caused by insertions in the template sequence or deletions in the query sequence. In both the cases, the missing residues need to be modeled for completion of the target structure. There are two main approaches for loop modeling to fill up the gaps with compatible loop structures. They are as follows: (a) Knowledge-based algorithm: Protein structure databases along with available threading libraries containing structural templates like MUSTER [25] are explored for loop conformations with the same residues at the end as the target protein. If a suitable loop region flanked by similar residues as that of the gaps in the target structure is found, then the main chain atomic coordinates of the loop region in the template structure are modeled on the target sequence to complete its discontinuity. (b) Energy-based approach: The limitations of the knowledgebased approach are overcome by the energy-based modelbuilding approach when no ideal template against the target protein is obtained. Structural fragments are generated in the gap regions to model loops in compliance with neighboring structural elements to make the structure energetically favorable and maintain structural integrity. An energy function based on molecular force fields is used for loop validation. The energy function is minimized using either Monte Carlo simulation or molecular dynamics techniques. The scoring system [26] used for modeled loop evaluation gives ideas about its conformational energy. The conformational energy depends on factors like steric hindrance, and intramolecular and intermolecular interactions like hydrogen bonds, covalent, electrostatic, and Van der Waals interactions. (c) Side chain modeling: The side chains of the target structure are modeled from similar structures by isosteric replacement of the side chains of the template structure. Highly conserved regions have similar side chain torsions and thus their orientations can be directly projected onto that of the target structure. However, the most successful algorithms for insertion of side chains are knowledge-based, relying on rotamer libraries [27] constructed from high-resolution experimentally determined protein structures. The orientation of the rotamers is dependent on its chemical environment scored by energy

272

Manish Sarkar and Soham Saha

functions based on the molecular force fields (CHARMM force fields, AMBER force fields). Refinement and optimization The final three-dimensional model of the target protein is refined and optimized so as to reduce the Gibbs’ free energy of the protein to that of its native state. This is done by releasing force constraints due to structural anomalies present in the structure prior to refinement. After this step, a final overall refinement process of the structure is done by using GROMOS [28], a molecular dynamics simulation to energetically minimize the protein structure in a realistic environment. Rigid body modeling of the protein structure leads to structural distortions, steric clashes, and energetically unfavorable events like overlap of Van der Waals spheres of the neighboring atoms. These are optimized by energy minimization functions representing the molecular force field. The generalized functional form of molecular force field consists of two sets of energy terms: (a) covalent bond interactions and (b) long-range nonbonded interactions like Van der Waals and electrostatic interactions. Present modeling algorithms use simulated annealing as an effective form of energy optimization. It consists of two steps: (a) increase in the temperature of the system allowing thermal expansion to a certain extent, and (b) slow decrease in the temperature of the system which leads to a global minimum energy state. Structural evaluation and validation The final minimized threedimensional structure of the protein that has been modeled is validated for the various structural features and parameters using validation algorithms (Fig. 1b). The parameters considered are as follows: (i) Ramachandran favored residues’ and outliers’ percentage of the main chain atoms, (ii) bond angles and lengths of the side chain atoms, their rotameric conformations, and outlier percentage, (iii) critical steric clashes between atomic pairs of different residues in the final structure, and (iv) Clash score, Molprobity Score (Molprobity) [29], or other appropriate scores.

2.2 Ab Initio Structure Prediction

The limitations of experimental methods often lead to a lack of suitable template structures having high sequence-structure alignment with the target protein locally or globally. In such cases, ab initio methodology is leveraged to model the structure (Fig. 1a). This technique depends on the parameters based on energies of different folded regions and conformations which are statistically significant compared to experimental data [30]. The initial step uses an energy function which is optimized for structural model estimation having stable thermodynamic parameters. This is followed by identification of the conformations having lowest energies. The structure which resembles the native structure of the

Structural Proteomics Using Protein Modeling

273

protein most significantly is selected from a set of structural decoys. After the initial model is constructed, further optimizations like loop remodeling, library, and energy-based side chain remodeling followed by energy minimization and optimization are performed to obtain the final structure. This is followed by rigorous structural validations using several pipelines described earlier [10]. Ab initio modeling of protein has a limitation of protein size (~200 aa) until which it can build the structure efficiently using the modeling algorithms [31]. However, development of the AlphaFold [32], which is a neural network-based prediction algorithm, has led to higher prediction accuracy in ab initio modeling with no limitations of the protein size and is still under validation by the CASP (Critical Assessment of Protein Structure Prediction) community.

3

The SARS-CoV-2 Proteome The SARS-CoV-2 is an RNA virus, with a genome similar to that of previously observed coronaviruses. Multiple sequence alignment of the viral genomes and amino acid sequences of SARS-CoV-1 and SARS-CoV-2 indicate a 99.9% homology in both the cases [33]. Compared to previous coronaviral strains, SARS-CoV2 demonstrated a larger contagion and much higher efficacy of spread. An overwhelming majority of the understanding of SARSCoV-2 genomes and proteomes are based on detailed characterizations of SARS-CoV-1 and 2 (2002–2003 and 2019, respectively) and other related coronaviruses (MERS-CoV, 2012) [34]. SARSCoV-2 has 11 open reading frames (ORFs) encoded by 11 genes: ORF1ab (that undergoes proteolytic cleavage forming 16 nonstructural proteins or nsps), ORF2 (spike protein), ORF3a, ORF4 (envelope protein), ORF5 (membrane protein), ORF6, ORF7a, ORF7b, ORF8, ORF9 (nucleocapsid protein), and ORF10. All the proteins of SARS-CoV-2, can be grouped into three large classes: a large polyprotein (ORF1ab, cleaved nonstructural protein products), structural proteins (S, E, M, and N), and accessory proteins (ORF3a, 6, 7ab, 8, and 10). The SARS-CoV-2 proteome is shown in Fig. 2.

3.1 Nonstructural Proteins (Nsp)

Viral Nsps are encoded by the virus that manipulates host cellular mechanisms. They mostly encode proteases, replicases, helicases, and polymerases along with other co-factors that assist in viral replication and processivity. The first gene of SARS-Cov-2 expresses a polyprotein, ORF1ab, comprising 16 Nsps. Nsp1 is the first protein of the polyprotein, best known as the “leader protein.” Common to other coronaviruses, Nsp1 of SARS-CoV-2 binds to the ribosomal mRNA channel of 40S ribosome, inactivating translation and promoting host mRNA degradation selectively, keeping viral mRNA intact [35, 36]. Nsp2 is the second protein of ORF1ab,

274

Manish Sarkar and Soham Saha

Fig. 2 Schematic representation of SARS-CoV2-proteome. The polyprotein1ab, structural proteins (S, E, M, and N) and accessory proteins (ORF3a, 6, 7ab, 8, and 10) are shown along the corresponding CoV2 genome along with their modeled structures determined from C-I-TASSER COVID-19 repository. Note that while experimental structure determination has been performed with certain CoV2 proteins, almost all proteins have been modeled

which binds to host prohibitin 1 and prohibitin 2 [37]. This affects host cell cycle, cell migration, cellular differentiation, and apoptosis. Nsp3, the papain-like proteinase protein, is the largest protein encoded by the coronaviruses. It contains several conserved domains: domains binding ssRNA, ADRP, G-quadruplex, Nsp4, protease, and transmembrane domain. The papain-like protease releases the Nsp1, Nsp2, and Nsp3 from the N-terminal of polyprotein ORF1ab [38, 39]. Nsp3 interacts with Nsp4, affecting viral replication [40] and probably host membrane rearrangement. Nsp5 is the 3C-like proteinase (3CL-pro) which cleaves at eleven distinct sites to yield mature and intermediate Nsps [41]. The Nsp5 from SARS-CoV-1 and SARS-CoV-2 are functionally highly conserved [42]. Nsp7 forms a complex with Nsp8, and this Nsp7-Nsp8 heterodimer complexes with Nsp12 [43]. In addition to this heterodimer, a Nsp8 monomer also complexes with Nsp12, which ultimately forms the RNA polymerase Complex [44]. Nsp10 interacts and stimulates Nsp14, which is (guanine-N7) methyl transferase (N7-MTase) [45]. Furthermore, Nsp10 also stimulates Nsp16 activity, which is a 2′-O-methyltransferase [46]. Nsp12 is the RNA-dependent RNA polymerase (RdRp) that copies viral RNA, complexed with a Nsp7-Nsp8 heterodimer and a Nsp8 monomer [47]. The core replication machinery of SARS-CoV-2 has lower enzymatic activity than SARS-CoV-1, and the subunits show less thermostability in SARS-CoV-2 compared to SARS-CoV-1 [48]. Nsp13 is a helicase enzyme, which unwinds the duplex viral

Structural Proteomics Using Protein Modeling

275

RNA [49], and has a high-resolution crystal structure [50]. Furthermore, it was shown that Nsp12 (RdRp) can enhance the helicase activity of Nsp13 through direct interactions [50]. Nsp13 of SARS-CoV-1 and SARS-CoV-2 also show 5′-triphosphatase activity, introducing the 5′-terminal cap to the viral mRNA [49, 51, 52], directly impacting splicing, nuclear export, translation, and mRNA stability. Nsp14 demonstrates 3′-5′ exoribonuclease activity and N7-methyltransferase activity in SARS-CoV-1 [53] and SARSCoV-2 [54], which also plays a role in the 5′-cap introduction of the virus. SARS-CoV-1 and 2 Nsp15 is a characterized endoribonuclease, which specifically targets and degrades the viral polyuridine sequences as a preventive measure against host immune system [55, 56]. Nsp16 in SARS-CoV-1 & 2 forms a complex with Nsp10, and protects viral mRNA degradation, translation, and promotes host immune evasion [57, 58]. 3.2 Structural Proteins

The four structural proteins of the SARS-CoV-2, like all other members of the family, are as follows: Spike (S) protein, Envelope (E) protein, Membrane (M) protein, and Nucleocapsid (N) protein. The S protein is a glycoprotein, which mediates attachment of the virus to the host cell. S comprises of two functional subunits responsible for binding to the host cell receptor (S1 subunit) and fusion of the viral and cellular membranes (S2 subunit). The S interacts with the host cell receptors, particularly with angiotensin-converting enzyme II (Ace2), through its receptor binding domain (RBD). The determination of the structure of SARS-CoV-2 RBD-ACE2 complex revealed important residues in the RBD essential for binding to Ace2 [59]. More recently, the C-terminus of the S1 subunit obtained by furin cleavage generates a [R/K]XX[R/K] motif which could bind to Neuropilin-1 and 2 (NRP1/2) [60], which are transmembrane receptors regulating biological processes like axon guidance, angiogenesis and vascular permeability [61–63]. The smallest of all the structural proteins of SARS-CoV-2 is the E protein (8–12 kDa). The hydrophobic transmembrane domain of SARS-CoV-1 E protein forms pentameric ion channels having varying degrees of ion selectivity [64, 65]. SARS-CoV-1 E protein can regulate a variety of functions associated with viral maturation and release by altering ion homeostasis of cellular organelles where they localize [66]. The ion channeling (IC) activity remains an important determinant of SARSCoV-2 viral pathogenesis. Furthermore, in silico protein modeling revealed that ion channels formed by pentameric E protein can undergo dynamic closed and open states [67]. The SARS-CoV1 M protein is an integral membrane protein that plays role in viral assembly, assisting the generation of new virion particles by interacting with the nucleoproteins and the spike proteins at the budding site [68, 69] through M-M interactions [70]. Cryo-EM of M protein structure reveals that it can adopt two conformational

276

Manish Sarkar and Soham Saha

states, which regulate membrane curvature [70]. Pro-apoptotic property of the M protein has also been reported [71]. In SARSCoV-2, the M protein is reported to antagonize type I and III IFN production, which plays role in the attenuation of antiviral immunity and enhances viral replication [72]. In another study, it was shown that envelope protein, nucleocapsid protein, and membrane protein co-expression are a prerequisite for efficient production and release of virion particles [73]. The N protein of CoVs is highly conserved in the coronaviridae family showing a high degree of sequence homology (94%). The SARS-CoV-2 N protein acts as the viral suppressor of RNAi (VSR) by sequestering dsRNA in cells [74]. Coronavirus N protein encapsulates viral genomic RNAs as a protection mechanism, and plays a role in viral RNA replication, especially at the initiation step. The SARS-CoV-1 N protein has signature binding motifs for cyclin and undergoes phosphorylation by cyclin-dependent kinase (CDK), thereby inhibiting CDK4cyclin D complex activity [75]. 3.3 Accessory Proteins

ORF3a is one of the major accessory proteins in coronaviruses which is a viroporin having IC activity, similar to E protein. ORF3a also have a PDZ-binding motif (PBM), and are implicated in SARS-CoV-1 replication and virulence [76]. In addition to that, the induction of proinflammatory cytokine storm observed in SARS has been attributed to the ORF3a. The activation of NLRP3 inflammasome through NF-κB activation and protein maturation has been shown to be mediated by ORF3a [77]. A recent ex vivo study also indicated that SARS-CoV-2 ORF3a can efficiently induce apoptosis in cell lines [78]. The cytoplasmic domain of ORF3a can also bind Ca2+ in vitro, and mediate protein conformational changes [79]. ORF3a has also been shown to interact with Nsp8, which may impact the RNA polymerase activity [80]. The ORF6 is a 63-residue amphipathic peptide involved in SARS-CoV1 infection kinetics. ORF6 demonstrates two domain-specific functions, the N-terminal lipophilic part implicated accelerated viral growth, whereas the C-terminal hydrophilic part interfering with protein import into the nucleus [81]. ORF6 of SARS-CoV-2 inhibits STAT1 nuclear translocation blocking the IFN signaling [82, 83]. ORF7a from SARS coronavirus is a type I transmembrane protein which is expressed and segregated as intracellular domains and the transmembrane domain playing a role in protein trafficking with the ER and Golgi network [84]. ORF7a was also shown to block the restriction activity of bone marrow stromal antigen type 2 (BST-2), leading to a loss of the antiviral potential of BST-2 [85]. Recently, ORF7a in SARS-CoV-2 was shown to inhibit type-I IFN signaling via STAT2 phosphorylation [86]. On the other hand, ORF7b is an accessory protein encoded on bicistronic subgenomic RNA 7 [87]. ORF7b demonstrates biochemical properties of an integral membrane protein and localizes in both cis- and

Structural Proteomics Using Protein Modeling

277

trans-Golgi networks [88]. During virion production, this protein is incorporated in new virus particles [87]. ORF8 encodes a single protein in SARS-CoV-2 while in SARS-CoV-1, it encodes two smaller proteins, 8a and 8b [89]. Protein 8ab (fusion of 8a and 8b) is modified by post-translational modifications like N-linked glycosylation and ubiquitination. Protein 8b undergoes rapid proteasomal degradation in the absence of 8a, suggesting that 8ab complex confers stability [89]. In addition, high expression of 8b downregulates E protein expression, implying its role in replication and pathogenesis [90]. The expression of 8b and 8ab also enables viruses to overcome interferon (IFN) activation in hosts to achieve higher replication efficiency. Overexpression of 8b and 8ab resulted in inhibition of the IFN-β signaling pathway [91]. SARS-CoV2 ORF8 alters the expression of surface MHC-I to evade the host immune response [92].ORF10 protein from SARS-CoV-2 comprises of 38-amino acids and contains an α-helical region. ORF10 interacts with the cullin-2 (CUL2) RING E3 ligase complex, and possibly hijacks it for the ubiquitination of host proteins [93].

4

Models of Important SARS-CoV-2 Viral Proteins Proteins with similar sequences share similar structural and functional repertoire. However, with the coronavirus-induced SARSCoV-2 taking the shape of a global pandemic, it is both necessary and crucial for us to understand the fundamental biology associated with this particular virus and extend the understanding to develop effective treatments (vaccines or antivirals medications). As in all new viral outbreaks, the SARS-CoV-2 genome is highly mutable and some of the proteins do not have homologous templates for effective protein modeling. These aspects severely impact our understanding of the outbreak from a proteomic point of view. In this chapter, we will discuss some major protein modeling tools and compare some of the SARS-CoV-2 protein models with their experimentally-available crystal or cryo-EM structures. We show a comparative structural analysis of the three different structural modeling pipelines: (i) C-I-TASSER (ii) SWISS-MODEL and (iii) GALAXYWEB, for proteins. TM-align and MM-align TM-align [94] compares two protein structures independent of their sequence homology from a conformational point of view. It generates an optimized amino acid alignment based on the structural similarity of the aligned residues. This protocol follows a heuristic dynamic programming iteration of structural alignments of the two protein structures (sequence/ PDB files) to generate an optimal superposition. TM-score measures the accuracy of the structural superposition and the degree of topological similarity. TM-score along with RMSD (root mean

278

Manish Sarkar and Soham Saha

square deviation) value of the aligned structures provide an estimate of global conformational similarity between the structures rather than local misaligned structural regions. The TM-score is normalized depending on the protein length to compare two protein structures of almost no structural similarity. TM-score has a value from 0 to 1, where two exactly identical protein structures having perfect structural superimpositions have a value of 1. A value of more than 0.5 indicates a significant structural similarity which can be domain-specific. MM-align [95] is a heuristic, iterative, Needleman-Wunsch dynamic programming algorithm with modified restraints which is used for structural superimpositions of two protein complexes. The multiple chains are joined in all possible combinations and the best alignment is optimized excluding the cross-chain alignments. An overall TM-score is calculated to quantify the similarity between the structures. 4.1 Homology Modeling of Proteins Where High-Resolution Experimental Structures Are Available: Comparison of Homology Models with Experimental Structures 4.1.1

Nsp1

4.1.2

3CL-pro (Nsp5)

The Nsp1 of SARS-CoV-2 is the leader protein of the viral polypeptide 1ab and is known to have endonucleolytic activity in the 5′ UTR regions of the host cell mRNA. We aligned the experimentally obtained structure of this protein (PDB id: 6ZB4) with the homology-modeled protein available from C-I-TASSER. The TM-score and RMSD of the superposition were 0.85 and 2.28, respectively (Fig. 3a(i)). The Nsp1 model of SARS-CoV-2 is structurally very similar to the experimental one, pointing to the fidelity of homology modeling. The structural core is composed of a βαβ motif along with antiparallel beta sheets formed by the six beta strands. There is a hairpin loop formation between the third and fourth beta strands surrounded by anti-parallel beta sheet. These structural features in the model closely resemble the experimental observations. In line with the above observations, structural alignment between two homology models (C-I-TASSER and SWISSMODEL) of the protein also showed high similarity (Fig. 3b(i); Tm score: 0.91, RMSD: 1.39). For the 3C-like protease (3CL-pro), we aligned experimentally known structure (PDB id: 7K3T) with the models generated in C-I-TASSER. The TM-score and RMSD of the superposition were both 0.97 (Fig. 3a(ii)). The protein has a chymotrypsin-like fold and consists of three different domains: (i) The N terminal flap region which is composed of a Greek key motif followed by antiparallel beta sheet (Fig. 3a(ii)), (ii) the catalytic site containing the flexible loop region has a cysteine-histidine catalytic triad at the active site (Fig. 3a(ii)) and (iii) the alpha helical dimerization domain (Fig. 3a(ii)). Similarly, structural superposition between two homology models (C-I-TASSER and SWISSMODEL) of the protein showed a high degree of similarity (Fig. 3b(ii); Tm score: 0.98, RMSD: 0.92).

Structural Proteomics Using Protein Modeling

279

Fig. 3 (a) Superimposition of experimentally derived structures (in red) of certain SARS-CoV-2 proteins with their corresponding modeled structures (in blue) from the C-I-TASSER COVID-19 repository: (i) nsp1 (green arrow: βαβ motif; purple arrow: hairpin loop); (ii) 3CL-pro (green arrow: flap region; purple arrow: catalytic serine-histidine triad; yellow arrow: dimerization domain); (iii) spike protein (green arrow: receptor binding domain (RBD); purple arrow: trimerization domain); (iv) Nucleocapsid protein (green arrow: NTD; purple arrow: CTD; yellow arrow: linker region); and (v) RdRp protein (green arrow: NiRAN domain; purple arrow: polymerase domain). Note the metrics of comparison RMSD and TM-score for each superposition. (b) Superimposition of modeled protein structures from the modeling servers: C-I-TASSER (in blue), and SWISS-MODEL (in red)— (i) nsp1, (ii) 3CL-pro, (iii) spike protein, (iv) nucleocapsid protein, and (v) RdRp protein. Similarly, note the metrics of comparison: RMSD and TM-score for each superposition. (c) Comparison of the predicted models of the envelope (E) protein from the following servers: (i) C-I-TASSER (in blue) and SWISS-MODEL (in red) (ii) C-ITASSER (in blue) and GALAXYWEB (in green) and (iii) SWISS-MODEL (in red) and GALAXYWEB (in green) (green

280

Manish Sarkar and Soham Saha

4.1.3

S Protein

Experimental structures (PDB id: 6ZB4) of the S protein aligns closely with the models of C-I-TASSER, as the TM-score and RMSD value of the superposition was 0.83 and 4.23, respectively (Fig. 3a(iii)). Homology-modeled structures of the S protein can complement the crystal structures in our understanding of the protein. The S protein is a trimer with each monomer consisting of two subunits: S1 and S2. The RBD is proximal to the C terminal end of the S1 subunit (Fig. 3a(iii)). Using X-ray crystallography, RBD-Ace2 structure revealed important residues in the RBD that are essential for binding to Ace2 [59]. The RBD contains a receptor binding motif (RBM) which interacts directly with Ace2 using networks of hydrophilic interactions [50]. Similar crystallographic studies also revealed that in comparison to CoV-1 RBD, CoV-2 RBD has comparatively higher compactness, and several residue variations in CoV-2 which stabilize the RBD-Ace2 interface [96]. Cryo-EM investigations on the spike protein of CoV-2 also indicated the presence of a furin cleavage site at the S1/S2 boundary, which undergoes cleavage during biogenesis [97]. Accordingly, structural alignment between two homology models (C-I-TASSER and SWISSMODEL) of the protein showed high similarity (Fig. 3b (iii); Tm score: 0.84, RMSD: 4.64).

4.1.4

N Protein

Experimental structures of partial N protein (PDB id: 6M3M) and homology models of the full-length protein generated by C-ITASSER have TM-score and RMSD of 0.79 and 2.41, respectively (Fig. 3a(iv)). These values are inside the allowed range for both the parameters. This particular example highlights the capability of homology modeling to generate full-length structures which could help to complete insights into the protein. The N protein binds the viral RNA and interacts with the M protein to take part in viral processes. The nuclear localization signal (NLS) is present at the C terminal end of the N protein. The protein has three wellsegregated yet conserved domains [98]: (i) N terminal domain (NTD) confers the RNA binding activity of the protein (ii) C terminal domain (CTD) responsible for dimerization and (iii) central intrinsically disordered region which acts as a linker between the two other domains (Fig. 3a(iv)). This domain, rich in Ser/Arg, are

ä Fig. 3 (continued) arrow: transmembrane domain; purple arrow: C terminal domain). Note the Tm scores and RMSD values of each structural alignment from different modeling servers. (d) Structural alignment of the modeled structures of the membrane (M) protein (i) QUARK (in red) and C-I-TASSER (in blue) (green arrow: central α-domain) and (ii) C-QUARK (in red) and C-I-TASSER (in blue). Note the metrics of comparison. Structural superposition of the predicted models of SARS-CoV-2 protein complexes with the corresponding experimentally determined structures. (e) Nsp7-Nsp8 tetrameric unit from hexadecameric model of SWISSMODEL repository with hetero-tetrameric crystal structure of nsp7-nsp8 complex of SARS-CoV-2. (f) Modeled Nsp7-Nsp8-Nsp12 complex from SWISS-MODEL repository with the cryo-EM structure of the replication machinery. (g) Modeled structure of Spike-Ace2 entry complex with the cryo-EM structure of the complex from SARS-CoV-2 Note the RMSD and TM-score for the complexes after structural superposition using MM-Align

Structural Proteomics Using Protein Modeling

281

important for post-translational modification (PTM) like SUMOylation and phosphorylation [99]. Comparison of the two homology models of the N protein shows high degree of structural similarity (Fig. 3b(iv); Tm score: 0.96, RMSD: 0.95). 4.1.5

RdRp (Nsp12)

Experimental structure of the RdRp (PDB id: 6XQB) and its homology model from the C-I-TASSER show high degree of structural similarity (Fig. 3a(v); TM-score: 0.99 and RMSD: 0.86). This shows that structure-function relationship studies can be performed using modeled structures. The RdRp consists of the following: (i) a N terminal nidovirus RdRp-associated nucleotidyl transferase (NiRAN) domain (Fig. 3a(v)) [44], (ii) the interface domain connecting the NiRAN domain with the polymerase domain, and (iii) the C terminal catalytic domain having classical right-hand architecture as in other viral RNA polymerases consisting of three subdomains: (a) thumb, (b) palm, and (c) finger (Fig. 3a(v)). Structural superimposition of the two homology models showed high degree of similarity as seen from the metrics (Fig. 3b(v); Tm score: 0.99, RMSD: 0.86).

4.2 Homology Modeling of Envelope (E) Protein Where High-Resolution Structure Is Unavailable: Comparison Between Different Homology Models

High-resolution experimental structures of the E protein are not available as there are many challenges to effective purification of membrane proteins. Homology modeling provides a strong alternative to address the structural aspects of this protein. C-I-TASSER, SWISSMODEL, and GalaxyWeb server have been used to generate homology models of the E protein using previously determined NMR structures as the template (PDB id: 5X29, 7K3G). All these modeled structures have high degrees of structural similarities leading to effective alignment. The TM-score and RMSD value between the alignments of C-I-TASSER and SWISSMODEL are 0.36 and 2.94 (Fig. 3c(i)), C-I-TASSER and GalaxyWeb are 0.35 and 3.55 (Fig. 3c), SWISSMODEL and GalaxyWeb are 0.59 and 1.39 (Fig. 3c(iii)), respectively. The protein has a small N terminus region followed by a transmembrane domain which pentamerizes to form pores in the membrane. This is followed by the large C terminal domain important for interactions by its PBM [100]. There are several sites of PTMs in this domain: (i) phosphorylation which might affect several protein-protein interactions and (ii) glycosylation which might lead to conformational change in the protein defining its function as a membrane topology modulator [101].

4.3 Ab Initio Protein Modeling of Membrane Protein (M) Where No Experimental Structure Is Available

Experimental structure of the M protein of the SARS-CoV-2 is not available due to its high hydrophobicity. Homology models have been generated by C-I-TASSER which gives us the only full-length structural representation of the protein. QUARK and C-QUARK have been explored to generate ab initio structural models. The TM-score and RMSD values of the alignment of C-I-TASSER with

282

Manish Sarkar and Soham Saha

QUARK models are 0.4 and 5.50, respectively (Fig. 3d(i)), while that of C-I-TASSER with C-QUARK are0.31 and 5.29, respectively (Fig. 3d(ii)). This energy-based template-free modeling might be inaccurate from a structural point of view, but has significant impact on functional annotations of the protein. The modeled structures all have a common central alpha helical domain consisting of three transmembrane alpha helices while the N terminal ectodomain and the C terminal endodomain are located across the viral membrane [102]. There are glycosylation sites in the N terminal ectodomain whose functional significance is still unclear and the truncated protein without the C terminal region is sufficient for its spatial localization along with the S protein in the ERGIC membrane [103].

5

Functional Implications of Protein Modeling Experimental structure determination (NMR, X-ray diffraction, and Cryo-EM) and protein modeling share common objectives, namely, obtaining 3D atomic level information from sequences. Protein models are useful in rational experimental design involving molecular bench-based experimental approaches. In silico approaches provide information and possible targets for sitedirected mutagenesis and/or protein engineering. An understanding of the structure-function relationship in these aspects are crucial for predicting ligand-binding sites, docking and in fact, forms the basis of rational drug design (small molecules and peptides) [104, 105]. Models also provide valuable information on the effects of mutations and SNPs [106]. However, the accuracy of drug or molecular interactions predictions depends on the accuracy of the side chain positions of the residues in the binding site [107]. We demonstrate some examples of how protein modeling has developed our understanding of the SARS-CoV-2 in general, and also the processes that affect its pathogenicity and virulence. Some aspects of molecular understanding like protein-protein interactions, active site predictions, molecular docking, complex formation and its functionality, and structural insights of mechanism of actions like ion channeling activity are highlighted in the following parts.

5.1 Protein-Protein Interactions

Spike-Ace2: The entry of the SARS-CoV-2 in humans is mediated by its S glycoprotein which interacts as a trimer with a dimeric Ace2 found on the surface of the host cells [108]. High-resolution crystal and Cryo-EM structures (PDB id: 6M0J, 6VW1) of this host-pathogen protein complex is available. High-resolution homology models of the Spike-Ace2 complex have also been predicted which have significant structural alignment scores with the recent structures from SARS-CoV-2 (Fig. 3g; Tm score: 0.99

Structural Proteomics Using Protein Modeling

283

and RMSD: 0.21). This can be leveraged to understand intermolecular interactions across the interface regions of the Spike-Ace2 complex. The interface is mediated by the Peptidase domain (PD) of the Ace2 and the RBD (Fig. 4a-left) [108]. The interface is composed of the arch-shaped N terminal helix (ɑ1) of the PD of Ace2 spanned by an extension of the loop region of the RBD forming a bridge-like structure (Fig. 4a-right) [108]. The RBD loopy bridge forms a network of H-bonds between its own residues Q498, T500, and N501 with Y41, Q42, K353, and R357 of the Ace2 ɑ1helix (Fig. 4a-right). K417 and Y453 of the RBD have polar interactions with E30 and H34 of Ace2, respectively, which is situated at the middle of the interface (Fig. 4a-right). The C terminal region of the Ace2-ɑ1 present in the interface of the complex interacts via H-bond formation between Ace2-Q24 and RBD-Q474 and Van der Waals interaction between F486 of RBD and M82 of Ace2 (Fig. 4a-right) [108]. A wider range of residue pairs (RBD-Ace2) have been identified mediating the interactions of the interface by H-bonds: N487-Q24, Q493-E35, Y505-E37, Y449-D38, G446-Q42, Y449-Q42, Y487-Y83, Y489-Y83, G502K353, and Y505-K393 (Fig. 4a-right) [59]. The interface region of both the complexes are nearly similar except for some variations in RBD which render lower Kd value and thus stronger binding for the spike protein of SARS-CoV-2 than that of previous SARS-CoVs [59, 108]. The change of RBD-V404 of SARS-CoV-1 to K417 of SARS-CoV-2 might result in a stronger salt bridge with E30 of Ace2, thus strengthening the interface. The L472-F486 substitution of spike-RBD is thought to have stronger Van der Waals interaction with Ace2-M82 [108]. Structural alignment is thus a powerful tool in understanding the nature of interactions present in the interface of protein complexes. It also gives us a comparative view of the Spike-Ace2 interface between different SARS-CoV strains regarding the interactions and binding affinity of the two protein partners. Recently, the entry of the SARS-CoV-2 in human hosts was shown to be facilitated by a human membrane receptor, Neuropilin 1 (NRP1) and 2 (NRP2) which are present on a wide range of cell types in our body like respiratory epithelium, neurons and others [109]. 5.2 Understanding the Functionality of Proteins

We have used homology modeling pipelines (MODELLER, GALAXYWEB, and SWISSMODEL) to generate lower energy state structural models of the SARS-CoV-2 E protein [67]. This follows the template-based modeling from the initial NMR structure (PDB id: 5X29) [110]. Key inter-helical residues in the structural orientation were identified that provides the basis of its viroporin activity [65]. The orientation of important residues explain how various structural and functional intermediates could be formed during its conformational changes [67]. The E protein

284

Manish Sarkar and Soham Saha

Fig. 4 (a) Homology models of Spike-Ace2 entry complex from SWISS-MODEL repository showing the interface between the RBD and Ace2 (red box). (Right): Enlarged image of the region in the red box showing amino acids interacting in the interface. (b) Homology model of pentameric envelope (E) protein showing the presence of pore water implicated in its viroporin activity. (c) Active site prediction of 3CL-Pro proteinase for ATP binding using ATPbind (shown in red boxes). The ATPbind probability plot shows that two regions (Arg 40 and Leu250/253) have the highest ATP binding probability. (d) Molecular docking of ATP molecule in the ATP-binding pockets (red boxes in Panel c) of 3CL-Pro (surface groove identification). (e) Homology model of

Structural Proteomics Using Protein Modeling

285

of SARS-CoV-2 was shown to localize in the ERGIC complex of human cells as pentamers, similar to earlier reports of SARS-CoV-1 and MERS [64]. In addition to that, important structural domains of the E protein were identified from its homology model like the bottleneck region, hydrophobic pocket, the central pore, and N-terminal gate. Using membrane embedding, functional states of open and closed configuration were predicted and continuous water channels as a mechanism of viroporin action were proposed (Fig. 4b) [67]. The structural insight into the putative mechanism of the SARS-CoV-2 E protein as a conformation-dependent ion channel could provide the basis for designing attenuated and inactivated vaccines. 5.3 Binding Site Predictions

The inference of protein functions from limited information on experimental structures in the post-genomic era is an important challenge. Homology-based modeling provides clues of individual residues in a structure for the prediction of function [111], as in the example of predicting catalytic sites [112]. These approaches use an evolutionary modeling technique and information from structural neighborhoods of residues like conserved sequences, charge, solvent accessibility, centrality of 3D structures, and hydrophobicity [112]. Charge interaction calculations can also be used to predict active sites, justifying pH-dependence of activity and transition state stabilization [113]. Not restrictive to structural stability, optimization of scoring functions has also been employed in predicting ligand-binding and enzymatic active sites [114]. Various optimization algorithms like Monte Carlo, genetic algorithms, and simulated annealing have been used for the prediction of multiple side-chain conformations. The orientation of ligands can vary based on hydrogen bonds and hydrophobic interactions in the active site [115]. We have shown an example of active site prediction using ATPbind from the I-TASSER server [116]. In brief, ATPbind employs template-based predictors like S-SITE and TM-SITE and protein sequence features like scoring matrices, predicted secondary structure, and solvent accessibility. Multiple support vector machines (SVMs) based on a random under-sampling technique predicts the probability of presence or absence of ATP binding sites. The ATPbind benchmark queries 429 ATP binding proteins from the PDB database to evaluate the proposed binding sites with other

ä Fig. 4 (continued) the nsp7-nsp8 tetrameric complex showing the interfaces of interactions (boxes i and ii): (i) Leucine zipper forming the dimerization interface of nsp7 and nsp8. (ii) Disulphide bond between Cys8 of the two nsp7-nsp8 protomers forming the tetramerization interface. The different regions are color coded. (f) Domain architecture of nsp12 and homology model of nsp7-nsp8-nsp12 complex from SWISS-MODEL repository showing specific structural features (boxes i and ii): (i) Beta hairpin loop is enlarged in red. (ii) Catalytic domain containing the fingertip loop (blue arrow) and loop extensions (black arrow). The different regions are color coded

286

Manish Sarkar and Soham Saha

existing predictors. We probed the 3CL-pro protein of SARS-CoV2 for the existence of ATP binding sites (Fig. 4c-top), as the vacuolar-H+ ATPase (V-ATPase) G1 subunit was identified as a 3CL-pro-interacting protein [117]. Using fundamental model of ATP from PubChem (PubChemID: CID_5957), ATPbind predicted the potential binding sites on 3CL-pro: Pro 39 and Arg40, Leu 250, and Leu 253, with ATP binding prediction of 40% or more (Fig. 4c-bottom). These residues seemed to form an ATP binding pocket on the protein which we later confirmed by molecular docking. 5.4 Molecular Docking

Molecular docking is used to predict ligand-receptor complex formation. This is generally achieved by computing ranked scoring functions of possible conformations by sampling conformations of the ligand in the active site of the protein [118]. The degrees of translational and rotational freedom as well as the conformational degrees of freedom of both the ligand and protein allow a large number of binding mode possibilities. Different kinds of algorithms are used as scoring functions: geometry-based [119], incremental fragment-based [120], fragment-based methods for the de novo design (MCSS and LUDI) [121, 122], and stochastic search (Monte-Carlo simulations) [123]. Molecular dynamics simulations can be used for refinement of the docked structures [124, 125]. Scoring functions can be divided into force-fieldbased, empirical and knowledge-based. Empirical scoring functions use binding energies like hydrogen bond, ionic interaction, hydrophobic effect, and binding entropy [126, 127], while knowledgebased scoring functions consider interatomic contact frequencies and distances between the ligand and protein [128, 129]. More recently, however, consensus scoring combines several different scores to assess the quality of the docking [130].

5.4.1 ATP Binding Sites on 3CL-pro

Having predicted the favorable binding sites for ATP on 3CL-pro, we wanted to visualize them on the protein. We used Autodock VINA module [131] to perform a docking of the ATP ligand (PubChemID: CID_5957) to the 3CL-pro protein model structure from C-I-TASSER. The search space for the VINA algorithm to run was pre-determined by the ATPbind module and defined by a three-dimensional box covering the region of interest in the protein structure. Visualization of the structures revealed that not only the binding sites predicted by ATPbind were consistent, but ATP binding grooves were also visible in the domains predicted by Autodock VINA (Fig. 4d).

5.4.2 Nsp7-nsp8 Primase Complex

The nsp7-nsp8 hetero-tetrameric complex of SARS-CoV-2 has been crystallized (PDB id: 6YHU). The complex has also been predicted using a homology modeling approach, using the experimental structure of the nsp 7–8 complex of SARS-CoV-1 [132] as

Structural Proteomics Using Protein Modeling

287

the template and is available in SWISSMODEL repository (https:// swissmodel.expasy.org/repository/species/2697049). The hexadecameric complex found in SARS-CoV-1 was not observed in SARS-CoV-2. Rather the nsp7-8 hetero-tetrameric structure of SARS-CoV-2 superposes significantly with the individual subunit of the hexadecameric complex from SARS-CoV-1 [133]. The modeled complex of the nsp7-nsp8 subunit was structurally aligned with the experimental structure from SARS-CoV-2 using MM-align. The Tm score is 0.87 and RMSD value is 2.59 (Fig. 3e). The modeled hetero-tetramer has similar secondary structures and conformations with the X-ray structure of the complex (PDB id: 6YHU). However, homology modeling cannot determine the oligomeric status of a complex without knowledgebased approach. The nsp7 consists of a helical bundle composed of three alpha helices (ɑ1–ɑ2–ɑ3) which are connected by loop regions (Fig. 4e). The nsp8 has two structural domains: (i) an ɑ-turn-ɑ motif (ɑ1–ɑ2) and (ii) a C-terminal domain which is composed of four anti parallel β-strands (β1–β4) with an ɑ helix (ɑ3) inserted into the predominantly beta folded region (Fig. 4e). The dimerization interface is around 1340 Å2 and composed of (i) ɑ1 and ɑ3 of nsp7 and (ii) ɑ1’ and ɑ2’ of nsp8. The tetramerization interface is around 950 Å2 present between the two nsp7-nsp8 dimers having a stoichiometry of 2:2 which is mediated by ɑ1 and ɑ2 helices of nsp7 and ɑ1’ helix of nsp8 (Fig. 4e). A leucine zipper motif is present in the dimerization interface (nsp7: Leu 56, 60 and 71 and nsp8: Leu 95 and 103) which is stabilized by two phenylalanine residues from the two interacting partners (Phe 49 of nsp7 and Phe 92 of nsp8) and other proximal hydrophobic residues (Fig. 4e(i)). The tetramerization interface is stabilized by disulphide bond formation between the symmetric Cys8 of nsp7 (Fig. 4e(ii)). It has a hydrophobic interior and a surface positive charge due to the presence of Lys and Arg residues from both nsp7 and nsp8. This hallmark feature of the primase complex in RNA viruses makes this region an ideal target for RNA binding [133]. 5.5 Insights into Viral Replication Machinery 5.5.1 Nsp7-Nsp8-Nsp12 Replication Machinery

Experimental structure of the replication machinery of SARS-CoV2 has been determined using Cryo-EM (PDB id: 7BW4) [48] and the homology model of this multi-subunit complex has been predicted by SWISSMODEL [132]. The predicted model structure of this multimeric complex was structurally aligned with its Cryo-EM structure using MM-align which resulted in Tm-score of 0.98 and RMSD value of 0.11 (Fig. 3f). These parameters show high degree of structural similarity between the experimental structure and the homology model inferring the significance and accuracy of the homology modeling of complex multimeric machinery. This multimeric complex consists of three different proteins (i) nsp7, (ii) nsp8, and (iii) nsp12 (RdRp) which fulfills the basic requirements of functional replication machinery (Fig. 4f). The RdRp

288

Manish Sarkar and Soham Saha

NiRAN domain is associated with a β-hairpin motif (Fig. 4f(i)) embedded into the groove of the NiRAN and palm subdomain of the catalytic domain. The catalytic domain of RdRp is formed by seven conserved critical catalytic motifs (A–G) (Fig. 4f-Domain architecture). Motif A spans 611th to 626th residues having the phylogenetically conserved D618 important for divalent cation binding. The primary catalytic center is present in the Motif C which spans between 753rd to 767th residues having the conserved catalytic site at 759–761 (SDD) similar to other viral RdRps [44]. Motif F forms a catalytic loop with extension loops to stabilize it (Fig. 4f(ii)). Motif G is a conserved feature of RdRp in several RNA viruses which interacts with the primer strand for initiation of primer-dependent-RNA synthesis [48]. Two conserved metal binding motifs were identified where the zinc ions are reported to stabilize the RdRp structure [134]. The channel for NTP entry in the complex is mediated by positively charged Arg and Lys residues in motif F and the RNA template possibly enters the active site formed by the A and C motifs through a structural groove (Fig. 4f). The template is clamped by the F and G motifs to maintain directionality and processivity in the E motif and the thumb domain provides the scaffold for the RNA primer strand (Fig. 4f) [44]. Motif F forms a long loop structured finger extension found in the finger subdomain of RdRp. It consists of a finger-tip closed ring catalytic loop stabilized spatially by associated finger extension loops intersecting the thumb subdomain (Fig. 4f(ii)) [44, 48]. RdRp is stabilized by interactions with (i) a nsp7-nsp8 heterodimer (nsp7-8.1) and (ii) a nsp8 protomer (nsp8.2) at two different sites. The nsp7-nsp8 heterodimer binds over the thumb domain and stacks in the thumb-finger interface providing higher stability. This restricts the spatial movement of the finger extension loops of the F motif of RdRp from both sides which in turn clamps the catalytic finger near the active sites (Fig. 4f). This particular interface is mediated by interactions between the nsp7 and nsp12 while nsp8.1 makes a smaller number of contacts with the RdRp. The nsp-8.2 protomer binds to the dorsal region of the finger domain and interacts with the interface domain of RdRp to stabilize the complex (Fig. 4f). Nsp8.1 and nsp8.2 are conformationally distinct due to refolding of the N terminus region of the nsp8.2 and having a different set of interactions with nsp12 than that of nsp7-8.1 [48, 134].

6

Summary of Methods

6.1 Steps of Homology Modeling

The homology modeling technique follows the following steps to achieve accuracy.

Structural Proteomics Using Protein Modeling

289

1. Template selection: It involves the identification of homologous protein structures of our protein of interest in the protein structure databases which might be used for its modeling. The accuracy of model prediction depends on the selection of the template structure. 2. Alignment of the target and template sequences: The target and the template sequences are realigned using refined alignment algorithms followed by manual alignment to improve the alignment quality. 2. Main chain modeling: A basic structural framework of the target protein is built consisting of main chain atoms. 3. Loop building: The gaps during alignment (“holes”) in the model of the target protein are filled by loop modeling. 4. Side chain modeling: After modeling of the main chain atoms, the side chain atoms of the undetermined regions of the protein structure are modeled using SCWRL (sidechain placement with a rotamer library) according to preferred conformations in Durnback’s Rotamer library. 5. Refinement and optimization: The final three-dimensional model of the target protein is refined and optimized so as to reduce the Gibbs free energy of the protein to that of its native state. 6. Structural evaluation and validation: The final minimized three-dimensional structure of the protein is validated by PROCHECK or MolProbity. The parameters considered are as follows: (i) Ramachandran favored residues’ and outliers’ percentage of the main chain atoms, (ii) bond angles and lengths of the side chain atoms and their rotameric conformations and outlier percentage, (iii) critical steric clashes between atomic pairs of different residues in the final structure, and (iv) Clash score and Molprobity Score (MolProbity). 6.2

ATPbind Steps

ATP binding site prediction on a protein is a traditional binary classifier problem. The steps are described in detail in [116]. 1. A position-specific scoring matrix (PSSM) is computed for each query sequence, using PSI-BLAST to search against the SwissProt database. 2. A predicted secondary structure (PSS) for a given protein sequence is generated, predicting the probabilities to exist in three secondary structure classes (coil, helix, and strand). 3. The predicted solvent accessibility (PSA) characteristics of each residue are obtained: buried, intermediate, and exposed for each residue. 4. Detection of protein templates and general-purpose binding sites using S-SITEatp-based feature.

290

Manish Sarkar and Soham Saha

5. An ATP-specific structure-template-based method to derive the general-purpose binding sites by structurally comparing the query protein with template proteins using TM-SITEatpbased feature. It generates ATP binding probability for each residue. 6. A support vector machine (SVM) is used to determine the probability of belonging to the ATP binding site class. 7. The probabilities of the residue belonging to the ATP binding site class are fused by mean ensemble schemes (ME). 8. The sensitivity, specificity, accuracy, precision, and the Matthews correlation coefficient (MCC) are utilized to evaluate predictive ability. 6.3 Molecular Docking of ATP and 3CL-pro

1. ATP molecular structure in .sdf format is obtained from PubChem library (Pubchem CID 5957). The .sdf is converted to. mol2 format. 2. The model of 3CL-pro protein of SARS CoV2 is obtained from C-I-TASSER repository of COVID-19 (C-I-TASSER ID: QHD43415_5). 3. The protein structure is visualized in CHIMERA and docked to ATP molecule using the following steps: (a) The search space for the docking algorithm to run is defined by a three-dimensional box with appropriate axes’ lengths which encompasses our region of interest in the protein structure. The region of the docking space was determined by the ATPbind process. (b) Docking of the ATP molecules to the 3CL-pro protein in the search space described previously is carried out using Autodock Vina which uses the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm as an Iterated Local search global optimizer method, to speed up its optimization procedures. 4. The docked conformations of the protein with ATP molecules were visualized using CHIMERA.

Acknowledgments We thank Victor Hannothiaux, Paul Etheimer, Leo Janin and Sophie Ameloot for proof-reading the manuscript for language and technical details. This work was supported by MedInsights SAS (SIRET: 91842274200018; SIREN: 918422742; www.medi nsights.fr), Paris. We did not receive any funding for writing this book chapter from any sources.

Structural Proteomics Using Protein Modeling

291

References 1. Agarwal PK (2006) Enzymes: an integrated view of structure, dynamics and function. Microb Cell Factories 5:2. https://doi.org/ 10.1186/1475-2859-5-2 2. Lee MJ, Yaffe MB (2016) Protein regulation in signal transduction. Cold Spring Harb Perspect Biol 8. https://doi.org/10.1101/ cshperspect.a005918 3. Ashcroft F, Gadsby D, Miller C (2009) Introduction. The blurred boundary between channels and transporters. Philos Trans R Soc Lond Ser B Biol Sci 364:145–147. https://doi.org/10.1098/rstb.2008.0245 4. Gadsby DC (2009) Ion channels versus ion pumps: the principal difference, in principle. Nat Rev Mol Cell Biol 10:344–352. https:// doi.org/10.1038/nrm2668 5. Franco-Zorrilla JM, Lo´pez-Vidriero I, Carrasco JL, Godoy M, Vera P, Solano R (2014) DNA-binding specificities of plant transcription factors and their potential to define target genes. Proc Natl Acad Sci U S A 111:2367– 2372. https://doi.org/10.1073/pnas. 1316278111 6. Latchman DS (1990) Eukaryotic transcription factors. Biochem J 270:281–289. https://doi.org/10.1042/bj2700281 7. Wright PE, Dyson HJ (2015) Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol 16:18–29. https://doi.org/10.1038/nrm3920 8. Uversky VN (2019) Intrinsically disordered proteins and their “mysterious” (meta)physics. Front Phys 7. https://doi.org/10. 3389/fphy.2019.00010 9. Ma B, Tsai C-J, Halilog˘lu T, Nussinov R (2011) Dynamic allostery: linkers are not merely flexible. Structure 19:907–917. https://doi.org/10.1016/j.str.2011.06.002 10. Schwede T (2013) Protein modeling: what happened to the “protein structure gap”? Structure 21:1531–1540. https://doi.org/ 10.1016/j.str.2013.08.007 11. Weinkam P, Pons J, Sali A (2012) Structurebased model of allostery predicts coupling between distant sites. Proc Natl Acad Sci U S A 109:4875–4880. https://doi.org/10. 1073/pnas.1116274109 12. Guex N, Peitsch MC, Schwede T (2009) Automated comparative protein structure modeling with SWISS-MODEL and SwissPdbViewer: a historical perspective. Electrophoresis 30(Suppl 1):S162–S173. https:// doi.org/10.1002/elps.200900140

13. Sa´nchez R, Sali A (1998) Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc Natl Acad Sci U S A 95: 13597–13602. https://doi.org/10.1073/ pnas.95.23.13597 14. Jalily Hasani H, Barakat K (2017) Homology modeling: an overview of fundamentals and tools. Int Rev Model Simul (IREMOS) 10: 129. https://doi.org/10.15866/iremos. v10i2.11412 15. Bhagwat M, Aravind L (2007) PSI-BLAST tutorial. Methods Mol Biol 395:177–186. https://doi.org/10.1007/978-1-59745514-5_10 16. Zhang Z, Sch€affer AA, Miller W, Madden TL, Lipman DJ, Koonin EV, Altschul SF (1998) Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res 26:3986– 3990. https://doi.org/10.1093/nar/26.17. 3986 17. Boratyn GM, Sch€affer AA, Agarwala R, Altschul SF, Lipman DJ, Madden TL (2012) Domain enhanced lookup time accelerated BLAST. Biol Direct 7:12. https://doi.org/ 10.1186/1745-6150-7-12 18. Stecher G, Tamura K, Kumar S (2020) Molecular Evolutionary Genetics Analysis (MEGA) for macOS. Mol Biol Evol 37: 1237–1239. https://doi.org/10.1093/ molbev/msz312 19. Sievers F, Higgins DG (2018) Clustal Omega for making accurate alignments of many protein sequences. Protein Sci 27:135–145. https://doi.org/10.1002/pro.3290 20. Peng J, Xu J (2009) Boosting protein threading accuracy. Res Comput Mol Biol 5541:31– 45. https://doi.org/10.1007/978-3-64202008-7_3 21. Zheng W, Wuyun Q, Li Y, Mortuza SM, Zhang C, Pearce R, Ruan J, Zhang Y (2019) Detecting distant-homology protein structures by aligning deep neural-network based contact maps. PLoS Comput Biol 15: e1007411. https://doi.org/10.1371/jour nal.pcbi.1007411 22. Peitsch MC (1997) Large scale protein modelling and model repository. Proc Int Conf Intell Syst Mol Biol 5:234–236 23. Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234:779–815. https:// doi.org/10.1006/jmbi.1993.1626 24. Zheng W, Li Y, Zhang C, Pearce R, Mortuza SM, Zhang Y (2019) Deep-learning contactmap guided protein structure prediction in

292

Manish Sarkar and Soham Saha

CASP13. Proteins 87:1149–1164. https:// doi.org/10.1002/prot.25792 25. Wu S, Zhang Y (2008) MUSTER: improving protein sequence profile-profile alignments by using multiple sources of structure information. Proteins 72:547–556. https://doi.org/ 10.1002/prot.21945 26. Fiser A, Do RK, Sali A (2000) Modeling of loops in protein structures. Protein Sci 9: 1753–1773. https://doi.org/10.1110/ps.9. 9.1753 27. Dunbrack RL, Karplus M (1993) Backbonedependent rotamer library for proteins. Application to side-chain prediction. J Mol Biol 230:543–574. https://doi.org/10.1006/ jmbi.1993.1170 28. Christen M, Hu¨nenberger PH, Bakowies D, Baron R, Bu¨rgi R, Geerke DP, Heinz TN, Kastenholz MA, Kr€autler V, Oostenbrink C, Peter C, Trzesniak D, van Gunsteren WF (2005) The GROMOS software for biomolecular simulation: GROMOS05. J Comput Chem 26:1719–1751. https://doi.org/10. 1002/jcc.20303 29. Williams CJ, Headd JJ, Moriarty NW, Prisant MG, Videau LL, Deis LN, Verma V, Keedy DA, Hintze BJ, Chen VB, Jain S, Lewis SM, Arendall WB, Snoeyink J, Adams PD, Lovell SC, Richardson JS, Richardson DC (2018) MolProbity: more and better reference data for improved all-atom structure validation. Protein Sci 27:293–315. https://doi.org/ 10.1002/pro.3330 30. Lee J, Wu S, Zhang Y (2009) Ab initio protein structure prediction. In: Rigden DJ (ed) From protein structure to function with bioinformatics. Springer Netherlands, Dordrecht, pp 3–25 31. Xu D, Zhang Y (2013) Toward optimal fragment generations for ab initio protein structure assembly: Ab Initio Fragment Generation. Proteins Struct Funct Bioinform 81:229–239. https://doi.org/10.1002/ prot.24179 32. Senior AW, Evans R, Jumper J, Kirkpatrick J, ˇ ´ıdek A, Nelson Sifre L, Green T, Qin C, Z AWR, Bridgland A, Penedones H, Petersen S, Simonyan K, Crossan S, Kohli P, Jones DT, Silver D, Kavukcuoglu K, Hassabis D (2020) Improved protein structure prediction using potentials from deep learning. Nature 577:706–710. https://doi.org/10. 1038/s41586-019-1923-7 33. Wang C, Liu Z, Chen Z, Huang X, Xu M, He T, Zhang Z (2020) The establishment of reference sequence for SARS-CoV-2 and variation analysis. J Med Virol 92:667–674. https://doi.org/10.1002/jmv.25762

34. de Wit E, van Doremalen N, Falzarano D, Munster VJ (2016) SARS and MERS: recent insights into emerging coronaviruses. Nat Rev Microbiol 14:523–534. https://doi.org/10. 1038/nrmicro.2016.81 35. Huang C, Lokugamage KG, Rozovics JM, Narayanan K, Semler BL, Makino S (2011) SARS coronavirus nsp1 protein induces template-dependent endonucleolytic cleavage of mRNAs: viral mRNAs are resistant to nsp1induced RNA cleavage. PLoS Pathog 7: e1002433. https://doi.org/10.1371/jour nal.ppat.1002433 36. Schubert K, Karousis ED, Jomaa A, Scaiola A, Echeverria B, Gurzeler L-A, Leibundgut M, Thiel V, Mu¨hlemann O, Ban N (2020) SARSCoV-2 Nsp1 binds the ribosomal mRNA channel to inhibit translation. Nat Struct Mol Biol 27:959–966. https://doi.org/10. 1038/s41594-020-0511-8 37. Cornillez-Ty CT, Liao L, Yates JR, Kuhn P, Buchmeier MJ (2009) Severe acute respiratory syndrome coronavirus nonstructural protein 2 interacts with a host protein complex involved in mitochondrial biogenesis and intracellular signaling. J Virol 83:10314– 10318. https://doi.org/10.1128/JVI. 00842-09 38. Angeletti S, Benvenuto D, Bianchi M, Giovanetti M, Pascarella S, Ciccozzi M (2020) COVID-2019: the role of the nsp2 and nsp3 in its pathogenesis. J Med Virol 92: 584–588. https://doi.org/10.1002/jmv. 25719 39. Lei J, Kusov Y, Hilgenfeld R (2018) Nsp3 of coronaviruses: structures and functions of a large multi-domain protein. Antivir Res 149: 58–74. https://doi.org/10.1016/j.antiviral. 2017.11.001 40. Sakai Y, Kawachi K, Terada Y, Omori H, Matsuura Y, Kamitani W (2017) Two-amino acids change in the nsp4 of SARS coronavirus abolishes viral replication. Virology 510:165– 174. https://doi.org/10.1016/j.virol.2017. 07.019 41. Tomar S, Johnston ML, St John SE, Osswald HL, Nyalapatla PR, Paul LN, Ghosh AK, Denison MR, Mesecar AD (2015) Ligandinduced dimerization of Middle East Respiratory Syndrome (MERS) coronavirus nsp5 protease (3CLpro): implications for nsp5 regulation and the development of antivirals. J Biol Chem 290:19403–19422. https://doi. org/10.1074/jbc.M115.651463 42. Roe MK, Junod NA, Young AR, Beachboard DC, Stobart CC (2021) Targeting novel structural and functional features of coronavirus protease nsp5 (3CLpro, Mpro) in the age

Structural Proteomics Using Protein Modeling of COVID-19. J Gen Virol 102. https://doi. org/10.1099/jgv.0.001558 43. te Velthuis AJW, van den Worm SHE, Snijder EJ (2012) The SARS-coronavirus nsp7+nsp8 complex is a unique multimeric RNA polymerase capable of both de novo initiation and primer extension. Nucleic Acids Res 40: 1737–1747. https://doi.org/10.1093/nar/ gkr893 44. Gao Y, Yan L, Huang Y, Liu F, Zhao Y, Cao L, Wang T, Sun Q, Ming Z, Zhang L, Ge J, Zheng L, Zhang Y, Wang H, Zhu Y, Zhu C, Hu T, Hua T, Zhang B, Yang X, Li J, Yang H, Liu Z, Xu W, Guddat LW, Wang Q, Lou Z, Rao Z (2020) Structure of the RNA-dependent RNA polymerase from COVID-19 virus. Science 368:779–782. https://doi.org/10.1126/science.abb7498 45. Ma Y, Wu L, Shaw N, Gao Y, Wang J, Sun Y, Lou Z, Yan L, Zhang R, Rao Z (2015) Structural basis and functional analysis of the SARS coronavirus nsp14–nsp10 complex. Proc Natl Acad Sci 112:9436–9441. https://doi.org/ 10.1073/pnas.1508686112 46. Wang Y, Sun Y, Wu A, Xu S, Pan R, Zeng C, Jin X, Ge X, Shi Z, Ahola T, Chen Y, Guo D (2015) Coronavirus nsp10/nsp16 methyltransferase can be targeted by nsp10-derived peptide in vitro and in vivo to reduce replication and pathogenesis. J Virol 89:8416–8427. https://doi.org/10.1128/JVI.00948-15 47. Subissi L, Posthuma CC, Collet A, Zevenhoven-Dobbe JC, Gorbalenya AE, Decroly E, Snijder EJ, Canard B, Imbert I (2014) One severe acute respiratory syndrome coronavirus protein complex integrates processive RNA polymerase and exonuclease activities. Proc Natl Acad Sci 111:E3900–E3909. https://doi.org/10. 1073/pnas.1323705111 48. Peng Q, Peng R, Yuan B, Zhao J, Wang M, Wang X, Wang Q, Sun Y, Fan Z, Qi J, Gao GF, Shi Y (2020) Structural and biochemical characterization of the nsp12-nsp7-nsp8 core polymerase complex from SARS-CoV-2. Cell Rep 31:107774. https://doi.org/10.1016/j. celrep.2020.107774 49. Jang K-J, Jeong S, Kang DY, Sp N, Yang YM, Kim D-E (2020) A high ATP concentration enhances the cooperative translocation of the SARS coronavirus helicase nsP13 in the unwinding of duplex RNA. Sci Rep 10: 4481. https://doi.org/10.1038/s41598020-61432-1 50. Jia Z, Yan L, Ren Z, Wu L, Wang J, Guo J, Zheng L, Ming Z, Zhang L, Lou Z, Rao Z (2019) Delicate structural coordination of the Severe Acute Respiratory Syndrome

293

coronavirus Nsp13 upon ATP hydrolysis. Nucleic Acids Res 47:6538–6550. https:// doi.org/10.1093/nar/gkz409 51. Ivanov KA, Thiel V, Dobbe JC, van der Meer Y, Snijder EJ, Ziebuhr J (2004) Multiple enzymatic activities associated with severe acute respiratory syndrome coronavirus helicase. J Virol 78:5619–5632. https://doi.org/ 10.1128/JVI.78.11.5619-5632.2004 52. Shu T, Huang M, Wu D, Ren Y, Zhang X, Han Y, Mu J, Wang R, Qiu Y, Zhang D-Y, Zhou X (2020) SARS-Coronavirus-2 Nsp13 possesses NTPase and RNA helicase activities that can be inhibited by bismuth salts. Virol Sin 35:321–329. https://doi.org/10.1007/ s12250-020-00242-1 53. Case JB, Ashbrook AW, Dermody TS, Denison MR (2016) Mutagenesis of S -adenosyl-lmethionine-binding residues in coronavirus nsp14 N7-methyltransferase demonstrates differing requirements for genome translation and resistance to innate immunity. J Virol 90: 7248–7256. https://doi.org/10.1128/JVI. 00542-16 54. Ogando NS, Zevenhoven-Dobbe JC, van der Meer Y, Bredenbeek PJ, Posthuma CC, Snijder EJ (2020) The enzymatic activity of the nsp14 exoribonuclease is critical for replication of MERS-CoV and SARS-CoV-2. J Virol 94. https://doi.org/10.1128/JVI.01246-20 55. Hong S, Seo SH, Woo S-J, Kwon Y, Song M, Ha N-C (2021) Epigallocatechin gallate inhibits the uridylate-specific endoribonuclease Nsp15 and efficiently neutralizes the SARSCoV-2 strain. J Agric Food Chem 69:5948– 5954. https://doi.org/10.1021/acs.jafc. 1c02050 56. Hackbart M, Deng X, Baker SC (2020) Coronavirus endoribonuclease targets viral polyuridine sequences to evade activating host sensors. Proc Natl Acad Sci 117:8094–8103. h t t p s : // d o i . o r g / 1 0 . 1 0 7 3 / p n a s . 1921485117 57. Decroly E, Debarnot C, Ferron F, Bouvet M, Coutard B, Imbert I, Gluais L, Papageorgiou N, Sharff A, Bricogne G, Ortiz-Lombardia M, Lescar J, Canard B (2011) Crystal structure and functional analysis of the SARS-coronavirus RNA cap 2’-Omethyltransferase nsp10/nsp16 complex. PLoS Pathog 7:e1002059. https://doi.org/ 10.1371/journal.ppat.1002059 58. Vithani N, Ward MD, Zimmerman MI, Novak B, Borowsky JH, Singh S, Bowman GR (2021) SARS-CoV-2 Nsp16 activation mechanism and a cryptic pocket with pan-coronavirus antiviral potential.

294

Manish Sarkar and Soham Saha

Biophys J:S000634952100254X. https:// doi.org/10.1016/j.bpj.2021.03.024 59. Lan J, Ge J, Yu J, Shan S, Zhou H, Fan S, Zhang Q, Shi X, Wang Q, Zhang L, Wang X (2020) Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor. Nature 581:215–220. https://doi.org/10.1038/s41586-0202180-5 60. Daly JL, Simonetti B, Klein K, Chen K-E, Williamson MK, Anto´n-Pla´garo C, Shoemark DK, Simo´n-Gracia L, Bauer M, Hollandi R, Greber UF, Horvath P, Sessions RB, Helenius A, Hiscox JA, Teesalu T, Matthews DA, Davidson AD, Collins BM, Cullen PJ, Yamauchi Y (2020) Neuropilin-1 is a host factor for SARS-CoV-2 infection. Science: eabd3072. https://doi.org/10.1126/sci ence.abd3072 61. Teesalu T, Sugahara KN, Kotamraju VR, Ruoslahti E (2009) C-end rule peptides mediate neuropilin-1-dependent cell, vascular, and tissue penetration. Proc Natl Acad Sci 106: 16157–16162. https://doi.org/10.1073/ pnas.0908201106 62. Guo H-F, Vander Kooi CW (2015) Neuropilin functions as an essential cell surface receptor. J Biol Chem 290:29120–29126. https:// doi.org/10.1074/jbc.R115.687327 63. Plein A, Fantin A, Ruhrberg C (2014) Neuropilin regulation of angiogenesis, arteriogenesis, and vascular permeability. Microcirculation 21:315–323. https://doi. org/10.1111/micc.12124 64. Nieto-Torres JL, DeDiego ML, Verdi˜ o JM, Reglaa´-Ba´guena C, Jimenez-Guarden ˜ oNava JA, Fernandez-Delgado R, Castan Rodriguez C, Alcaraz A, Torres J, Aguilella VM, Enjuanes L (2014) Severe acute respiratory syndrome coronavirus envelope protein ion channel activity promotes virus fitness and pathogenesis. PLoS Pathog 10:e1004077. https://doi.org/10.1371/journal.ppat. 1004077 65. Verdia´-Ba´guena C, Nieto-Torres JL, Alcaraz A, DeDiego ML, Torres J, Aguilella VM, Enjuanes L (2012) Coronavirus E protein forms ion channels with functionally and structurally-involved membrane lipids. Virology 432:485–494. https://doi.org/10. 1016/j.virol.2012.07.005 66. Nieva JL, Madan V, Carrasco L (2012) Viroporins: structure and biological functions. Nat Rev Microbiol 10:563–574. https://doi.org/ 10.1038/nrmicro2820 67. Sarkar M, Saha S (2020) Structural insight into the role of novel SARS-CoV-2 E protein: a potential target for vaccine development and

other therapeutic strategies. PLoS One 15: e0237300. https://doi.org/10.1371/jour nal.pone.0237300 68. Escors D, Ortego J, Laude H, Enjuanes L (2001) The membrane M protein carboxy terminus binds to transmissible gastroenteritis coronavirus core and contributes to core stability. J Virol 75:1312–1324. https://doi. org/10.1128/JVI.75.3.1312-1324.2001 69. Kuo L, Masters PS (2003) The small envelope protein E is not essential for murine coronavirus replication. J Virol 77:4597–4608. h t t p s : // d o i . o r g / 1 0 . 1 1 2 8 / j v i . 7 7 . 8 . 4597-4608.2003 70. Neuman BW, Joseph JS, Saikatendu KS, Serrano P, Chatterjee A, Johnson MA, Liao L, Klaus JP, Yates JR, Wu¨thrich K, Stevens RC, Buchmeier MJ, Kuhn P (2008) Proteomics analysis unravels the functional repertoire of coronavirus nonstructural protein 3. J Virol 82:5279–5294. https://doi. org/10.1128/JVI.02631-07 71. Tsoi H, Li L, Chen ZS, Lau K-F, Tsui SKW, Chan HYE (2014) The SARS-coronavirus membrane protein induces apoptosis via interfering with PDK1-PKB/Akt signalling. Biochem J 464:439–447. https://doi.org/10. 1042/BJ20131461 72. Zheng Y, Zhuang M-W, Han L, Zhang J, Nan M-L, Zhan P, Kang D, Liu X, Gao C, Wang P-H (2020) Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) membrane (M) protein inhibits type I and III interferon production by targeting RIG-I/ MDA-5 signaling. Signal Transduct Target Ther 5:299. https://doi.org/10.1038/ s41392-020-00438-7 73. Siu YL, Teoh KT, Lo J, Chan CM, Kien F, Escriou N, Tsao SW, Nicholls JM, Altmeyer R, Peiris JSM, Bruzzone R, Nal B (2008) The M, E, and N structural proteins of the severe acute respiratory syndrome coronavirus are required for efficient assembly, trafficking, and release of virus-like particles. J Virol 82:11318–11330. https://doi.org/10. 1128/JVI.01052-08 74. Mu J, Xu J, Zhang L, Shu T, Wu D, Huang M, Ren Y, Li X, Geng Q, Xu Y, Qiu Y, Zhou X (2020) SARS-CoV-2-encoded nucleocapsid protein acts as a viral suppressor of RNA interference in cells. Sci China Life Sci 63:1–4. https://doi.org/10.1007/s11427020-1692-1 75. Surjit M, Liu B, Chow VTK, Lal SK (2006) The nucleocapsid protein of severe acute respiratory syndrome-coronavirus inhibits the activity of cyclin-cyclin-dependent kinase complex and blocks S phase progression in

Structural Proteomics Using Protein Modeling mammalian cells. J Biol Chem 281:10669– 10681. https://doi.org/10.1074/jbc. M509233200 ˜ o-Rodriguez C, Honrubia JM, Gutie´r76. Castan ´ lvarez J, DeDiego ML, Nieto-Torres JL, rez-A ˜ o JM, Regla-Nava JA, Jimenez-Guarden Fernandez-Delgado R, Verdia-Ba´guena C, Queralt-Martı´n M, Kochan G, Perlman S, Aguilella VM, Sola I, Enjuanes L (2018) Role of severe acute respiratory syndrome coronavirus viroporins E, 3a, and 8a in replication and pathogenesis. mBio 9. https://doi. org/10.1128/mBio.02325-17 ˜ o-Rodriguez C, Ye 77. Siu K-L, Yuen K-S, Castan Z-W, Yeung M-L, Fung S-Y, Yuan S, Chan C-P, Yuen K-Y, Enjuanes L, Jin D-Y (2019) Severe acute respiratory syndrome coronavirus ORF3a protein activates the NLRP3 inflammasome by promoting TRAF3dependent ubiquitination of ASC. FASEB J 33:8865–8877. https://doi.org/10.1096/fj. 201802418R 78. Ren Y, Shu T, Wu D, Mu J, Wang C, Huang M, Han Y, Zhang X-Y, Zhou W, Qiu Y, Zhou X (2020) The ORF3a protein of SARS-CoV-2 induces apoptosis in cells. Cell Mol Immunol 17:881–883. https:// doi.org/10.1038/s41423-020-0485-9 79. Minakshi R, Padhan K, Rehman S, Hassan MDI, Ahmad F (2014) The SARS Coronavirus 3a protein binds calcium in its cytoplasmic domain. Virus Res 191:180–183. https:// doi.org/10.1016/j.virusres.2014.08.001 80. Kumar P, Gunalan V, Liu B, Chow VTK, Druce J, Birch C, Catton M, Fielding BC, Tan Y-J, Lal SK (2007) The nonstructural protein 8 (nsp8) of the SARS coronavirus interacts with its ORF6 accessory protein. Virology 366:293–303. https://doi.org/10. 1016/j.virol.2007.04.029 81. Hussain S, Gallagher T (2010) SARScoronavirus protein 6 conformations required to impede protein import into the nucleus. Virus Res 153:299–304. https://doi.org/ 10.1016/j.virusres.2010.08.017 82. Miorin L, Kehrer T, Sanchez-Aparicio MT, Zhang K, Cohen P, Patel RS, Cupic A, Makio T, Mei M, Moreno E, Danziger O, White KM, Rathnasinghe R, Uccellini M, Gao S, Aydillo T, Mena I, Yin X, MartinSancho L, Krogan NJ, Chanda SK, Schotsaert M, Wozniak RW, Ren Y, Rosenberg BR, Fontoura BMA, Garcı´a-Sastre A (2020) SARS-CoV-2 Orf6 hijacks Nup98 to block STAT nuclear import and antagonize interferon signaling. Proc Natl Acad Sci U S

295

A 117:28344–28354. https://doi.org/10. 1073/pnas.2016650117 83. Li J-Y, Liao C-H, Wang Q, Tan Y-J, Luo R, Qiu Y, Ge X-Y (2020) The ORF6, ORF8 and nucleocapsid proteins of SARS-CoV-2 inhibit type I interferon signaling pathway. Virus Res 286:198074. https://doi.org/10.1016/j. virusres.2020.198074 84. Nelson CA, Pekosz A, Lee CA, Diamond MS, Fremont DH (2005) Structure and intracellular targeting of the SARS-coronavirus Orf7a accessory protein. Structure 13:75–85. https://doi.org/10.1016/j.str.2004.10.010 85. Taylor JK, Coleman CM, Postel S, Sisk JM, Bernbaum JG, Venkataraman T, Sundberg EJ, Frieman MB (2015) Severe acute respiratory syndrome coronavirus ORF7a inhibits bone marrow stromal antigen 2 virion tethering through a novel mechanism of glycosylation interference. J Virol 89:11820–11833. https://doi.org/10.1128/JVI.02274-15 86. Cao Z, Xia H, Rajsbaum R, Xia X, Wang H, Shi P-Y (2021) Ubiquitination of SARS-CoV2 ORF7a promotes antagonism of interferon response. Cell Mol Immunol 18:746–748. https://doi.org/10.1038/s41423-02000603-6 87. Schaecher SR, Mackenzie JM, Pekosz A (2007) The ORF7b protein of severe acute respiratory syndrome coronavirus (SARSCoV) is expressed in virus-infected cells and incorporated into SARS-CoV particles. J Virol 81:718–731. https://doi.org/10.1128/JVI. 01691-06 88. Schaecher SR, Diamond MS, Pekosz A (2008) The transmembrane domain of the severe acute respiratory syndrome coronavirus ORF7b protein is necessary and sufficient for its retention in the Golgi complex. J Virol 82: 9477–9491. https://doi.org/10.1128/JVI. 00784-08 89. Le TM, Wong HH, Tay FPL, Fang S, Keng C-T, Tan YJ, Liu DX (2007) Expression, post-translational modification and biochemical characterization of proteins encoded by subgenomic mRNA8 of the severe acute respiratory syndrome coronavirus. FEBS J 274:4211–4222. https://doi.org/10.1111/ j.1742-4658.2007.05947.x 90. Keng C-T, Choi Y-W, Welkers MRA, Chan DZL, Shen S, Gee Lim S, Hong W, Tan Y-J (2006) The human severe acute respiratory syndrome coronavirus (SARS-CoV) 8b protein is distinct from its counterpart in animal SARS-CoV and down-regulates the expression of the envelope protein in infected cells.

296

Manish Sarkar and Soham Saha

Virology 354:132–142. https://doi.org/10. 1016/j.virol.2006.06.026 91. Wong HH, Fung TS, Fang S, Huang M, Le MT, Liu DX (2018) Accessory proteins 8b and 8ab of severe acute respiratory syndrome coronavirus suppress the interferon signaling pathway by mediating ubiquitin-dependent rapid degradation of interferon regulatory factor 3. Virology 515:165–175. https:// doi.org/10.1016/j.virol.2017.12.028 92. Zhang Y, Chen Y, Li Y, Huang F, Luo B, Yuan Y, Xia B, Ma X, Yang T, Yu F, Liu J, Liu B, Song Z, Chen J, Yan S, Wu L, Pan T, Zhang X, Li R, Huang W, He X, Xiao F, Zhang J, Zhang H (2021) The ORF8 protein of SARS-CoV-2 mediates immune evasion through down-regulating MHC-I. Proc Natl Acad Sci U S A 118:e2024202118. https:// doi.org/10.1073/pnas.2024202118 93. Gordon DE, Jang GM, Bouhaddou M, Xu J, Obernier K, White KM, O’Meara MJ, Rezelj VV, Guo JZ, Swaney DL, Tummino TA, Huettenhain R, Kaake RM, Richards AL, Tutuncuoglu B, Foussard H, Batra J, Haas K, Modak M, Kim M, Haas P, Polacco BJ, Braberg H, Fabius JM, Eckhardt M, Soucheray M, Bennett MJ, Cakir M, McGregor MJ, Li Q, Meyer B, Roesch F, Vallet T, Mac Kain A, Miorin L, Moreno E, Naing ZZC, Zhou Y, Peng S, Shi Y, Zhang Z, Shen W, Kirby IT, Melnyk JE, Chorba JS, Lou K, Dai SA, Barrio-Hernandez I, Memon D, Hernandez-Armenta C, Lyu J, Mathy CJP, Perica T, Pilla KB, Ganesan SJ, Saltzberg DJ, Rakesh R, Liu X, Rosenthal SB, Calviello L, Venkataramanan S, Liboy-Lugo J, Lin Y, Huang X-P, Liu Y, Wankowicz SA, Bohn M, Safari M, Ugur FS, Koh C, Savar NS, Tran QD, Shengjuler D, Fletcher SJ, O’Neal MC, Cai Y, Chang JCJ, Broadhurst DJ, Klippsten S, Sharp PP, Wenzell NA, Kuzuoglu D, Wang H-Y, Trenker R, Young JM, Cavero DA, Hiatt J, Roth TL, Rathore U, Subramanian A, Noack J, Hubert M, Stroud RM, Frankel AD, Rosenberg OS, Verba KA, Agard DA, Ott M, Emerman M, Jura N, von Zastrow M, Verdin E, Ashworth A, Schwartz O, d’Enfert C, Mukherjee S, Jacobson M, Malik HS, Fujimori DG, Ideker T, Craik CS, Floor SN, Fraser JS, Gross JD, Sali A, Roth BL, Ruggero D, Taunton J, Kortemme T, Beltrao P, Vignuzzi M, Garcı´a-Sastre A, Shokat KM, Shoichet BK, Krogan NJ (2020) A SARSCoV-2 protein interaction map reveals targets for drug repurposing. Nature. https://doi. org/10.1038/s41586-020-2286-9

94. Zhang Y, Skolnick J (2005) TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 33:2302– 2309. https://doi.org/10.1093/nar/gki524 95. Mukherjee S, Zhang Y (2009) MM-align: a quick algorithm for aligning multiple-chain protein complex structures using iterative dynamic programming. Nucleic Acids Res 37:e83. https://doi.org/10.1093/nar/ gkp318 96. Shang J, Ye G, Shi K, Wan Y, Luo C, Aihara H, Geng Q, Auerbach A, Li F (2020) Structural basis of receptor recognition by SARS-CoV-2. Nature 581:221–224. https://doi.org/10.1038/s41586-0202179-y 97. Walls AC, Park Y-J, Tortorici MA, Wall A, McGuire AT, Veesler D (2020) Structure, function, and antigenicity of the SARS-CoV2 spike glycoprotein. Cell 181:281–292.e6. https://doi.org/10.1016/j.cell.2020. 02.058 98. Kang S, Yang M, Hong Z, Zhang L, Huang Z, Chen X, He S, Zhou Z, Zhou Z, Chen Q, Yan Y, Zhang C, Shan H, Chen S (2020) Crystal structure of SARS-CoV2 nucleocapsid protein RNA binding domain reveals potential unique drug targeting sites. Acta Pharm Sin B 10:1228–1238. https:// doi.org/10.1016/j.apsb.2020.04.009 99. Surjit M, Lal SK (2010) The nucleocapsid protein of the SARS coronavirus: structure, function and therapeutic potential. In: Lal SK (ed) Molecular biology of the SARScoronavirus. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 129–151 100. Schoeman D, Fielding BC (2019) Coronavirus envelope protein: current knowledge. Virol J 16:69. https://doi.org/10.1186/ s12985-019-1182-0 101. Mukherjee S, Bhattacharyya D, Bhunia A (2020) Host-membrane interacting interface of the SARS coronavirus envelope protein: immense functional potential of C-terminal domain. Biophys Chem 266:106452. https://doi.org/10.1016/j.bpc.2020. 106452 102. Bianchi M, Benvenuto D, Giovanetti M, Angeletti S, Ciccozzi M, Pascarella S (2020) Sars-CoV-2 envelope and membrane proteins: structural differences linked to virus characteristics? Biomed Res Int 2020:1–6. https://doi.org/10.1155/2020/4389089 103. Voss D, Pfefferle S, Drosten C, Stevermann L, Traggiai E, Lanzavecchia A, Becker S (2009) Studies on membrane topology,

Structural Proteomics Using Protein Modeling N-glycosylation and functionality of SARSCoV membrane protein. Virol J 6:79. https://doi.org/10.1186/1743-422X-6-79 104. Holm L, Ouzounis C, Sander C, Tuparev G, Vriend G (1992) A database of protein structure families with common folding motifs. Protein Sci 1:1691–1698. https://doi.org/ 10.1002/pro.5560011217 105. Horton P, Nakai K (1997) Better prediction of protein cellular localization sites with the k nearest neighbors classifier. Proc Int Conf Intell Syst Mol Biol 5:147–152 106. Feyfant E, Sali A, Fiser A (2007) Modeling mutations in protein structures. Protein Sci 16:2030–2041. https://doi.org/10.1110/ ps.072855507 107. Kopp J, Bordoli L, Battey JND, Kiefer F, Schwede T (2007) Assessment of CASP7 predictions for template-based modeling targets. Proteins 69(Suppl 8):38–56. https://doi. org/10.1002/prot.21753 108. Yan R, Zhang Y, Li Y, Xia L, Guo Y, Zhou Q (2020) Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2. Science 367:1444–1448. https://doi.org/ 10.1126/science.abb2762 109. Cantuti-Castelvetri L, Ojha R, Pedro LD, Djannatian M, Franz J, Kuivanen S, van der Meer F, Kallio K, Kaya T, Anastasina M, Smura T, Levanov L, Szirovicza L, Tobi A, ¨ sterlund P, Joensuu M, Kallio-Kokko H, O Meunier FA, Butcher SJ, Winkler MS, Mollenhauer B, Helenius A, Gokce O, Teesalu T, Hepojoki J, Vapalahti O, Stadelmann C, Balistreri G, Simons M (2020) Neuropilin-1 facilitates SARS-CoV2 cell entry and infectivity. Science 370:856– 860. https://doi.org/10.1126/science. abd2985 110. Surya W, Li Y, Torres J (2018) Structural model of the SARS coronavirus E channel in LMPG micelles. Biochim Biophys Acta Biomembr 1860:1309–1317. https://doi.org/ 10.1016/j.bbamem.2018.02.017 111. George RA, Spriggs RV, Bartlett GJ, Gutteridge A, MacArthur MW, Porter CT, Al-Lazikani B, Thornton JM, Swindells MB (2005) Effective function annotation through catalytic residue conservation. Proc Natl Acad Sci 102:12299–12304. https://doi.org/10. 1073/pnas.0504833102 112. Sankararaman S, Sha F, Kirsch JF, Jordan MI, Sjo¨lander K (2010) Active site prediction using evolutionary and structural information. Bioinformatics 26:617–624. https:// doi.org/10.1093/bioinformatics/btq008

297

113. Bate P, Warwicker J (2004) Enzyme/nonenzyme discrimination and prediction of enzyme active site location using chargebased methods. J Mol Biol 340:263–276. https://doi.org/10.1016/j.jmb.2004. 04.070 114. Chakrabarti R, Klibanov AM, Friesner RA (2005) Computational prediction of native protein ligand-binding and enzyme active site sequences. Proc Natl Acad Sci 102: 10153–10158. https://doi.org/10.1073/ pnas.0504023102 115. Yamamoto D, Takai S, Miyazaki M (2007) Prediction of interaction mode between a typical ACE inhibitor and MMP-9 active site. Biochem Biophys Res Commun 354:981– 984. https://doi.org/10.1016/j.bbrc.2007. 01.088 116. Hu J, Li Y, Zhang Y, Yu D-J (2018) ATPbind: accurate protein–ATP binding site prediction by combining sequence-profiling and structure-based comparisons. J Chem Inf Model 58:501–510. https://doi.org/10. 1021/acs.jcim.7b00397 117. Lin C-W, Tsai F-J, Wan L, Lai C-C, Lin K-H, Hsieh T-H, Shiu S-Y, Li J-Y (2005) Binding interaction of SARS coronavirus 3CL(pro) protease with vacuolar-H+ ATPase G1 subunit. FEBS Lett 579:6089–6094. https:// doi.org/10.1016/j.febslet.2005.09.075 118. Meng X-Y, Zhang H-X, Mezei M, Cui M (2011) Molecular docking: a powerful approach for structure-based drug discovery. Curr Comput Aided Drug Des 7:146–157. h t t p s : // d o i . o r g / 1 0 . 2 1 7 4 / 157340911795677602 119. Brint AT, Willett P (1987) Algorithms for the identification of three-dimensional maximal common substructures. J Chem Inf Model 27:152–158. https://doi.org/10.1021/ ci00056a002 120. Rarey M, Kramer B, Lengauer T, Klebe G (1996) A fast flexible docking method using an incremental construction algorithm. J Mol Biol 261:470–489. https://doi.org/10. 1006/jmbi.1996.0477 121. Miranker A, Karplus M (1991) Functionality maps of binding sites: a multiple copy simultaneous search method. Proteins 11:29–34. https://doi.org/10.1002/prot.340110104 122. Bo¨hm HJ (1992) LUDI: rule-based automatic design of new substituents for enzyme inhibitor leads. J Comput Aided Mol Des 6: 593–606. https://doi.org/10.1007/ BF00126217 123. Goodsell DS, Lauble H, Stout CD, Olson AJ (1993) Automated docking in

298

Manish Sarkar and Soham Saha

crystallography: analysis of the substrates of aconitase. Proteins 17:1–10. https://doi. org/10.1002/prot.340170104 124. Weiner SJ, Kollman PA, Case DA, Singh UC, Ghio C, Alagona G, Profeta S, Weiner P (1984) A new force field for molecular mechanical simulation of nucleic acids and proteins. J Am Chem Soc 106:765–784. https://doi.org/10.1021/ja00315a051 125. Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW, Kollman PA (1995) A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J Am Chem Soc 117:5179–5197. https://doi.org/10.1021/ja00124a002 126. Bo¨hm HJ (1998) Prediction of binding constants of protein ligands: a fast method for the prioritization of hits obtained from de novo design or 3D database search programs. J Comput Aided Mol Des 12:309–323. h t t p s : // d o i . o r g / 1 0 . 1 0 2 3 / a:1007999920146 127. Gehlhaar DK, Verkhivker GM, Rejto PA, Sherman CJ, Fogel DB, Fogel LJ, Freer ST (1995) Molecular recognition of the inhibitor AG-1343 by HIV-1 protease: conformationally flexible docking by evolutionary programming. Chem Biol 2:317–324. https://doi. org/10.1016/1074-5521(95)90050-0 128. Muegge I, Martin YC (1999) A general and fast scoring function for protein-ligand interactions: a simplified potential approach. J Med Chem 42:791–804. https://doi.org/ 10.1021/jm980536j 129. Ishchenko AV, Shakhnovich EI (2002) SMall Molecule Growth 2001 (SMoG2001): an improved knowledge-based scoring function for protein-ligand interactions. J Med Chem 45:2770–2780. https://doi.org/10.1021/ jm0105833 130. Charifson PS, Corkery JJ, Murcko MA, Walters WP (1999) Consensus scoring: a method for obtaining improved hit rates from docking databases of three-dimensional structures into proteins. J Med Chem 42:5100–5109. https://doi.org/10.1021/jm990352k 131. Trott O, Olson AJ (2010) AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 31:455–461. https://doi.org/10.1002/jcc. 21334 132. Kirchdoerfer RN, Ward AB (2019) Structure of the SARS-CoV nsp12 polymerase bound to nsp7 and nsp8 co-factors. Nat Commun 10:2342. https://doi.org/10.1038/s41467019-10280-3

133. Konkolova E, Klima M, Nencka R, Boura E (2020) Structural analysis of the putative SARS-CoV-2 primase complex. J Struct Biol 211:107548. https://doi.org/10.1016/j. jsb.2020.107548 134. Yin W, Mao C, Luan X, Shen D-D, Shen Q, Su H, Wang X, Zhou F, Zhao W, Gao M, Chang S, Xie Y-C, Tian G, Jiang H-W, Tao S-C, Shen J, Jiang Y, Jiang H, Xu Y, Zhang S, Zhang Y, Xu HE (2020) Structural basis for inhibition of the RNA-dependent RNA polymerase from SARS-CoV-2 by remdesivir. Science 368:1499–1504. https://doi.org/10. 1126/science.abc1560 135. Schwede T, Kopp J, Guex N, Peitsch MC (2003) SWISS-MODEL: an automated protein homology-modeling server. Nucleic Acids Res 31:3381–3385. https://doi.org/ 10.1093/nar/gkg520 136. Webb B, Sali A (2016) Comparative protein structure modeling using MODELLER. Curr Protoc Bioinform 54. https://doi.org/10. 1002/cpbi.3 137. Eswar N (2003) Tools for comparative protein structure modeling and analysis. Nucleic Acids Res 31:3375–3380. https://doi.org/ 10.1093/nar/gkg543 138. Wang S, Li W, Liu S, Xu J (2016) RaptorXProperty: a web server for protein structure property prediction. Nucleic Acids Res 44: W430–W435. https://doi.org/10.1093/ nar/gkw306 139. Bennett-Lovsey RM, Herbert AD, Sternberg MJE, Kelley LA (2008) Exploring the extremes of sequence/structure space with ensemble fold recognition in the program Phyre. Proteins Struct Funct Bioinform 70: 611–625. https://doi.org/10.1002/prot. 21688 140. Bazzoli A, Tettamanzi AGB, Zhang Y (2011) Computational protein design and large-scale assessment by I-TASSER structure assembly simulations. J Mol Biol 407:764–776. https://doi.org/10.1016/j.jmb.2011. 02.017 141. Khare SD, Whitehead TA (2015) Introduction to the Rosetta special collection. PLoS One 10:e0144326. https://doi.org/10. 1371/journal.pone.0144326 142. Yang J, Anishchenko I, Park H, Peng Z, Ovchinnikov S, Baker D (2020) Improved protein structure prediction using predicted interresidue orientations. Proc Natl Acad Sci 117:1496–1503. https://doi.org/10.1073/ pnas.1914677117 143. Ko J, Park H, Heo L, Seok C (2012) GalaxyWEB server for protein structure prediction

Structural Proteomics Using Protein Modeling and refinement. Nucleic Acids Res 40:W294– W297. https://doi.org/10.1093/nar/ gks493 144. Zheng W, Zhang C, Wuyun Q, Pearce R, Li Y, Zhang Y (2019) LOMETS2: improved

299

meta-threading server for fold-recognition and structure-based function annotation for distant-homology proteins. Nucleic Acids Res 47:W429–W436. https://doi.org/10.1093/ nar/gkz384

Chapter 16 Homology Modeling of Antibody Variable Regions: Methods and Applications Harsh Bansia and Suryanarayanarao Ramakumar Abstract Adaptive immunity specifically protects us from antigenic challenges. Antibodies are key effector proteins of adaptive immunity, and they are remarkable in their ability to recognize a virtually limitless number of antigens. Fragment variable (FV), the antigen-binding region of antibodies, can be split into two main components, namely, framework and complementarity determining regions. The framework (FR) consists of light-chain framework (FRL) and heavy-chain framework (FRH). Similarly, the complementarity determining regions (CDRs) comprises of light-chain CDRs 1–3 (CDRs L1–3) and heavy-chain CDRs 1–3 (CDRs H1–3). While FRs are relatively constant in sequence and structure across diverse antibodies, sequence variation in CDRs leading to differential conformations of CDR loops accounts for the distinct antigenic specificities of diverse antibodies. The conserved structural features in FRs and conformity of CDRs to a limited set of standard conformations allow for the accurate prediction of FV models using homology modeling techniques. Antibody structure prediction from its amino acid sequence has numerous important applications including prediction of antibody-antigen interaction interfaces and redesign of therapeutically and biotechnologically useful antibodies with improved affinity. This chapter summarizes the current practices employed in the successful homology modeling of antibody variable regions and the potential applications of the generated homology models. Key words Adaptive immune system, Antibody modeling, Antibody modeling tools, Computational structure prediction, Antibody-antigen docking, Paratope, Epitope, SARS-CoV-2, COVID-19 pandemic, Spike protein, Vaccine

1

Introduction Our immune system is a remarkable defense system that protects us from invading pathogens, toxins, and other harmful foreign substances. The molecules from these invaders which can elicit an immune response are collectively called as antigens. One of the main components of the immune system responsible for conferring this protection, the adaptive immune system, possesses not only the exceptional capability of differentiating self versus non-self, but also an extraordinary property of “memory.” While the former property allows the immune system to mount a specific response against a

Sławomir Filipek (ed.), Homology Modeling: Methods and Protocols, Methods in Molecular Biology, vol. 2627, https://doi.org/10.1007/978-1-0716-2974-1_16, © Springer Science+Business Media, LLC, part of Springer Nature 2023

301

302

Harsh Bansia and Suryanarayanarao Ramakumar

particular antigenic challenge, the latter property allows it to respond quickly and strongly to a subsequent challenge by the same antigen, thereby effectively neutralizing and clearing the antigen. Antibodies are a major agent of adaptive immunity. Antibodies are antigen-recognizing and antigen-binding proteins that are produced in response to a particular antigen. Antibodies are structurally homologous, and after recognizing and binding antigens, participate in antibody-mediated effector processes which helps in clearing of the antigen. Through antibodies, adaptive immunity can respond to a virtually limitless number of possible antigenic challenges, therefore requiring an enormous repertoire of diverse antibodies with different antigenic specificities. This diversity in antibodies is achieved through V(D)J recombination and somatic hypermutation occurring at the DNA level [1] thereby allowing structurally homologous antibodies to bind diverse array of antigens. Structures of macromolecules provide insights into their mechanism of action and associated function, and as such knowledge of antibody structure aids not only in understanding its mechanism of antigen recognition but also in subsequent neutralization of the bound antigen [2]. Importance of antibodies in general, and the determination of their structure, is exemplified and underscored by the observation that one of the major responses of scientists to the ongoing coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome-coronavirus 2 (SARS-CoV2) and its ever-emerging variants such as Delta and Omicron was to find antibodies that can neutralize SARS-CoV-2 and determine structures of those antibodies alone or in complex with the SARSCoV-2 antigens [3–7]. These structures provide insights into antibody recognition of SARS-CoV-2 antigens and may provide the rationale for vaccine design [8, 9]. Vaccination is being used as a means to control the COVID-19 pandemic. Vaccination eventually results in the production of antibodies which neutralize the virus, underlining the importance in general of antibodies to humankind. Antibodies can also outsmart antigens by recognizing crypticbinding sites [10] or cryptic epitopes also termed as cryptotopes which are potential immunological sites masked within the threedimensional structure of an antigenic protein. For instance, one of the neutralizing antibodies against the receptor-binding domain (RBD) of the SARS-CoV, CR3022, binds to a cryptic site on the RBD of the SARS-CoV-2 spike protein [9]. Thus, owing to their key role in adaptive immunity, antibodies are frequent targets of structure determination methods including X-ray crystallography, nuclear magnetic resonance (NMR), and cryo-electron microscopy (cryo-EM). However, due to the intrinsic nature of the molecule, limited resources, and various other factors, structures cannot be experimentally determined for all the sequences. In such instances computational structure prediction methods incorporating existing

Homology Modeling of Antibody Variable Regions

303

experimental information can be used to produce accurate structure models, thereby bridging the sequence-structure gap [11]. The knowledge that variation in antibody sequence is the key factor governing antibody diversity with distinct antigenic specificities, and that diverse antibodies are in general structurally homologous, allows for the standard homology modeling techniques to generate antibody structure models within the range of accuracy of experimentally determined structures [12]. In homology modeling, known structures with sequences similar to the query sequence are used as templates for modeling the input sequence, building on the assumption that the query sequence will adopt a similar structure owing to the sequence similarity. Thus, using homology modeling one can generate a structure model from its sequence alone, based on existing knowledge derived from experimentally determined template structures. While working on a particular antibody, its sequence is either known or is determined from standard experimental methods [13, 14]. Also, recent developments in next-generation sequencing have allowed for determination of large number of antibody sequences [15]. Antibody structure prediction from its amino acid sequence has numerous important applications [16]. Owing to their ability to specifically recognize virtually any foreign molecule, antibodies are used in biotechnology as probes and diagnostics [17] and the structure prediction can aid in redesign of these biotechnologically useful antibodies with improved specificity and affinity [18]. Further, predicted antibody structure models in conjunction with antigen structure can be used in docking simulations to infer atomic-level details of antigen-antibody interactions [19, 20] which can aid in development of vaccines mimicking the antigen [21]. Therapeutic antibodies are used as drugs to treat cancer and autoimmune diseases [17, 22–25] and structure prediction can aid in development and engineering of therapeutic antibodies with improved affinity [18, 26]. Generally, the number of candidates being considered in therapeutic antibody development far exceeds the capacity of experimental structure determination methods and in such cases, computational methods for antibody structure prediction become important [27]. In order to successfully predict antibody structure, an understanding of its various structural components is required. 1.1 Antibody Structure

Classical antibody molecules are a homodimer of heterodimers. Each heterodimer consists of two distinct polypeptide chains, namely, heavy (H) and light (L) chains (Fig. 1). The heterodimer (H-L) is held together and is stabilized by interchain disulfide bonds (Fig. 1) and by noncovalent interactions such as hydrophobic, electrostatic interactions and hydrogen bonds. Two heterodimers (H-L) linked together and stabilized by a set of interchain disulfide bridges and noncovalent interactions give rise to the

304

Harsh Bansia and Suryanarayanarao Ramakumar

Fig. 1 Schematic of an antibody molecule representing its common “Y” shaped architecture comprising of four polypeptide chains, namely, two light-chains (L) and two heavy-chains (H). Various regions of the antibody molecule are labeled: Fab, fragment antigen-binding; FC, fragment crystallizable; FV, fragment variable; VL, variable region of light-chain (dark-green); VH, variable region of heavy-chain (dark-blue); CL, constant region of light-chain (green); CH1, CH2 and CH3, constant regions of heavy-chain (blue); CHO, N-linked glycan (yellow); Hinge (magenta); –S–S–, disulfide bridge; NH3+, amino-terminal; COO, carboxy-terminal. Terms are explained in the text

common antibody structure, a “Y” shaped homodimer of heterodimers (H-L)2 comprising of four polypeptide chains (Fig. 1). Both, heavy and light chain consists of several homologous domains with each domain comprising of about 110 amino acids. Among antibodies of different specificities, the amino terminal domain of both heavy and light chains varies greatly in terms of its sequence whereas the rest of the domains have a relatively constant sequence [28]. The regions of variable sequence are termed as V regions and the regions of relatively constant sequence are called as C regions. Generally, the light-chain consists of two domains, an amino-terminal variable domain (VL) and a carboxyterminal constant domain (CL) (Fig. 1). The heavy-chain contains four domains, an amino-terminal variable domain (VH) followed by three adjacent constant domains (CH1, CH2, and CH3) (Fig. 1). Variable and constant domains fold into a characteristic structure called the immunoglobulin fold (Fig. 2) [29]. Topology of the

Homology Modeling of Antibody Variable Regions

Light Chain

Heavy Chain

VL

VH

Fab

FV CL

CL

CH1

CH2

Light Chain

VL

CH1

Hinge

305

VH Heavy Chain

N-linked Glycan

CH2

CH3

FC CH3

Fig. 2 Cartoon representation of an intact antibody molecule (PDB ID 1IGT) representing its overall structure and topology of its various folded domains. VL (smudge), VH (slate), CL (green), CH1, CH2 and CH3 (blue) domains fold into the characteristic immunoglobulin fold comprising of a sandwich of two β-pleated sheets with each sheet containing antiparallel β strands connected by loops of varied lengths. Inter and intrachain disulfide bridges are represented as orange sticks. The VH and CH1 domains interact with the VL and CL domains to form the “arms” of the “Y” shaped antibody molecule called as fragment antigen-binding (Fab) which contains the antigen-binding region of the antibody. Amino-terminal VH and VL domains in each Fab arm form the fragment variable (FV) or the antigen-binding region of the antibodies. The CH2 and CH3 domains from each of the heavy-chain interact to form the “stem” of the “Y” shaped antibody molecule called as fragment crystallizable (FC) which mediates various antibody-mediated effector functions. N-linked glycan in the CH2 domain is shown as yellow sticks

fold resembles a sandwich of two β-pleated sheets with each sheet containing antiparallel β strands connected by loops of varied lengths (Fig. 2) [29]. The two β-pleated sheets are held together and stabilized by intrachain disulfide bonds and by the noncovalent interactions (Fig. 2). Among the two heterodimers (H-L)2, the CH2 and CH3 domains from each of the heavy-chain interact with each other to form the “stem” of the “Y” shaped antibody molecule and is referred to as fragment crystallizable (FC) (Figs. 1 and 2). The FC mediates various antibody-mediated effector functions [27] and are glycosylated in CH2 domains (Figs. 1 and 2).

306

Harsh Bansia and Suryanarayanarao Ramakumar

Glycosylation of antibodies in FC region has been shown to be critical for maintaining the structural integrity of the molecule [30]. In each heterodimer (H-L), VH and CH1 domains from the heavy-chain interact with the VL and CL domains of the light-chain to form the “arms” of the “Y” shaped antibody molecule and is referred to as fragment antigen-binding (Fab) (Figs. 1 and 2). The Fab contains the antigen-binding region of the antibody. The CH1 and CH2 domains are connected by extended sequence region without similarity to the other antibody domains and is called as hinge region (Figs. 1 and 2). The flexibility at hinge regions allows for the variation of angle between the two Fab arms thus facilitating antigen binding [27]. 1.2 Antibody Variable Region

Recognition of a wide variety of antigens requires a diverse population of antibodies and this diversity is achieved through recombination and somatic hypermutation processes occurring in the variable region of antibodies [1]. Amino-terminal VH and VL domains in each Fab arm are collectively referred to as Fragment variable (FV) (Figs. 1, 2, and 3). FV is the antigen-binding region of the antibodies. Though variable domains have greater sequence variability relative to the constant domains, the variation is not uniformly distributed across the variable domains. The sequence variation is concentrated in a few discrete regions of variable domains that correspond to the loops of varied lengths connecting the antiparallel β strands within each sheet of two β-pleated sheets b

a

Antigen (SARS-CoV-2 spike receptor binding domain)

Antigen (SARS-CoV-2 spike receptor binding domain)

Epitope H2

H3

Paratope H1

L3

L1

Paratope H1

L2

H3 L3

L2

FRL

FRL

FRH

Antibody (Fv of SARS-CoV-2 neutralizing antibody S2E12)

Epitope H2

L1

VH

VL

Antibody FRH (Fv of SARS-CoV-2 neutralizing antibody S2E12)

VH

VL

Fig. 3 (a) Cartoon and (b) space-filling representation of FV region of SARS-CoV-2 neutralizing antibody S2E12 in complex with antigen (SARS-CoV-2 spike receptor-binding domain) (PDB ID 7K45) depicting predominant interaction of the antigen with the antibody variable region. Three complementarity determining regions (CDRs) form the VH domain (slate), namely, H1 (yellow), H2 (green), H3 (cyan) and three from VL domain (smudge), namely, L1 (yellow), L2 (green), L3 (cyan) form the paratope or the antigen-binding site which recognizes the epitope (magenta) or the specific region of the antigen. Framework regions (FRs), namely, FRL and FRH which support the CDRs to maintain their antigen-binding conformations are shown

Homology Modeling of Antibody Variable Regions

307

of the immunoglobulin fold [28]. These hypervariable loops, located at the tip of the FV and directed away from the rest of the antibody molecule, form the antigen-binding site (Fig. 3) [31]. The specific region of the antigen that is recognized and bound by the antigen-binding site is called as epitope (Fig. 3). Since the structure of the antigen-binding site should be complementary to the structure of the epitope, these hypervariable loops containing the antigen-binding site are known as complementarity determining regions (CDRs) and form the paratope, the epitope recognition motif of antibodies (Fig. 3) [28, 31]. In comparison to the CDRs, the other regions of the variable domains comprising of the β strands within the β-pleated sheets of the immunoglobulin fold exhibit far less sequence variation and are termed as the framework regions (FRs) (Fig. 3). As the name suggests, FRs act as a supporting scaffold for the CDRs to maintain their antigenbinding conformations. Each variable domain contains three CDRs. Variation in the length and sequence of CDRs contributes to distinct antigenic specificities of diverse antibodies. 1.3 Homology Modeling of Antibody Variable Region

Since the paratope of antibody molecule lies in the FV region, it is the segment of the antibody molecule that is the focus of antibody modeling methods. As described above, based on the sequence variation, the FV comprising of VH and VL domains can be split into the following components, namely, light-chain framework (FRL), heavy-chain framework (FRH), light-chain CDRs 1–3 (CDRs L1–3) and heavy-chain CDRs 1–3 (CDRs H1–3) (Fig. 3). Since FRL and FRH display far less sequence variation than the CDRs and also correspond to the structurally conserved sandwich of β-pleated sheets of the immunoglobulin fold, FRs can be fairly accurately modeled from the existing experimentally determined template structures. Similarly, out of the six CDR loops, CDR L1, L2, L3, H1, and H2 have only a limited number of main-chain conformations dictated by their length and sequence and are referred to as canonical loop conformations or canonical classes [32, 33]. The canonical classes have been catalogued and categorized based on the analysis of available antibody crystal structures [34]. Thus, CDRs L1–3, H1–2 can also be accurately modeled from the template selected from their representative canonical class [12]. On the other hand, CDR H3 is the most hypervariable loop both in terms of its length and sequence and often lacks representative canonical class [35]. Thus, CDR H3 component of FV is often modeled de novo [36]. Overall semantics of the antibody modeling algorithm is similar across available modeling tools. The series of steps used in homology modeling of antibody variable region is summarized in Fig. 4 and elaborated in subsequent sections.

308

Harsh Bansia and Suryanarayanarao Ramakumar Homology modeling of antibody variable region Input variable region amino acid sequences NH3+

Variable region of light-chain (VL)

COO-

NH3+

Variable region of heavy-chain (VH)

COO-

Identification of CDRs and FRs in the input VL and VH sequences NH3+ NH3+

FRL FRH

FRL

CDR L1

FRL

CDR L2

FRH

CDR H1

FRL

CDR L3

FRH

CDR H2

FRH

CDR H3

COOCOO-

Selection of template structures to be used for modeling the identified CDRs and FRs

L1

L2

L3

H1

FRH

FRL

H2

H3

In silico mutation of the selected template structures to match the sequence of the query CDRs and FRs

L1

L2

H1

L3

H2

H3

FRH

FRL

Optimization of VH/VL orientation

FRH

FRL

Grafting of the CDR loop templates, CDR H3 modeling and refinement

H1

H3

H2

L1 L2 L3

FRH

FRL

VH

VL FV

Output antibody homology model

Fig. 4 Schematic illustrating the major steps employed in homology modeling of antibody variable region

2

Materials and Methods Standard homology modeling practices are used to generate antibody structure model from antibody sequence. Since FV comprises of VH and VL domains, the input for the FV modeling is the amino acid sequences of the VL and VH domains in the FASTA format.

Homology Modeling of Antibody Variable Regions

309

Briefly, the input sequences are split into distinct segments, and for each segment closest experimental structure by sequence is chosen. That is followed by assembly of the chosen structural segments into an initial model which is further refined and energy minimized (Fig. 4). Because of the availability of accurate template structures for most of the FV components, steps in the computational modeling of FV can be fully automated with minimal intervention required. Hence, there are various fully automated web servers (see Note 1) freely available for the homology modeling of the FV including RosettaAntibody [37] (see Note 2), Prediction of ImmunoGlobulin Structure (PIGS) [38] (see Note 3), Kotai Antibody Builder [39] (see Note 4) and ABodyBuilder [40] (see Note 5). The order in which the modeling steps (Fig. 4) are carried out varies among these web-servers. 2.1 Identification of CDRs and FRs in the Input Sequences

In order to search template structures for each of the FRL, FRH, CDRs L1–3, and H1–3 components of the FV, these components must first be identified in the input amino acid sequences of the VL and VH domains. To achieve that, the first step is to map the input VL and VH amino acid sequences onto some standardized antibody numbering scheme. Antibody numbering schemes annotate structurally equivalent residue positions within an antibody sequence by giving an identifier to each amino acid of the VL and VH sequences. Various antibody numbering schemes are available in literature [41]. Characteristics of commonly used schemes are summarized in Table 1.

Table 1 Different antibody numbering schemes Scheme

Characteristics

Numbering tool

Kabat [42, 43]

Developed from the variable region sequence data, sequence insertion positions in CDRs L1 and H1 does not match the structural insertion position

Abnum [44] (see Note 6)

Chothia [33]

Places the insertions in CDRs L1 and H1 at structurally correct position

Abnum [44] (see Note 6)

Martin (Enhanced Chothia) [45]

Identical to Chothia but also places insertion in FRs at structurally correct position

Abnum [44] (see Note 6)

Aho [46]

Places alignment gaps in a way that minimizes the average PyIgClassify [34] deviation from the averaged structure of the aligned (see Note 7) domains, required for antibody design protocols

IMGT [47]

Counts residues in the variable region amino acid sequences continuously without using insertion codes

IMGT/ DomainGapAlign [48] (see Note 8)

310

Harsh Bansia and Suryanarayanarao Ramakumar

Table 2 Different CDR definitions following Chothia numbering scheme CDR

Kabat

Chothia

Martin

PyIgClassify

IMGT

L1

24–34

24–34

24–34

24–34

27–32

L2

50–56

50–56

50–56

49–56

50–52

L3

89–97

89–97

89–97

89–97

89–97

H1

31–35

26–32

26–35

23–35

26–33

H2

50–65

52–56

50–58

50–58

51–57

H3

95–102

95–102

95–102

93–102

93–102

These schemes vary in their way of labeling positions and identifying locations at which they allow insertions and deletions. Analyses of antibody sequences, structures of antibodies, and antibody-antigen complexes have led researchers to propose a number of CDR definitions in order to identify CDRs from VL and VH amino acid sequences. Most commonly used definitions are: Kabat [42, 43] (based on high variability in sequences of CDRs as compared to other antibody regions), Chothia [33] (based on conserved conformations of CDRs in available crystal structures), Martin (enhanced Chothia) [45] (based on analyses of available antigen-antibody complex structures), PyIgClassify [32, 34] (based on symmetrical CDRs where N- and C-terminal residues are opposite to each other in the conformation), IMGT [47, 48] (sequencebased, like Kabat). Combination of an antibody numbering scheme and a CDR definition can be used to identify the FRL, FRH, CDRs L1–3, and H1–3 components in the input VL and VH amino acid sequences. Since Chothia numbering scheme is based on structural considerations, it is widely used by many antibody structure prediction tools (see Note 9). Table 2 lists different CDR definitions following Chothia numbering scheme. Advantage of structure-based numbering scheme is that structurally equivalent positions are numbered identically thereby facilitating alignments and comparisons. Multiple sequence alignment of VL and VH domains have revealed that residues flanking the CDRs are conserved with the Cys residues as the best-conserved feature. Hence, the position and identities of these flanking residues can also be used to identify CDRs in the input sequences. The rules to identify CDRs by just looking at the input sequences are given in Table 3 (http://www.bioinf.org.uk/abs/info.html#cdrid). Once a numbering scheme is selected, most tools use regular expressions dictated by CDR definition to identify CDRs in the input sequences (see Notes 10 and 11). RosettaAntibody [37] uses regular expression matching to the Kabat CDR definitions with antibody sequence numbered according to the Chothia scheme.

Homology Modeling of Antibody Variable Regions

311

Table 3 Rules for identifying CDRs in the input variable region sequences CDR loop Start

Residues before

Residues after

Length

L1

Approximately residue 24

C

W (typically WYQ, WLQ, 10–17 WFQ, WYL)

L2

16 residues after the end of L1 –

Generally IY but also VY, Mostly 7 IK, IF

L3

33 residues after the end of L2 C

FGXG

7–11

H1

Approximately residue 26

CXXX

W (typically WV, but also WI, WA)

10–12

H2

19 residues at the end of H1

Typically LEWIG

(KR)(LIVFTA) (TSIA)

9–12

H3

33 residues at the end of H2

CXX (typically CAR) WGXG

3–25

ABodyBuilder [40] employs IMGT numbering scheme and PyIgClassify CDR definitions. PIGS server [38] uses Chothia numbering scheme and Kotai Antibody builder [39] uses PyIgClassify CDR definitions. 2.2 Identification and Selection of Template Structures for CDRs and FRs

Because the FRs corresponds to the sequentially and structurally conserved sandwich of β-pleated sheets of the immunoglobulin fold and the CDRs L1–3, H1–2 belong to a limited number of canonical loop conformational class, FV can be accurately modeled from the template structures selected for each of the above components. Templates can be selected for each FRL, FRH, CDRs L1–3, CDRs H1–2, and initial VL  VH orientation by maximum sequence similarity search using BLAST+ [49]. Many tools have their custom databases of high-quality antibody structures curated from the Protein Data Bank against which the BLAST search is made. ABodyBuilder [40] searches SAbDab database [50] for template selection. Kotai Antibody Builder [39] has a MANGO module that selects template structures for the FRs and each CDR by a sequence-based database search and rule-based heuristics. In comparison to CDR H3, CDRs L1–3 and H1–2 have a limited number of loop conformations dictated by their length and sequence resulting in their respective canonical loop conformational class [32, 33]. Hence, separate databases are maintained for each canonical loop conformational class. For instance, H2–10 represents the database for CDR H2 loop of length 10. If the identified CDR H2 loop from the input sequence is of length 10, its sequence will be queried against the curated H2–10 database comprising of structures of CDR H2 loop of length 10 extracted from the available antibody structures and the best-scoring template by BLAST bit score will be selected for modeling the input CDR H2 sequence.

312

Harsh Bansia and Suryanarayanarao Ramakumar

The same principle follows for other CDRs (see Note 7). Similarly, best-scoring templates by BLAST bit score are selected for the FRs. Once the structural components are selected corresponding to each of the input FRL, FRH, CDRs L1–3, and CDRs H1–2 sequences, they are mutated in silico to match the residues in the query sequence components and are used for assembling the initial FV model. 2.3 Optimization of the Initial VL and VH Orientation

In naturally occurring antibodies, FV is stabilized by specific noncovalent interactions between VL and VH domains distributed across the interface formed by the faces of β sheets from respective FRL and FRH regions. Proper orientation of VL and VH domains through interacting FRL and FRH interfaces is crucial for maintaining the shape of the paratope and hence, the antigen-recognizing capability of antibodies (see Note 12). Rearrangements in the VL and VH orientations have been attributed to the altered antibody affinities [51, 52]. Also, FRs act as a supporting scaffold for the CDRs to maintain their antigen-binding conformations. Since the selected templates for modeling FRL and FRH regions may not have correct orientation thereby affecting the placement of CDRs. Also, when the residues are mutated in the selected FRL and FRH templates to match the residues from the query sequence, the mutations might introduce steric clashes and other forms of non-permissible residue contacts. Thus, the interaction interface of the resultant FV model is far from optimized and optimizing the initial orientation of the VL and VH is therefore crucial. One way is to select a single global template for modeling both FRL and FRH regions, that is, a single template with greater than 80% sequence identity for both FRL and FRH regions with the expectation of near-native VL and VH orientation. ABodyBuilder [40] and PIGS server [38] prefer this “same antibody” approach. In cases where a single global template with greater than 80% sequence identity for both FRL and FRH regions is not available, hybrid templates are selected, that is, FRL and FRH templates originating from different antibody structures. In such instances, VL and VH domains are re-oriented using the orientation from the highest sequence identity global template with the re-orientation procedure carried out as described in Bujotzek et al. [53]. Another solution for this problem is to select multiple templates with different VL and VH orientation and use them to generate multiple initial FV models with different VL  VH orientations. These multiple initial FV models are then used for CDR grafting and further downstream steps. RosettaAntibody [37] selects ten distinct templates with ten different initial VL  VH orientations. Using multiple templates increases the probability that at least one of the templates will conform to the near-native interaction interface (see Notes 13 and 14) during downstream refinement and also does away with selection bias [19, 37].

Homology Modeling of Antibody Variable Regions

2.4 Grafting of CDR Templates, Building CDR H3, and Assembling Initial Model

313

Structural templates selected for and mutated according to query CDRs are then grafted onto the FRs assembled in the previous step with optimized VL  VH orientation. As stated previously, residues flanking the CDRs are sequentially conserved and their identities and positions are used to identify CDRs. Sequence conservation also leads to structure/conformation conservation and hence the residues flanking the CDRs serve as grafting points. Precisely, grafting involves superimposition of the CDR flanking residues common to both the structural template selected for the query CDR and the FRs assembled in the previous step. Order of CDR modeling is crucial as conformation of one modeled loop may influence the conformation of next loop being modeled. ABodyBuilder [40] which uses FREAD database method [54] to model CDR loops, models them in the following order to minimize the influence: CDRL2, CDRH2, CDRL1, CDRH1, CDRL3, and CDRH3. Grafting often introduces nonstandard conformations at the graft points accompanied by steric clashes. RosettaAntibody [37] uses cycles of minimization, random torsional sampling, and cyclic coordinate descent [55] such that only permissible geometric parameters are incorporated into the FV model. CDR H3 is the most hypervariable loop both in terms of its length and sequence [35, 36]. Therefore, CDR H3 has to be modeled de novo based on its sequence [35, 36]. CDR H3 is also known to contribute significantly to the antigen recognition and its accurate modeling is crucial for the accuracy of the overall FV model [12]. Though CDR H3 does not have canonical structures like other CDRs, a classification of this loop’s conformation based on presence or absence of a β bulge in the region of the loop closer to the FRH exists that can be predicted based on its sequence [56]. PIGS server [38] uses this sequence-based conformational classification of CDR H3 to filter templates from structure databases for modeling CDR H3 loop. Kotai Antibody Builder [39] uses Spanner [57] to build CDR H3 loop decoys and scores them with the OSCAR energy function. RosettaAntibody [37] initially builds coarse grain models of the CDR H3 where each side-chain is represented by a low-resolution pseudo-atom, and the all-atom conformations for only the backbone are initially sampled. Once the all-atom backbone geometry is optimized, all atom definitions are introduced for the CDR H3 side-chains followed by all atom refinement of CDR H3. Since CDR H3 modeling is not based on the knowledge-based parameters, its incorporation might disturb the previously optimized VL  VH interface. Hence, VL and VH domains are re-docked using RosettaDock [58], and the VL  VH interface is reoptimized in the context of the newly introduced CDR H3 loop. This completes the initial assembly of the FV structure model corresponding to the input sequences.

314

Harsh Bansia and Suryanarayanarao Ramakumar

2.5 Side-Chain Optimization and Final Refinement

3

At this stage, the initially assembled FV model has the entire backbone built in along with the side-chains for residues which are identical between selected templates and input FV sequence. For residues that differ between the template and query sequences, side-chains are predicted using the SCWRL 4.0 method [59]. ABodyBuilder [40] uses relaxation by MODELLER [60] to remove any side-chain clashes in the final models. All atom refinement is done to bring the stereochemistry of the modeled FV closer to experimentally observed native conformations with side-chain conformations sampled from standard rotamer libraries (see Note 15). In antibody modeling protocols such as RosettaAntibody [37] where multiple models are generated, the models are sorted according to knowledge-based scoring potentials which are akin to the concept of free energy with low-scoring models having near-native conformations (see Note 16). RosettaAntibody [37] uses Rosetta Energy Function as a measure to score the final models. These FV homology models can be used for further downstream applications.

Applications and Further Developments Experimentally determined structures for antibody-antigen complexes are the best reference points for studying mechanisms of antibody specificities and antibody-mediated neutralization of antigens. However, structures cannot be experimentally determined for every pair of antibody-antigen complex. Antibody homology models come to the rescue in cases where either structure for the antigen is experimentally available or can be determined to a high precision using homology modeling practices. Often, epitope mapping studies provide information about the epitope [14]. Thus, constraints can be derived from all of the above experimentally available information and can be incorporated into experimentally guided computational antibody-antigen docking algorithms to accurately predict antibody-antigen complexes [19, 20] to be used further for studying mechanisms of antibodymediated neutralization of antigens. For instance, it was possible to provide a structure-based rationalization for the neutralization of cytotoxic abrin by its monoclonal antibody D6F10 by first generating a homology model of D6F10 FV and using it into an experimentally guided computational antibody-antigen docking protocol to predict D6F10-Abrin complex [2]. Further, binding affinities can be calculated from the predicted antibody-antigen structures and the feedback can be used in computational antibody design [61] wherein antibody sequence-structure-docking iterations in conjunction with mutagenesis help in designing antibodies with improved affinities [62]. In the absence of experimentally determined structures, antibody homology models and computationally docked antibody-

Homology Modeling of Antibody Variable Regions

315

antigen complexes provide fairly accurate estimates. Hence, development of improved algorithms for antibody modeling including increasing the accuracy and speed of CDR H3 loop modeling and antibody-antigen docking are being actively pursued [63]. Among machine learning methods, recent advances in deep learning-based techniques have shown promising results in their ability to predict protein structures from sequences [64] demonstrating accuracy on par with experimentally determined structures as evidenced by AlphaFold2 [65] and RoseTTA fold [66]. While AlphaFold2 and RoseTTA fold methods are for general protein structure prediction, deep learning-based approaches specific to predicting accurate antibody FV structures have also been recently developed. DeepAB [67] combines a deep neural network for predicting 3D residue interaction networks with a Rosetta-based protocol for generating accurate antibody FV structures from predicted network topologies. ABlooper [68] is a deep learning-based tool for predicting accurate structure of CDR loops.

4

Notes 1. Local installations of antibody modeling tools provide greater control at the individual steps of the FV modeling protocol. 2. RosettaAntibody modeling server can be accessed via https:// rosie.rosettacommons.org/antibody/submit. 3. Prediction of ImmunoGlobulin Structure (PIGS) server can be accessed via https://www.hsls.pitt.edu/obrc/index.php? page¼URL1227716072. 4. Kotai Antibody Builder service can be accessed via https:// sysimm.org/rep_builder/. 5. ABodyBuilder antibody modeling service can be accessed via http://opig.stats.ox.ac.uk/webapps/newsabdab/sabpred/ abodybuilder/. This tool can also model nanobodies. 6. Abnum which can number antibody sequence using Kabat or Chothia or Martin schemes can be accessed via http://www. bioinf.org.uk/abs/abnum/. 7. PyIgClassify web server http://dunbrack2.fccc.edu/ PyIgClassify/ automates the process of identifying CDRs and outputs the identified CDRs along with their sequence, length, and canonical loop conformational class. 8. IMGT/DomainGapAlign for analyses of antibody sequences can be accessed via http://www.imgt.org/3Dstructure-DB/ cgi/DomainGapAlign.cgi. 9. RosettaAntibody server uses Chothia numbering scheme by default. However, the local RosettaAntibody application [37]

316

Harsh Bansia and Suryanarayanarao Ramakumar

provides the option to toggle between different numbering schemes among Chothia, Martin, Aho, IMGT and Kabat. 10. Sometimes, the input sequences contain flanking residues which do not match with the usual CDR definitions. In such cases, regular expressions can be altered to accommodate unusual sequences. 11. AbCheck server http://www.bioinf.org.uk/abs/seqtest.html [69] can be used for identifying unusual residues due to potential cloning artifacts and sequencing errors in the input sequence. 12. ABangle [70], a tool for calculating and analyzing the VH  VL orientation in antibodies can be accessed via https://www.stats. ox.ac.uk/~dunbar/abangle/. 13. The optimized VL  VH orientations of the FV models can be assessed for their closeness to the orientations observed for antibody crystal structures by generating plots for each of the four antibody light–heavy orientational coordinate frame (LHOC) metrics [2]. The LHOC metrics are packing angle, interdomain distance, heavy opening angle, and light opening angle [37]. 14. Packing Angle Prediction Server (PAPS) http://www.bioinf. org.uk/abs/paps/ can be used to predict the VH/VL packing angle. 15. Since it is the amino acid side-chains from the paratope and epitope that majorly participate in the specific interactions at the antibody-antigen interface, side-chain optimization is of paramount importance in the final refinement stages of the FV modeling protocol. 16. It is preferable to generate multiple models at specific steps of the FV modeling protocol and order them by knowledge-based scoring functions and best-scoring models to be used for the subsequent steps. For example, ensemble docking approximates for backbone conformational flexibility and accounts for uncertainty in homology models [2, 19, 37]. References 1. Chi X, Li Y, Qiu X (2020) V(D)J recombination, somatic hypermutation and class switch recombination of immunoglobulins: mechanism and regulation. Immunology 160:233– 247 2. Bansia H, Bagaria S, Karande AA, Ramakumar S (2019) Structural basis for neutralization of cytotoxic abrin by monoclonal antibody D6F10. FEBS J 286:1003–1029 3. Gowthaman R, Guest JD, Yin R, AdolfBryfogle J, Schief WR, Pierce BG (2020)

CoV3D: a database of high resolution coronavirus protein structures. Nucleic Acids Res 49: 282–287 4. Tortorici MA, Beltramello M, Lempp FA, Pinto D, Dang HV, Rosen LE et al (2020) Ultrapotent human antibodies protect against SARS-CoV-2 challenge via multiple mechanisms. Science 370:950–957 5. Wajnberg A, Amanat F, Firpo A, Altman DR, Bailey MJ, Mansour M et al (2020) Robust neutralizing antibodies to SARS-CoV-

Homology Modeling of Antibody Variable Regions 2 infection persist for months. Science 370: 1227–1230 6. Hurlburt NK, Seydoux E, Wan YH, Edara VV, Stuart AB, Feng J et al (2020) Structural basis for potent neutralization of SARS-CoV-2 and role of antibody affinity maturation. Nat Commun 11(1):5413 7. Piccoli L, Park YJ, Tortorici MA, Czudnochowski N, Walls AC, Beltramello M et al (2020) Mapping neutralizing and immunodominant sites on the SARS-CoV-2 spike receptor-binding domain by structure-guided high-resolution serology. Cell 183:1024–1042 8. Barnes CO, Jette CA, Abernathy ME, Dam KA, Esswein SR, Gristick HB et al (2020) SARS-CoV-2 neutralizing antibody structures inform therapeutic strategies. Nature 588: 682–687 9. Yuan M, Wu NC, Zhu X, Lee CD, So RTY, Lv H et al (2020) A highly conserved cryptic epitope in the receptor binding domains of SARSCoV-2 and SARS-CoV. Science 368:630–633 10. Bansia H, Mahanta P, Yennawar NH, Ramakumar S (2021) Small glycols discover cryptic pockets on proteins for fragment-based approaches. J Chem Inf Model 61:1322–1333 11. Shirai H, Ikeda K, Yamashita K, Tsuchiya Y, Sarmiento J, Liang S et al (2014) Highresolution modeling of antibody structures by a combination of bioinformatics, expert knowledge, and molecular simulations. Proteins 82: 1624–1635 12. Almagro JC, Teplyakov A, Luo J, Sweet RW, Kodangattil S, Hernandez-Guzman F et al (2014) Second antibody modeling assessment (AMA-II). Proteins 82:1553–1562 13. Surendranath K, Karande AA (2008) A neutralizing antibody to the a chain of abrin inhibits abrin toxicity both in vitro and in vivo. Clin Vaccine Immunol 15:737–743 14. Bagaria S, Ponnalagu D, Bisht S, Karande AA (2013) Mechanistic insights into the neutralization of cytotoxic abrin by the monoclonal antibody D6F10. PLoS One 8(7):e70273 15. Georgiou G, Ippolito GC, Beausang J, Busse CE, Wardemann H, Quake SR (2014) The promise and challenge of high-throughput sequencing of the antibody repertoire. Nat Biotechnol 32:158–168 16. Kuroda D, Shirai H, Jacobson MP, Nakamura H (2012) Computer aided antibody design. Protein Eng Des Sel 25:507–521 17. Kaplon H, Muralidharan M, Schneider Z, Reichert JM (2020) Antibodies to watch in 2020. MAbs 12(1):1703531 18. Lippow SM, Wittrup KD, Tidor B (2007) Computational design of antibody-affinity

317

improvement beyond in vivo maturation. Nat Biotechnol 25:1171–1176 19. Sivasubramanian A, Sircar A, Chaudhury S, Gray JJ (2009) Toward high-resolution homology modeling of antibody Fv regions and application to antibody-antigen docking. Proteins 74:497–514 20. Pedotti M, Simonelli L, Livoti E, Varani L (2011) Computational docking of antibodyantigen complexes, opportunities and pitfalls illustrated by influenza hemagglutinin. Int J Mol Sci 12:226–251 21. Correia BE, Bates JT, Loomis RJ, Baneyx G, Carrico C, Jardine JG et al (2014) Proof of principle for epitope-focused vaccine design. Nature 507:201–206 22. Hoos A, Ibrahim R, Korman A, Abdallah K, Berman D, Shahabi V et al (2010) Development of ipilimumab: contribution to a new paradigm for cancer immunotherapy. Semin Oncol 37:533–546 23. Callahan MK, Wolchok JD, Allison JP (2010) Anti-CTLA-4 antibody therapy: immune monitoring during clinical development of a novel immunotherapy. Semin Oncol 37:473–484 24. Yasunaga M (2020) Antibody therapeutics and immunoregulation in cancer and autoimmune disease. Semin Cancer Biol 64:1–12 25. Hafeez U, Gan HK, Scott AM (2018) Monoclonal antibodies as immunomodulatory therapy against cancer and autoimmune diseases. Curr Opin Pharmacol 41:114–121 26. Norman RA, Ambrosetti F, Bonvin AMJJ, Colwell LJ, Kelm S, Kumar S et al (2020) Computational approaches to therapeutic antibody design: established methods and emerging trends. Brief Bioinform 21:1549–1567 27. Chiu ML, Goulet DR, Teplyakov A, Gilliland GL (2019) Antibody structure and function: the basis for engineering therapeutics. Antibodies 8(4):55 28. Wu TT, Kabat EA (1970) An analysis of the sequences of the variable regions of Bence Jones proteins and myeloma light chains and their implications for antibody complementarity. J Exp Med 132:211–250 29. Padlan EA (1994) Anatomy of the antibody molecule. Mol Immunol 31:169–217 30. Hayes JM, Cosgrave EF, Struwe WB, Wormald M, Davey GP, Jefferis R et al (2014) Glycosylation and Fc receptors. Curr Top Microbiol Immunol 382:165–199 31. Collis AV, Brouwer AP, Martin AC (2003) Analysis of the antigen combining site: correlations between length and sequence composition of the hypervariable loops and the nature of the antigen. J Mol Biol 325:337–354

318

Harsh Bansia and Suryanarayanarao Ramakumar

32. North B, Lehmann A, Dunbrack RL (2011) A new clustering of antibody CDR loop conformations. J Mol Biol 406:228–256 33. Al-Lazikani B, Lesk AM, Chothia C (1997) Standard conformations for the canonical structures of immunoglobulins. J Mol Biol 273:927–948 34. Adolf-Bryfogle J, Xu Q, North B, Lehmann A, Dunbrack RL (2015) PyIgClassify: a database of antibody CDR structural classifications. Nucleic Acids Res 43:432–438 35. Reczko M, Martin AC, Bohr H, Suhai S (1995) Prediction of hypervariable CDR-H3 loop structures in antibodies. Protein Eng 8:389– 395 36. Zhu K, Day T (2013) Ab initio structure prediction of the antibody hypervariable H3 loop. Proteins 81:1081–1089 37. Weitzner BD, Jeliazkov JR, Lyskov S, Marze N, Kuroda D, Frick R et al (2017) Modeling and docking of antibody structures with Rosetta. Nat Protoc 12:401–416 38. Marcatili P, Olimpieri PP, Chailyan A, Tramontano A (2014) Antibody modeling using the prediction of immunoglobulin structure (PIGS) web server [corrected]. Nat Protoc 9: 2771–2783 39. Yamashita K, Ikeda K, Amada K, Liang S, Tsuchiya Y, Nakamura H et al (2014) Kotai Antibody Builder: automated high-resolution structural modeling of antibodies. Bioinformatics 30:3279–3280 40. Leem J, Dunbar J, Georges G, Shi J, Deane CM (2016) ABodyBuilder: automated antibody structure prediction with data-driven accuracy estimation. MAbs 8:1259–1268 41. Dondelinger M, File´e P, Sauvage E, Quinting B, Muyldermans S, Galleni M et al (2018) Understanding the significance and implications of antibody numbering and antigen-binding surface/residue definition. Front Immunol 9:2278 42. Kabat EA, Wu TT, Bilofsky H (1976) Attempts to locate residues in complementaritydetermining regions of antibody combining sites that make contact with antigen. Proc Natl Acad Sci U S A 73:617–619 43. Johnson G, Wu TT (2000) Kabat database and its applications: 30 years after the first variability plot. Nucleic Acids Res 28:214–218 44. Abhinandan KR, Martin AC (2008) Analysis and improvements to Kabat and structurally correct numbering of antibody variable domains. Mol Immunol 45:3832–3839 45. Kontermann R, Du¨bel S (eds) (2010) Antibody engineering. Springer protocols handbooks. Springer, Berlin, Heidelberg, pp 33–51

46. Honegger A, Plu¨ckthun A (2001) Yet another numbering scheme for immunoglobulin variable domains: an automatic modeling and analysis tool. J Mol Biol 309:657–670 47. Lefranc M-P, Pommie´ C, Ruiz M, Giudicelli V, Foulquier E, Truong L et al (2003) IMGT unique numbering for immunoglobulin and T cell receptor variable domains and Ig superfamily V-like domains. Dev Comp Immunol 27: 55–77 48. Ehrenmann F, Kaas Q, Lefranc MP (2010) IMGT/3Dstructure-DB and IMGT/DomainGapAlign: a database and a tool for immunoglobulins or antibodies, T cell receptors, MHC, IgSF and MhcSF. Nucleic Acids Res 38:301–307 49. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K et al (2009) BLAST +: architecture and applications. BMC Bioinformatics 10:421 50. Dunbar J, Krawczyk K, Leem J, Baker T, Fuchs A, Georges G et al (2014) SAbDab: the structural antibody database. Nucleic Acids Res 42:1140–1146 51. Foote J, Winter G (1992) Antibody framework residues affecting the conformation of the hypervariable loops. J Mol Biol 224:487–499 52. Nakanishi T, Tsumoto K, Yokota A, Kondo H, Kumagai I (2008) Critical contribution of VH-VL interaction to reshaping of an antibody: the case of humanization of antilysozyme antibody, HyHEL-10. Protein Sci 17:261–270 53. Bujotzek A, Dunbar J, Lipsmeier F, Scha¨fer W, Antes I, Deane CM et al (2015) Prediction of VH-VL domain orientation for antibody variable domain modeling. Proteins 83:681–695 54. Choi Y, Deane CM (2010) FREAD revisited: accurate loop structure prediction using a database search algorithm. Proteins 78:1431–1440 55. Canutescu AA, Dunbrack RL (2003) Cyclic coordinate descent: a robotics algorithm for protein loop closure. Protein Sci 12:963–972 56. Kuroda D, Shirai H, Kobori M, Nakamura H (2008) Structural classification of CDR-H3 revisited: a lesson in antibody modeling. Proteins 73:608–620 57. Lis M, Kim T, Sarmiento J, Kuroda D, Dinh HQ, Kinjo A et al (2011) Bridging the gap between single-template and fragment based protein structure modeling using Spanner. Immunome Res 7:1–8 58. Gray JJ, Moughon S, Wang C, SchuelerFurman O, Kuhlman B, Rohl CA et al (2003) Protein-protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. J Mol Biol 331: 281–299

Homology Modeling of Antibody Variable Regions 59. Krivov GG, Shapovalov MV, Dunbrack RL (2009) Improved prediction of protein sidechain conformations with SCWRL4. Proteins 77:778–795 60. Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234:779–815 61. Adolf-Bryfogle J, Kalyuzhniy O, Kubitz M, Weitzner BD, Hu X, Adachi Y et al (2018) RosettaAntibodyDesign (RAbD): a general framework for computational antibody design. PLoS Comput Biol 14(4):e1006112 62. Warszawski S, Borenstein Katz A, Lipsh R, Khmelnitsky L, Ben Nissan G, Javitt G et al (2019) Optimizing antibody affinity and stability by the automated design of the variable light-heavy chain interfaces. PLoS Comput Biol 15(8):e1007207 63. Jeliazkov JR, Frick R, Zhou J, Gray JJ (2021) Robustification of RosettaAntibody and Rosetta SnugDock. PLoS One 16(3): e0234282 64. Gao W, Mahajan SP, Sulam J, Gray JJ (2020) Deep learning in protein structural modeling and design. Patterns (N Y) 1(9):100142

319

65. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O et al (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596:583–589 66. Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR et al (2021) Accurate prediction of protein structures and interactions using a three-track neural network. Science 373:871–876 67. Ruffolo JA, Sulam J, Gray JJ (2022) Antibody structure prediction using interpretable deep learning. Patterns (N Y) 3(2):100406 68. Abanades B, Georges G, Bujotzek A, Deane CM (2022) ABlooper: fast accurate antibody CDR loop structure prediction with accuracy estimation. Bioinformatics 38:1877–1880 69. Martin AC (1996) Accessing the Kabat antibody sequence database by computer. Proteins 25:130–133 70. Dunbar J, Fuchs A, Shi J, Deane CM (2013) ABangle: characterising the VH-VL orientation in antibodies. Protein Eng Des Sel 26: 611–620

Chapter 17 3D-BMPP: 3D Beta-Barrel Membrane Protein Predictor Wei Tian, Meishan Lin, Ke Tang, Manisha Barse, Hammad Naveed, and Jie Liang Abstract β-barrel membrane proteins (βMPs), found in the outer membrane of gram-negative bacteria, mitochondria, and chloroplasts, play important roles in membrane anchoring, pore formation, and enzyme activities. However, it is often difficult to determine their structures experimentally, and the knowledge of their structures is currently limited. We have developed a method to predict the 3D architectures of βMPs. We can accurately construct transmembrane domains of βMPs by predicting their strand registers, from which full 3D atomic structures are derived. Using 3D Beta-barrel Membrane Protein Predictor (3D-BMPP), we can further accurately model the extended beta barrels and loops in non-TM regions with overall greater structure prediction coverage. 3DBMPP is a general technique that can be applied to protein families with limited sequences as well as proteins with novel folds. Applications of 3DBMPP can be broadly applied to genome-wide βMPs structure prediction. Key words β-barrel membrane proteins, Structure prediction, Sequence covariation, Strand register, Computer simulation, Loop prediction, Sequential Monte Carlo sampling

1

Introduction β-barrel membrane proteins (βMPs) are medically important as bacterial βMPs provide an important candidate class of molecular targets for the development of antimicrobial drugs and vaccines. The advancement in the studies of βMPs show further promise in bionanotechnology such as bionanopore sensor development. A major hindrance in the studies of βMPs is the limited structural knowledge: As of November 2020, only 552 βMP structures, of which 323 are unique [1] have been deposited in the Protein Data Bank (PDB) that contains over 170,000 structures [2]. This limitation also hinders understanding of structural basis of the function and mechanism of βMPs.

Sławomir Filipek (ed.), Homology Modeling: Methods and Protocols, Methods in Molecular Biology, vol. 2627, https://doi.org/10.1007/978-1-0716-2974-1_17, © Springer Science+Business Media, LLC, part of Springer Nature 2023

321

322

Wei Tian et al.

Computational structure prediction can bridge the gap between identified βMP sequences and resolved βMP structures by providing high-resolution and high-accuracy model structures. Here we describe a template-free method for predicting 3D structures of βMPs, which provides significant improvement over previous methods [3]. Our predictor method, named 3D beta-barrel membrane protein predictor (3D-BMPP), is based on a statistical mechanical model [4] that comprises of sequence covariation information and is built upon a parametric structural model of intertwined zigzag coil. In addition, predictions are extended to include non-TM regions, including both extended β-sheets and loops, with significantly enriched coverage of residues. Furthermore, our method can be applied to model structures of βMPs with novel folds, including those from mitochondria of eukaryotes, as corroborated by the accurately modeled structures of VDAC and FimD. Our method is general and can broadly be applied to the genomewide structural prediction of βMPs.

2

Materials 3D-BMPP is a python framework with source code and scripts available from the public git repository: https://github.com/ jksr/3dbmpp. In addition, 3DBMPP predictor is dependent on softwares like BBQ algorithm, PSICOV, and SCWRL4 (see Notes 1 and 2).

2.1 Additional Equipment

1. Linux HPC cluster or workstation equipped with at least 2 GB RAM per computational node. 2. G++ compiler version ≥4.7 (https://gcc.gnu.org/). 3. CMake software version ≥4.8 (https://cmake.org/). 4. Proteins as FASTA files can be obtained from Protein Data Bank [2]. 5. Git software, for obtaining 3D-BMPP and PSICOV source code, version ≥1.7.1 (https://git-scm.com). 6. Python packages, Numpy, and Biopython along with JAVA class utilities are required by our predictor.

2.2

Equipment Setup

We assume access is available to a Linux terminal operating a bash shell. Download and install all listed software. It is recommended to download 3D-BMPP and PSICOV using git command-line. 1. In a Linux terminal, obtain the 3DBMPP source codes via gitclone command: git clone https://github.com/jksr/3dbmpp-pipe.git

Predicting β-Barrel Membrane Protein Structure

323

2. Next, navigate to source folder with cd bin/src followed by make command. cd ../..

3. Follow analogous procedures to download and install PSICOV and save it in under folder named psicov. In addition, the psicov requires HHBLITS for predicting contacts for a target sequence as alignment tool. Follow the HHBLITS documentation for its installation instructions [5]. 4. To download scwrl, you must apply for a license on their website. We then recommend unpacking it into a folder name scwrl in the top-level folder of this repository.

3

Methods To predict structures of βMPs, we proceed in three steps: predicting strand registers (interstrand hydrogen bond contacts), predicting 3D coordinates of TM residues, and modeling non-TM residues (Fig. 1). Detailed information of the methods can be found in SI Appendix, sections 2–5 [6] (see Note 3). 1. First, a folder with protein PDB name needs to be created for the required structure to store all the input files and results. 2. Put the corresponding fasta file into the created folder. 3. Put a file with rough information of the starting and ending points of beta strands into the folder. Please refer to example/ 1bxw.strands.

Fig. 1 The flowchart of βMP structure prediction method 3D-BMPP [6]

324

Wei Tian et al.

The sequence ID, seqid of the starting and ending points should be consistent with the fasta file. The start and end points of beta strands can be approximately determined via any thirdparty software for secondary structure prediction listed in http://www.ompdb.org/links.php. We recommend using consensus secondary structure prediction from PREDTMBB2 [7], BOCTOPUS2 [8], and BetAware [9]. Information from the PSICOV sequence covariation analysis [10] may aid the structure prediction. This is not a mandatory input. An empty file can be created with filename ending with .psicov in the folder to skip this step. However, this might affect the accuracy of the prediction (see Notes 4 and 5). 4. In this method, the transmembrane beta-barrel proteins are classified into five groups depending on the number of beta strands (Table 1) (see Notes 6 and 7). The examples of the five groups are shown in Fig. 2.

Table 1 Five groups of TMBs based on the number of their strands Groups Description

Example PDB ids

1

Small TMBs (strand#