248 32 3MB
English Pages 95 [96] Year 2023
SpringerBriefs in Applied Sciences and Technology Silvano Coletti · Gabriella Bernardi Editors
Exscalate4CoV High-Performance Computing for COVID Drug Discovery
SpringerBriefs in Applied Sciences and Technology
SpringerBriefs present concise summaries of cutting-edge research and practical applications across a wide spectrum of fields. Featuring compact volumes of 50 to 125 pages, the series covers a range of content from professional to academic. Typical publications can be: • A timely report of state-of-the art methods • An introduction to or a manual for the application of mathematical or computer techniques • A bridge between new research results, as published in journal articles • A snapshot of a hot or emerging topic • An in-depth case study • A presentation of core concepts that students must understand in order to make independent contributions SpringerBriefs are characterized by fast, global electronic dissemination, standard publishing contracts, standardized manuscript preparation and formatting guidelines, and expedited production schedules. On the one hand, SpringerBriefs in Applied Sciences and Technology are devoted to the publication of fundamentals and applications within the different classical engineering disciplines as well as in interdisciplinary fields that recently emerged between these areas. On the other hand, as the boundary separating fundamental research and applied technology is more and more dissolving, this series is particularly open to trans-disciplinary topics between fundamental science and engineering. Indexed by EI-Compendex, SCOPUS and Springerlink.
Silvano Coletti · Gabriella Bernardi Editors
Exscalate4CoV High-Performance Computing for COVID Drug Discovery
Editors Silvano Coletti Chelonia SA Lugano, Ticino, Switzerland
Gabriella Bernardi Lugano, Ticino, Switzerland
ISSN 2191-530X ISSN 2191-5318 (electronic) SpringerBriefs in Applied Sciences and Technology ISBN 978-3-031-30690-7 ISBN 978-3-031-30691-4 (eBook) https://doi.org/10.1007/978-3-031-30691-4 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Thales the primary question was not ‘What do we know?’ but ‘How do we know it?’ —Aristotle
Foreword
Computers as Primary Tools in Drug Discovery The ability to correlate the structure of protein–ligand complexes with their binding affinity has been a long-standing dream and challenge of computational chemists. This challenge reflects the need for effective tools in the rational design of drugs that would block the activity of proteins involved in devastating diseases. Obviously, reaching accurate predictions in calculations of binding energies is one of the most important aims of computational biology. However, progressing in this direction has been far from simple. The starting point for addressing this challenge has been the progress in structural evaluations of protein–ligand complexes. The enormous advances on this front call for corresponding progress in the use of computer modeling to estimate the binding affinity of protein–ligand complexes. The corresponding progress in estimating binding energies has been very slow. Qualitative quantitative structure–activity relationship (QSAR) approaches were reported quite early, followed by very primitive attempts to compute the energetics of ligand–protein complexes. Some of the attempts involved problematic focus on studies of the ligand without the protein. Others included the protein but did not consider the solvent around the protein and the energetics of the ligand in water. The use of thermodynamics in cycles in computations of protein substrate systems emerged in 1981 and began to be applied with simplified solvent models. More rigorous free energy perturbation (FEP) studies of protein–ligand interactions emerged in 1984 but progressed very slowly, except in the calculation of enzymatic reactions. The problem has been that such calculations face enormous convergence problems when applied to absolute binding free energies. The situation is more reasonable when one focuses on evaluating substitutions of small parts of the ligand. As far as evaluations of absolute binding energies are concerned, it appeared that, in many cases, macroscopic and semi-macroscopic models gave more stable results than the corresponding microscopic models did. These models included approaches that are based on the linear response approximation, such as the protein–dipole Langevin–dipole model with the linear response approximation treatment with a
vii
viii
Foreword
scaled non-electrostatic term (PDLD/S-LRA/β) and the significantly less consistent molecular mechanics generalized Born surface area (MM/GBSA) approach. Another major problem is the evaluation of the effect of water penetration. Not only is it difficult in such cases to identify the position of the water molecules, but their positions also change drastically during the binding process. In principle, one can try to insert water molecules in a grand canonical Monte Carlo treatment, but the chance of water molecules being accepted is extremely small because, in most insertion attempts, we will not have a large enough cavity in the protein site. Our solution to this problem has been the “water flooding” approach that effectively reproduces grand canonical results. Other important directions are the use of computations to determine protein–protein interactions, whereas at present, the best option is to use coarse-grained (CG) models. Another promising new direction is the use of machine learning. This direction is “orthogonal” to the use of physical-based modeling because the physics is hidden in the data, but the potential of exploiting big data is very promising. Despite the current problems, there is no doubt that computer-aided calculations will eventually become the key quantitative tool in rational drug design. Once a more quantitative level for calculating protein–ligand binding is achieved, we will see several promising directions. For example, it will be possible to augment automated screening procedures and greatly accelerate the refinement of the final drug candidates. Major advances in computer-aided drug design will clearly emerge in the future. The only question is how long it will take to make such approaches quantitative. We believe that strategies that go below the 1 kcal/mol error limit would be sufficiently quantitative and advance the field toward a new era. The enormous increase in computer power in the past decade is leading to a paradigm shift in computer-aided drug design. We are probably approaching a stage in which the computing power will be translated to reliable performance, and drug screening by computers will be more effective than experimental screening. Here, the Exscalate4COV project and other projects described in this book mark the emergence of such directions. Open-access collaboration is another promising path wherein the effort of a large scientific community can greatly help push the field forward. Educating large parts of the community regarding the power of computers in drug design also holds great value. Educating future medical professionals and future decision-makers will substantially contribute to realizing the aim of using computers as the key tool in developing new medications. Arieh Warshel 2013 Nobel Prize winner in Chemistry University of Southern California Los Angeles, CA, USA
Foreword
ix
Reference 1. A. Warshel, Theoretical studies of drug receptor interaction. Trends in Biochem. Sci. 1, N105– N106 (1976)
Acknowledgments
This book would have not been possible without the support, encouragement, and feedback of so many team members, authors, and collaborators of the EXSCALATE4COV and LIGATE projects who worked to complete their tasks. This book aims, for the first time, to gather available evidence on research and use cases of artificial intelligence techniques and HPC resources aiming to accelerate the drug discovery and development process. It aims to indicate avenues for future research and policy actions that could impact precision medicine. Eminent international experts have been invited to contribute to this work, thus representing the state of the art in this field. As the recognition and consideration of AI in drug development are increasing in significance and applications at all levels, the need has arisen for a reference text that can be used across disciplines to guide research and practice based on a precision medicine paradigm. This book is meant for drug hunters, AI and HPC experts, students, policymakers, academic researchers, and other stakeholders to encourage and promote a precision medicine approach in basic research and clinical practice, as well as novel policy actions. First, we would like to thank the contributing authors for their enthusiasm in joining the project and their dedication to the chapters. We would like to express gratitude to all our authors, coauthors, coordinators from EXSCALATE4COV, LIGATE, and REMEDI4ALL European projects to SAS Inc. and Nanome Inc., and finally to Dompé farmaceutici and Chelonia SA teams for their full-time contributions to this book. Our thanks also go to those who provided support, assistance, collaboration, reviews, and fruitful discussions about the book content together with essential comments. Exscalate4COV is the project funded by Horizon 2020 research and innovation programme under grant agreement No. 101003551. LIGATE is a project funded by European High-Performance Computing Joint Undertaking (JU) under grant agreement No. 956137. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Italy, Sweden, Austria, Czech Republic, and Switzerland.
xi
xii
Acknowledgments
REMEDI4ALL is a project funded by Horizon Europe Research and Innovation programme under grant agreement No. 101057442. It would not be possible to write a book without input from authors and their organizations that allowed the sharing of relevant studies of scientific and industrial applications, use cases in the field of in silico drug discovery, drug repurposing and drug development, big data, and artificial intelligence. A special thanks to the Nobel Laureate Dr. Arieh Warshel from University of Southern California and Dr. Thomas Skordas, Deputy Director-General of Directorate-General for Communications Networks, Content and Technology from the European Commission, for their input which uniquely values and frames the importance of the work done by the science and technology teams. Thanks to all. Silvano Coletti Gabriella Bernardi
About This Book: How Should This Text be Read
The worldwide emergence of COVID-19 has led to accomplishing in a short time what is normally done in a longer time. The pandemic emergency has brought to the forefront many topics, first and foremost vaccines, but of equal importance has been the finding of new therapies against the infection. Drug development takes a long time, both because of the need to do human trials and before that to select promising molecules. But the latter can be greatly sped up by numerical simulations and AI tools. The European consortium named EXSCALATE4COV aimed and successfully achieved the finding of drug active molecules against SARS-Cov-2 leveraging on best European supercomputing resources. At the end of the EXSCALATE4COV project, members of the consortium decided not to disperse what had been done documenting their contribution also with this text, in addition to all open-access resources available. Last but not the least, one of the chapters introduces LIGATE and REMEDI4ALL projects. The central goal of LIGATE is to create and validate a leading application solution for drug discovery in HPC systems up to the exascale level. REMEDI4ALL aims to build a sustainable European innovation platform to enhance the repurposing of medicines for all. The proposed topics are very diverse and specific at the same time. The book should be read with this particular context in mind, so each chapter is independent of the others because it deals with explaining a single aspect of the problem. Altogether, however, they provide a comprehensive view of this unprecedented effort, so that such a multi-faceted and complex problem can be appreciated in its entirety and from many different points of view, from High-Performance Computing to pharmaceutical outcomes, even though, as mentioned above, each chapter can be read and understood as a self-consistent contribution. It may be a good strategy to start with the chapters that intrigue the most. Also, keep in mind that this is not a popularized text, although it is clearly and concisely written, because the goal was to summarize the highlights of the project just completed.
xiii
xiv
About This Book: How Should This Text be Read
Last but not the least, I would like to thank all the authors that dedicated their time and a lot of effort to the realization of this book. Enjoy your reading! Gabriella Bernardi
Contents
1
2
3
4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcello Allegretti and Silvano Coletti References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A European Drug-Discovery Platform: From In Silico to Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gianluca Palermo, Daniela Iaconis, and Philip Gribbon 2.1 Potential Benefits of Virtual Screening and In Silico Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 EXSCALATE4CoV and Accelerated Drug Development . . . . . . . . 2.3 The EXSCALATE Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Extreme Scale Simulations: The E4C Big Run . . . . . . . . . . . . . . . . . 2.5 In Vitro and Experimental Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Drug Repurposing Strategy in the Exscalate4CoV Project: Raloxifene Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrea Beccari, Lamberto Dionigi, Emanuele Nicastri, Candida Manelfi, and Elizabeth Gavioli 3.1 Drug Repurposing for COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Therapeutic Potential of Raloxifene . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Raloxifene Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The High-Performance Computing Resources for the EXSCALATE4CoV Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrew Emerson, Federico Ficarelli, Gianluca Palermo, and Francesco Frigerio 4.1 Supercomputer Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 HPC-Layer 5 (HPC5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 CINECA Marconi M100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 8 9
10 12 12 13 14 16 19
19 21 22 24 25 27
28 28 29
xv
xvi
Contents
4.4
Uses of HPC in EXSCALATE4CoV . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 HPC Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 High-Throughput Virtual Screening . . . . . . . . . . . . . . . . . . 4.4.3 The Big Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Molecular Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29 29 30 31 32 33 33 33
5
The Impact of the Scientific Metaverse on the Biotech Industry: How Virtual Reality Helped Researchers Fight Back Against COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Carmine Talarico and Edgardo Leija
6
From Genomes to Variant Interpretations Through Protein Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Janani Durairaj, Leila Tamara Alexander, Gabriel Studer, Gerardo Tauriello, Ingrid Guarnetti Prandi, Rosalba Lepore, Giovanni Chillemi, and Torsten Schwede 6.1 Interpreting Genetic Variants in the Context of Protein Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Workflows for Identifying Relevant Variants . . . . . . . . . . . . . . . . . . 6.3 From Relevant Variants to Relevant Protein Structures . . . . . . . . . . 6.4 Variants and Structures in the Context of Protein Environment . . . 6.5 What is in the Future? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
8
The Role of Structural Biology Task Force: Validation of the Binding Mode of Repurposed Drugs Against SARS-CoV-2 Protein Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefano Morasso, Elisa Costanzi, Nicola Demitri, Barbara Giabbai, and Paola Storici 7.1 Mpro as a Drug Target: Structural Properties . . . . . . . . . . . . . . . . . . 7.2 Known Inhibitors of Mpro Bind into the Active Site . . . . . . . . . . . . 7.3 Myricetin Binds Covalently with Cys145 in the MPro Active Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 The Peptidomimetic MG-132 Acts as Dual Inhibitor of Mpro and Cathepsin L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Drug Discovery and Big Data: From Research to the Community . . . Luca Barbanotti, Marta Cicchetti, and Gaetano Varriale 8.1 The Evolution of Clinical Data: From Hand-Written Case Reports to Real-World Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 RWD and Real-World Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 The Clinical Trial System is Broken . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Advantages of Using RWD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
41 42 45 46 48 49
51
52 53 54 56 57 61
61 63 64 65
Contents
8.4.1 Various Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 How the Entire Clinical Trial Process is Affected . . . . . . . 8.5 RWE Supporting Drug Repositioning . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Old Challenges, New Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Healthcare Analytics Hubs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
xvii
65 66 67 67 68 70
Exploiting Drug-Discovery Research for Educational Purposes . . . . . 73 Giuliana Catara and Cristina Rigutto References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10 Beyond the Exscalate4CoV Project: LIGATE and REMEDI4ALL Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carmine Talarico, Andrea R. Beccari, and Davide Graziani 10.1 LIGATE (www.ligateproject.eu/) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Final Objectives and Anticipated Outcomes . . . . . . . . . . . . 10.2 REpurposing MEDIcines for All . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Final Objectives and Anticipated Outcomes . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79 79 80 81 82 83
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Chapter 1
Introduction Marcello Allegretti and Silvano Coletti
Despite recent advancements, biopharma drug research and development remains expensive and time-consuming. However, there are numerous opportunities to build capabilities that enhance productivity and improve the probability of success. Computer-assisted decision-making has found its place in modern medicinal chemistry, while in its comparably modest beginnings, today, the fast development of high-performance computing (HPC) technologies and the rapid growth of artificial intelligence (AI) applications in biopharma offer the opportunity to deliver value at scale by fully integrating in silico approaches into scientific process changes. As long as high-quality experimental data amasses, the in silico drugdiscovery approach extends its potential beyond the prediction of small molecular ligands/single-receptor interaction, allowing us to address more complex questions involving multiple ligands, multiple binding sites, and multiple-receptor molecules. One of the biggest challenges is the programmed design of polypharmacological drugs with the specific ability to modulate the activity of multiple targets. For many diseases, it may no longer be sensible to pursue a one-target–one-drug philosophy, and the prospect of developing novel agents that block a network of pathways is considered a tremendous opportunity to treat complex multifactorial diseases. Though a ligand might interact with many targets, and a target may accommodate different types of ligands, the possibility of effectively predicting multiple interactions leading to programmed polypharmacology has been limited by the availability of efficient and reliable tools. The computer sciences have contributed fast hardware and computing solutions, as well as excellent algorithms that have already partially been transferred to the area of molecular informatics—in particular, sophisticated M. Allegretti (B) Dompé farmaceutici S.p.A., Milan, Italy e-mail: [email protected] S. Coletti Chelonia SA, Lugano, Switzerland © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Coletti and G. Bernardi (eds.), Exscalate4CoV, SpringerBriefs in Applied Sciences and Technology, https://doi.org/10.1007/978-3-031-30691-4_1
1
2
M. Allegretti and S. Coletti
machine learning techniques for pattern recognition in large datasets and modeling of functional relationships between data classes. In silico polypharmacology is just one of the avenues that has opened, thanks to the unimaginable evolution of HPC and AI, but only specific case studies and applications will enable us to unravel the enormous potential of these technologies in the drug-discovery process. Developed by Dompé farmaceutici with the support of Politecnico di Milano and CINECA, EXaSCale smArt pLatform Against paThogEns (EXSCALATE) is one of the most powerful computer-aided drug design platforms to date and represents the technological application of state-of-the-art tools in AI exploiting the best European supercomputing resources [1]. In December 2019, the outbreak of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in Wuhan caused a public health crisis in China that rapidly spread across the world, causing general concern. On February 11, 2020, the World Health Organization officially named the disease COVID-19. The Chinese government took strong and harsh measures to control the outbreak progression characterized by atypical pneumonia cases. Similar measures were also adopted in Europe and worldwide, and in a short period, the identification of new vaccines and therapeutic solutions against COVID-19 became a global challenge. In parallel with the massive effort in novel vaccine development, at the very beginning of 2020, the EXSCALATE for CoronaVirus (Exscalate4CoV) consortium (E4C) composed of European and national infrastructures, universities, a center of excellence, and a pharmaceutical company successfully applied to an urgent call from the European Commission (call H2020-SC1-PHE-CORONAVIRUS-2020) to find new therapeutics against COVID-19. The primary goal of the E4C project was to identify small-molecular-weight drugs already available in the therapeutic armamentarium or well characterized in clinical studies (safe in humans) that were active against coronavirus in vitro and ready for repurposing in humans. Using the EXSCALATE supercomputing platform, researchers selected several molecules with potential efficacy against SARS-CoV2 to counteract the COVID-19 pandemic and improve the management and care of patients. The underlying idea was to screen in silico a library of safe chemical compounds in parallel against multiple viral target proteins with an established key role in the infection and replication processes. The selection of molecules showing a moderate affinity for multiple targets rather than molecules with a high affinity for a single target is a typical application case of polypharmacologic research, thus offering the opportunity to put to test the potential of HPC-driven drug discovery. The EXSCALATE platform, the engine on which E4C runs, is the only platform capable of exascale-ready virtual screening. Thus, it enables the evaluation of billions of molecules on multiple targets in hours. This facility empowered smart in silico molecular dynamics and docking studies and increased the accuracy and predictability of computer-aided drug design (CADD). In the context of the collaborative program, the combination of advanced CADD with high-throughput biochemical and phenotypic screening allowed rapid evaluation of the simulation results and reduced time for the selection of the most promising drugs against SARS-CoV-2 taken from commercialized and developing drugs already known to be safe for humans.
1 Introduction
3
This approach is especially useful against pandemic viruses and other pathogens, when the immediate identification of effective treatments is paramount. In parallel with the repurposing program, the proprietary Tangible Chemical Database (TCDb), which comprises > 500 billion molecules, was screened to identify innovative drugs to be tested against coronavirus. Such massive virtual screening activities, billions of molecules screened against multiple target proteins, needed a huge computational resource; therefore, the activities were supported and empowered by four of the most powerful computer centers in Europe, namely CINECA, Barcelona Supercomputing Center, Julich, and ENI. Jointly, these supercomputing centers were able to guarantee the best combination of hardware architectures, the required knowledge, and the highest speed-up for the simulations, performing all the molecular dynamics simulations of the viral proteins and the ultrafast virtual screening of the E4C library. The Swiss Institute of Bioinformatics (SIB) provided the homology models for the viral proteins to be virtually screened. Furthermore, the E4C project took advantage of the SIB infrastructure for the phylogenetic, co-evolutive, and pathogenicity aspects of the viral key bioinformatic characterizations (i.e., RNA and protein sequences). The considerable amount of sequence information coming from all scientific communities and hospitals has been continuously captured by consolidated pipelines of several public databases. The results of this huge virtual screening culminated in the selection of putative active compounds that were tested in a phenotypic screening to identify molecules capable of blocking virus infection or replication in in vitro models. This step was performed at the KU Leuven research infrastructure, an automated platform for the multiparameter high-throughput screening data collection on live pathogens of high or unknown biosafety risk, a world-class high-throughput screening facility authorized to work with biosafety level 3 pathogens [2]. The Fraunhofer Institute for Molecular Biology and Applied Ecology (IME) complemented the phenotypic screening with biochemical assays on the different putative virus targets, like coronavirus proteases, ligases, and polymerases, and provided access to Fraunhofer’s Broad Repurposing Library [3, 4]. The University of Naples synthesized the identified compounds for experimental testing, and the University of Cagliari completed the biological assessment defining the mechanism of action of the inhibitors and the selection of mutants in replicon systems. To support the rational design of new chemical entities able to inhibit the coronavirus, Elettra Sincrotrone Trieste and the International Institute of Molecular and Cell Biology, providing complementary expertise and technology infrastructures, produced X-ray structures for the most interesting viral enzymes, as apo structure and, where applicable, with the inhibitor bound, to further enhance the quality of the in silico models and to evaluate the structural similarities with other viral proteins [3]. The entire process was designed to allow the rapid identification of active and safe molecules to be further tested in animal models and clinical trials. The coupling of exascale-ready virtual screening with multiple high-throughput biochemical and phenotypic screenings not only provides a powerful tool for the rapid identification of safe-in-human drugs that can be deployed immediately to
4
M. Allegretti and S. Coletti
treat the already infected population but also represents a promising platform for the design of drug candidates for novel pan-coronavirus inhibitors to address future emergencies. Such a massive data production made possible by the EXSCALATE platform and the available computing resources required an extremely efficient way to share the information and the results of the analysis. INFN high-throughput and data sharing infrastructure (CERN Tier-1 Centre) complemented the simulations with data staging, dissemination, and re-analysis to maximize product exploitation. In a relatively short time, the consortium generated valuable information and data confirming the validity of the collaborative approach for a rapid reaction to emergent medical needs: more than 400 active molecules identified so far out of > 30,000, 3 patents within 9 months, and 29 peer-reviewed papers with a global impact factor > 154 points in a single year. The E4C experience also confirmed the tremendous potential of the repurposing approach for the rational selection of treatments for emergent needs from the existing safe-in-human molecular libraries. In fact, in agreement with the project objectives, E4C studies selected raloxifene as a clinical candidate against SARS-CoV-2 with a potential polypharmacologic mechanism of action (i.e., cooperatively acting on multiple viral targets). After the experimental preclinical validation, raloxifene was selected to enter a fast-track clinical trial that provided preliminary hints on the potential value of the treatment to prevent disease progression in patients recently infected with SARS-CoV-2 and presenting with mild-to-moderate symptoms. Identification of raloxifene as a new agent for the treatment of COVID-19 and the thorough comprehension of its mechanism of action were allowed by an integrated approach between virtual screening protocols and tailored wet-lab experiments, thus confirming the strong reliability of HPC to support in silico screening for the purpose of polypharmacologic research applications. The results of the studies conducted in the context of the E4C program will be presented in detail in the chapters of this book, but, as a way of introduction, we believe it is extremely important to put emphasis on the technological challenges faced during the project and on the significance and value of data, protocols, and algorithms generated and made available to the scientific community. In fact, as a reaction to the health emergency, we had the opportunity to push the best hardware and software technologies to the extreme by performing the world’s largest and fastest virtual screening simulation ever, running more than 1 trillion simulations in one single shoot. In November 2020, the consortium carried out the largest virtual screening experiment. This unprecedented simulation involved the two supercomputers of ENI and CINECA and allowed virtual testing of more than 1 billion molecules against the most important viral proteins in just 60 h. The deployment of ad hoc virtual screening protocols and X-ray validation of the most relevant hypotheses offered the opportunity to test the reliability of in silico technologies and to further optimize standard protocols. The E4C consortium promptly shared all the scientific outcomes with the research community by using established channels like the ChEMBL portal for the biochemical data, the SWISS-MODEL portal for the homology models of wild-type and
1 Introduction
5
Fig. 1.1 Exscalate4CoV summary
mutant viral proteins, the Protein Data Bank for the experimentally resolved protein structures, the EUDAT for the data generated by in silico simulations, and the E4C project website. The infographic shown in Fig. 1.1 summarizes the results of the project highlighting the most important goals reached during the period. Over 60 million hours of computation were needed to perform molecular dynamics experiments, which made it possible to evaluate and understand the structural behavior of over 45 viral proteins. Moreover, in the initial phase of the pandemic, given the lack of experimental protein models, a Web platform was created for the generation of homologic models of viral proteins, useful for the entire scientific community (https://swissm odel.expasy.org/). Thanks to the collaboration among the project partners and the entities belonging to Exscalate4CoV’s League, 35 viral proteins were identified and experimentally solved by using X-ray techniques. In December 2022, the keyword Exscalate4CoV returned nearly 8000 results in a Google search, an indicator of the extensive work performed in communication and dissemination activities led by the Swiss partner Chelonia SA in tight collaboration with Dompé farmaceutici. The large amount of experimental and theoretical data produced led to the publication of 29 peer-reviewed papers with a global impact factor > 154 points in 1 year (of which around 50 impact factor points were achieved since January 2021) and the achievement of 3 patents during the first 9 months of the program. The most complete (>40 simulations) and the most informative (>10 µs) set of SARS-CoV-2 molecular dynamics simulations were released in response to the effort of the most important European HPC resources. With the aim to make all this information available and promote open scientific collaboration, the consortium
6
M. Allegretti and S. Coletti
Fig. 1.2 Exscalate4CoV website home page
deployed valuable Web platforms to support the global research community with bioinformatics and simulation tools (Fig. 1.2). MEDIATE—MolEcular DockIng AT home—is accessible via the direct link med iate.exscalate4cov.eu and gives free access to the largest database available today on the SARS-CoV-2 virus both from a structural (3D structures) and functional (proteins interacting with human cells) point of view, including all the molecular dynamics involved in cellular interaction and active sites for potential drug entry. In this regard, the molecular bank of MEDIATE has been generated considering the main classes of molecules, which have been selected to allow accelerated clinical development. The library contains 10,000 drugs, 400,000 natural products, 70,000 nutraceuticals, 100 million oligopeptides, 5 million molecules already on the market for research purposes, and 72 billion de novo molecules easily synthesized. The MEDIATE portal collects the predictions made by research groups around the world (crowdsourcing) and combines them into a single model using the neural networks and AI from SAS with the aim of identifying new and more effective treatments against COVID-19 in the shortest possible time. The MEDIATE project relied on a scientific board chaired by Dr. Arieh Warshel, who received a Nobel Prize in Chemistry. 1 TRILLION DOCK—During the weekend of November 21, 2020, the public– private consortium E4C, supported by the European Commission, carried out the most complex supercomputing simulation ever realized. The objective was to simulate the behavior of the SARS-CoV-2 virus to identify the best therapeutic treatment. More than 70 billion molecules were simulated on the 15 active interaction sites of the virus for a total of more than a thousand billion interactions evaluated in just 60 h. This feat was made possible by the simultaneous availability of the computing power (81 petaflops: millions of billions of operations per second) of Eni’s HPC5, the most powerful industrial supercomputer in the world; CINECA’s Marconi100 supercomputer; the virtual screening software accelerated by the Politecnico di Milano and
1 Introduction
7
CINECA; and the Exscalate molecular library from Dompé. Using these technologies and methods, it has been possible to reach the new goal of 5 million simulated molecules per second, making the most out of the supercomputing infrastructure. The data from the simulation was processed with SAS Viya using AI techniques and advanced analytics. Results were available in real time on the portal 1trilliondock. exscalate4cov.eu designed in collaboration with SAS to allow scientists around the world to carry out their own simulations, benefiting from state-of-the-art knowledge. SPIKE MUTANTS—The Spike Mutants website (spikemutants.exscalate4co v.eu) aims to provide the scientific community with structural information on emerging variants involving the protein sequence of the SARS-CoV-2 spike protein. The emergence of new SARS-CoV-2 variants harboring mutations in the spike protein that might affect viral fitness and transmissibility has been an issue of great concern, particularly after the identification of two independent emerging strains in the UK and South Africa that had larger than usual number of mutations in the spike protein that may have functional significance [5]. Previous reports of the D614G mutation and reports of virus variants from Denmark, Great Britain, Northern Ireland, and South Africa have raised concern regarding the impact of viral changes [6]. VIRALSEQ—Using data provided by GISAID and analysis of SARS-CoV-2 sequences carried out within the E4C project, the E4C consortium is collecting essential information and making it easily usable and accessible to the scientific community and beyond. All data are accessible via viralseq.exscalate4cov.eu. MOLECULAR ANATOMY (https://ma.exscalate4cov.eu/)—This Web server allows molecular framework generation according to the definition rules identifying a set of nine molecular representations at different abstraction levels to define a multidimensional network of hierarchically interconnected molecular frameworks. The protocols also prepare the files for a network visualization that allows a full graphical representation of a compound dataset, permitting efficient navigation in the scaffold’s space and significantly contributing to high-quality structure–activity relationship analysis. DRUGBOX—Exscalate4CoV opened the “drug box” (https://www.exscalate 4cov.eu/login.php) where companies and research institutes can send molecular structures from their compound libraries for screening against the 3D crystal structure of SARS-CoV-2. This initiative helps to identify new treatments against SARS-CoV-2 by analyzing third-party compound structures [7]. An unprecedented deployment of forces for the COVID-19 pandemic allowed the rapid creation of a European infrastructure designed to leverage the integration of competences, resources, and technologies to generate an immediate response to health emergencies. It is hoped that the effort made to establish efficient collaborative models and the important milestones reached along the E4C program will serve as an example and pave the way to the design of a permanent European network for pandemic preparedness.
8
M. Allegretti and S. Coletti
References 1. D. Gadioli, E. Vitali, F. Ficarelli, C. Latini, C. Manelfi, C. Talarico et al., EXSCALATE: an extreme-scale virtual screening platform for drug discovery targeting polypharmacology to fight SARS-CoV-2. IEEE Trans. Emerg. Top. Comput. (2022). https://doi.org/10.1109/TETC.2022. 3187134 2. A. Zaliani, L. Vangeel, J. Reinshagen, D. Iaconis, M. Kuzikov, O. Keminer et al., Cytopathic SARS-CoV-2 screening on VERO-E6 cells in a large-scale repurposing effort. Sci Data. 9(1), 405 (2022) 3. M. Kuzikov, E. Costanzi, J. Reinshagen, F. Esposito, L. Vangeel, M. Wolf, Identification of inhibitors of SARS-CoV-2 3CL-pro enzymatic activity using a small molecule in vitro repurposing screen. ACS Pharmacol Transl Sci. 4(3), 1096–1110 (2021) 4. A. Corona, K. Wycisk, C. Talarico, C. Manelfi, J. Milia, R. Cannalire, Natural compounds inhibit SARS-CoV-2 nsp13 unwinding and ATPase enzyme activities. ACS Pharmacol Transl Sci. 5(4), 226–239 (2022) 5. Naveca F, Nascimento V, Souza V, Corado A, Nascimento F, Silva G, et al. Phylogenetic relationship of SARS-CoV-2 sequences from Amazonas with emerging Brazilian variants harboring mutations E484K and N501Y in the Spike protein. https://virological.org/t/phy logenetic-relationship-of-sars-cov-2-sequences-from-amazonas-with-emerging-brazilian-var iants-harboring-mutations-e484k-and-n501y-in-the-spike-protein/585. Accessed January 11, 2023. 6. World Health Organization. COVID-19: Global. https://www.who.int/emergencies/disease-out break-news/item/2020-DON305. Published December 31, 2020. Accessed January 11, 2023. 7. Zubas, cu F. Research group to crowdsource compounds to test efficacy against COVID-19. Science Business. https://sciencebusiness.net/covid-19/news/research-group-crowdsource-com pounds-test-efficacy-against-covid-19. Published April 3, 2020. Accessed January 11, 2023.
Chapter 2
A European Drug-Discovery Platform: From In Silico to Experimental Validation Gianluca Palermo, Daniela Iaconis, and Philip Gribbon
Abstract The COVID-19 pandemic highlighted an urgent need for streamlined drug development processes. Enhanced virtual screening methods could expedite drug discovery via rapid screening of large virtual compound libraries to identify high-priority drug candidates. The EXSCALATE4CoV (EXaSCale smArt pLatform Against paThogEns for CoronaVirus) consortium (E4C) research team developed EXSCALATE (EXaSCale smArt pLatform Against paThogEns), the most complex screening simulation to date, containing a virtual library of >500 billion compounds and a high-throughput docking software, LiGen (Ligand Generator). Additionally, E4C developed a smaller virtual screen of a “safe-in-man” drug library to identify optimal candidates for drug repurposing. To identify compounds targeting SARSCoV-2, EXSCALATE performed >1 trillion docking simulations to optimize the probability of identifying successful drug candidates. Ligands identified in simulations underwent subsequent in vitro experimentation to determine drug candidates that have anti-SARS-CoV-2 agency and have probable in-human efficacy. While many compound candidates were validated to have anti-SARS-CoV-2 properties, raloxifene had the best outcome and subsequently demonstrated efficacy in a phase 2 clinical trial in patients with early mild-to-moderate COVID-19, providing proof of concept that the in silico approaches used here are a valuable resource during emergencies. After its emergence in 2019, the SARS-CoV-2 coronavirus spread internationally at a rapid pace, leading to the designation of COVID-19 as a pandemic in March 2020. In addition to a devastating impact on public health, COVID-19 has resulted in extensive negative social and economic effects in every corner of the G. Palermo (B) DEIB Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy e-mail: [email protected] D. Iaconis Dompé farmaceutici S.p.A., Milan, Italy P. Gribbon Screening Port, and Fraunhofer Cluster of Excellence for Immune-Mediated Diseases (CIMD), Fraunhofer Institute for Translational Medicine and Pharmacology (ITMP), Frankfurt, Germany © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Coletti and G. Bernardi (eds.), Exscalate4CoV, SpringerBriefs in Applied Sciences and Technology, https://doi.org/10.1007/978-3-031-30691-4_2
9
10
G. Palermo et al.
globe. When the pandemic arrived, the medical and scientific communities identified an urgent need to establish more rapid therapeutic and vaccine development processes for COVID-19. However, it was clear that any new measures needed to be implemented in a way that also supported rapid mobilization to fight potential future pandemics. Therapeutic discovery is a complicated and prolonged process, often taking 10–15 years to complete all stages, and typically involves a linear workflow starting with in silico investigations, followed by increasingly complex and correspondingly expensive in vitro, in vivo, and clinical studies. In the context of the pandemic, the importance of the in silico stage increased because of the capacity of exascale computational methods to identify and prioritize small molecule (and biological) agents with the greatest therapeutic potential. Better in silico-generated starting points for drug-discovery efforts increase the likelihood of success in downstream laboratory-based experimental stages and can contribute to vitally needed reductions in costs and time to market for new therapies.
2.1 Potential Benefits of Virtual Screening and In Silico Experimentation Virtual screening is one of the key techniques applied in the early stages of drug discovery [1]. The procedure is intended to analyze large virtual libraries of candidate molecules and predict a subset of ligands with high affinity for the therapeutic target of interest. Because virtual screening is performed in silico, by means of computer simulation, this approach allows the evaluation of virtual compound collections that are typically many orders of magnitude greater in size than can be investigated by classical experimental approaches [2]. The virtual compound collections may contain known drugs, natural compounds, commercially available “catalog” molecules, or even completely novel structures that have yet to be synthesized. Two types of in silico virtual screening strategies are commonly applied, which are generally termed ligand-based and structure-based [3]. In the absence of a 3D structure for the therapeutically relevant target protein, complex, or nucleic acid, the virtual screening uses a ligand-based route. This approach exploits preexisting data from molecules known to bind the target directly or even molecules that may indirectly modulate the function of the target via specific biological pathways. By modeling these data in silico, researchers can predict new candidate hits with improved properties. Alternatively, if the 3D structure of the target is known or can be accurately predicted using resources such as Alphafold [4], then the structure-based strategy can be applied. Here, the shape of both the ligand and the biological entity, their electrostatic-, hydrophobicity-, and chemical bond-related properties, and the effects of water molecules or attendant counter ions can all be used to predict novel ligands. Structure-based approaches make use of molecular docking software that searches for ligand orientations and conformations within one or more binding sites identified for the target protein and generates scoring functions to rank the relative binding
2 A European Drug-Discovery Platform: From In Silico to Experimental …
11
affinity for a given ligand to putative binding sites presented by the target [5]. The docking procedure explores the different positions within a binding site, applying not only rigid roto-translations of the ligands (called rigid docking) but also adding to the search the degrees of freedom provided by the rotatable bonds of the molecule (flexible docking). Rotatable bonds are a subset of the bonds within a molecule that split the molecule into two disjoined sets of atoms that are then able to rotate independently along the axis of the considered bond, without altering the chemical properties of the ligand. On the protein target side, one widely adopted assumption is that the binding site acts as a rigid body. The scoring functions take as input the ligand poses generated by the docking phase and the region of a protein considered as the binding surface, and then predict the binding affinity to estimate the interaction strength between the two. The reliability of scoring function calculations is one of the most important factors determining the success rate of in silico approaches. In contrast to ligand-based approaches, structure-based virtual screening explicitly predicts where and how candidate ligands may interact with target proteins. In the case of viruses that are variants of concern (VOCs), this also permits predictions of the potential effect of mutations in therapeutically relevant viral genes and therefore their capacity to develop resistant phenotypes toward existing or future drug treatments. The outcome of the virtual screening process is a set of candidate ligands that can be prioritized for future experimental analyses in downstream stages of the drug-discovery process. To be included in the set, a ligand must demonstrate strong predicted interactions with one or more binding sites of the target proteins. The size of the set generated is normally not fixed because it depends on the extent of the investment planned in the subsequent phases. However, the molecules in the set are roughly ranked in terms of several metrics (e.g., binding affinity prediction or the number of sites with strong interactions). The virtual screening phase is a complex and inexact operation because the docking and scoring phases are based respectively on heuristics and interaction models. Moreover, both ligands and pockets can alter their shape upon binding, which may invalidate important assumptions underlying docking operation. The efficient implementation of virtual screening procedures gives rise to two main positive effects [6]. First, it can reduce the time and resources needed to identify hits by allowing for downstream experimental screening to be focused on an “enriched” set of molecules with higher potential to demonstrate ligand properties when compared with classical high-throughput screening-based identification. Second, when combined with rapid access to chemical matter for testing, for example in conjunction with click-chemistry-based synthesis workflows or just-intime delivery of cherry-picked compounds from commercial vendors, it facilitates the investigation of larger regions of chemical space. The greater number of molecules evaluated for their potential as ligands elevates the probability of finding a suitable compound for progression. These advantages are clear and have become more evident during the COVID-19 pandemic.
12
G. Palermo et al.
2.2 EXSCALATE4CoV and Accelerated Drug Development The EXSCALATE4CoV consortium (E4C) was a European Union-funded research team whose goal was to find drugs to treat SARS-CoV-2 viral infection and COVID19 and in the process establish know-how and resources relevant for tackling future pandemics [7]. The team chose an in silico method to identify small molecule therapeutics that could be validated in vitro and in vivo and then further progressed through clinical trials and into patient use. The project’s objectives were divided into two phases, or waves, that were implemented in parallel. The first objective was the identification of promising marketed therapeutics that could be adapted to the treatment of infected individuals [8], a process termed drug repurposing. It was prioritized within the project because it can shorten drug development times by reducing the number of studies needed to bring the molecule to the clinic. The second objective was related to the identification of novel antiviral compounds effective against the current SARS-CoV-2 virus, as well as related VOCs. The complete development of new chemical entities was out of the scope of the project, given the different time scales, but the results of this phase have been made available to the scientific community for further advancement through the MEDIATE portal. Aside from the main goal related to implementing an urgent drug-discovery process, the E4C project has also been used to further develop and demonstrate the feasibility of building a high-throughput drug-discovery platform capable of running extreme-scale virtual screening experiments. The goal of this platform is clearly to have a suitable tool ready to be used at the European level to tackle possible future pandemics.
2.3 The EXSCALATE Platform The project finalized and utilized EXSCALATE (EXaSCale smArt pLatform Against paThogEns), an extreme-scale in silico screening platform for computer-aided drug design (CADD) with a powerful computation engine, LiGen (Ligand Generator) [9, 10]. This platform has been a key element within the E4C project because current state-of-the-art tools were not sufficient to meet the goal of the extreme-scale virtual screening campaign planned for the second wave of the project. The EXSCALATE platform is composed of two main pillars: (i) a virtual library of more than 500 billion target compounds that has been built from databases of millions of commercial reagents wherein compound structures were further elaborated to include new theoretical structures accessible via robust single-step synthetic reactions and (ii) a molecular docking software, called LiGen, for high-throughput virtual screening designed from the bottom up to run on high-performance computing architectures (HPC) and to screen billions of compounds in a very short time. In this context,
2 A European Drug-Discovery Platform: From In Silico to Experimental …
13
molecular docking refers to a method to calculate the preferred position and shape of a small molecule (the ligand) when bound to a larger one (the protein). LiGen solves the problem using a two-phase approach. First, it focuses only on the geometrical characteristics of the two molecules searching for shape complementarity. The docking algorithm considers the target protein as a rigid body, while it employs rigid roto-translation possibilities and ligand flexibilities through the rotatable bonds for the geometric search of the most suitable poses of the small molecule. Second, it estimates the intensity of the actual physical and chemical interaction between the two molecules by means of an empirical scoring function. The scoring function is used to select the best poses within the geometrically suitable set and to rank the different molecules to prioritize the search. LiGen code has been designed to run on modern heterogeneous HPC machines and can be scaled up to the entire supercomputer. The application code is written in C++ with CUDA kernels to exploit the multiple NVIDIA graphics processing unit (GPU) cards available within the node. Alternative versions to target different node architectures are also available for code portability [11]. A Message Passing Interface (MPI) backbone has been used to manage parallel node architectures and to synchronize the input/output accesses to the storage [12]. The interaction with the file system, to read the target molecules and to write the estimated scores, is the only reason for synchronizing the different evaluations. Indeed, the target problem is embarrassingly parallel because the evaluation of every ligand is independent of all others.
2.4 Extreme Scale Simulations: The E4C Big Run The COVID-19 pandemic presented a great challenge for the high-performance computing community to help find ways to support the search for antiviral agents against SARS-CoV-2. In 2020, two large experiments emerged from each side of the Atlantic Ocean. In the United States, at the beginning of the pandemic, the first attempt was made to test over 1 billion molecules on two SARS-CoV-2 protein structures by using the SUMMIT supercomputer at Oak Ridge National Laboratory [13]. The system used AutoDock as the docking engine and was adapted to run extremescale virtual screening on a multi-GPU node [14, 15]. This feat was achieved in some 12 h. A similar approach, but with a different virtual screening platform and experiment size, was carried out by the E4C [7]. The EXSCALATE platform achieved by the end of 2020 the most complex virtual screening simulation ever realized, testing more than 70 billion molecules across 15 sites of the SARS-CoV-2 proteins in 60 h. This extreme-scale experiment involved 50× more molecules and 7.5× more protein targets than previous efforts, and it required 80% of the capacity of the two fastest supercomputers available in Europe at the time: CINECA-Marconi100 and ENI-HPC5, with an aggregated computational power of 81 PetaFLOPS [13]. Within E4C, LiGen has been tuned to fit the HPC node of the two supercomputers (both based on NVIDIA-V100 GPUs), while optimizing the input/output accesses and throughput per node. Overall, LiGen sustained a throughput close to 2000 ligands
14
G. Palermo et al.
per second per node and 2500 ligands per second per node, respectively, on ENIHPC5 and CINECA-Marconi100, with an average sustained throughput on both machines reaching a value that goes beyond 5 million ligands per second. In addition to the main docking experiment, two other aspects of the in silico studies used considerable computational resources in the pre- and post-docking stages: The first related to upfront data preparation for both viral proteins and target ligands, requiring computational resources at ENI and CINECA to resolve the protein structures and to determine the conformational states of target sites. The chemical library members were also encoded as Simplified Molecular Input Line Systems (SMILESs) in a compact hydrogen-free format, and the 3D displacement of each atom (including added hydrogens) was determined for every compound [16]. The latest major computationally intensive procedure was related to the post-processing of the results coming from the virtual screening phase. Indeed, in addition to an offline rescoring of the compounds, statistical descriptors related to the score distribution on each target site have been computed using DASK on the CINECA-Marconi100 machine. Moreover, the outcome of this experiment was further enhanced with an additional four protein targets in the first half of 2021. Overall, the E4C project performed over 1 trillion docking evaluations in that timeframe, generating more than 65 TB of data representing the binding affinity of the evaluated chemical space with the different targets. For each target, the set of most promising compounds (500 million out of more than 70 billion) has been released publicly on the MEDIATE portal [17].
2.5 In Vitro and Experimental Evidence The EXSCALATE CADD platform allows for the rapid identification of ligands with high medical relevance to disrupt the function and spread of SARS-CoV-2. The ligands identified by the European platform set the stage for in vitro experimental validation and illustrate the practical applicability of this approach. This novel approach forms a connection between computational and experimental methods to rapidly select the most promising drug candidates for the treatment of viral infection. The workflow used during the E4C project acted as a demonstration of the viability of the approach. The extreme-scale virtual screening experiment described above gave us the opportunity to generate information on new compounds and scaffolds with potential antiviral activity and provided extensive data resources to support future experimental evaluations. However, the smaller virtual screening campaign executed with the “safe-in-man” drug library allowed us to support the possible repositioning of existing pharmaceutical products as a rapid answer to the medical emergency. Of the 10,000 drugs belonging to the repurposing library, approximately 1% were selected for the subsequent phase because, in addition to experimentally determined activities, they also showed high binding scores on at least a single protein. Approximately 50% of these also resulted in a polypharmacological profile. In the selected set, during the
2 A European Drug-Discovery Platform: From In Silico to Experimental …
15
pandemic period, approximately 40% were experimentally validated by the project or by other studies reported in the literature. Thanks to the great effort spent in the context of E4C, we could describe the anticytopathic effect of the selected compounds on different cell lines, such as VeroE6 and A549, and Calu-3 as models for human pulmonary infection. Additionally, we were able to validate the antiviral activity on different SARS-CoV-2 VOCs. Moreover, we generated eight in vitro functional assays to test the activity of the compounds on the main viral proteins following the virus life cycle from entry through to replication. A cytopathic SARS-CoV-2 screening on VeroE6 cells allowed us to identify 110 compounds with an anti-cytopathic IC50 < 20 μM. From this group, 18 are also marketed drugs. Interestingly, 70% modulate intracellular signaling pathways: notable groups are inhibitors of growth factor receptors (e.g., masitinib, and tandutinib), dihydrofolate reductase (e.g., trimetrexate), and estrogen receptor modulators (e.g., clomiphene and raloxifene). In addition, where a therapeutic indication was annotated, the majority of compounds were associated with cancer and anti-infective (antifungal and antimalarial) therapy. These observations suggest that drugs associated with cell survival and growth may be an optimal choice for antiviral therapies for SARS-CoV-2 if adequate safety and exposure/efficacy can be achieved [18]. We also reported the results of a screening run against the SARS-CoV-2 viral proteins. We confirmed previously reported inhibitors of the main protease (3CL-Pro) and identified 62 additional compounds with IC50 values below 1 μM [19]. Additionally, we experimentally validated flavonoids, which emerged as best scored binders of nsp13, as active compounds for the unwinding and ATPase helicase activities in the low micromolar range [20]. Moreover, the great success of E4C is the characterization of raloxifene as a novel agent to fight COVID-19. This drug was selected from the virtual screening repurposing campaign among the best scored compounds. Raloxifene is a drug marketed for osteoporosis. It has a well-known safety profile and has already been proposed as an antiviral against the Ebola virus, hepatitis C virus, hepatitis B virus, Zika virus, and influenza virus A, supporting our interest in further characterization. We described the anti-cytopathic effect of raloxifene on different cell lines and against the main SARS-CoV-2 VOCs [18, 21]. We demonstrated that its antiviral activity is exerted through its polypharmacological profile: raloxifene can act directly on viral replication mechanisms, as well as host proteins involved in the clinical outcome of the disease. Raloxifene is a selective estrogen receptor modulator (SERM). Interestingly, the estrogen receptor (ER) is considered to play a crucial role in inhibiting viral replication, as well as in inflammation, lung activity, and cardiovascular system modulation, suggesting that modulation of this receptor could influence host response to COVID-19 [8, 21]. Moreover, to assess its activity on viral mechanisms, we tested this drug on the different assays developed by the E4C consortium. Interestingly, we found that raloxifene affects viral entry by acting on the modulation of ACE2 and ADAM17 transcription and is involved in the inhibition of TMPRSS2 enzymatic activity: these three proteins are key host proteins that mediate viral anchoring and infection [21, 22]. Additionally, we found that raloxifene, despite not being active on 3CLpro enzymatic activity, is able to inhibit protein dimerization, finally
16
G. Palermo et al.
inhibiting 3CLpro function. Moreover, others have shown that raloxifene partially inhibits RdRp enzymatic activity [23]. Because of its activity on the ER, we used a bioinformatics approach to correlate ER with viral protein and predicted a potential function of the viral spike protein as a cofactor for ERα nuclear signaling. This function arises from the direct interaction of the nuclear receptor coregulator LXDlike motif, predicted by the EXSCALATE platform, present on the S2 subunit of the viral protein, and the activation function 2 (AF-2) region on Erα [22]. Interestingly, we demonstrated that this interaction is responsible for procoagulation activity, possibly associated with COVID-19 outcome, and that raloxifene inhibits spike protein–estrogen receptor (S-ER)-mediated cellular processes [22, 24]. These data supported a phase 2 clinical trial in which raloxifene showed evidence of effect in the primary virologic endpoint in the treatment of patients with early mild-to-moderate COVID-19, shortening the time of viral shedding. In conclusion, our approach allowed the rapid identification of active and safe molecules to quickly address medical emergencies. In a few months, we were able to pass from in silico prediction to a successful clinical trial, representing the end stage of a great collaborative effort and supporting the high quality of EXSCALATE platform performance.
References 1. B.K. Shoichet, Virtual screening of chemical libraries. Nature 432(7019), 862–865 (2004) 2. C. Lipinski, A. Hopkins, Navigating chemical space for biology and medicine. Nature 432(7019), 855–861 (2004) 3. I.M. Kapetanovic, Computer-aided drug discovery and development (CADDD): in silicochemico-biological approach. Chem. Biol. Interact. 171(2), 165–176 (2008) 4. J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger et al., Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021) 5. E. Lionta, G. Spyrou, D.K. Vassilatis, Z. Cournia, Structure-based virtual screening for drug discovery: principles, applications and recent advances. Curr. Top. Med. Chem. 14(16), 1923– 1938 (2014) 6. N.A. Murugan, A. Podobas, D. Gadioli, E. Vitali, G. Palermo, S. Markidis, A review on parallel virtual screening softwares for high-performance computers. Pharmaceuticals 15(1), 63 (2022) 7. The EXCALATE4CoV (E4C) project, https://www.exscalate4cov.eu/. Accessed 13 Oct 2022 8. M. Allegretti, M.C. Cesta, M. Zippoli, A. Beccari, C. Talarico, F. Mantelli et al., Repurposing the estrogen receptor modulator raloxifene to treat SARS-CoV-2 infection. Cell Death Differ. 29(1), 156–166 (2022) 9. D. Gadioli, E. Vitali, F. Ficarelli, C. Latini, C. Manelfi, C. Talarico et al., EXSCALATE: An extreme-scale virtual screening platform for drug discovery targeting polypharmacology to fight SARS-CoV-2. IEEE Trans. Emerg. Top. Comput. (2022). https://doi.org/10.1109/TETC. 2022.3187134 10. A.R. Beccari, C. Cavazzoni, C. Beato, G. Costantino, LiGen: a high performance workflow for chemistry driven de novo design. J. Chem. Inf. Model. 53(6), 1518–1527 (2013) 11. E. Vitali, D. Gadioli, G. Palermo, A. Beccari, C. Cavazzoni, C. Silvano, Exploiting OpenMP and OpenACC to accelerate a geometric approach to molecular docking in heterogeneous HPC nodes. J. Supercomput. 75(7), 3374–3396 (2019)
2 A European Drug-Discovery Platform: From In Silico to Experimental …
17
12. S. Markidis, D. Gadioli, E. Vitali, G. Palermo, Understanding the I/O impact on the performance of high-throughput molecular docking, in 2021 IEEE/ACCM Sixth International Parallel Data Systems Workshop (PDSW) (2021), pp. 9–14 13. J. Glaser, J.V. Vermaas, D.M. Rogers, J. Larkin, S. LeGrand, S. Boehm et al., High-throughput virtual laboratory for drug discovery using massive datasets. Int. J. High Perform. Comput. Appl. 35(5), 452–468 (2021) 14. G.M. Morris, R. Huey, W. Lindstrom, M.F. Sanner, R.K. Belew, D.S. Goodsell et al., AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. J. Comput. Chem. 30(16), 2785–2791 (2009) 15. S. LeGrand, A. Scheinberg, A.F. Tillack, M. Thavappiragasam, V. Vermaas, R. Agarwal, et al., GPU-accelerated drug discovery with docking on the summit supercomputer: porting, optimization, and application to COVID-19 research, in Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (2020), p. 43 16. D. Weininger, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28(1), 31–36 (1988) 17. MEDIATE - MolEcular DockIng AT home, https://mediate.exscalate4cov.eu/. Accessed 13 Oct 2022 18. A. Zaliani, L. Vangeel, J. Reinshagen, D. Iaconis, M. Kuzikov, O. Keminer et al., Cytopathic SARS-CoV-2 screening on VERO-E6 cells in a large-scale repurposing effort. Sci. Data. 9(1), 405 (2022) 19. M. Kuzikov, E. Costanzi, J. Reinshagen, F. Esposito, L. Vangeel, M. Wolf et al., Identification of inhibitors of SARS-CoV-2 3CL-Pro enzymatic activity using a small molecule in vitro repurposing screen. ACS Pharmacol. Transl. Sci. 4(3), 1096–1110 (2021) 20. A. Corona, K. Wycisk, C. Talarico, C. Manelfi, J. Milia, R. Cannalire et al., Natural compounds inhibit SARS-CoV-2 nsp13 unwinding and ATPase enzyme activities. ACS Pharmacol. Transl. Sci. 5(4), 226–239 (2022) 21. D. Iaconis, L. Bordi, G. Matusali, C. Talarico, C. Manelfi, M. Candida Cesta et al., Characterization of raloxifene as a potential pharmacological agent against SARS-CoV-2 and its variants. Cell Death Dis. 13(5), 498 (2022) 22. O. Solis, A.R. Beccari, D. Iaconis, C. Talarico, C.A. Ruiz-Bedoya, J.C. Nwachukwu, et al., The SARS-CoV-2 spike protein binds and modulates estrogen receptors. bioRxiv (2022) 23. National Institutes of Health, National Center for Advancing Translational Sciences. OpenData: COVID-19, https://opendata.ncats.nih.gov/covid19/. Accessed 13 Oct 2022 24. S.S. Barbieri, F. Cattani, L. Sandrini, M.M. Grillo, C. Talarico, D. Iaconis, et al., Relevance of the viral spike protein/cellular estrogen receptor-α interaction for endothelial-based coagulopathy induced by SARS-CoV-2. bioRxiv (2022)
Chapter 3
The Drug Repurposing Strategy in the Exscalate4CoV Project: Raloxifene Clinical Trials Andrea Beccari, Lamberto Dionigi, Emanuele Nicastri, Candida Manelfi, and Elizabeth Gavioli Abstract Drug repurposing is a cost-effective process to identify therapeutic candidates during a medical crisis or pandemic. The supercomputing platform, EXaSCale smArt pLatform Against paThogEns for CoronaVirus (EXSCALATE4CoV; E4C), was used to identify drug candidates for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection. E4C identified raloxifene as having great therapeutic potential, confirmed by in vitro data, which led to the progression of clinical trials to assess its efficacy. Raloxifene met the primary virologic endpoint in the treatment of early mild coronavirus disease 2019 (COVID-19), and although additional clinical trials are needed to confirm these results, there is evidence in support of in silico drug repurposing to provide cost-effective and rapid drug screening to identify treatment options for the pandemic and future pandemics.
3.1 Drug Repurposing for COVID-19 There remains an urgent need to find cost-effective treatments for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection and to address future pandemics. It is estimated that bringing a new drug to market may cost $161 million to $2 billion [1]. Drug repurposing is the process of identifying novel uses for approved and investigational drugs and may be a compelling strategy to combat the coronavirus disease 2019 (COVID-19) pandemic or future pandemics [2]. This method may be more cost- and time-efficient than the de novo drug development process, as drug repurposing has safety, tolerability, and pharmacokinetic data that has been demonstrated for other indications [3]. Three common approaches to repurposing are computational approaches, biological experimental approaches, and mixed approaches [2]. Experimental approaches A. Beccari · L. Dionigi · C. Manelfi · E. Gavioli (B) Dompé farmaceutici S.p.A., L’Aquila, Italy e-mail: [email protected] E. Nicastri Lazzaro Spallanzani National Institute for Infectious Diseases, IRCCS, Rome, Italy © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Coletti and G. Bernardi (eds.), Exscalate4CoV, SpringerBriefs in Applied Sciences and Technology, https://doi.org/10.1007/978-3-031-30691-4_3
19
20
A. Beccari et al.
may include binding assays such as proteomics and mass spectrometry, or phenotypic methods used to identify binding interactions of ligands and assay components to identify lead compounds and mechanisms of action [4]. Computational or in silico approaches are an especially exciting avenue because they can simulate numerous interactions between drugs and pathogens, and the number of simulations is limited only by the system’s computational power [5]. Li et al. recently discovered 30 drugs that can be potentially repurposed for treating COVID-19 by analyzing the genome of SARS-CoV-2 and using a novel network-based drug repurposing platform [2, 6]. The EXaSCale smArt pLatform Against paThogEns for CoronaVirus (EXSCALATE4CoV; E4C) is a European project that aims to identify promising therapeutic agents for SARS-CoV-2 infection by using one of the world’s most powerful supercomputing platforms capable of evaluating the interactions between 70 billion molecules and 15 binding sites of 12 viral SARS-CoV-2 proteins within 60 h [5]. During the E4C project, 7000 most promising candidates were identified and analyzed for their potential involvement in host adaptation, pathogenicity, and host– pathogen interactions employed against the virus based on subsequent in vitro screenings of SARS-CoV-2 target proteins [3]. Out of these potential agents, raloxifene, an estrogen receptor (ER) modulator, was identified as a potential therapeutic agent for mild-to-moderate disease owing to its predicted high probability of interacting with several relevant SARS-CoV-2 proteins including papain-like protease (PLpro) and the spike protein (S) (Fig. 3.1). E4C also used supercomputing and bioinformatics to determine interactions between S and ERs, finding that S binds with high affinity to human ERα and uses this mechanism to modulate ER-dependent biological functions [3, 7].
Fig. 3.1 In silico polypharmacologic effects of raloxifene performed on the main SARS-CoV-2 proteins. Figure reproduced from [3]. Reproduced under Creative Commons license https://creati vecommons.org/licenses/by/4.0
3 The Drug Repurposing Strategy in the Exscalate4CoV Project: …
21
3.2 Therapeutic Potential of Raloxifene Sex hormone receptors like ER may be involved in regulating the viral entry protein expression and activity [8]. Angiotensin-converting enzyme 2 (ACE2) is the primary functional viral receptor expressed on the surface of cells in SARS-CoV and SARSCoV-2, and is considered the major tropism determinant [3, 9]. Despite similar structures and binding patterns between SARS-CoV and SARS-CoV-2, the latter spreads more rapidly than SARS-CoV. This is likely owing to the stronger binding of the ACE2–SARS-CoV-2 complex, as the virus is optimized for binding to this receptor [10]. Estrogens are known to modulate ACE2 and most literature reports the upregulation of this enzyme through a proposed process such as the following: SARS-CoV-2 enters host cells via the ACE2 receptor initiating an inflammatory response and producing proinflammatory cytokines, specifically interleukin-6 (IL6) and tumor necrosis factor-α (TNF-α). This leads to cytokine storm and cytokine release syndrome, life-threatening systemic inflammatory responses that can lead to the development of acute respiratory distress syndrome, and the dysfunction of multiple organ systems [11]. Estrogens may also provide a protective effect in the progression of COVID19 infections owing to their role in the regulation of innate and adaptive immune responses, as well as in the control of the hyperactivation of the immune system [12, 13]. Antiinflammatory activity of estrogen is mediated by multiple mechanisms including increased production of antiinflammatory cytokines and decreased production of proinflammatory cytokines, increased antibody production, and suppressed macrophage and monocyte migration [14]. Raloxifene or 1-[6-hydroxy-2-(4-hydroxyphenyl) benzo[b]thien-3-yl]-1-[4-[2(1-piperidinyl)ethoxy]phenyl]methanone is a selective benzothiophene ER modulator (BT-SERM) that has demonstrated both antagonistic and agonistic effects on estrogen-responsive tissues (Fig. 3.2) [3]. This medication has been used around the world to treat and prevent osteoporosis in postmenopausal women (Europe and the United States) and reduce the risk of invasive breast cancer in postmenopausal women (United States) [3, 15, 16]. Membrane permeability allows for access to and binding of the ER in the nucleus, which occurs with a similar affinity as the ligand estradiol. In reproductive tissue, activation of genes that contain estrogen response elements is blocked. In bone and nonreproductive tissue, raloxifene-responding elements to DNA sequences are activated, resulting in decreased osteoclastic resorptive activity and reduced production of IL-6 and TNF-α [17]. Raloxifene has been used for >20 years and has one of the best long-term safety profiles among the class of SERMs, along with a pharmacologic profile that indicates a good risk-to-benefit ratio and tolerability [3]. Additionally, the cardiovascular safety profile of raloxifene has been established through several long-term studies and marketing analyses [16]. Raloxifene may be a promising SERM for the treatment of SARS-CoV-2 and has already been shown to have antiviral activity in vitro against Ebola disease, flaviviruses, hepatitis C, and influenza A virus [18–23]. This is owing to raloxifene’s broad tissue distribution, including to the lungs, and ability
22
A. Beccari et al.
Fig. 3.2 SERM structural analogies with steroids. Figure reproduced from [3]. Reproduced under Creative Commons license https://creativecommons.org/licenses/by/4.0
to inhibit the production of viral progeny and target the life cycle of the virus [24]. In a prospective, open-label, randomized controlled trial of postmenopausal women with hepatitis C virus genotype 1b, raloxifene statistically significantly demonstrated a benefit in sustained virologic response and end-of-treatment response compared with the standard of care after 24 weeks [22]. There are many connections between ER modulators and host response to viral infections that suggest raloxifene as a promising therapeutic agent to combat the COVID-19 pandemic [8]. Raloxifene was identified by the E4C project owing to its activity and tissue distribution in the lungs and its ability to regulate SARS-CoV-2 cell entry receptor expression and modulate the inflammatory response (Fig. 3.3).
3.3 Raloxifene Clinical Trials E4C identified raloxifene as having great therapeutic potential against SARS-CoV-2, and in vitro data confirm raloxifene as being the most potent molecule of the SERM class with antiviral activity at low concentrations against SARS-CoV-2, with activity against several SARS-CoV-2 strains [3]. The identification of this promising drug candidate by EXSCALATE has led to the progression of clinical trials to assess its efficacy in COVID-19 [11]. A phase II, adaptive, multicenter, randomized, placebo-controlled, double-blind study sponsored by Dompé farmaceutici S.p.A. was designed to evaluate the efficacy and safety of raloxifene in adult patients with mild-to-moderate COVID-19 [11]. The
3 The Drug Repurposing Strategy in the Exscalate4CoV Project: …
23
Fig. 3.3 Potential inhibitory actions of raloxifene on SARS-CoV-2 infection and COVID-19 progression
study was conducted in participants aged ≥40 years with symptomatic SARS-CoV2 infection confirmed by polymerase chain reaction within 7 days and no need for supplemental oxygen or mechanical ventilation [11]. Sixty-eight participants were enrolled and randomized to raloxifene 120 mg (n = 23), raloxifene 60 mg (n = 24), or a placebo (n = 21); the study design can be seen in Fig. 3.4. The primary virologic endpoint, participants with an undetectable virus at day 7, was met for raloxifene 60 mg, compared with placebo (P = 0.01) (Fig. 3.5). The percentage of participants with undetectable SARS-CoV-2 did not significantly differ among the groups at later time points, but a favorable trend was observed. The results of this study suggest raloxifene has the potential to induce a reduction in early and sustained viral load and the ability to act as an ER agonist and antiinflammatory agent at low doses while acting as an ER antagonist at higher doses [11]. Safety outcomes of the study included the proportion of participants with a grade ≤ 2 according to the Common Terminology Criteria for Adverse Events (CTCAE) on days 7, 14, and 28, and the proportion of participants with any severe adverse events (grade ≥ 3 according to CTCAE) at days 7, 14, and 28. Treatment-emergent
Fig. 3.4 Study design and primary objective
24
A. Beccari et al.
Fig. 3.5 Proportion of patients with undetectable SARS-CoV-2
adverse events occurred in 52.6% of participants in the placebo group compared with 36.4% in the 60-mg group and 50.0% in the 120-mg group. The most commonly reported adverse effect was gastrointestinal discomfort. Treatment-emergent serious adverse events leading to hospitalization occurred more frequently in the placebo group (26.3%) compared with the 60-mg group (13.6%) and the 120-mg group (10.0%) [11]. This study provides a great platform to initiate further clinical studies to evaluate the potential of raloxifene as a COVID-19 drug candidate.
3.4 Summary The EXSCALATE platform has proven to have great utility in drug repurposing during a pandemic and should be used in the future to identify candidates for other diseases. Raloxifene was identified as a potential agent against SARS-CoV-2 and met the primary virologic endpoint in the treatment of early mild COVID-19 disease. Although additional clinical trials are needed to confirm the efficacy and safety of raloxifene in treating SARS-CoV-2 infection, there is sufficient evidence in support of in silico drug repurposing to provide cost-effective and rapid screening of drugs to identify treatment options for the COVID-19 pandemic and future pandemics.
3 The Drug Repurposing Strategy in the Exscalate4CoV Project: …
25
References 1. A. Sertkaya, A. Birkenbach, A. Berlind, J. Eyraud, Eastern Research Group, Inc., Examination of clinical trial costs and barriers for drug development, https://aspe.hhs.gov/reports/examin ation-clinical-trial-costs-barriers-drug-development-0. Accessed 19 June 2022 2. T.U. Singh, S. Parida, M.C. Lingaraju, M. Kesavan, D. Kumar, R.K. Singh, Drug repurposing approach to fight COVID-19. Pharmacol. Rep. 72(6), 1479–1508 (2020) 3. M. Allegretti, M.C. Cesta, M. Zippoli, A. Beccari, C. Talarico, M. Mantelli et al., Repurposing the estrogen receptor modulator raloxifene to treat SARS-CoV-2 infection. Cell Death Differ. 29(1), 156–166 (2022) 4. V. Parvathaneni, N.S. Kulkarni, A. Muth, V. Gupta, Drug repurposing: a promising tool to accelerate the drug discovery process. Drug Discov. Today. 24(10), 2076–2085 (2019) 5. D. Gadioli, E. Vitali, F. Ficarelli, C. Latini, C. Manelfi, C. Talarico, et al., EXSCALATE: An extreme-scale in-silico virtual screening platform to evaluate 1 trillion compounds in 60 hours on 81 PFLOPS supercomputers. Preprint posted online 22 Oct 2021 6. X. Li, J. Yu, Z. Zhang, J. Ren, A.E. Peluffo, W. Zhang et al., Network bioinformatics analysis provides insight into drug repurposing for COVID-2019. Med. Drug Discov. 10, 100090 (2021) 7. O. Solis, A.R. Beccari, D. Iaconis, C. Talarico, C.A. Ruiz-Bedoya, J.C. Nwachukwu, et al., The SARS-CoV-2 spike protein binds and modulates estrogen receptors. bioRxiv. Preprint posted online 23 May 2022 8. D. Iaconis, L. Bordi, G. Matusali, C. Talarico, C. Manelfi, M. Candida Cesta et al., Characterization of raloxifene as a potential pharmacological agent against SARS-CoV-2 and its variants. Cell Death Dis. 13(5), 498 (2022) 9. H. Zhang, M.R. Rostami, P.L. Leopold, J.G. Mezey, S.L. O’Beirne, Y. Strulovici-Barel et al., Expression of the SARS-CoV-2 ACE2 receptor in the human airway epithelium. Am. J. Respir. Crit. Care Med. 202(2), 219–229 (2020) 10. C. Bai, A. Warshel, Critical differences between the binding features of the spike proteins of SARS-CoV-2 and SARS-CoV. J. Phys. Chem. B. 124(28), 5907–5912 (2020) 11. E. Nicastri, F. Marinangeli, E. Pivetta, E. Torri, F. Reggiani, G. Fiorentino et al., A phase 2 randomized, double-blinded, placebo-controlled, multicenter trial evaluating the efficacy and safety of raloxifene for patients with mild to moderate COVID-19. EClinicalMedicine 48, 101450 (2022) 12. R. Mishra, L.M. Behera, S. Rana, Binding of raloxifene to human complement fragment 5a (h C5a): a perspective on cytokine storm and COVID19. J. Biomol. Struct. Dyn. 40(3), 982–994 (2022) 13. S. Hong, J. Chang, K. Jeong, W. Lee, Raloxifene as a treatment option for viral infections. J. Microbiol. 59(2), 124–131 (2021) 14. Q. Ma, Z.W. Hao, Y.F. Wang, The effect of estrogen in coronavirus disease 2019. Am. J. Physiol. Lung Cell Mol. Physiol. 321(1), L219–L227 (2021) 15. J.R. Caeiro Rey, E. Vaquero Cervino, M. Luz Rentero, E. Calvo Crespo, A. Oteo Alvaro, M. Casillas, Raloxifene: mechanism of action, effects on bone tissue, and applicability in clinical traumatology practice. Open Orthop. J. 3, 14–21 (2009) 16. EVISTA (raloxifene hydrochloride). Package insert, https://www.accessdata.fda.gov/drugsa tfda_docs/label/2018/020815s034lbl.pdf. Accessed 2 Nov 2022 17. H.K. Patel, T. Bihani, Selective estrogen receptor modulators (SERMs) and selective estrogen receptor degraders (SERDs) in cancer treatment. Pharmacol. Ther. 186, 1–24 (2018) 18. N.S. Eyre, E.N. Kirby, D.R. Anfiteatro, G. Bracho, A.G. Russo, P.A. White et al., Identification of estrogen receptor modulators as inhibitors of flavivirus infection. Antimicrob. Agents Chemother. 64(8), e00289-e320 (2020) 19. Y.S. Yoon, Y. Jang, T. Hoenen, H. Shin, Y. Lee, M. Kim, Antiviral activity of sertindole, raloxifene and ibutamoren against transcription and replication-competent Ebola virus-like particles. BMB Rep. 53(3), 166–171 (2020)
26
A. Beccari et al.
20. Y. Murakami, M. Fukasawa, Y. Kaneko, T. Suzuki, T. Wakita, H. Fukazawa, Selective estrogen receptor modulators inhibit hepatitis C virus infection at multiple steps of the virus life cycle. Microbes Infect. 15(1), 45–55 (2013) 21. M. Takeda, M. Ikeda, K. Mori, M. Yano, Y. Ariumi, H. Dansako et al., Raloxifene inhibits hepatitis C virus infection and replication. FEBS Open Bio 22(2), 279–283 (2012) 22. N. Furusyo, E. Ogawa, M. Sudoh, M. Murata, T. Ihara, T. Hayashi et al., Raloxifene hydrochloride is an adjuvant antiviral treatment of postmenopausal women with chronic hepatitis C: a randomized trial. J. Hepatol. 57(6), 1186–1192 (2012) 23. J. Peretz, A. Pekosz, A.P. Lane, S.L. Klein, Estrogenic compounds reduce influenza A virus replication in primary human nasal epithelial cells derived from female, but not male, donors. Am. J. Physiol. Lung Cell Mol. Physiol. 310(5), L415–L425 (2016) 24. J.A. Dodge, C.W. Lugar, S. Cho, L.L. Short, M. Sato, N.N. Yang et al., Evaluation of the major metabolites of raloxifene as modulators of tissue selectivity. J. Steroid Biochem. Mol. Biol. 61(1–2), 97–106 (1997)
Chapter 4
The High-Performance Computing Resources for the EXSCALATE4CoV Project Andrew Emerson, Federico Ficarelli, Gianluca Palermo, and Francesco Frigerio Abstract In order to repurpose currently available therapeutics for novel diseases, druggable targets have to be identified and matched with small molecules. In the case of a public health emergency, such as the ongoing coronavirus disease 2019 (COVID19) pandemic, this identification needs to be accomplished quickly to support the rapid initiation of effective treatments to minimize casualties. The utilization of supercomputers, or more generally High-Performance Computing (HPC) facilities, to accelerate drug design is well established, but when the pandemic emerged in early 2020, it was necessary to activate a process of urgent computing, i.e., prioritized and immediate access to the most powerful computing resources available. Thanks to the close collaboration of the partners in the HPC activity, it was possible to rapidly deploy an urgent computing infrastructure of world-class supercomputers, massive cloud storage, efficient simulation software, and analysis tools. With this infrastructure, the project team performed very long molecular dynamics simulations and extreme-scale virtual drug screening experiments, eventually identifying molecules with potential antiviral activity. In conclusion, the EXaSCale smArt pLatform Against paThogEns for CoronaVirus (EXSCALATE4CoV) project successfully brought together Italian computing resources to help identify effective drugs to stop the spread of the SARS-CoV-2 virus.
A. Emerson (B) · F. Ficarelli CINECA, Casalecchio di Reno, Italy e-mail: [email protected] G. Palermo Politecnico Milano, Milan, Italy F. Frigerio Eni S.p.A, Milan, Italy © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Coletti and G. Bernardi (eds.), Exscalate4CoV, SpringerBriefs in Applied Sciences and Technology, https://doi.org/10.1007/978-3-031-30691-4_4
27
28
A. Emerson et al.
Table 4.1 Supercomputers used in EXSCALATE4CoV System name
Total nodes
Node architecture
Total system power (PFLOPS)
Eni HPC5
3188
2× Intel Xeon Gold + 4 NVIDIA V100 GPUs and PCIe
51.7
MARCONI M100
980
2× IBM Power AC922 + 4xNVIDIA V100s with NVLINK 2.0
32
4.1 Supercomputer Resources At the start of the project (March 2020), the High-Performance Computing (HPC) resources available were limited only to the Tier-1 Galileo cluster at CINECA since the flagship Marconi100 system had not yet been put into service. Indeed, the pandemic started during the installation period of the larger machine. However, in April 2020, Eni S.p.A. decided to support the project by providing multiple nodes of their NVIDIA V100 GPU-equipped HPC5 system, at the time the most powerful supercomputer in Europe and the sixth most powerful in the world. This action allowed both the molecular dynamics (MD) and virtual screening teams to start optimizing the GPU-based software stacks. In May 2020, the Eni cluster was joined by the Marconi 100 (M100) CINECA system, also equipped with NVIDIA V100 GPUs and at the time the second most powerful supercomputer in Europe. In addition to computing cycles, the project also benefited from cloud storage made available by the Italian Institute of Nuclear Physics (INFN) for storing the results of the calculations. The specifications of the two supercomputers are given in Table 4.1, and in the following sections we describe in more detail the HPC resources and how they were used in the project.
4.2 HPC-Layer 5 (HPC5) HPC5 was developed for the energy industry by Eni in February 2020 to reduce computing times, making idea generation quicker while requiring less energy consumption. The cluster is based on nodes containing 2× Intel Gold CPUs and 4 NVIDIA V100 GPUs connected via PCIe links. For the lifetime of the EXSCALATE4CoV project, Eni provided exclusive access to 50 of these nodes together with a dedicated file system to support the research [1]. However, 1500 HPC5 nodes were made available (in conjunction with the CINECA M100 system) for the socalled “big run” of the high-throughput virtual screening experiments executed by the project. Eni also provided research and support staff for using the HPC resources.
4 The High-Performance Computing Resources …
29
4.3 CINECA Marconi M100 CINECA is a nonprofit supercomputing consortium in Italy that comprises 103 members, including 69 universities, 32 Italian national institutions, the Italian Ministry of Universities and Research, and the Italian Ministry of Education. CINECA is the largest computer center in Italy and one of the most important in Europe and the world. For EXSCALATE4CoV, CINECA provided dedicated time and support staff on the flagship Marconi M100 system, which consists of nodes equipped with IBM POWER9 processors and NVIDIA V100 GPUs connected via the high-bandwidth NVLink interconnect. Marconi100 is a production system with many users; therefore, it was not possible to allocate dedicated nodes to the project, unlike with the resources for HPC5. An exception was made for the big virtual screening run, where the center agreed to dedicate the whole system (nearly 1000 nodes) for 60 h to run the experiment.
4.4 Uses of HPC in EXSCALATE4CoV 4.4.1 HPC Workflow The HPC workflow in EXSCALATE4CoV shown in Fig. 4.1 is centered around the EXSCALATE platform [2], a high-throughput virtual screening system based on the LiGen package, proprietary software that has been codeveloped by Dompé Pharmaceuticals, CINECA, Politecnico di Milano, and other partners [3]. The LiGen software screens a database of putative drug molecules or ligands against accurate 3D structures of one or more protein targets. Ligands that are ranked highly by the system can then be examined further in the drug-discovery pipeline. The 3D protein structures that LiGen reads are usually obtained from the experiment, especially high-quality radiographic diffraction data, which are considered the gold standard for protein structures. However, the crystals used for these experiments are not good representations of the proteins in vivo because the molecules are highly flexible, and physiological conditions will favor a large dynamic range of possible shapes or conformations. One way to obtain representative sets of conformations in a more realistic environment is to apply the technique of MD simulation to the protein structure. The EXSCALATE4CoV HPC workflow therefore has two components requiring computational resources: the first being the virtual screening based on LiGen and the second being MD simulations for providing representative 3D protein coordinates as inputs to LiGen. In addition to providing structures for LiGen, the MD simulations were also used directly for studying protein dynamics, with the raw trajectories being made available to other researchers via the cloud. We now describe these two components in more detail.
30
A. Emerson et al.
Fig. 4.1 EXSCALATE4CoV HPC Workflow
4.4.2 High-Throughput Virtual Screening 4.4.2.1
Software Porting of the EXSCALATE4CoV Platform
Virtual screening is a computational technique that searches libraries of small molecules (termed ligands) to identify those structures that are most likely to bind to a drug target, usually a protein such as an enzyme. For in silico high-throughput virtual screening (HTVS), a large number of ligands can be processed by a parallel computer owing to the fact that each evaluation is distinct from the others, thus allowing high parallelism. In the urgent computing environment of the EXSCALATE4CoV project, the HTVS framework needed to be not only parallel but also very optimized at the CPU, GPU, and node levels and tuned for both the M100 and HPC5 architectures. The first stage of the optimization involved tuning the main modules of LiGen for single-node execution. Thus, in early 2020, the LiGen docking and scoring algorithms were ported to CUDA to run on NVIDIA GPUs, and CPU-only versions underwent a major rewrite as well to be able to fit the new code structure. After several rounds of parameter tuning and kernel optimization, it was possible to obtain a sustained throughput of approximately 2400 ligands per second on a fully utilized M100 node with 4× NVIDIA-V100 GPUs. The availability of multiple implementations of the dock and score algorithm, i.e., for both CPUs and GPUs, allowed the throughput to be maximized. The next stage was to scale up to all the available nodes of the target cluster, which involved optimizing several steps in the docking pipeline including (i) the transfer of data within a node to feed the accelerators, (ii) data transfer from storage devices to the machine’s nodes and vice versa, (iii) minimization of the communications between nodes and synchronizations between processes, and (iv) improving resilience to reduce the impact of hardware faults in the time to solution.
4 The High-Performance Computing Resources …
31
It should be emphasized that during the whole development phase, there was a low-level tuning of the kernels to match the HPC platform characteristics and memory sizes and in fact, the fitting of the data structures on the most suitable memories has been one of the key program features that has contributed to the high performance of the system. Despite the similarity of the two types of HPC nodes available in the CINECA-Marconi100 and ENI-HPC5, accelerated by the 4 NVIDIA GPUs, a custom tuning was needed to address the different file system structures and the different CPU architectures. In addition to the software optimization, the parameters for the pose search were also chosen to provide a balance between the quality of the solution and the docking throughput.
4.4.2.2
Drug Repurposing
One of the first major tests of the LiGen-enabled EXSCALATE platform was a drug repurposing experiment, i.e., a screening of existing drugs approved for different therapeutic conditions. In this experiment, a database containing 400,000 molecules consisting of so-called “safe-in-man” drugs (around 10,000) and natural products was screened against 3D structures of the viral 3CLPro protein, which were obtained from MD simulations (see below). From the docking results, it was possible to identify a small subset of molecule candidates that could be tested in the laboratory. An outcome of the laboratory was the identification of the osteoporosis drug raloxifene as a possibly active molecule against the SARS-CoV-2 virus. After approval by the EMA, the drug entered clinical trials for its use in COVID-19 patients [4].
4.4.3 The Big Run After only a few months of development and testing, the EXSCALATE platform was deemed ready for the first massive HTVS experiment, known within the project as the Big Run. Thus, in November 2020, HPC resources consisting of all the nodes of both the CINECA M100 and Eni HPC5 clusters, a combined theoretical performance of 81 PFLOPS, were allocated to execute a massive virtual screening, which, at the time of writing and to the best of our knowledge, has been the largest high-throughput docking event ever achieved [5]. In the course of 60 h, a chemical library of more than 70 billion ligands was docked against 15 binding sites of 12 viral SARS-CoV-2 proteins to achieve more than a trillion docking operations. The results of this run, amounting to about 65 TB of data (approximately 4.33 TB for each active site) are still being analyzed after more than one year from the end of the project. In fact, the large amount of data has necessitated the development of ad hoc parallel tools to help perform the post-processing and a completely new approach to identify the more interesting compounds from the initial huge size of the target library.
32
A. Emerson et al.
4.4.4 Molecular Dynamics 4.4.4.1
Software Optimization
For the MD simulations, we used the open-source GROMACS software [6] since this is very efficient, and one of the developers of the program, KTH, was also a partner of the project. Toward the beginning of 2020, the GROMACS 2020 version became available and was installed on the HPC infrastructure. This and subsequent versions have the advantage that almost all of the atomic interactions can be executed on the GPU, with only limited communications toward the CPU. After some tuning of the compile and runtime parameters on HPC5 and M100, it was possible to achieve a minimum of 100 ns/day performances for the COVID-19 proteins using only one compute node (the performance varies according to the size of the simulated system). The high single-node performance is very convenient because it avoids a time-consuming parallel scaling analysis to determine the optimum number of nodes to use and minimizes the wait time for batch resources in the shared computer systems. GROMACS 2020 was installed on both M100 and HPC5, and batch scripts in the SLURM (M100) and PBS (HPC5) job schedulers were prepared and optimized. To overcome the 24-h wall time limit present on both systems, UNIX bash scripts were designed to allow batch jobs to be pipelined in such a way that each job would only start when the previous one in the pipeline had finished.
4.4.4.2
Proteins Studied
The initial aim was to simulate for 10 µs all the proteins coded by the SARS-CoV2 genome, which amounts to about 25 different structures. However, since some proteins such as the proteases are more pharmacologically relevant, the list of systems to simulate with MD was prioritized. A further consideration was based on the availability of good quality structures from radiographic experiments, needed in the MD runs as starting coordinates for the simulations. Radiographic structures of many viral proteins were not present at the start of the project; in these cases, coordinates based on homology models (i.e., structures based on the known structures of similar proteins) were used instead. These are less accurate than radiographic data; therefore, if a high-quality radiographic structure for one of these molecules became available during the course of the project, the homology-based simulation was abandoned, and the dynamics restarted with the new coordinates. In addition to the single protein simulations, so-called holo systems consisting of a protein and bound ligand were also simulated if a high-quality structure became available. In total, about 30 10 µs MD simulations were completed, analyzed, and uploaded to the cloud storage by the end of the project. More details on the parameters and conditions used in the MD simulations can be found, for example, in Grottesi et al. [7].
4 The High-Performance Computing Resources …
33
4.5 Analysis At the end of each run, the MD trajectories were processed to remove the solvent, hence significantly reducing the sizes of the trajectory files. These were then analyzed with programs available from the GROMACS suite to ensure that artifacts introduced by the periodic boundary conditions used in the simulation had been removed correctly. In order to find representative protein structures for LiGen, an analysis called Principal Component Analysis was employed to identify clusters in conformational spaces that were preferentially explored by protein conformations during the MD run. These clusters were then sampled to provide input structures (e.g., about 200) for the screening with LiGen.
4.6 Summary In the space of only a few months at the beginning of 2020, an HPC infrastructure involving the two most powerful supercomputers in Europe at the time was deployed and used to couple extremely long MD simulations together with an optimized HTVS based on the LiGen software. The application of the infrastructure in a drug repurposing experiment resulted in the identification of potential novel compounds, one of which, the osteoporosis and breast cancer drug raloxifene, later entered clinical trials for treating COVID-19. In addition, in a later stage of the project, a massive virtual screening involving 70 billion ligands was performed, which, to our knowledge, is the largest screening ever carried out. The MD simulations on the other hand have provided a wealth of data, which was made freely available to the scientific community. It should be emphasized that these impressive achievements were only made possible thanks to the collaboration of the project partners who provided computing and data resources, software, personnel, and expertise at very short notice. This project was therefore a triumph not only of the latest computer technology but also of the cooperation and understanding of many people—users, managers, scientists, and programmers—united by a global emergency.
References 1. ENI. HPC5 for EXSCALATE4CoV: supercomputer versus coronavirus, https://www.eni.com/ en-IT/operations/hpc5-for-exscalate4cov.html. Accessed 9 Dec 2022 2. D. Gadioli, E. Vitali, F. Ficarelli, C. Latini, C. Manelfi, C. Talarico et al., EXSCALATE: an extreme-scale virtual screening platform for drug discovery targeting polypharmacology to fight SARS-CoV-2. IEEE Trans. Emerg. Top. Comput. (2022). https://doi.org/10.1109/TETC.2022. 3187134 3. A.R. Beccari, C. Cavazzoni, C. Beato, G.L. Costantino, A high performance workflow for chemistry driven de novo design. J. Chem. Inf. Model. 53, 1518–152 (2013)
34
A. Emerson et al.
4. E. Nicastri, F. Marinangeli, E. Pivetta, E. Torri, F. Reggiani, G. Fiorentino et al., A phase 2 randomized, double-blinded, placebo-controlled, multicenter trial evaluating the efficacy and safety of raloxifene for patients with mild to moderate COVID-19. EClinicalMedicine 48, 101450 (2022) 5. D. Gadioli, E. Vitali, F. Ficarelli, C. Latini, C. Manelfi, C. Talarico, et al., EXSCALATE: An extreme-scale in-silico virtual screening platform to evaluate 1 trillion compounds in 60 hours on 81 PFLOPS supercomputers. arXiv:2110.11644 (2021) 6. E. Lindahl, B. Hess, D. Van Der Spoel, GROMACS 3.0: a package for molecular simulation and trajectory analysis. J. Mol. Model. 7(8), 306–317 (2001) 7. A. Grottesi, N. Besker, A. Emerson, C. Manelfi, A.R. Beccari, F. Frigerio, et al., Computational studies of SARS-CoV-2 3CLpro: insights from MD simulations. Int. J. Mol. Sci. 21(15) (2020)
Chapter 5
The Impact of the Scientific Metaverse on the Biotech Industry: How Virtual Reality Helped Researchers Fight Back Against COVID-19 Carmine Talarico and Edgardo Leija Abstract The coronavirus disease 2019 pandemic not only precipitated a digital revolution but also led to one of the largest scientific collaborative open-source initiatives. The EXaSCale smArt pLatform Against paThogEns for CoronaVirus (EXSCALATE4CoV) consortium, led by Dompé farmaceutici S.p.A., brought together 18 global organizations to counter international pandemics more rapidly and efficiently. The consortium also partnered with Nanome, an extended reality software company whose software facilitates the visualization, modification, and simulation of molecules via augmented reality, mixed reality, and virtual reality applications. To characterize the molecular structure of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and to identify promising drug targets, the EXSCALATE4CoV team utilized methods such as homology modeling, molecular dynamics simulations, high-throughput virtual screening, docking, and other computational procedures. Nanome provided analysis of those computational procedures and supplied virtual reality headsets to help scientists better understand and interact with the molecular dynamics and key chemical interactions of SARS-CoV-2. Nanome’s collaborative ideation platform enables scientific breakthroughs across research institutions in the fight against the coronavirus pandemic and other diseases.
C. Talarico (B) Dompé farmaceutici S.p.A., L’Aquila, Italy e-mail: [email protected] E. Leija Nanome Inc., San Diego, California, United States © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Coletti and G. Bernardi (eds.), Exscalate4CoV, SpringerBriefs in Applied Sciences and Technology, https://doi.org/10.1007/978-3-031-30691-4_5
35
36
C. Talarico and E. Leija
The coronavirus disease 2019 (COVID-19) pandemic was a catalyst for people all over the world to adapt to an inevitable digital revolution. Owing to its global scale, it became one of the largest open-source initiatives in which many scientific researchers collaborated to combat the threat to public health. In the field of drug discovery, there is something called Computer-Aided Drug Design (CADD) in which researchers model and simulate predictions of how the design of a molecule may behave under specific conditions. For a long time, these complex 3D data were being analyzed through a 2D medium; however, thanks to advancements in virtual reality hardware and the software development by the Nanome team, anyone can now visualize and interact with molecular data in an immersive and collaborative environment. Nanome is an extended reality (XR) software company that was founded in 2015 to change the way we understand and interact with science at the molecular level in a real-time collaboration platform. The XR term refers to a superset category that includes a combination of environments between the real world and a virtual world through wearable human–computer interactions generated via augmented reality, mixed reality, and virtual reality (VR) applications. Nanome’s software enables people to visualize, modify, and simulate molecules such as proteins, chemical compounds, and nucleic acids to accelerate and enhance scientific ideation, communication, and fail-fast decision-making. The platform facilitates effective communication of data and integrates with existing computational chemistry workflows— benefits that have led to the adoption of the San Diego-based company’s enterprise solution by several pharmaceutical and biotech companies worldwide. There were several organizations across Europe, Australia, America, and Asia that used Nanome for their COVID-19 research. In Europe, the EXaSCale smArt pLatform Against paThogEns for CoronaVirus (EXSCALATE4CoV; E4C) consortium,
5 The Impact of the Scientific Metaverse on the Biotech Industry: How …
37
led by Dompé farmaceutici S.p.A., received funding from the European Commission’s Horizon 2020 (H2020) programme to bring together 18 organizations across seven countries to leverage the continent’s best life-science research labs with an aim to counter international pandemics faster and more efficiently. Because the Nanome platform was a great way to connect all of these groups together, Nanome provided several VR headsets across these organizations. To see the full list of the organizations in that consortium and stay up to date on their research efforts, visit www.exs calate4cov.eu. In the first couple of months of the pandemic, researchers didn’t know the exact molecular structure of the most important drug targets of the newly emerged coronavirus behind COVID-19. In the scientific world, this virus was known as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2 for short). However, owing to previous work on SARS-CoV-1 and through advanced CADD methods such as homology modeling, molecular dynamics (MD) simulations, high-throughput virtual screening, docking, and other computational procedures, the EXSCALATE4CoV team was able to start developing an early hypothesis for a plan of attack. The collaborative analysis of the output from those computational procedures is where Nanome comes into the picture. In Nanome, scientists analyzed a drug target of SARS-CoV-2 known as the main protease (Mpro) because it provided a basis for designing potential small molecules. The scientists were able to more clearly understand what happened to the chemical interactions of the catalytic dyad, which indicated the potential prevention of enzymatic activity between the two dimers of the Mpro. It was an eye-opening experience for many researchers to hold that molecule in their hands and witness how different regions of the structure moved over time when playing back the MD trajectories (Fig. 5.1). Analyzing the MD in VR helped the scientists better understand how those movements disrupted multiple chemical interactions, including key hydrogen bonds. The Mpro was selected for deeper analysis because it provided a basis for designing potential small molecules. These small molecules, also known as ligands, could interact in the same active site where the catalytic domain of the Mpro is located. In Nanome, rendering that pocket with a hydrophobic surface representation made it simple to understand which regions of the Mpro could interact with the hydrophobic core of the small molecules. The binding pocket is essentially a cavelike region on a structure that has enough space and the appropriate properties to fit potential ligands. Understanding the shape and intricacies of a pocket and the type of chemical interactions that could be formed is critical for a good structure-based drug design process. Imagine being inside the Mpro binding pocket (Fig. 5.2), you would be able to look in any direction and start to generate a deeper understanding of the structural nuances of the Mpro. It would immediately become obvious in which direction you could grow a molecule to potentially generate additional hydrogen bonds. Now, grab the protein with both of your virtual hands and scale it down. Notice how there is an avatar in the virtual room? This is your medicinal chemistry colleague joining you remotely from another lab to discuss potential modifications to the small molecule.
38
C. Talarico and E. Leija
Fig. 5.1 Carmine Talarico from Dompé farmaceutici S.p.A. analyzes the SARS-CoV-2 Main Protease with Edgardo Leija and Daniel Gruffat from Nanome. Source https://youtu.be/oiKdWt qbOZA
Based on their virtual hands pointing to a specific area of the small molecule in the pocket and their audio description coming through the speakers of the VR headset, you can easily understand their recommendation to add an additional carbon ring and know exactly where they suggest adding it. Your colleague then raises an arm and points to an area next to you where you find a floating PDF containing an SAR table relevant to your project. You take out your molecule-building tool and based on your colleagues’ suggestions and the additional guidance from the floating literature, you edit the small molecule accordingly and are surprised to see it automatically start to optimize its geometry through a quick minimization. As the minimization completes, a table containing your molecule’s chemical property predictions appears and shows you its values for total polar surface area, lipophilicity, LogD, and others. Your computational chemist colleague then decides to port into your virtual room and brings additional molecules that were generated via machine learning methods. Being able to easily discern the false positives and false negatives, the enhanced medicinal chemist in your room ensures that your team focuses only on the best options. You then ask the application to hide the molecular surface and, because there are voice commands, the surface clears the scene effortlessly. Finally, you decide to save everything that just happened in your room by stopping the spatial recording that was capturing all the molecular changes, avatar gestures, and audio narrations during that meeting. You share this spatial recording with a colleague who couldn’t make the meeting and they start to play back the VR session. They would see your avatar go through every decision that was made earlier, and, when an insightful
5 The Impact of the Scientific Metaverse on the Biotech Industry: How …
39
Fig. 5.2 Carmine Talarico from Dompé farmaceutici S.p.A. analyzes spike protein mutations and how it interacts with the ACE2 receptor with Steve McCloskey and Daniel Gruffat from Nanome. Source https://youtu.be/qc_7GPJSoFQ
moment prompts them to pause the recording, they could change to “interact mode” and continue the design iterations from that specific point. This is not fiction or some distant future. This is happening now across many pharmaceutical and biotech companies, government labs, and academic research institutions that leverage Nanome as a collaborative ideation platform for their drugdiscovery work. Nanome has enabled a profound impact and scientific breakthroughs in the fight against the coronavirus pandemic and other diseases. To experience this for yourself, visit www.nanome.ai.
Chapter 6
From Genomes to Variant Interpretations Through Protein Structures Janani Durairaj, Leila Tamara Alexander, Gabriel Studer, Gerardo Tauriello, Ingrid Guarnetti Prandi, Rosalba Lepore, Giovanni Chillemi, and Torsten Schwede Abstract The large amount of genetic, phenotypic, and structural data from diverse conditions and environments offers opportunities for new groundbreaking research. Today, the major scientific task is to interpret the vast number of genetic variants within these data. As described in this chapter, identifying relevant variants and connecting them with the associated protein structural and environmental information is a powerful approach to biological discoveries. The unified view of the data brings us a step closer to understanding genetic variation, which is also fundamental for achieving the goals of personalized medicine and the planet’s environment.
6.1 Interpreting Genetic Variants in the Context of Protein Structures Progress in DNA technologies resulted in rapid and cost-effective genome sequencing, which has become an integral part of medical and scientific investigations. We also witness a rapid increase in phenotypic data in both routine clinical as well as research settings, often associated with different variations in the
Authors Janani Durairaj and Leila Tamara Alexander are contributed equally to this work. J. Durairaj · L. T. Alexander (B) · G. Studer · G. Tauriello · T. Schwede Biozentrum, University of Basel, Basel, Switzerland e-mail: [email protected] Computational Structural Biology, SIB Swiss Institute of Bioinformatics, Basel, Switzerland I. G. Prandi · G. Chillemi Dipartimento per la Innovazione nei Sistemi Biologici, Agroalimentari e Forestali, Università Della Tuscia, 01100 Viterbo, Italy R. Lepore Department of Biomedicine, Basel University Hospital and University of Basel, Basel, Switzerland © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Coletti and G. Bernardi (eds.), Exscalate4CoV, SpringerBriefs in Applied Sciences and Technology, https://doi.org/10.1007/978-3-031-30691-4_6
41
42
J. Durairaj et al.
sequenced genomes. At the same time, the breakthroughs in experimental and algorithmic approaches in the protein structural biology field allowed for streamlined and scalable protein structure determination and analysis. We therefore have an unprecedented amount of data on genetic, protein structure, and phenotypic levels for addressing a broad spectrum of scientific questions. Protein structure can provide insights that would not be obtainable from the sequence alone [1]. Since variations in genome sequence can affect phenotype, translation of linear gene sequence into respective protein structures and consequent variant mapping can be effective tools for gaining knowledge that explains observed phenotypes. This could result in better hypotheses for the role and mechanism of action behind variants that confer resistance, specificity, or novel function. The SARS-CoV-2 virus of the COVID-19 pandemic with major variants of concern Delta and Omicron was an excellent illustration of this concept. The increased virulence of the Delta variant and increased transmissibility of the Omicron variant were explained by only a number of critical mutations that were observed in key structural regions of the spike protein (Fig. 6.1A–C).
6.2 Workflows for Identifying Relevant Variants Only some of the variants in a protein sequence are linked to phenotypic changes. Looking for patterns across different genomes and homologous proteins allows us to pinpoint relevant residue variants: those that are more likely to be the determinants for resistance, specificity, or evolution of function. These pattern-finding approaches often rely on multiple sequence alignments (MSAs), the arrangement of a number of protein sequences that take into account evolutionary events such as mutations, insertions, deletions, and rearrangements of amino acids (Fig. 1B). Subsequent analysis of variations and co-variations in MSAs is crucial to infer functionally relevant residues and their respective mutations. Here, we describe three major ways of using MSAs to pinpoint relevant variant positions: specificity-determining positions, mutational hotspots, and co-evolutionary analysis. The first approach is to group proteins from the same family in accordance to some sub-functionality, such as a substrate or product specificity, activity levels, promiscuity, or resistance. Alternatively, unsupervised grouping, based on sequence covariation, phylogeny, or other patterns, can also be performed. In the resulting MSA, residue positions that are conserved within specific subgroups but differ across them are called specificity determining positions (SDPs)—they are the residues which determine the sub-functionality in question. For example, as a part of the EXSCALATE4CoV project, SDPs within different sub-genera of beta-coronavirus spike proteins were examined [2]. The analysis uncovered a set of residues that defined a specific region of the receptor binding domains, and was in direct or partial contact with the host cell angiotensin-converting enzyme 2 (ACE2, human ACE2 in case of SARS-CoV-2). These positions display
6 From Genomes to Variant Interpretations Through Protein Structures
43
Fig. 6.1 A A schematic diagram illustrating the organization of SARS-CoV-2 genome and the domain arrangement of the spike protein within it. B Multiple alignments of SARS-CoV-2 spike protein sequences for the six labeled variants, colored in accordance with amino acid properties; the columns where omicron differs from the reference are marked with circles: green when a mutation is present only in omicron and amino acid properties differ (an example of defining relevant variants), red otherwise. C Structure of the spike protein trimer (PDB ID: 6VYB) with omicron mutation positions shown in red, and the relevant variants as identified in (B) shown in green. D Incorporating the effects of the environment: spike protein trimer (gray) in complex with hACE2 receptor (orange, PDB ID: 6M18), glycans (purple), and water (blue) as an input to MD simulation studies
typical properties of protein functional sites, and are very likely involved in determining host cell receptor specificity for the analyzed spike proteins. Similarly, an SDP analysis of Ebola viruses helped to assess the human infection potential of Reston virus [3], and the more novel Bombali virus [4]. The identified SDPs were thought to be involved in interaction with host cell karyopherins, thereby inhibiting interferon signaling. Two critical SDPs differed in the Reston and Bombali viruses, indicating
44
J. Durairaj et al.
that these two viruses did not have pathogenicity in humans, or pathogenicity was reduced. However, mutating SDPs across pathogenic and reduced pathogenic groups has been used to successfully replace function or selectivity from one group to the other [5, 6]. Likewise, in the case of SARS-CoV-2, one phylogenetically distant Sarbecovirus clade of coronaviruses had mostly identical SDPs, indicating that the sequences in this clade may only be a few mutations away from developing human infection capabilities. The second common application is looking at MSAs of closely related strains to find mutational hotspots within the linear genome. Coupled with phenotypic annotations for the strains such as disease causation or drug resistance, certain positions or combinations of positions can be linked to the phenotype in question. A recent approach trained a deep generative model on MSAs of human proteins to predict the likelihoods of each amino acid variant occurrence with respect to the wild type [7]. The distribution of these likelihoods formed two distinct peaks, thereby binning most variants into respective benign or pathogenic categories. The concerted worldwide response to the SARS-CoV-2 pandemic led to massive sequencing efforts, surpassing that of any other pathogen. As of June 2022, over 11 million HCoV-19 genome sequences were made available in GISAID [8], and clinically relevant variants with measured phenotypes were made available in ClinVar [9] and WHO [10], allowing for many large-scale analyses of variations and their effects. In one such study, variants from 48,000 genomes were mapped onto the 3D structures of SARS-CoV-2 proteins in order to give structural and energetic context to conservation and other MSA-derived features [11]. In another study, the authors performed deep sequencing of over a thousand individual SARS-CoV-2 samples and analyzed within-host variation, identifying recurrent mutations linked to a number of mutational hotspots under purifying selection [12]. Co-evolutionary analysis is a popular concept that goes hand in hand with MSA construction, named after the biological phenomenon of co-evolution. This theory postulates that evolution preferentially selects for protein pairs with matching mutations, i.e., the mutated residues involved in an interaction, and this interaction remains preserved. In accordance with this concept, pairs of residues with correlated mutations in MSAs are likely to have co-evolved due to respective protein interactions [13]. This approach allows identifying interacting residues within and across proteins, without the need to incorporate structural information [14]. Analysis of co-evolving mutations across a MSA that concatenated SARS-CoV-2 spike proteins and ACE2 receptors pinpointed key intermolecular contacts on the interface of the spike protein–ACE2 complex, and confirmed the role of previously determined SDPs in this interaction [2]. Incorporating residue correlations from the MSA into mutant effect prediction also led to improved results, underlining the epistatic nature of mutations and the interlink between function and co-evolution [15]. Co-evolutionary information has been taken to the extreme in recent protein structure prediction methods such as AlphaFold2, harnessed by deep learning, to predict 3D protein structure at experimental levels of accuracy [16].
6 From Genomes to Variant Interpretations Through Protein Structures
45
6.3 From Relevant Variants to Relevant Protein Structures The Protein Data Bank (PDB) is the single worldwide archive and a primary source of experimentally determined structural data of proteins [17]. Yet experimental structure determination can be time-consuming and laborious, particularly for proteins from novel or poorly characterized pathogens, and the number of proteins available in the PDB is substantially lower than the number of known protein sequences [18]. An attractive alternative to the experimental structural determination is to apply theoretical modeling approaches, which have evolved into standardized easy-to-use online pipelines, such as SWISS-MODEL [19], I-TASSER [20], IntFOLD [21], and Robetta [22]. These approaches rely either on the concepts of homology modeling (also known as comparative or template-based modeling, where a protein structure model is extrapolated from evolutionarily related proteins of known structure), physics-based de novo modeling, or a combination of both. In recent years, deep learning-based modeling approaches have begun to achieve near-experimental accuracy, with the AlphaFold2 prediction results being considered as a major breakthrough in protein modeling [23, 24]. Mere weeks after the start of the COVID-19 pandemic, SWISS-MODEL was used to generate predicted 3D structures for the entire SARS-CoV-2 proteome, which were kept up to date with the new releases of experimental structures and emerging variants (https://swissmodel.expasy.org/repository/species/2697049). Despite the great advances in computational structure prediction, building highaccuracy 3D structures for any query protein is not yet a given. Even in the case of experimentally determined structures, some structural elements might be more accurate than others. Quality assessment tools and algorithms are therefore indispensable in any structure determination or prediction pipeline. Classical stereochemistry checks used in experimental structure determination, for example the MolProbity macromolecular validation tool [25], also help to indicate inaccuracies in theoretical models. However, to assess model accuracy in the absence of experimental observables, one should turn to specialized tools that attach confidence scores to individual residues and spatial regions, allowing to filter out low-quality models. In fact, Critical Assessment of Structure Prediction (CASP) [26] and Continuous Automated Model EvaluatiOn (CAMEO) [27] are community-wide efforts for benchmarking the state of the art in structure prediction that have dedicated categories for model accuracy assessment. The online modeling pipelines mentioned above automatically provide model accuracy scores along with their predicted models. Additionally, SWISS-MODEL and IntFOLD allow uploading models from arbitrary sources to run through their accuracy assessment pipelines: QMEANDisCo [28] and ModFOLD [29], respectively. Recently, the Coronavirus Structural Task Force took upon the gargantuan and highly important task of assessing the huge inflow of experimental structures of SARS-CoV-2 variants. The assessment is done both in terms of the fit of the final model to the raw experimental data, and the agreement between the model with the prior biological knowledge of the mechanisms involved [30].
46
J. Durairaj et al.
With the set of quality-annotated structures in our possession, the next step is to map the relevant information at protein, motif, and residue levels. This includes strain-specific variation and associated phenotypic information, as described in the previous section, and also prior knowledge gained from studying related homologous proteins or similar sequence motifs and structural regions, as well as information calculated from the amino acid sequence and 3D coordinates. Examples of such calculations and predictions include sequence conservation from MSAs of related proteins, surface accessibility and residue depths, amino acid physicochemical and electrostatic properties, and predicted cavities and pockets that may be involved in the ligand-binding activity. Residue-level quality scores can also be combined with indicators of structural instability, such as intrinsic disorder [31]. Many proteins have disordered regions, either due to being in unbound form (while binding to a ligand or another protein restores order), or simply because flexible, disordered regions aid or at least do not affect its activity. Altogether, mapping the gene variation to relevant, high-quality structures can give context to observed phenotypes, and even help predict the phenotypic effect of new mutations.
6.4 Variants and Structures in the Context of Protein Environment Proteins are not independent entities within a cell, but tend to interact with their environment and undergo conformational changes that can alter their function in an intricate biological network. The information about protein interactions with the environment coupled with the annotated protein structures help to extend the context of functional and phenotypically driven interpretation. Many proteins, particularly enzymes, carry out their activity by interacting with small molecules, including drugs and metabolites. Knowledge of active sites, binding cavities, and residues that are involved in protein–ligand interactions is fundamental for drug discovery and design. One way to identify what ligands can bind to a protein is to look at the protein’s homologues for which binding assays and co-crystallization studies have been performed. Mapping this information onto the annotated structure can be an insightful approach. Another common practice in homology modeling that was performed in the EXSCALATE4CoV project is to transfer ligand coordinate information from similar proteins for which ligand-bound structures are available. Similarly, AlphaFill is an initiative that “transplants” ligands into AlphaFold-predicted structures based on sequence and structure similarity [32]. Such approaches can be especially useful in drug repurposing campaigns, where approved drugs with a track record of safe use in humans can be investigated outside the scope of their original medical indication [33].
6 From Genomes to Variant Interpretations Through Protein Structures
47
If the binding mechanism, pose, and residues involved are significantly different even across similar proteins, computational methods, such as molecular docking methods, can be to predict how well a ligand binds to a specific protein. Molecular docking takes the query protein structure and the small molecule as inputs, and additionally some indication of the appropriate binding site cavity, often obtained due to the mapped structural annotations. They then return biophysically plausible and optimal poses of the small molecule within the designated binding pocket, leading to another set of binding residues to add to the annotations. To scale up molecular docking for the evaluation of millions of potential drug-like candidates, virtual screening approaches have been developed. They allow for the quick filtering of a large number of molecules to obtain a smaller set of binding candidates upon which the more accurate docking algorithms can be applied. The LIGATE project, which builds upon the EXSCALATE4CoV project results, aims to port virtual screening to exascale computing platforms and enable the screening of billions of molecules in a matter of days (https://www.ligateproject.eu/). When it comes to the understanding of intermolecular complexity, analysis of protein interaction with other proteins is important. Identifying and predicting interacting proteins is possible computationally through co-evolutionary analysis as discussed in Sect. 6.2, as well as experimentally, through site-directed mutagenesis, complex formation, and affinity assays. This information forms a protein– protein interaction (PPI) network that links protein entries and their annotations together. PPIs are crucial to understanding while tracking disease variants, and the immune escape strategies of SARS-CoV-2 were intrinsically linked to mutations which either inhibited binding to antibodies or increased receptor binding affinity [34]. The STRING database consists of PPIs collected from a variety of different sources for proteins across the tree of life [35]. Once an interaction is predicted or established, the next step is to pinpoint the residues in the interaction interface. Mutations or drugs that target the interface could disrupt or modify protein activity. Protein–protein complexes can be determined experimentally through approaches such as co-crystallization and cryoelectron microscopy, or computationally through protein docking and multimeric modeling. Until now, we described protein structures in bound and unbound forms as static objects. Proteins, however, may create and destroy contacts within themselves and with the environment as they move, a phenomenon often not captured in both experimentally determined or modeled structures. Molecular dynamics simulation is the technique that accounts for these effects (Fig. 1D). In this technique, the atoms are described as charged particles that interact with each other, governed by a classical potential. The electronic effect is indirect and embedded in the parameters of the potential equations that are solved in time, called force-field parameters. Therefore, the accuracy of the result depends on not only the precise prediction of the initial model but also on how well these parameters are described and applied. As part of the EXSCALATE4CoV project, an online resource systematically organizing atomistic simulations of the SARS-CoV-2 proteome was developed and released, allowing for interactive analysis of viral proteins and the effect of the emerging variants [36].
48
J. Durairaj et al.
Protein activity and function can also be regulated via post-translational modifications (PTMs). These modifications can also cause significant changes to the protein structure and hence directly mediate protein function [37]. For example, the correct modeling of the spike protein’s glycans on the surface of the SARS-CoV-2 virus was essential to elucidate the mechanism behind host interaction and infection [38, 39]. Thus, we come closer to a more detailed understanding of proteins and their function by combining annotations and information from different levels and perspectives: the gene and amino acid sequence, changes in these sequences across different species and conditions, the 3D fold adopted, and interactions of this fold with the environment and regulatory elements.
6.5 What is in the Future? Over the years, both the amount of experimental data and the computational power available to us have been increasing at a rapid rate. The ultimate task now is to make use of computational resources and consolidate the data in a meaningful way. Each individual protein has its own unique signature of genetic variants, influencing how it interacts with its environment and the environment interacts with it. In an ideal future, interventions for medical, research, or other purposes would be tailormade on a case-by-case basis, taking into account this unique variation, as well as their complex and interconnected effects at a sequence, structure, and molecular environment level. Linking different perspectives into a unified view, as described in this chapter, brings us a step closer to the interpretation of the vast and ever-growing sequence and phenotypic data. At the same time, despite the vast amount of data at different levels available, there are still gaps in our knowledge that preclude us from a full understanding of a variety of biological mechanisms. These gaps could be filled in the future not only by developing novel data collection techniques and instruments but also by making use of computational resources and algorithms for more advanced simulations, visualizations, and predictions, as well as consolidation of data from different sources and modalities. With these advancements, we become better at addressing the goals of personalized medicine and the planet’s environment, both of which rely heavily on our understanding of the biological activity within and around us. Acknowledgements This work was supported by the EuroHPC-JU grant No. 956137 (LIGATE) and SIB, Swiss Institute of Bioinformatics.
6 From Genomes to Variant Interpretations Through Protein Structures
49
References 1. S. Borocci, C. Cerchia, A. Grottesi, et al., Altered local interactions and long-range communications in UK variant (B.1.1.7) spike glycoprotein. Int. J. Mol. Sci. 22(11) (2021) 2. C. Pontes, V. Ruiz-Serra, R. Lepore, A. Valencia, Unraveling the molecular basis of host cell receptor usage in SARS-CoV-2 and other human pathogenic β-CoVs. Comput. Struct. Biotechnol. J. 19, 759–766 (2021) 3. M. Pappalardo, I.G. Reddin, D. Cantoni, J.S. Rossman, M. Michaelis, M.N. Wass, Changes associated with Ebola virus adaptation to novel species. Bioinformatics 33(13), 1911–1915 (2017) 4. H.J. Martell, S.G. Masterson, J.E. McGreig, M. Michaelis, M.N. Wass, Is the Bombali virus pathogenic in humans? Bioinformatics 35(19), 3553–3558 (2019) 5. V. Lagrée, A. Froger, S. Deschamps et al., Switch from an aquaporin to a glycerol channel by two amino acids substitution. J. Biol. Chem. 274(11), 6817–6819 (1999) 6. W.D. Heo, T. Meyer, Switch-of-function mutants based on morphology classification of Ras superfamily small GTPases. Cell 113(3), 315–328 (2003) 7. J. Frazer, P. Notin, M. Dias et al., Disease variant prediction with deep generative models of evolutionary data. Nature 599(7883), 91–95 (2021) 8. S. Elbe, G. Buckland-Merrett, Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Chall. 1(1), 33–46 (2017) 9. M.J. Landrum, S. Chitipiralla, G.R. Brown et al., ClinVar: improvements to accessing data. Nucleic Acids Res. 48(D1), D835–D844 (2020) 10. D. Parums, Editorial: Revised World Health Organization (WHO) terminology for variants of concern and variants of interest of SARS-CoV-2. Med. Sci. Monit. 27, e933622 (2021) 11. J.H. Lubin, C. Zardecki, E.M. Dolan, et al., Evolution of the SARS-CoV-2 proteome in three dimensions (3D) during the first six months of the COVID-19 pandemic. bioRxiv (2020) 12. G. Tonkin-Hill, I. Martincorena, R. Amato, et al., Patterns of within-host genetic diversity in SARS-CoV-2. Elife. 10 (2021) 13. U. Göbel, C. Sander, R. Schneider, A. Valencia, Correlated mutations and residue contacts in proteins. Proteins Struct. Funct. Genet. 18(4), 309–317 (1994) 14. F. Morcos, A. Pagnani, B. Lunt et al., Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. USA 108(49), E1293-1301 (2011) 15. T.A. Hopf, J.B. Ingraham, F.J. Poelwijk et al., Mutation effects predicted from sequence covariation. Nat. Biotechnol. 35(2), 128–135 (2017) 16. J. Jumper, D. Hassabis, Protein structure predictions to atomic accuracy with AlphaFold. Nat. Methods 19(1), 11–12 (2022) 17. H.M. Berman, J. Westbrook, Z. Feng et al., The protein data bank. Nucleic Acids Res. 28(1), 235–242 (2000) 18. T. Schwede, Protein modeling: what happened to the “protein structure gap”? Structure 21(9), 1531–1540 (2013) 19. A. Waterhouse, M. Bertoni, S. Bienert et al., SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 46(W1), W296–W303 (2018) 20. W. Zheng, C. Zhang, Y. Li, R. Pearce, E.W. Bell, Y. Zhang, Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations. Cell Rep. Methods 1(3) (2021) 21. L.J. McGuffin, R. Adiyaman, A.H.A. Maghrabi et al., IntFOLD: an integrated web resource for high performance protein structure and function prediction. Nucleic Acids Res. 47(W1), W408–W413 (2019) 22. M. Baek, F. DiMaio, I. Anishchenko et al., Accurate prediction of protein structures and interactions using a three-track neural network. Science 373(6557), 871–876 (2021) 23. J. Jumper, R. Evans, A. Pritzel et al., Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021)
50
J. Durairaj et al.
24. E. Callaway, ‘It will change everything’: DeepMind’s AI makes gigantic leap in solving protein structures. Nature 588(7837), 203–204 (2020) 25. C.J. Williams, J.J. Headd, N.W. Moriarty et al., MolProbity: More and better reference data for improved all-atom structure validation. Protein Sci. 27(1), 293–315 (2018) 26. A. Kryshtafovych, T. Schwede, M. Topf, K. Fidelis, J. Moult, Critical assessment of methods of protein structure prediction (CASP)-Round XIV. Proteins 89(12), 1607–1617 (2021) 27. X. Robin, J. Haas, R. Gumienny, A. Smolinski, G. Tauriello, T. Schwede, Continuous Automated Model EvaluatiOn (CAMEO)-Perspectives on the future of fully automated evaluation of structure prediction methods. Proteins 89(12), 1977–1986 (2021) 28. G. Studer, C. Rempfer, A.M. Waterhouse, R. Gumienny, J. Haas, T. Schwede, QMEANDisCodistance constraints applied on model quality estimation. Bioinformatics 36(6), 1765–1771 (2020) 29. L.J. McGuffin, F.M.F. Aldowsari, S.M.A. Alharbi, R. Adiyaman, ModFOLD8: accurate global and local quality estimates for 3D protein models. Nucleic Acids Res. 49(W1), W425–W430 (2021) 30. A. Thorn, Die coronavirus structural task force. BIOspektrum 26(4), 442–443 (2020) 31. F. Quaglia, B. Mészáros, E. Salladini et al., DisProt in 2022: improved quality and accessibility of protein intrinsic disorder annotation. Nucleic Acids Res. 50(D1), D480–D487 (2022) 32. M.L. Hekkelman, I. de Vries, R.P. Joosten, A. Perrakis, AlphaFill: enriching the AlphaFold models with ligands and co-factors. bioRxiv (2021) 33. S. Pushpakom, F. Iorio, P.A. Eyers et al., Drug repurposing: progress, challenges and recommendations. Nat. Rev. Drug Discov. 18(1), 41–58 (2019) 34. W.T. Harvey, A.M. Carabelli, B. Jackson et al., SARS-CoV-2 variants, spike mutations and immune escape. Nat. Rev. Microbiol. 19(7), 409–424 (2021) 35. D. Szklarczyk, A.L. Gable, K.C. Nastou et al., The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 49(D1), D605–D612 (2021) 36. M. Torrens-Fontanals, A. Peralta-García, C. Talarico, R. Guixà-González, T. Giorgino, J. Selent, SCoV2-MD: a database for the dynamics of the SARS-CoV-2 proteome and variant impact predictions. Nucleic Acids Res. 50(D1), D858–D866 (2022) 37. I. Bludau, S. Willems, W.-F. Zeng et al., The structural context of posttranslational modifications at a proteome-wide scale. PLoS Biol. 20(5), e3001636 (2022) 38. O.C. Grant, D. Montgomery, K. Ito, R.J. Woods, Analysis of the SARS-CoV-2 spike protein glycan shield reveals implications for immune recognition. Sci. Rep. 10(1), 14991 (2020) 39. M.S. Tagliamonte, N. Abid, S. Borocci, et al., Multiple recombination events and strong purifying selection at the origin of SARS-CoV-2 spike glycoprotein increased correlated dynamic movements. Int. J. Mol. Sci. 22(1) (2020)
Chapter 7
The Role of Structural Biology Task Force: Validation of the Binding Mode of Repurposed Drugs Against SARS-CoV-2 Protein Targets Focus on SARS-CoV-2 Main Protease (Mpro): A Promising Target for COVID-19 Treatment Stefano Morasso, Elisa Costanzi, Nicola Demitri, Barbara Giabbai, and Paola Storici Abstract The main protease (Mpro) of SARS-CoV-2, a cysteine protease that plays a key role in generating the active proteins essential for coronavirus replication, is a validated drug target for treating COVID-19. The structure of Mpro has been elucidated by macromolecular crystallography, but owing to its conformational flexibility, finding effective inhibitory ligands was challenging. Screening libraries of ligands as part of EXaSCale smArt pLatform Against paThogEns (ExScalate4CoV) yielded several potential drug molecules that inhibit SARS-CoV-2 replication in vitro. We solved the crystal structures of Mpro in complex with repurposed drugs like myricetin, a natural flavonoid, and MG-132, a synthetic peptide aldehyde. We found that both inhibitors covalently bind the catalytic cysteine. Notably, myricetin has an unexpected binding mode, showing an inverted orientation with respect to that of the flavonoid baicalein. Moreover, the crystallographic model validates the docking pose suggested by molecular dynamics experiments. The mechanism of MG-132 activity against SARS-CoV-2 Mpro was elucidated by comparison of apo and inhibitor-bound crystals, showing that regardless of the redox state of the environment and the crystalline symmetry, this inhibitor binds covalently to Cys145 with a well-preserved binding pose that extends along the whole substrate binding site. MG-132 also fits well into the catalytic pocket of human cathepsin L, as shown by computational docking, suggesting that it might represent a good start to developing dual-targeting drugs against COVID-19.
S. Morasso · E. Costanzi · B. Giabbai · P. Storici (B) Protein Facility, Structural Biology Lab, Elettra Sincrotrone Trieste S.C.p.A., Trieste, Italy e-mail: [email protected] N. Demitri XRD2 Beamline, Elettra Sincrotrone Trieste S.C.p.A., Trieste, Italy © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Coletti and G. Bernardi (eds.), Exscalate4CoV, SpringerBriefs in Applied Sciences and Technology, https://doi.org/10.1007/978-3-031-30691-4_7
51
52
S. Morasso et al.
7.1 Mpro as a Drug Target: Structural Properties The main protease (Mpro) of SARS-CoV-2, also referred to as 3-chymotrypsin-like protease (3CLpro) or nonstructural-protein 5 (nsp5), is a cysteine protease that is part of the polyproteins (pp1a and pp1ab) encoded by the viral RNA genome. It catalyzes its own excision from pp1a/pp1ab and that of 15 other mature nonstructural proteins (nsps 1–16) [1]. Mpro activity is essential to the viral replication cycle and RNA transcription. The protein is fully functional as a dimer composed of two 33.8 kDa monomers that share a large dimerization interface and arrange perpendicularly to one another, forming a characteristic heart-shaped particle. The monomer structure is composed of three domains: domain I (residues 8–101) and II (residues 102–184) arranged in a β-barrel, and domain III (residues 185–303), which folds into a fiveα-helix bundle (Fig. 1a, b). The active site is located on the surface, at the interface between domains I and II, and contains the noncanonical Cys145-His41 catalytic dyad (Fig. 1c). A conserved water molecule located in proximity to His41 is also important for catalysis [2]. Domain III largely contributes to dimer formation and is critical for enzyme activity. The dimer formation is functional to activation, whereas the single monomers are mostly inactive [3, 4]. These structural features are shared with the main proteases of other coronaviruses [5]. The highly conserved structure and the low homology with human proteins make it a recognized target for the development of therapeutics against COVID-19, as well as against other coronavirus infections. Within the EXSCALATE4CoV (E4C) consortium, the structural biology group of Elettra Synchrotron (www.elettra.eu) committed to validating the binding modes of the compounds selected through virtual and repurposing screenings [6, 7]. In this framework, we first determined the crystal structure of the apoprotein in different space groups. Initial crystals were obtained from the PACT commercial screening,
Fig. 7.1 Structure of SARS-CoV-2 Mpro (PDB ID: 7BB2). a The enzyme is organized as a homodimer (A and B chains that are colored cyan and pink, respectively). b Each protomer consists of three domains (I–III): The chymotrypsin-like and picornavirus 3C protease—like domains I and II (in blue and green, respectively) harbor the active site. Domain III (in yellow–red) forms a fivehelix bundle and is involved in the dimerization of the protein. c Close-up of the active site and the hydrogen bond network. Atoms are in stick representation colored according to atom type, while hydrogen bonds are shown as dashed lines
7 The Role of Structural Biology Task Force: Validation of the Binding …
53
reproducing the conditions published by Zhang et al. [3]. Thus, we solved the structures of Mpro in the apo form at 1.65 Å resolution in the space groups C 2 (Protein Data Bank [PDB] identifier [ID]: 7ALH) and P 21 (PDB ID: 7ALI). These space groups are very frequent among the Mpro structures deposited in the PDB. In these space groups, the main protease always organizes as a dimer, with the twofold axis being either crystallographic (C 2) or noncrystallographic (P 21 ). Subsequently, we obtained apo crystals by seeding techniques in space group P 21 21 21 , which contains the whole dimer in the asymmetric unit (PDB ID: 7BB2). Among the three apo structures, the fold of the dimer is conserved, with minor differences in local regions that adopt slightly diverse conformations. In particular, the two loops ASL1 and ASL2 (residues 44–53 and 184–194, respectively) delimiting the active site show differences between the three structures, suggesting that Mpro has high plasticity for adapting to different substrates, as reported in the literature by us and others [2, 6, 8]. Using the seeding technique, we co-crystallized Mpro in a complex with various ligands. The main effort was dedicated to clarifying the binding mode of repurposed drugs selected within the E4C consortium. Between February 2020, when the first SARS-CoV-2 Mpro structure was released in the PDB [9], and November 2022, almost 650 structures were solved by X-ray crystallography and made available to the scientific community, even before publication in several cases. This momentous work that incorporated co-crystals with repurposed drugs, as well as other small molecules including natural compounds and chemical fragments, has largely contributed to the computer-aided drug design campaigns [10] of E4C.
7.2 Known Inhibitors of Mpro Bind into the Active Site Druggability of SARS-CoV-2 Mpro is confirmed by the thousands of small-molecule inhibitors that have been identified throughout screening campaigns performed as a global effort to fight COVID-19 [11, 12]. Joint programs of virtual screening together with biochemical-based high-throughput screening have evaluated broad lists of chemicals, including natural compounds, repurposed drugs, and novel entities. The first actions for the fast development of drugs against SARS-CoV-2 pointed to the screening of inhibitors derived from previous research on the main protease from other coronaviruses such as SARS-CoV and MERS [5, 13]. The resulting compounds had limited potency in enzymatic assays, despite the high sequence homology among the three viruses. Notably, the sequences of Mpro from SARSCoV2 and SARS-CoV differ by only 12 amino acids, and only 1 in the active site. This difference suggests that these residues, though distant from the binding site, contribute to enzyme plasticity and ligand binding via allosteric regulation [6]. These findings led to the hypothesis that the complexity and variety of Mpro conformational changes and interactions with ligands would make fast drug identification challenging [14]. In this respect, macromolecular crystallography has largely
54
S. Morasso et al.
contributed to protein–ligand models useful for accelerating the drug-discovery process. Interestingly, although some allosteric binding sites have been identified, including a few that sit at the dimer interface [11, 15], most of the inhibitors crystallized with Mpro bind into the active site [11, 16]. The enzyme’s activesite cavity reveals a high degree of malleability, allowing a variety of different chemical moieties to bind and inhibit SARS-CoV-2 3CL Mpro. The compounds identified are covalent or noncovalent inhibitors. Among the covalent, a substantial number are peptidomimetics or peptidomimetic-derived molecules that mount different warheads that react with Cys145. Indeed, the first Mpro inhibitor authorized in many countries for COVID-19 treatment was PF-07321332 (nirmatrelvir) [17]. It is an orally available peptidomimetic developed by Pfizer from a series of molecules active against pan-corona Mpro. Nirmatrelvir was formulated in combination with ritonavir and branded as Paxlovid. Beyond nirmatrelvir, other potent compounds targeting Mpro are under development that derive from chemical scaffolds or natural origin [18–22].
7.3 Myricetin Binds Covalently with Cys145 in the MPro Active Site Flavonoids account for a large and important group of natural products widely observed in plants. They are polyphenolic secondary metabolites, and, owing to a combination of biochemical and antioxidant effects, they are considered beneficial for various diseases such as cancer, Alzheimer’s disease, and atherosclerosis. Interestingly, flavonoids have also been reported to have antiviral activity [23]. Specifically, flavonoids of natural or synthetic origins have been proposed to target SARS-CoV-2 [24, 25]. Within the E4C initiative, a repurposing biochemical screening was performed that involved 8700 compounds containing marketed drugs, clinical and preclinical candidates, and small molecules regarded as safe in humans [7]. Among the 256 hits, some flavonoids were confirmed to inhibit Mpro with IC50s ranging from 3.6 to 0.18 μM. In this context, myricetin showed an IC50 of 220 nM in the biochemical assay; interestingly, it was also predicted as a nanomolar-level binder from a dockingbased virtual screening performed on a collection of 30,000 compounds [6]. Given the interest in this molecule, which is largely present in several edible plants and a key ingredient of various foods and beverages, we determined the X-ray co-crystal structure of myricetin in complex with Mpro at 1.77 Å resolution (Fig. 2a, b; PDB ID: 7B3E) [7]. Unexpectedly, myricetin formed a covalent bond between the Cys145 sulfur and the 2' position of the flavonoid, leading to unprecedented binding for a flavonoid scaffold. At that time, the only X-ray structure of SARS-CoV-2 Mpro in complex with a flavonoid had been obtained with baicalein (PDB ID: 6M2N), which
7 The Role of Structural Biology Task Force: Validation of the Binding …
55
was modeled in reverse orientation, with the chromone moiety occupying the S1 subpocket and no possibility to form a covalent adduct with Cys145 [26]. Consistent with the 7B3E crystallographic structure, the in silico noncovalent docking calculations led myricetin to adopt a similar pose (Fig. 2d) [6]. Moreover, the electronic map of the X-ray structure showed that the Mpro binding pocket is only partially occupied by myricetin and that voids are filled by solvent molecules (ethylene glycol and water), suggesting an opportunity for future structure-based drug design efforts. The same conclusions have been subsequently reported by Su and colleagues [27], who proposed pyrogallol as a convenient warhead in designing new flavone-based covalent inhibitors of Mpro. Indeed, as suggested by Kuzikov et al. [7], the pyrogallol reactivity requires an oxidative step for the sulfur addition to the 2' position of myricetin, as shown in Fig. 2c. The unexpected binding of myricetin to Cys145 opens new routes for the development of more potent covalent ligands that are of great interest for therapeutics and biochemical tools.
Fig. 7.2 Myricetin binds to Cys145 in the Mpro active site (PDB ID: 7B3E). a Overall structure of Mpro dimer in complex with myricetin, which occupies both active sites: the dimer is shown in a cartoon model (chain A colored in light green, chain B colored in light blue) superimposed on the surface (white); the myricetin is shown in magenta stick-bone and superimposed by fo-fc map contoured at 1 sigma. b Interaction of myricetin (yellow) with active-site residues (blue) and water molecules (white). Hydrogen bonds are shown with blue lines, water bridges in light blue, hydrophobic interaction with a dotted gray line. c Mechanism of myricetin oxidation and Michael addition to Cys145. d Overlay of crystal structure (green), docked (blue, RMSD 3.14 Å), and refined (yellow, RMSD 0.46 Å) binding poses of myricetin. The image in panel c was adapted from Kuzikov et al. [7], and the image in panel d was adapted from Gossen et al. [6]
56
S. Morasso et al.
7.4 The Peptidomimetic MG-132 Acts as Dual Inhibitor of Mpro and Cathepsin L The same screening of Mpro ligands against SARS-CoV-2 performed as part of the E4C program and mentioned in the previous chapter led to the identification of MG-132 (Fig. 3b). This synthetic peptidomimetic aldehyde, originally identified as a proteasome inhibitor and investigated as an antineoplastic drug, blocks SARSCoV2 Mpro enzymatic activity with an IC50 of 7.4 μM and shows good antiviral activity, detected as a reduction of viral RNA (EC50 of 0.1 μM) in cells infected with SARS-CoV-2. We solved the crystal structure of Mpro in complex with MG-132 under different crystallization conditions, showing that in the presence and absence of reducing agents and independently from the space organization of the crystals, this compound attaches covalently to Cys145 through a Michael addition, and it has a well-defined binding mode that does not alter the overall fold of the dimer [28]. The covalent stereoselective (S) hemithioacetal bond is nicely defined in the electron-density maps of our structures (Fig. 3b). MG-132 extends along the S1–S4 subsites of the substrate
Fig. 7.3 Binding mode of MG-132 in Mpro active site. a Electron-density map (fo-fc, contoured at 1 sigma) of MG-132 covalently bound to Cys145. b Chemical model of MG-132. c Main interactions of MG-132 bound to Cys145 with Mpro active-site residues (ligand and residues are represented as stick-bones, MG-132 and Cys145 are colored in cyan, and other residues and cartoon protein model are in light blue). d Best scoring pose obtained from the covalent docking of MG-132 on cathepsin L. The main interactions of MG-132 are shown: hydrogen bonds (blue lines), hydrophobic interactions (gray dashed lines), and pi-stacking interactions (green dashed lines). Figure adapted from Costanzi et al. [28]
7 The Role of Structural Biology Task Force: Validation of the Binding …
57
binding pocket, interacting with residues through hydrogen bonds and hydrophobic interactions, in addition to the covalent bond with Cys145 (Fig. 3c). An extensive biochemical analysis revealed that this bond is reversible, as expected, but K m and V max measured at different incubation times suggest a slow k off , indicative of a long residency time. Considering that MG-132 is known to inhibit other cysteine proteases and that, despite its rather poor inhibition of Mpro, it has sub-micromolar potency in antiviral cell assays [29, 30], we investigated the role of MG-132 in inhibiting the human cathepsin L, a host protease that is important for viral entry. This lysosomal cysteine protease is proposed as a target for COVID-19, as it cleaves the viral S protein to promote entry of the virus into host cells [31, 32]. MG-132 is known to inhibit cathepsin L in the nanomolar range [28, 33]. Induced-fit docking and covalent docking models of MG-132 bound to cathepsin L show that the ligand can form a covalent linkage with Cys26 and embrace the active site by numerous hydrogen bonds and pi-staking interaction, as shown in Fig. 3d [28]. This analysis provides new hints for the development of Mpro/cathepsin L dual inhibitors that may prove beneficial against COVID-19, increasing efficacy and reducing the threat of drug resistance.
References 1. P. V’kovski, A. Kratzel, S. Steiner, H. Stalder, V. Thiel, Coronavirus biology and replication: implications for SARS-CoV-2. Nat. Rev. Microbiol. 19(3), 155–170 (2021) 2. D.W. Kneller, G. Phillips, H.M. O’Neill, R. Jedrzejczak, L. Stols, P. Langan et al., Structural plasticity of SARS-CoV-2 3CL Mpro active site cavity revealed by room temperature X-ray crystallography. Nat. Commun. 11(1), 3202 (2020) 3. L. Zhang, D. Lin, X. Sun, U. Curth, C. Drosten, L. Sauerhering et al., Crystal structure of SARS-CoV-2 main protease provides a basis for design of improved α-ketoamide inhibitors. Science 368(6489), 409–412 (2020) 4. B. Goyal, D. Goyal, Targeting the dimerization of the main protease of coronaviruses: a potential broad-spectrum therapeutic strategy. ACS Comb. Sci. 22(6), 297–305 (2020) 5. S.A. Amin, S. Banerjee, S. Gayen, T. Jha, Protease targeted COVID-19 drug discovery: what we have learned from the past SARS-CoV inhibitors? Eur. J. Med Chem. 215, 113294 (2021) 6. J. Gossen, S. Albani, A. Hanke, B.P. Joseph, C. Bergh, M. Kuzikov et al., A blueprint for high affinity SARS-CoV-2 Mpro inhibitors from activity-based compound library screening guided by analysis of protein dynamics. ACS Pharmacol. Transl. Sci. 4(3), 1079–1095 (2021) 7. M. Kuzikov, E. Costanzi, J. Reinshagen, F. Esposito, L. Vangeel, M. Wolf et al., Identification of inhibitors of SARS-CoV-2 3CL-pro enzymatic activity using a small molecule in vitro repurposing screen. ACS Pharmacol. Transl. Sci. 4(3), 1096–1110 (2021) 8. A. Ebrahim, B.T. Riley, D. Kumaran, B. Andi, M.R. Fuchs, S. McSweeney, et al., The temperature-dependent conformational ensemble of SARS-CoV-2 main protease (Mpro ). IUCrJ. 9(5) (2022) 9. Z. Jin, X. Du, Y. Xu, Y. Deng, M. Liu, Y. Zhao et al., Structure of Mpro from SARS-CoV-2 and discovery of its inhibitors. Nature 582(7811), 289–293 (2020) 10. Y. Liu, J. Gan, R. Wang, X. Yang, Z. Xiao, Y. Cao, DrugDevCovid19: an atlas of anti-COVID-19 compounds derived by computer-aided drug design. Molecules 27(3), 683 (2022)
58
S. Morasso et al.
11. G. Macip, P. Garcia-Segura, J. Mestres-Truyol, B. Saldivar-Espinoza, G. Pujadas, S. GarciaVallvé, A review of the current landscape of SARS-CoV-2 main protease inhibitors: have we hit the bullseye yet? Int. J. Mol. Sci. 23(1), 259 (2021) 12. S. Maghsoudi, B. Taghavi Shahraki, F. Rameh, M. Nazarabi, Y. Fatahi, O. Akhavan et al., A review on computer-aided chemogenomics and drug repositioning for rational COVID-19 drug discovery. Chem. Biol. Drug Des. 100(5), 699–721 (2022) 13. H. Yang, J. Yang, A review of the latest research on Mpro targeting SARS-COV inhibitors. RSC Med. Chem. 12(7), 1026 (2021) 14. M. Bzówka, K. Mitusi´nska, A. Raczy´nska, A. Samol, J.A. Tuszy´nski, A. Góra, Structural and evolutionary analysis indicate that the SARS-CoV-2 Mpro is a challenging target for small-molecule inhibitor design. Int. J. Mol. Sci. 21(9), 3099 (2020) 15. S. Gunther, P.Y.A. Reinke, Y. Fernandez-Garcia, J. Lieske, T.J. Lane, H.M. Ginn et al., X-ray screening identifies active site and allosteric inhibitors of SARS-CoV-2 main protease. Science 372(6542), 642–646 (2021) 16. D.D. Nguyen, K. Gao, J. Chen, R. Wang, G.-W. Wei, Unveiling the molecular mechanism of SARS-CoV-2 main protease inhibition from 137 crystal structures using algebraic topology and deep learning. Chem. Sci. 11(44), 12036–12046 (2020) 17. D.R. Owen, C.M.N. Allerton, A.S. Anderson, L. Aschenbrenner, M. Avery, S. Berritt et al., An oral SARS-CoV-2 Mpro inhibitor clinical candidate for the treatment of COVID-19. Science 374(6575), 1586–1593 (2021) 18. K. Raman, K. Rajagopal, F. Islam, M. Dhawan, S. Mitra, B. Aparna, et al., Role of natural products towards the SARS-CoV-2: a critical review. Ann. Med. Surg. 104062 (2022) 19. K. Nepali, R. Sharma, S. Sharma, A. Thakur, J.-P. Liou, Beyond the vaccines: a glance at the small molecule and peptide-based anti-COVID19 arsenal. J. Biomed. Sci. 29(1), 65 (2022) 20. L. Zhong, Z. Zhao, X. Peng, J. Zou, S. Yang, Recent advances in small-molecular therapeutics for COVID-19. Precis. Clin. Med. 5(4), pbac024 (2022) 21. S. Mousavi, S. Zare, M. Mirzaei, A. Feizi, Novel drug design for treatment of COVID-19: a systematic review of preclinical studies. Can. J. Infect. Dis. Med. Microbiol. 2022, 2044282 (2022) 22. A.-T. Ton, M. Pandey, J.R. Smith, F. Ban, M. Fernandez, A. Cherkasov, Targeting SARS-CoV-2 papain-like protease in the post-vaccine era. Trends Pharmacol. Sci. 43(11), 906–919 (2022) 23. S.L. Badshah, S. Faisal, A. Muhammad, B.G. Poulson, A.H. Emwas, M. Jaremko, Antiviral activities of flavonoids. Biomed. Pharmacother. 140, 111596 (2021) 24. S. Jo, S. Kim, D.H. Shin, M.-S. Kim, Inhibition of SARS-CoV 3CL protease by flavonoids. J. Enzyme Inhib. Med. Chem. 35(1), 145–151 (2020) 25. F. Batool, E.U. Mughal, K. Zia, A. Sadiq, N. Naeem, A. Javid et al., Synthetic flavonoids as potential antiviral agents against SARS-CoV-2 main protease. J. Biomol. Struct. Dyn. 40(8), 3777–3788 (2022) 26. H. Su, S. Yao, W. Zhao, M. Li, J. Liu, W. Shang, et al., Discovery of baicalin and baicalein as novel, natural product inhibitors of SARS-CoV-2 3CL protease in vitro. BioRxiv. 038687 (2020). https://doi.org/10.1101/2020.04.13.038687 27. H. Su, S. Yao, W. Zhao, Y. Zhang, J. Liu, Q. Shao et al., Identification of pyrogallol as a warhead in design of covalent inhibitors for the SARS-CoV-2 3CL protease. Nat. Commun. 12(1), 3623 (2021) 28. E. Costanzi, M. Kuzikov, F. Esposito, S. Albani, N. Demitri, B. Giabbai et al., Structural and biochemical analysis of the dual inhibition of MG-132 against SARS-CoV-2 main protease (Mpro/3CLpro) and human cathepsin-L. Int. J. Mol. Sci. 22(21), 11779 (2021) 29. M. Dittmar, J.S. Lee, K. Whig, E. Segrist, M. Li, K. Jurado et al., Drug repurposing screens reveal FDA approved drugs active against SARS-Cov-2. Cell Rep. 35(1), 108959 (2021) 30. A. Zaliani, L. Vangeel, J. Reinshagen, D. Iaconis, M. Kuzikov, O. Keminer et al., Cytopathic SARS-CoV-2 screening on VERO-E6 cells in a large-scale repurposing effort. Sci. Data. 9(1), 405 (2022) 31. J. Shang, Y. Wan, C. Luo, G. Ye, Q. Geng, A. Auerbach et al., Cell entry mechanisms of SARS-CoV-2. Proc. Natl. Acad. Sci. USA 117(21), 11727–11734 (2020)
7 The Role of Structural Biology Task Force: Validation of the Binding …
59
32. M.-M. Zhao, W.-L. Yang, F.-Y. Yang, L. Zhang, W.-J. Huang, W. Hou et al., Cathepsin L plays a key role in SARS-CoV-2 infection in humans and humanized mice and is a promising target for new drug development. Signal Transduct. Target Ther. 6(1), 134 (2021) 33. H. Ito, M. Watanabe, Y.-T. Kim, K. Takahashi, Inhibition of rat liver cathepsins B and L by the peptide aldehyde benzyloxycarbonyl-leucyl-leucyl-leucinal and its analogues. J. Enzyme Inhib. Med. Chem. 24(1), 279–286 (2009)
Chapter 8
Drug Discovery and Big Data: From Research to the Community Luca Barbanotti, Marta Cicchetti, and Gaetano Varriale
Abstract Technology and artificial intelligence, alongside the COVID-19 pandemic vastly increasing technology use in health care, have precipitated an escalation of big data. Although real-world data (RWD) and real-world evidence (RWE) have contributed to determining outcomes outside the scope of randomized clinical trials (RCTs), RWD and RWE are underutilized in demonstrating drug effectiveness. Utilizing RWD may enhance the ability of regulatory agencies to approve drugs, provide drug effectiveness insight to payers, and improve personalized medicine. Additionally, RWD and RWE may assist in overcoming the limitations of RCT data such as treatment adherence and underrepresented patient subgroups and may support and expedite drug repositioning. Even though the limitations of using RWE and RWD include fragmented data context, poor data quality, and information governance, healthcare analytics hubs such as the European Health Data Space are designed to foster synergy among private and public healthcare players and may assist in overcoming these potential limitations. Such healthcare analytics hubs may enhance the utilization of RWE and/or RWD, which could ultimately result in better patient outcomes.
8.1 The Evolution of Clinical Data: From Hand-Written Case Reports to Real-World Data The term “big data” first appeared in the academic literature in the early 2000s, in the field of statistics and econometrics [1]. Diebold was the first to use the term in one of his papers to describe the phenomenon of the explosive growth of data in volume, velocity, and variety [2]. Before him, it seems academics were aware of the emerging L. Barbanotti · M. Cicchetti Advanced Analytics & AI Team, SAS Institute, Milano, Italy G. Varriale (B) SAS Institute, Milano, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Coletti and G. Bernardi (eds.), Exscalate4CoV, SpringerBriefs in Applied Sciences and Technology, https://doi.org/10.1007/978-3-031-30691-4_8
61
62
L. Barbanotti et al.
phenomenon but not the term. Conversely, a few pre-2000 references, both academic and non-academic, used the term but were not thoroughly aware of the phenomenon [3].
Although big data has no unambiguous definition, the term has become part of the common lexicon over the past decade and touches various fields, from social sciences to health care. Figure 10.1, from Dash et al. [4], is highly representative of the wide spread of big data in health care. The authors assimilated health care into a big data repository [4]. Indeed, the practice of saving every single record related to a patient’s medical history is very old and can be traced to ancient Egypt, with the oldest case report written on a papyrus dating back to 1600 BC [5]. Since then, many steps forward have been taken: the evolution of technology and the advent of artificial intelligence (AI) have gone hand in hand with the enrichment and digitization of clinical data: from written case report forms (CRFs) to electronic CRFs and electronic health records (EHRs). Clinical data are enriched with laboratory examinations, various types of diagnoses, images, and the patient’s own reported symptoms. A large amount of unstructured data exists, therefore, that can be analyzed by means of natural language processing and image processing techniques to discover hidden trends and support the physician’s decisions. This revolution has been accelerated by COVID-19, which has made it impossible for many patients to travel to the point of care, thus imposing a new paradigm both in the conduct of clinical trials and in treatment methodologies: decentralized clinical trials, remote monitoring, and telemedicine are just a few examples. New types of data and new data sources have emerged: sensors, wearable and medical devices connected to improve patients’ outcomes, data reported directly by a patient or generated by the clinical practice or even during administrative claims processing [6]. The Internet of Medical Things is a central aspect of the current trend of healthcare digitization: connected items make it possible to produce, collect, and transmit health data even in real time. Data, therefore, come not only from the clinical setting but, above all, from the real environment in which the patient lives. For this reason, these kinds of data are called real-world data (RWD). The United States (US) Food and Drug Administration (FDA) defines RWD as “the data relating to patient
8 Drug Discovery and Big Data: From Research to the Community
63
health status and/or the delivery of health care routinely collected from a variety of sources. RWD can come from a number of sources, for example: • • • • •
Electronic health records (EHRs) Claims and billing activities Product and disease registries Patient-generated data including in home-use settings Data gathered from other sources that can inform on health status, such as mobile devices” [7].
Information can be collected 24/7 from every patient, thus producing a large amount of data coming from mere observations of daily conditions. That is why the terms RWD and big data seem to be used as synonyms in the field of observational studies (i.e., studies in which the investigator does not intervene and no attempt is made to affect the outcome [e.g., no treatment is given]) [8]. This large amount of data needs to be integrated, managed, and properly handled. Consequently, challenges in terms of privacy, security, data ownership, data stewardship, and governance arise [9]. These issues will be discussed in more detail in the following paragraphs, along with the benefits of using RWD, especially for the patient/citizen community.
8.2 RWD and Real-World Evidence When it comes to RWD and real-world evidence (RWE), what are we actually talking about? And why has RWD been gaining increasing consent for the last few years? With regard to the first topic, it seems that there is no univocal definition; rather, approaches and perspectives vary depending on stakeholders, their specific needs, and the region in which they operate [10]. Let’s try to clarify by referring to the definitions given by three regulatory agencies: the US FDA, European Medicine Agency (EMA), and Pharmaceuticals and Medical Devices Agency (PMDA) from Japan (Table 8.1) [7, 11, 12]. When looking at the three definitions shown in Table 8.1, points of contact and points of contrast can be discerned. The similarities suggest that one can state, with some confidence, that RWD is patient-level data, relating to a patient’s state of health, and that the clinical evidence generated from the analysis of these data constitutes RWE. Points of contrast, instead, seem to be the different line of thinking that sees the EMA as opposed to the FDA, at least at first glance: while the EMA speaks of any sources “other than traditional clinical trials,” the FDA states that RWE can also be generated from randomized trials. However, the real difference lies not in whether the reference is made to clinical trials generically understood, but rather in the study design. According to the Glossary of Common Site Terms, study design consists of “the investigative methods and strategies used in the clinical study” [13]. Therefore, the aspect of overcoming the limitations of an outdated traditional clinical trial system appears here. Progress in digital technology and analytics makes it possible to collect
64
L. Barbanotti et al.
Table 8.1 EMA European Medicine Agency Regulatory agency RWD definition
RWE definition
FDA [7]
“Data relating to patient health status and/or the delivery of health care routinely collected from a variety of sources”
“Clinical evidence regarding the usage and potential benefits or risks of a medical product derived from analysis of RWD. RWE can be generated by different study designs or analyses, including but not limited to, randomized trials, including large simple trials, pragmatic trials, and observational studies (prospective and/or retrospective)”
EMA [11]
“Routinely collected data relating to “Information derived from analysis patient health status or the delivery of real-world data” of health care from a variety of sources other than traditional clinical trials”
PMDA [12]
“Health-related data are gathered and accumulated in the clinical practice day by day”
No formal definition so far
EMA European Medicine Agency; FDA Food and Drug Administration; PMDA Pharmaceuticals and Medical Devices Agency; RWD real-world data; RWE real-world evidence
large amounts of data 24/7, also in real time, and process it to continuously monitor the patient in his/her real-world environment. Hence, RWD can help fill the gap between a traditional clinical trial and real-world clinical practice. In other words, RWD can help us understand the effectiveness of a drug in the real world [14], going beyond the common application to the postapproval surveillance and using RWD to support regulatory decision-making [10]. It is important to emphasize that RWD are not intended to replace classic clinical trials, but rather to enrich and supplement them [15]. Indeed, RWD can bring tremendous advantages. Let’s have a deep dive into them.
8.3 The Clinical Trial System is Broken In 2019, Dr. Janet Woodcock, at that time director of the US FDA Center for Drug Evaluation and Research, publicly declared: “I personally believe the clinical trial system is broken.” Then, she continued: “FDA will work with its stakeholders to understand how RWE can best be used to increase the efficiency of clinical research and answer questions that may not have been answered in the trials that led to the drug approval, for example how a drug works in populations that weren’t studied prior to approval” [10]. The reason behind such an outstanding statement must be
8 Drug Discovery and Big Data: From Research to the Community
65
found in what now has become widely acknowledged across the scientific community: randomized controlled trials (RCTs), at least in the form they were originally conceived, might not be our best option anymore. For decades, RTCs have been considered the only effective way to support decision-making within the regulatory approval process. From the experimental design point of view, RCTs represent the best possible setup to demonstrate causality between drugs and intended or unintended effects [16]. This is undoubtedly true for studies conducted under “ideal experimental conditions.” Without questioning the high control level, high content standardization, and low bias provided by design in RCTs, concern is rising around the fact that “ideal experimental conditions” may differ significantly from actual clinical practice. As a matter of fact, it becomes clear that evidence resulting from such an approach shines at internal validity but often fails to properly represent effectiveness on real-world patients because of their heterogeneity in terms of genetic make-up, comorbidities, and other ongoing medications [16]. On the other hand, RWE has been only partially exploited and, as of today, is still substantially under-exploited in demonstrating drug effectiveness [10]. Nevertheless, where it has found application outside the limited perimeter of postapproval safety requirements, it has contributed to determining insights about external validity, relative effectiveness compared with products in state-of-the-art clinical practice, appropriate treatment pattern, biomarker development, cost-effectiveness, and actual availability [17].
8.4 Advantages of Using RWD Let’s try to better understand what kind of opportunities arise from RWD and RWE by considering the various stakeholders in both the healthcare ecosystem and the entire clinical trial process itself, as these aspects ultimately affect the citizen/patient.
8.4.1 Various Stakeholders Regulatory agencies—In recent years, regulatory agencies have tried to address the issue of RWD use by better defining and regulating its application contours: from drug development to drug approval, from label extensions and new indications for already marketed drugs to postmarket studies, and approval of medical devices. In this direction, for example, the US Congress approved the 21st Century Cures Act in 2016, mandating that the US FDA define guidelines on how pharmaceutical companies could use RWE to support the approval of a new indication for an already approved drug and postmarketing commitments, with consequent benefits in terms of cost and time of drug development [6].
66
L. Barbanotti et al.
Payers—Let’s now consider the payers’ point of view, focusing on how RWD and RWE can help avoid investing money in bad outcomes. Indeed, the collection of RWD provides information about outcomes in real settings, as well as results of drug effectiveness in population subgroups that are not adequately represented in clinical trials [6]. The reference is particularly important to rare diseases, genetic therapies, and surrogate endpoints. The small sample size on which drugs have been approved so far, and in which endpoints may not occur frequently, may lead to subjective conclusions of efficacy and effectiveness. Such risks also prevent pharmaceutical companies from undertaking RCTs, as drug developers should use proxies for the true clinical endpoints. RWD can help create larger cohorts of patients, virtual control arms, or synthetic patient cohorts made up of patients who have similar conditions or are at the same stage of a certain disease. However, these topics will be explored further in the next paragraphs. For now, let’s just keep in mind that these large datasets can be used to compare available endpoints with surrogate endpoints reported from RCTs, thus providing estimates of drug treatment effects in a broader population. Care delivery system—Finally, considering that RWD is patient-level data, and a source of information that could not otherwise be traced, it has a great impact on care delivery systems and medical practice, as it enables better alignment of drugs with patient characteristics.
8.4.2 How the Entire Clinical Trial Process is Affected As stated so far, the most important revolutionary point of using RWD in clinical trials is the possibility to overcome traditional limits and gaps between the controlled environment in which the patient is involved and looked after and the real environment in which the patient lives. This is the first real advantage of exploiting these data, as it enables researchers to understand how a drug will perform in the real world. Indeed, inclusion/exclusion criteria, adherence, and patient compliance are higher in a clinical trial setting [14]. These factors can lead to different treatment effects in clinical trials and real-world clinical practice. Moreover, patient subgroups or niches may be excluded from a study, or data regarding their clinical situation may not be sufficient. Collecting RWD improves the ability to profile risks and benefits and thus provides more support in choosing and optimizing the dosing regimen. Also, the possibility to leverage digital biomarkers improves trial designs. For example, data on activity, sleep, vital signs, and so on can help researchers to understand the determinants of placebo and drug response, going beyond demographics and other variables delimited in clinical trials [6]. Finally, integration between RWD and RCT data constitutes an important database on which to carry out various types of analysis (e.g., drug effectiveness prediction considering different rates of adherence ascertained from prescription refill history and cost-effectiveness analysis at an early stage [6]).
8 Drug Discovery and Big Data: From Research to the Community
67
8.5 RWE Supporting Drug Repositioning Real-world evidence also has been successfully exploited, coupled with machine learning (ML)/AI, to support drug repositioning. Starting in the 2000s, new drugs and biological applications submitted to the FDA decreased significantly owing to the overwhelming time, costs, and investments needed in the development process [18]. Since then, drug repositioning has been identified as one of the most viable options to “feed the pipeline” with low-risk candidates, making the whole process faster, cheaper, and less risky [18]. Today, the widespread availability of omics-based and computational technologies is enabling scientists to generate and analyze many new structured and unstructured data sources and, thus, is pushing drug repositioning even further. As a result, repositioning strategies, which excel in leveraging both AI and RWD, are undoubtedly some of the most promising ones. For example, a twostep filtering process can be structured: “Step 1, apply ML/AI approaches to identify top-ranked drugs based on their drug-target binding affinity scores while considering their challenges, and Step 2, investigate only the top-ranked drugs to validate their clinical effectiveness using RWD while considering the realities of dealing with electronic health records (EHR)” [18]. Step 1 is focused on predicting drug-target binding affinity. Even with computational power being much more affordable than it was previously, algorithms doubling their predictive power year after year, and data science skills being at an all-time high in popularity, the complexity involved in this task is still extraordinary. Hundreds of different modeling strategies can be generated just by tuning a few basic contents. Target representation, input feature parameters, feature selection, feature engineering, algorithm selection, learning approach, and validation settings are only a few of the dimensions one might incorporate. Step 2 is focused on further investigating the shortlist identified in Step 1, mainly by extracting value from related EHRs. An EHR is ideally an electronic version of the patient history; thus, it might include care providers’ notes, radiology reports, lab reports, and other relevant insights expressed in a descriptive, but unstructured, form [18]. Analyzing such diverse asset collections is typically complex in terms of quality, sparsity, and domain understanding. Then again, ML/AI approaches in this field are very numerous and very diverse in nature and complexity. Here, like in Step 1, it is important to stress that the quality and relevance of results are not always directly dependent on algorithmic complexity.
8.6 Old Challenges, New Opportunities The fulfillment of RWE potential lies primarily in an adequate data management system. This knowledge has focused renewed attention on many already-known technological challenges that must be addressed to optimize information technology (IT) systems and data infrastructures of hospitals and research centers. Such challenges could be summarized as “fragmented clinical IT systems, the quality of the
68
L. Barbanotti et al.
data therein, [and] information governance and operations” [19]. First, clinical data often reside in silos. They are typically generated across different clinical units and then stored in different systems and infrastructure units [19]. For example, laboratory results generally reside in laboratory information management systems, radiological imaging data in picture archiving and communication systems, prescription data in pharmacy systems, and so on. Highly fragmented data contexts, as well as the absence of a coherent and widespread data model, are a severe barrier to any relevant initiative based on the consumption of data. Moreover, the same issues limit data comparability between centers and/or other entities willing to cooperate by redirecting their joined research efforts into the exploration of newly enabled data cohorts. Second, data quality is often deficient. Tools and interfaces designated for data collection are not conceived to equally foster the actual employment of collected data for research purposes. Lack of data quality could be differentiated into four dimensions: completeness, structure, accuracy, and availability [19]. For example, new primarily scientific data sources, such as biomarkers, genomics data, and novel laboratory tests, often are not collected systematically, with consequent harm to completeness and availability [19]. Additionally, anamnestic data, comorbidities, and toxicities are often recorded as free, descriptive text by clinicians. This practice usually compromises data structure and potentially limits completeness as well as availability [19]. Third, in a highly complex and inefficient data landscape, the indispensable observance of patient data privacy and protection is further limiting the rise and strengthening of multicenter studies and other relevant partnership networks. In addition to considerations about potential benefits from organizational design and economy of scale and being discouraged from creating networks, hospitals, and research centers are also blocked from designing sufficiently large patient cohorts [19]. The lack of such cohorts will prevent them from meeting the minimal experimental setting requirements needed to explore more advanced approaches in both RWE-based studies and ML/AI applications in this area.
8.7 Healthcare Analytics Hubs Overcoming the organizational and technological challenges depicted above will open up new horizons. Researchers, patients, healthcare companies, and public administrators may find their lives and activities disrupted by brand-new paradigms. After we face challenges, relevant drivers of change should first be identified and then properly triggered. By broadening the concept of hospital and research center networks, we came to healthcare analytics hubs. Healthcare analytics hubs are structures conceived to engage and foster synergies among private and public players from across the healthcare value chain. The core objective of such an organization is maximizing the quality and effectiveness of healthcare outcomes for citizens, as well as profitability for for-profit partners.
8 Drug Discovery and Big Data: From Research to the Community
69
Let’s try to depict how the collaboration might take place and what the expected outcomes might be. The COVID-19 pandemic has heightened the importance and the need for fast access to healthcare data for both prevention and research purposes. Therefore, even entities such as the European Commission have begun to move in the direction of data sharing. Specifically, “The European strategy for data proposed the establishment of domain-specific common European data spaces. The European Health Data Space (‘EHDS’) is the first proposal of such domain-specific common European data spaces” [20]. It seems that the ultimate goal is the enhancement of health data and the ability to make full use of it. And this is where the synergies between the various players in the healthcare ecosystem come into play. According to our experience, the healthcare ecosystem is extremely interconnected, meaning that networks that connect the various actors are very dense and cross-linked, with different players working very close to each other (consider the setup of a clinical trial that, in its basic and traditional form, involves a pharmaceutical company, a contract research organization, and a hospital at least), communicating for data regulation, drug, or medical device approvals. Even the emerging field of software as a medical device involves, at least, a tech company and a pharmaceutical one. The software is intended to be used by a person, a patient who is being cared for by a physician … and so on. Consequently, various types of players, with different purposes, collaborate in an ecosystem in which more and more data are being produced but not fully exploited. Indeed, data flow is somewhere interrupted. Hospitals, healthcare infrastructures, and points of care have a great deal of patient-level data. Advances in analytical techniques and technology allow researchers to maximize the use and the fruition of this type of data, without even sharing sensitive data. This is the foundation of federated learning (FL), an ML technique that addresses the issues of data privacy, data governance, and data sharing [21, 22]. Models are trained collaboratively, without moving patient data outward from the institutions in which they reside [21]. Raw data are not moved to a single server or data center; instead, only model characteristics (such as parameters) are transferred. In other words, it is the model and not the data that is shared. That is why FL finds its natural application in health care, and it is one of the techniques that can be applied to overcome issues linked to the nature of patients’ data. Also, it is clear why FL can foster collaboration among private and public players in the ecosystem. Artificial intelligence also allows for synthetic data generation. Synthetic data are data artificially generated by algorithms, taking as input real data and applying models in such a way that original patterns are reproduced in the synthetic dataset. SAS has recently investigated the value of synthetic data, assessing data quality, validity, and usability by comparing AI-generated data with original data [23]. Results are astounding, as they demonstrate that synthetic data are as valid as real data and that analysis of synthetic data outperforms analysis of traditionally anonymized data. Other examples in the literature show that synthetic data, even if they do not exactly reproduce the results in numerical terms, lead to the same conclusions as studies conducted with real data [24]. These findings are especially true for research
70
L. Barbanotti et al.
purposes, where clinical trial datasets often need to be used for secondary analysis, revealing new insights (on drug safety, bias evaluation, replication, or validation of a published study, for example). In this case, synthetic data have the potential to overcome the need for ethics board reviews, to be a solution to the problem of sensitive data access and, thus, can simplify and accelerate research studies [24]. Another important outcome of the analytics hub concerns the patient directly. Indeed, the integration of different data owned by different players can lead to tailored, personalized treatment. According to the FDA, “Precision medicine, sometimes known as ‘personalized medicine’ is an innovative approach to tailoring disease prevention and treatment that takes into account differences in people’s genes, environments, and lifestyles. The goal of precision medicine is to target the right treatments to the right patients at the right time” [25]. Therefore, the first step is to profile each patient. Using RWD can certainly help from the environmental and lifestyle perspective. Omics data, on the other hand, provides an all-around view of the biological system. Despite its complexity, new technologies such as nextgeneration sequencing have allowed researchers to sequence genomes in a short time, thus greatly reducing the cost to obtain and sequence genome data [4]. Having all this information, and combining it with data from clinical patient history, clinical assessment, medical reports, insurance records, and pharmacy prescription information [4], leads to better patient outcomes by providing better treatments and care (e.g., helping to answer questions like “which is the better dosage of a drug for that specific patient?”). To sum up, the main characters of the healthcare analytics hub are both public and private players who work with and for different purposes. Regulation of these aspects is critical and central, and collaboration should be regulated by partnerships or agreements to protect involved parties. It must also be considered that profitmaking purposes are not extraneous to the logic according to which players operate in the relevant context, when considering both technologies and service providers and data providers.
References 1. M. Favaretto, E. De Clercq, C.O. Schneble, B.S. Elger, What is your definition of Big Data? Researchers’ understanding of the phenomenon of the decade. PLoS ONE 15(2), e0228987 (2020) 2. F.X. Diebold, Big data dynamic factor models for macroeconomic measurement and forecasting, in Eighth World Congress of the Econometric Society, Seattle, http://www.ssc.upenn. edu/~fdiebold/papers/paper40/temp-wc.PDF (2000) 3. F.X. Diebold, “Big Data” and its origins (2020), arXiv:2008.05835 4. S. Dash, S.K. Shakyawar, M. Sharma, S. Kaushik, Big data in healthcare: management, analysis and future prospects. J. Big Data. 6(1), 54 (2019) 5. R.F. Gillum, From papyrus to the electronic tablet: a brief history of the clinical medical record with lessons for the digital age. Am. J. Med. 126(10), 853–857 (2013)
8 Drug Discovery and Big Data: From Research to the Community
71
6. B. Swift, L. Jain, C. White, V. Chandrasekaran, A. Bhandari, D.A. Hughes et al., Innovation at the intersection of clinical trials and real-world data science to advance patient care. Clin. Transl. Sci. 11(5), 450–460 (2018) 7. U.S. Food & Drug Administration. Real-world evidence, https://www.fda.gov/science-res earch/science-and-research-special-topics/real-world-evidence. Accessed 6 Dec 2022 8. M. Okada, Big data and real-world data-based medicine in the management of hypertension. Hypertens. Res. 44(2), 147–153 (2021) 9. J. Andreu-Perez, C.C. Poon, R.D. Merrifield, S.T. Wong, G.Z. Yang, Big data for health. IEEE J. Biomed. Health Inform. 19(4), 1193–1208 (2015) 10. E. Baumfeld Andre, R. Reynolds, P. Caubel, L. Azoulay, N.A. Dreyer, Trial designs using realworld data: the changing landscape of the regulatory approval process. Pharmacoepidemiol. Drug Saf. 29(10), 1201–1212 (2020) 11. Candore G. Update on real-world evidence and DARWIN EU (2021), https://www.ema. europa.eu/en/documents/presentation/update-real-world-evidence-darwin-eu-gianmario-can dore_en.pdf. Accessed 6 Dec 2022 12. Y. Fujiwara, Utilization of real world data: PMDA’s approaches (2021), https://www.pmda.go. jp/english/about-pmda/0004.pdf. Accessed 6 Dec 2022 13. National Institutes of Health, U.S. National Library of Medicine, ClinicalTrials.gov. Glossary of common site terms, https://clinicaltrials.gov/ct2/about-studies/glossary. Accessed 6 Dec 2022 14. Q. Liu, A. Ramamoorthy, S.-M. Huang, Real-world data and clinical pharmacology: a regulatory science perspective. Clin. Pharmacol. Ther. 106(1), 67–71 (2019) 15. V.L. Bartlett, S.S. Dhruva, N.D. Shah, P. Ryan, J.S. Ross, Feasibility of using real-world data to replicate clinical trial evidence. JAMA Netw. Open. 2(10), e1912869 (2019) 16. A. Makady, A. de Boer, H. Hillege, O. Klungel, W. Goettsch, What is real-world data? A review of definitions based on literature and stakeholder interviews. Value Health. 20(7), 858–865 (2017) 17. M. Burcu, N.A. Dreyer, J.M. Franklin, M.D. Blum, C.W. Critchlow, E.M. Perfetto et al., Real-world evidence to support regulatory decision-making for medicines: considerations for external control arms. Pharmacoepidemio. Drug Saf. 29(10), 1228–1235 (2020) 18. R. Wieder, N. Adam, Drug repositioning for cancer in the era of AI, big omics, and real-world data. Crit. Rev. Oncol. Hematol. 175, 103730 (2022) 19. B.E. Maissenhaelter, A.L. Woolmore, P.M. Schlag, Real-world evidence research based on big data: motivation-challenges-success factors. Onkologe 24(2), 91–98 (2018) 20. European Commission. Proposal for a regulation of the European Parliament and of the council on European Health Data Space (2022), https://eur-lex.europa.eu/resource.html?uri=cellar:dbf d8974-cb79-11ec-b6f4-01aa75ed71a1.0001.02/DOC_1&format=PDF. Accessed 6 Dec 2022 21. N. Rieke, J. Hancox, W. Li, F. Milletari, H.R. Roth, S. Albarqouni et al., The future of digital health with federated learning. NPJ Digit. Med. 3(1), 119 (2020) 22. H. Islam, K. Alaboud, T. Paul, M.K.Z. Rana, A. Mosa, A privacy-preserved transfer learning concept to predict diabetic kidney disease at out-of-network siloed sites using an in-network federated model on real-world data. AMIA Ann. Symp. Proc. 2022, 264–273 (2022) 23. E. Van Unen, SAS. Using AI-generated synthetic data for easy and fast access to high quality data, https://blogs.sas.com/content/hiddeninsights/2022/07/07/ai-generated-syntheticdata-easy-and-fast-access-to-high-quality-data/. Accessed 6 Dec 2022 24. Z. Azizi, C. Zheng, L. Mosquera, L. Pilote, K. El Emam, Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ Open 11(4), e043497 (2021) 25. U.S. Food & Drug Administration. Precision medicine (2018), https://www.fda.gov/medicaldevices/in-vitro-diagnostics/precision-medicine. Accessed 6 Dec 2022
Chapter 9
Exploiting Drug-Discovery Research for Educational Purposes Giuliana Catara and Cristina Rigutto
Abstract Sustained and innovative communication is needed to engage citizens as science and technology rapidly evolve to meet global challenges. The role of science in society has become a role of science for society, underscoring the importance of effective communication in fostering scientific literacy. Informal science education experiences, facilitated by the more widespread implementation of information technologies, are becoming increasingly relevant to science understanding over time. Additionally, social media provides opportunities for learners to interact with content and to become active creators of information. The life sciences have pioneered innovative educational programs, particularly virtual reality techniques, that represent a successful approach to learning and teaching chemical interactions.
Pandemics, climate change, carbon-free environments, and sustainability are some of the challenging threats that science must tackle urgently in order to improve our lives. As science and technology move ahead, it has become clear that citizens are part of this development and change, and continuous and upgraded communication of science is needed to engage people with it. The role of science in society has turned into the role of science for society [1]. This shift highlights the importance of scientific communication in supporting and fostering citizens’ scientific literacy, defined as the attitude toward science and technology-related issues, the understanding of concepts and processes required for personal decision-making, participation in civic and cultural affairs, and economic productivity (science-based knowledge) [2, 3]. Over the past three decades, institutional programs, as well as professional and voluntary practices in public communication of science, have increased and diversified in favor of a knowledge-based society [4, 5]. Therefore, the public understanding of G. Catara (B) Institute of Biochemistry and Cell Biology (IBBC), National Research Council of Italy, Naples, Italy e-mail: [email protected] C. Rigutto Department of Sociology and Social Research, University of Trento, Trento, Italy © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Coletti and G. Bernardi (eds.), Exscalate4CoV, SpringerBriefs in Applied Sciences and Technology, https://doi.org/10.1007/978-3-031-30691-4_9
73
74
G. Catara and C. Rigutto
science over the years has increased. According to the last European survey on “European citizens’ knowledge towards science and technology” released in 2021, European Union (EU) citizens are informed about new medical discoveries (67%), environmental problems (82%), and scientific discoveries and technological development (66%), indicating a higher engagement with these issues with respect to previous surveys [6]. In addition, respondents think that health care and climate change are the fields in which science and technology can drive change. Horizon Europe, the current European key funding programme for research and innovation, aims to reinforce public engagement in accordance with the open-science approach, consisting of the sharing of knowledge with all relevant actors, including academia, industry, public authorities, end-users, citizens, and society at large [7]. Therefore, science education has assumed a pivotal role in making students and the lay public formed and informed. Science education research shows that people learn science across a broad range of physical and social contexts, including school (everyday environment), home, museums or science centers (designed environments), and science programs (science clubs, citizen science activities), to cite the most popular. Among these, out-of-school experiences appear to be relevant in contributing to people’s science understanding over time [8] and are collectively known as informal science learning. Traditionally, the general distinction between formal, informal, and non-formal learning considers three key characteristics: “whether the learning involves objectives, whether it is intentional, and whether it leads to a qualification” [9]. According to the European guidelines for validating non-formal and informal learning [10], formal learning occurs in an organized and structured environment, such as in an educational institution or on the job, and is clearly designated as learning. In the case of formal education, the goals, places, and methods are externally determined by the educational providers, whereas in informal learning, the aims and acquisition of knowledge or skills are determined by the end-users [11, 12]. Given that an estimated 70% to 90% of human scholarship falls into this category [13], informal learning is now regarded as a valuable form of education worthy of recognition [14]. Accordingly, formal and informal science education mirrors these properties as well. More generally, non-formal education is defined as “organised and sustained educational activities that do not correspond exactly to the definition of formal education and may have differing durations and may or may not confer certification” [15]. It follows that the overall purposes of non-formal education rely on raising the skills of an ever-larger segment of the world’s population to achieve a lifelong learning outcome. The last 15 years have seen a growing interest in non-formal and informal science learning both in Europe and worldwide, outlining the need for coordinated measures in making non-formal learning more integrated into political priorities. Now, it is generally accepted that non-formal science education can be functional in providing appropriate learning opportunities to different learners and in motivating them to learn science in both institutional and non-institutional contexts. A great contribution to the diffusion of non-formal science learning has also derived from the implementation of information technologies, which have influenced
9 Exploiting Drug-Discovery Research for Educational Purposes
75
aspects of life as diverse as education, communication, and work. Digital competencies are considered key skills for lifelong learning across the EU population in a wide age group [16]. Thus, information technologies have further enriched the set of learning experiences through the innovation or introduction of novel educational processes, enabling the engagement of a larger number of end-users with science [17–19]. Nevertheless, to increase learners’ understanding rather than communicating knowledge, educators need to provide active learning experiences to different audiences and enhance engagement [20, 21]. Social media provides science learners with opportunities to interact with content-specific messages, transforming any content in conversations and becoming co-creators of knowledge. To keep pace with the digital world in which knowledge is rapidly expanding, science information should be designed to encourage dialogue, not only about matters internal to science but also about public understanding of science, the impacts of scientific knowledge on society, and the issue of scientific uncertainty. In this scenario, social media companies such as Meta are funding the development of immersive engaging content for the metaverse that increases access to learning through technology. These rapidly evolving digital technologies also have significant potential to improve learning experiences that support inclusive education of minority groups. Yet, the lack of quality and reliable Internet access in many countries widens preexisting gaps in access to a learning environment, thus hindering inclusion in non-formal science learning. The pandemic has further transformed education [22, 23]. Educators had to face new and unexpected challenges posed by remote learning. They had to learn how to stay connected with their students while apart and how to provide active learning experiences. Thus, the exploitation of emerging technologies and practices that rely on constructionism, coding, and joyful activities and that currently occur in various spaces such as hackerspaces, makerspaces, TechShops, FabLabs, and museums, to cite a few, offers the potential to increase engagement and creativity in science education at the intersection between formal and informal science education [18, 24, 25]. Among the diverse scientific fields, life sciences have gathered attention through the development of innovative educational programs. In particular, virtual reality techniques, which exploit 3D molecular models, have improved the visualization of chemical compounds and their chemical interactions. Thus, virtual reality has become successful as an alternative method for the implementation of both the learning and teaching of this multidisciplinary topic [26].
76
G. Catara and C. Rigutto
References 1. F. Gannon, Science for society. EMBO Rep. 7(6), 561 (2006) 2. B.K. Haywood, J.C. Besley, Education, outreach, and inclusive engagement: towards integrated indicators of successful program outcomes in participatory science. Public Understand Sci. 23(1), 92–106 (2014) 3. R.M. Vieira, C. Tenreiro-Vieira, Fostering scientific literacy and critical thinking in elementary science education. Int. J. Sci. Math. Educ. 14(4), 659–680 (2016) 4. M. Bucchi, B. Trench, Science communication and science in society: a conceptual review in ten keywords. Tecnoscienza 7(2), 151–168 (2016) 5. C.M. Reincke, A.L. Bredenoord, M.H.W. van Mil, From deficit to dialogue in science communication: the dialogue communication model requires additional roles from scientists. EMBO Rep. 21(9), e51278 (2020) 6. European Commission. Special Eurobarometer 516: European citizen’s knowledge and attitudes towards science and technology. Brussels, European Union (2021).https://doi.org/10. 2775/071577 7. European Commission. Directorate-General for Research and Innovation, Horizon Europe, open science: early knowledge and data sharing, and open collaboration (2021), https://data. europa.eu/doi/10.2777/18252. Accessed 7 Dec 2022 8. C.R. Latchem, Informal learning and non-formal education for development. J. Learn. Dev. 1(1) (2014). https://doi.org/10.56059/jl4d.v1i1.6 9. P. Werquin, Recognition of non-formal and informal learning in OECD countries: an overview of some key issues. REPORT-Zeitschrift für Weiterbildungsforschung 32(3), 11–23 (2009). https://doi.org/10.3278/REP0903W011 10. CEDEFOP, European Center for the Development of Vocational Training. European guidelines for validating non-formal and informal learning (2009), https://www.cedefop.europa.eu/en/pub lications/4054. Accessed 7 Dec 2022 11. D.A. Cofer, Informal Workplace Learning. Practice Application Brief No. 10 (2000) 12. T.J. Conlon, A review of informal learning literature, theory and implications for practice in developing global professional competence. J. Eur. Ind. Train. 28(2/3/4), 283–295 (2004) 13. T. Jeffs, M.K. Smith, What is informal education. The Encyclopedia of Pegagogy and Informal Education (1997, 2005, 2011). Updated 19 Oct 2019, https://infed.org/mobi/what-is-informaleducation. Accessed 7 Dec 2022 14. Organisation for Economic Co-operation and Development. Recognition of Non-formal and Informal Learning – Home, https://www.oecd.org/education/skills-beyond-school/recogniti onofnon-formalandinformallearning-home.htm. Accessed 9 Dec 2016 15. UNESCO. Non-formal education, http://uis.unesco.org/en/glossary-term/non-formal-edu cation. Accessed 7 Dec 2022 16. European Commission. Directorate-General for Education, Youth, Sport and Culture, Education and training monitor 2020: teaching and learning in a digital age. Publications Office of the European Union (2020). https://data.europa.eu/doi/10.2766/917974. Accessed 9 Dec 2016 17. S.A. Kalaian, R.M. Kasim, Effectiveness of various innovative learning methods in health science classrooms: a meta-analysis. Adv. Health Sci. Educ. Theory Pract. 22(5), 1151–1167 (2017) 18. M. Giannakos, An introduction to non-formal and informal science learning in the ICT era, in Non-Formal and Informal Science Learning in the ICT Era. ed. by M. Giannakos (Springer, Singapore, 2020), pp.3–13 19. U. Kossybayeva, B. Shaldykova, D. Akhmanova, S. Kulanina, Improving teaching in different disciplines of natural science and mathematics with innovative technologies. Educ. Inf. Technol. (Dordr). 27(6), 7869–7891 (2022) 20. B. Lee, Social media as a non-formal learning platform. Procedia Soc. Behav. Sci. 103, 837–843 (2013)
9 Exploiting Drug-Discovery Research for Educational Purposes
77
21. C. Kivunja, Innovative methodologies for 21st century learning, teaching and assessment: a convenience sampling investigation into the use of social media technologies in higher education. Int. J. High. Educ. 4(2), 1–26 (2015) 22. S.G. Huber, C. Helm, COVID-19 and schooling: evaluation, assessment and accountability in times of crises—reacting quickly to explore key issues for policy, practice and research with the school barometer. Educ. Assess. Eval. Account. 32(2), 237–270 (2020) 23. S. González, X. Bonal, COVID-19 school closures and cumulative disadvantage: assessing the learning gap in formal, informal and non-formal education. Eur. J. Educ. 56(4), 607–622 (2021) 24. M.-J. Castro, M. López, M.-J. Cao, M. Fernández-Castro, S. García, M. Frutos et al., Impact of educational games on academic outcomes of students in the degree in nursing. PLoS ONE 14(7), e0220388 (2019) 25. A.S. Bahrin, M.S. Sunar, A. Azman, Enjoyment as gamified experience for informal learning in virtual reality, in Intelligent Technologies for Interactive Entertainment: Proceedings of the 13th International Conference on Intelligent Technologies for Interactive Entertainment, INTETAIN 2021, December 3–4, 2021, ed. by Z. Lv, H. Song (Springer, Cham, 2022), pp. 383–399, https:// doi.org/10.1007/978-3-030-99188-3_24 26. J. Falah, M. Wedyan, S.F.M. Alfalah, M. Abu-Tarboush, A. Al-Jakheem, M. Al-Faraneh et al., Identifying the characteristics of virtual reality gamification for complex educational topics. Multimodal Technol Interact 5(9), 53 (2021)
Chapter 10
Beyond the Exscalate4CoV Project: LIGATE and REMEDI4ALL Projects Carmine Talarico, Andrea R. Beccari, and Davide Graziani
Abstract In the last 2 years, the SARS-CoV-2 (COVID-19) pandemic demonstrated that rapid response to outbreaks with readily effective treatments represents a primary health and societal priority. At the same time, we became conscious that technological resources are often not used in the most efficient manner. The LIGATE and REpurposing MEDIcines For All (REMEDI4ALL) projects started on the large-scale mobilization efforts of the EXaSCale smArt pLatform Against paThogEns (Exscalate4Cov) project with the aim to apply cutting-edge technologies in drug discovery, sustain the fight against future pandemics, and promote the everyday fight against rare diseases. In particular, the LIGATE project, using the drug-discovery platform Exscalate, intends to boost the virtual screening of drug campaigns at an extreme scale in terms of performance and streamline the drug-development process. The aim of the REMEDI4ALL project is to collect sciQ1entific expertise and innovative technology platforms for the repurposing of medicines to treat rare diseases or other pathologic conditions with no current therapy.
10.1 LIGATE (www.ligateproject.eu/) A plethora of molecular simulation tools are available to perform in silico virtual screening studies, but just a few are able to screen millions of molecules on a specific target with the efficiency and precision needed in emergency situations such as that presented by COVID-19. In response to possible new pandemic health crises and based on the experience achieved in Exscalate4Cov, the LIGATE project was born. It combined supercomputing and artificial intelligence (AI) resources, completely agnostic from the high-performance computing (HPC) infrastructure type, with state-of-the-art facilities and institutes for validating hypotheses that are generated in silico [1]. The process that leads from the virtual to the identification of new active molecules then follows a clinical validation phase. All this is ensured by the C. Talarico (B) · A. R. Beccari · D. Graziani Dompé farmaceutici S.p.A., L’Aquila, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Coletti and G. Bernardi (eds.), Exscalate4CoV, SpringerBriefs in Applied Sciences and Technology, https://doi.org/10.1007/978-3-031-30691-4_10
79
80
C. Talarico et al.
Fig. 10.1 The LIGATE drug-discovery workflow
expertise of each group within the LIGATE team. Specifically, LIGATE is a 3-year, public–private consortium coordinated by Dompé Farmaceutici and supported by the European Commission (grant agreement 956,137). Begun in April 2021, LIGATE aggregates 11 institutions and companies in 5 European countries. The main scope of this initiative is to obtain a fully automated workflow at an exascale level, with the goal of fulfilling an in silico drug-discovery campaign in less than 1 day. The platform is based on Dompé’s proprietary application with integrated chemical functions and a heterogeneous architecture capable of running on exascale systems. In addition, given the presence of expert partners in the areas of software, hardware, drug discovery, and AI, specific modules are being developed and continuously optimized to generate a dedicated computer-aided drug design (CADD) workflow. This new CADD solution will be available if new problems of social and health interest arise. In the meantime, the platform will be validated and tuned on real-world cases to identify new molecules with antiviral activity. The data generated in silico will go through an experimental validation phase and then be made available to the scientific community. The workflow of LIGATE fully automates the drug design process, as summarized in Fig. 10.1.
10.1.1 Final Objectives and Anticipated Outcomes The groundbreaking outputs of the project can be summarized as follows: 1. To develop a portable and tunable drug-discovery platform ready for exascale HPC systems to respond promptly to a worldwide pandemic. This enhanced
10 Beyond the Exscalate4CoV Project: LIGATE and REMEDI4ALL Projects
81
CADD platform will be freely available for academic and nonprofit organizations through the HPC centers. 2. To validate the CADD solution on health problems of social interest. A tool for immediate response to a public health emergency (e.g., outbreaks of Ebola or Zika virus and multidrug-resistant bacteria) will be available. 3. To produce and share data and information that are recognized by the scientific community by publishing the results in high-level scientific journals [2, 3]. Public Web servers supporting the scientific community were released to distribute all the produced data [4, 5]. Proof of concept of the HPC approach to accelerated drug discovery was previously reported via the ANTAREX project [6], which evaluated 1.2 billion virtual molecules versus the Zika proteome; this dataset also contained ~12,000 launched and clinical phase drugs, allowing for rapid response to Zika infection [6].
10.2 REpurposing MEDIcines for All REpurposing MEDIcines For All (REMEDI4ALL) is a research and innovation collaborative environment in which 24 European partners are engaged in facilitating the development of cost-effective and patient-centric repurposed medicines to cover unmet medical needs across relevant disease areas [7]. Although the discovery of new uses for repurposed drugs is the primary and more evident aim of REMEDI4ALL, the more ambitious one is reshaping all aspects of drug repurposing implementation to establish a research and innovation platform based on an enhanced collaboration of all actors central to the drug repurposing process. The plan is to establish strong collaboration within a multidisciplinary group that supports the operational activities of high-potential projects at any stage of development. REMEDI4ALL is a 5-year project based on a group of 13 work packages in support of each project. To test the platform, REMEDI4ALL launched a group of four demonstrator projects [7]. The platform will integrate the scientific, methodological, financial, legal, regulatory, and intellectual property aspects of the repurposing approach, ensuring the communication of all the relevant stakeholders of the repurposing process (researchers, clinicians, patients, policymakers, regulators, and funders). In the final synthesis, the main focus of REMEDI4ALL will be the increase of positive outcomes of high-potential projects and improvements in collaboration at any stage of development. To obtain this result, REMEDI4ALL will support the groups through a comprehensive education and training portfolio and will prepare the background for an advanced cross-sectoral policy dialogue with all relevant stakeholders and thought leaders. The tools and processes developed through REMEDI4ALL will be validated in a set of clinical phase demonstrator projects that represent areas of high patient need, including oncology and rare and infectious diseases [7].
82
C. Talarico et al.
10.2.1 Final Objectives and Anticipated Outcomes The key drug repurposing objectives, which REMEDI4ALL proposed to achieve [7] within 5 years, can be recapitulated in the following items and are described in the infographic of Fig. 10.2: • Create a fully operational, sustainable, and accessible platform. • Assemble advanced in silico tools for AI-driven drug repurposing, open and accessible datasets, and a broad range of laboratory and clinical development tools and expertise. • Design, test, and optimize the platform. • Advance the policy environment. • Train and integrate the next generation of researchers, clinicians, patients, policymakers, regulators, and research funders involved in the repurposing projects. The four demonstrator projects on which the platform will be tested and optimized have been accurately selected and are reported in Table 10.1. Projected enduring outcomes of the REMEDI4ALL include the following: a broader portfolio of effective medicines for patients, cost-effective medicines and controlled healthcare system costs, more efficient drug repurposing ecosystem, and better understanding of drug mechanisms of action.
Fig. 10.2 Summary of the REMEDI4ALL concept and approach [7]
10 Beyond the Exscalate4CoV Project: LIGATE and REMEDI4ALL Projects
83
Table 10.1. Selected drug repurposing projects [7] Demonstrator project
Approved indication(s)
Repurposing concept
Disease area or potential new indication
Crizotinib Rimcazole
Oncology Schizophrenia
Discovery and preclinical target validation for inhibition of viral replication
COVID-19
Valproic acid Simvastatin
Epilepsy; depression Cardiovascular disease
Preclinical dose-finding and phase 1/2 proof-of-concept study with gemcitabine and taxol
Metastatic pancreatic ductal adenocarcinoma
Tazarotene
Psoriasis; acne vulgaris
Proof-of-concept study in ultra-rare disease using drug identified in high-throughput screen
Multiple sulfatase deficiency
Losartan
Hypertension
Dose-finding for safe and effective reduction of transforming growth factor β
Osteogenesis imperfecta
References 1. LIGATE: The game changer in drug discovery, https://www.ligateproject.eu/. Accessed 7 June 2022 2. N.A. Murugan, A. Podobas, D. Gadioli, E. Vitali, G. Palermo, S. Markidis, A review on parallel virtual screening softwares for high-performance computers. Pharmaceuticals 15(1), 63 (2022) 3. S. Markidis, D. Gadioli, E. Vitali, G. Palermo, Understanding the I/O impact on the performance of high-throughput molecular docking, in Proceedings of the IEEE/ACM Sixth International Parallel Data Systems Workshop (PDSW), St. Louis, MO (2021), pp. 9–14 4. Dompe Farmaceutici. Spike mutants, https://spikemutants.exscalate4cov.eu/. Accessed 8 Dec 2022 5. I. Romeo, I.G. Prandi, E. Giombini, C.E.M. Gruber, D. Pietrucci, S. Borocci et al., The Spike Mutants website: a worldwide used resource against SARS-CoV-2. Int. J. Mol. Sci. 23(21), 13082 (2022) 6. European Commission. High performance computing used to develop new drugs against the Zika virus, https://digital-strategy.ec.europa.eu/en/news/high-performance-computing-useddevelop-new-drugs-against-zika-virus. Accessed 7 June 2022 7. REMEDi4ALL: Repurposing of medicines 4All, https://remedi4all.org/. Accessed 7 June 2022
Conclusions
High-Performance Computing (HPC), combined with AI and data analytics, is indispensable for the new global data economy. The exponential increase in the volume and diversity of big data is creating new possibilities for sharing knowledge, carrying out scientific research, fostering industrial innovation, and developing public policies. In almost every scientific discipline and in a large number of industrial sectors, researchers and engineers are using advanced modeling and simulations as well as high-performance big data analytics techniques, to answer fundamental research questions and enable new discoveries, and breakthrough innovations. The use of HPC systems is fundamentally affecting our everyday lives. For example, they allow us to better predict and monitor natural disasters, discover new clean energy technologies, and model our climate with increasing precision and accuracy. Many critical breakthroughs would be today impossible without large computational resources and related software and tools. Biology and life sciences are two of the key application areas of HPC use as it enables the understanding of the dynamics of biomolecules and proteins, the design of new materials and the discovery of novel drugs, and a better understanding of the functioning of the human brain or the real-time prediction of pandemic trajectories. The EU-funded Exscalate4COV initiative was carried out by a successful European consortium of 18 European institutions and coordinated by Dompé farmaceutici S.p.A. The initiative is an outstanding illustration in proving that HPC can urgently respond to real-world problems, ultimately saving lives and reducing economic loss. We will soon see exciting new EU projects in the biomedical field powered by HPC such as the digital Twin of the Human Body, which will allow us to simulate medical treatments tailored to each individual or model tumor response to targeted therapies. The support at the EU level to all these initiatives, and to HPC in general, is being carried out by a legal and funding entity—the EuroHPC Joint Undertaking (JU). EuroHPC is enabling 32 EU Member States and Associated European countries to
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. Coletti and G. Bernardi (eds.), Exscalate4CoV, SpringerBriefs in Applied Sciences and Technology, https://doi.org/10.1007/978-3-031-30691-4
85
86
Conclusions
coordinate their supercomputing strategies and investments together with the EU and to support highly innovative initiatives such as Exscalate4COV. Since its creation, EuroHPC has substantially increased investments in HPC at the European level and has started to restore Europe’s position as a leading HPC power globally. The development in Europe of a competitive HPC ecosystem and an integrated world-class exascale supercomputing and quantum computing capability will be crucial for this ambitious goal. This ecosystem ensures that the EU maintains a leading position in the digital economy and contributes to strengthening Europe’s technological and data sovereignty. I hope that many major breakthroughs will emerge from our investments in a very wide range of application areas, and that many initiatives will follow the example of Excalate4CoV in delivering concrete solutions for the critical challenges that our society faces today. Thomas Skordas Deputy Director General, DG Communications Networks, Content and Technology, European Commission